At ProPublica Illinois, we’ve just restarted a data collection project to get new information about what happens to inmates at one of the country’s largest and most notorious jails.
Cook County Jail has been the subject of national attention and repeated reform efforts since its earliest days. Al Capone famously had “VIP accommodations” there in 1931, with homemade meals and a large cell in the hospital ward that he shared only with his bodyguard. Other prisoners have been more poorly accommodated: In the 1970s, the warden was fired for allegedly beating inmates with his own hands, and the facility was placed under federal oversight in 1974. During the 1980s, the federal government forced the jail to release inmates because of overcrowding. In 2008, the Department of Justice found systematic violation of inmates’ 8th Amendment rights and once again pushed for reforms.
These days, the jail, which has just recently been taken out of the federal oversight program, is under new management. Tom Dart, the charismatic and media-savvy sheriff of Cook County, oversees the facility. Dart has argued publicly for reducing the population and improving conditions at the jail. He’s also called the facility a de facto mental hospital, and said inmates should be considered more like patients, even hiring a clinical psychologist to run the jail.
Efforts to study the jail’s problems date back decades. A 1923 report on the jail by the Chicago Community Trust says, “Indifference of the public to jail conditions is responsible for Chicago’s jail being forty years behind the time.”
The promises to fix it go back just as far. The same 1923 report continues, “But at last the scientific method which has revolutionized our hospitals and asylums is making inroads in our prisons, and Chicago will doubtless join in the process.”
Patterns in the data about the inmate population could shed light on the inner workings of the jail, and help answer urgent questions, such as: How long are inmates locked up? How many court dates do they have? What are the most common charges? Are there disparities in the way inmates are housed or disciplined?
Such detailed data about the inmate population has been difficult to obtain, even though it is a matter of public record. The Sheriff’s Department never answered my FOIA request in 2012 when I worked for the Chicago Tribune.
Around the same time, I started a project at FreeGeek Chicago to teach basic coding skills to Chicagoans through working on data journalism projects. Our crew of aspiring coders and pros wrote code that scraped data from the web we couldn’t get other ways. Our biggest scraping project was the Cook County Jail website.
Over the years, the project lost momentum. I moved on and out of Chicago and the group dispersed. I turned off the scraper, which had broken for inexplicable reasons, last August.
I moved back home to Chicago earlier this month and found the data situation has improved a little. The Chicago Data Cooperative, a coalition of local newsrooms and civic-data organizations, is starting to get detailed inmate data via Freedom of Information requests. But there’s even more information to get.
So for my first project at ProPublica Illinois, I’m bringing back the Cook County Jail scraper. Along with Wilberto Morales, one of the original developers, we are rebuilding the scraper from scratch to be faster, more accurate and more useful to others interested in jail data. The scraper tracks inmates’ court dates over time and when and where they are moved within the jail complex, among other facts.
Our project complements the work of the Data Cooperative. Their efforts enable the public to understand the flow of people caught up in the system from arrest to conviction. What we’re adding will extend that understanding to what happens during inmates’ time in jail. It’s not clear yet if we’ll be able to track an individual inmate from the Data Cooperative’s data into ours. There’s no publicly available, stable and universal identifier for people arrested in Cook County.
The old scraper ran from June 5, 2014, until July 24, 2016. The new scraper has been running consistently since June 20, 2017. It is nearly feature-complete and robust, writing raw CSVs with all inmates found in the jail on a given day.
Wilberto will lead the effort to develop scripts and tools to take the daily CSVs and load them into a relational database.
We plan to analyze the data with tools such as Jupyter and R and use the data for reporting.
A manifest of daily snapshot files (for more information about those, read on) is available at https://s3.amazonaws.com/cookcountyjail.il.propublica.org/manifest.csv
How Our Scraper Works, a High-Level Overview
The scraper harvests the original inmate pages from the Cook County Jail website, mirrors those pages and processes them to create daily snapshots of the jail population. Each record in the daily snapshots data represents a single inmate on a single day.
The daily snapshots are anonymized. Names are stripped out, date of birth is converted to age at booking, and a one-way hash is generated from name, birth date and other personal details, so researchers can study recidivism. The snapshot data also contains the booking ID, booking date, race, gender, height, weight, housing location, charges, bail amount, next court data and next court location.
We don’t make the mirrored inmate pages public, to avoid misuse of personal data for things like predatory mugshot or background check websites.
How Our Scraper Works, the Nerdy Parts
The new scraper code is available on Github. It’s written in Python 3 and uses the Scrapy library for scraping.
Data Architecture
When we built our first version of the scraper in 2012, we could use the web interface to search for all inmates whose last name started with a given letter. Our code took advantage of this to collect the universe of inmates in the data management system, simply by running 26 searches and stashing the results.
Later, the Sheriff's Department tightened the possible search inputs and added a CAPTCHA. However, we were still able to access individual inmate pages via their Booking ID. This identifier follows a simple and predictable pattern: YYYY-MMDDXXX where XXX is a zero-padded number corresponding to the order that the inmate arrived that day. For example, an inmate with Booking ID “2017-0502016” would be the 16th inmate booked on May 2, 2017. When an inmate leaves the jail, the URL with that Booking ID starts returning a 500 HTTP status code.
The old scraper scanned the inmate locator and harvested URLs by checking all of the inmate URLs it already knew about and then incrementing the Booking ID until the server returned a 500 response. The new scraper works much the same way, though we’ve added some failsafes in case our scraper misses one or more days.
The new scraper can also use older data to seed scrapes. This reduces the number of requests we need to send and gives us the ability to compare newer records to older ones, even if our data set has missing days.
Scraping With Scrapy
We’ve migrated from a hodgepodge of Python libraries and scripts to Scrapy. Scrapy’s architecture makes scraping remarkably fast, and it includes safeguards to avoid overwhelming the servers we’re scraping.
Most of the processing is handled by inmate\_spider.py
. Spiders are perhaps the most fundamental elements that Scrapy helps you create. A spider handles generating URLs for scraping, follows links and parses HTML into structured data.
Scrapy also has a way to create data models, which it calls “Items.” Items are roughly analogous to Django models, but I found Scrapy’s system underdeveloped and difficult to test. It was never clear to me if Items should be used to store raw data and to process data during serialization or if they were basically fancy dicts that I should put clean data into.
Instead, I used a pattern I learned from Norbert Winklareth, one of the collaborators on the original scraper. I wrote about the technique in detail for NPR last year. Essentially, you create an object class that takes a raw HTML string in its constructor. The data model object then exposes parsed and calculated fields suitable for storage.
Despite several of its systems being a bit clumsy, Scrapy really shines due to its performance. Our original scraper worked sequentially and could harvest pages for the approximately 10,000 inmates under jail supervision in about six hours, though sometimes it took longer. Improvements we made to the scraper got this down to a couple hours. But in my tests, Scrapy was able to scrape 10,000 URLs in less than 30 minutes.
We follow the golden rule at ProPublica when we’re web scraping: “Do unto other people’s servers as you’d have them do unto yours.” Scrapy’s “autothrottle” system will back off if the server starts to lag, though we haven’t seen any effect so far on the server we’re scraping.
Scrapy’s speed gains are remarkable. It’s possible that these are due in part to increases in bandwidth, server capacity and in web caching at the Cook County Jail’s site, but in any event, it’s now possible to scrape the data multiple times every day for even higher accuracy.
Pytest
I also started using a test framework for this project I haven’t used before.
I’ve mostly used Nose, Unittest and occasionally Doctests for testing in Python. But people seem to like Pytest (including several of the original jail scraper developers) and the output is very nice, so I tried it this time around.
Pytest is pretty slick! You don’t have to write any boilerplate code, so it’s easy to start writing tests quickly. What I found particularly useful is the ability to parameterize tests over multiple inputs.
Take this abbreviated code sample:
testdata = (
(get\_inmate('2015-0904292'), {
'bail\_amount': '50,000.00',
}),
(get\_inmate('2017-0608010'), {
'bail\_amount': '\*NO BOND\*',
}),
(get\_inmate('2017-0611015'), {
'bail\_amount': '25,000',
}),
)
@pytest.mark.parametrize("inmate,expected", testdata) def test_bail_amount(inmate,expected): assert inmate.bail_amount == expected['bail_amount']
In the testdata variable assignment, the get\_inmate
function loads an Inmate model instance from sample data and then defines some expected values based on direct observation of the scraped pages. Then, by using the @pytest.mark.parameterize(...)decorator
and passing it the testdata variable, the test function is run for all the defined values.
There might be a more effective way to do this with Pytest fixtures. Even so, this is a significant improvement over using metaclasses and other fancy Python techniques to parameterize tests as I did here. Those techniques yield practically unreadable test code, even if they do manage to provide good test coverage for real-world scenarios.
In the future, we hope to use the Moto library to mock out the complex S3 interactions used by the scraper.
How You Can Contribute
We welcome collaborators! Check out the contributing section of the project README for the latest information about contributing. You can check out the issue queue, fork the project, make your contributions and submit a pull request on Github!
And if you’re not a coder but you notice something in our approach to the data that we could be doing better, don’t be shy about submitting an issue.