One of the most basic functions of data journalism is to independently verify government officials’ claims.
So when Texas Gov. Greg Abbott and state officials started to boast that Operation Lone Star — a now multibillion-dollar initiative launched in March 2021 to battle cross-border crime — had resulted in thousands of arrests, multiple drug seizures and numerous referrals of unauthorized immigrants to the federal government for deportation, we asked for the underlying data. It was immediately clear that examining the operation’s achievements would be a challenge.
ProPublica’s joint investigative unit with The Texas Tribune partnered with The Marshall Project to collect and analyze the data state agencies were keeping about the operation and report out the findings. I chatted with Marshall Project data reporter Andrew Rodriguez Calderón to learn more about how the newsrooms used data to identify questions about the state’s narrative surrounding Operation Lone Star.
(By the way, The Marshall Project is an investigative newsroom focused on criminal justice. It produces incredible work that you can get it in your inbox by signing up for one of these newsletters.)
As Calderón told me, Texas’ Department of Public Safety sent records about Operation Lone Star last summer in response to requests from our reporters. But the data the agency initially gave us was a mess.
There were two releases that came from three different departments with dissimilar ways of recording arrests and charges. In most cases, each row of the data represented one charge (and it’s important to note that an arrest can result in multiple charges), but the way the charges were entered was not standardized or easy to understand. This made it nearly impossible to analyze.
“We were trying to reconcile all of those datasets to turn them into one master representation,” Calderón said.
Four reporters — Calderón and his Marshall Project colleague Keri Blakinger, along with Lomi Kriel and Perla Trevizo from the ProPublica-Texas Tribune partnership — spent hours combing through thousands of rows and manually comparing the arrest data with the FBI’s Uniform Crime Report and the Texas penal code in order to standardize the charges and then group similar charges into buckets.
That meticulous data review was both frustrating and galvanizing for the team of reporters, who wondered how Abbott and other state officials had drawn the conclusions underlying their nearly weekly public statements.
“We hadn’t been able to shape this data to be able to say something about it. So how were they doing it?” Calderón asked as he explained the process.
In November, the state sent the reporters new data that standardized the charges, making it easier to analyze. DPS officials asked reporters to ignore the previous data the agency released, saying it only included charges from some of the counties conducting arrests. The new dataset covered charges from Operation Lone Star’s launch in March 2021 through November, and it included counties beyond the border.
Now, the rows included a column classifying each charge as a felony or a misdemeanor, and there was a column with standardized charges. This meant that Calderón could run a simple script to help categorize the charges into groupings.
“Part of the goal of cleaning the data was for us to be able to classify them as drug charges or vehicle charges or violent charges or traffic offenses, that sort of thing,” Calderón said.
The state later sent us a second comprehensive dataset, which went through December, and then a third, which expanded the time period through January.
We used the third data set to conduct the analysis. Bolstering this data with additional reporting led us to these conclusions about the state’s claims:
- The state’s data includes arrests and charges that had no connection to the border.
- The arrest data includes work done by troopers stationed in the targeted counties before Operation Lone Star’s launch.
- Arrest and drug seizure data does not show how the operation’s work is distinguished from that of other law enforcement agencies.
In response to the findings, the governor’s office maintained that “dangerous individuals, deadly drugs, and other illegal contraband have been taken off our streets or prevented from entering the State of Texas altogether thanks to the men and women of Operation Lone Star.” But DPS and Abbott have provided little proof to substantiate such statements.
And the team found another wrinkle in the state’s narrative as it conducted its analysis. Reporters compared the three different datasets with one another, looking at how the data changed over time.
Calderón compared the different datasets the team had received from the state using what’s called an anti-join function.
A join function takes two datasets, finds matching rows and combines their columns. An anti-join does the opposite. Instead of adding sheets together, it analyzes two sets of similar data and outputs only the rows that are different between them.
Using this function, Calderón found that by the time DPS gave the news organizations the third dataset in January, more than 2,000 charges had been removed. The state stopped counting them toward Operation Lone Star after the news organizations started asking questions.
Asked by the news organizations why such charges were not excluded from the operation’s metrics at the start, DPS officials said they are continuously improving how they collect and report the data “to better reflect the mission” of securing the border. The officials said it wasn’t valid to say charges had been removed.
But in the explanation at the bottom of the story, reporters clarify:
The constantly changing nature of the database is not unique to Operation Lone Star. Methods for comparing datasets are commonly used and actively studied. It is valid to analyze changes in such databases (with the appropriate caveats) and to describe them as additions or removals. DPS itself told reporters the department “identified offenses that should be removed” in a December 2021 email about changes to Operation Lone Star data collection.
Basically: We stand by our analysis.
Being systematic about data and documenting every step might take time on the front end, but it makes the process more transparent, easier to replicate and stronger overall.