In this tutorial, learn how to analyze the jail booking and warrant data, and replicate our main analyses, starting from the raw data. See our code on Github for more examples.

Debtors’ Prisons Project Tutorial

The basics

Working in R starts with the dataframe. Dataframes are like very organized spreadsheets. They have rows (also called “observations”) and columns (also called “variables”). The spreadsheet we’ve given you is structured like this: each row represents a booking in the Liberty County, TX, jail — or, more accurately, part of a booking (more on that later) — and each column represents something about that booking, for instance the zip code where the booked individual lived, their age, or one of their charges. Let’s take a look at the dataframe now.

colnames(bookings)

##  [1] "race"           "ethnicity"      "sex"            "age"           
##  [5] "state"          "zip"            "booking_date"   "release_date"  
##  [9] "charge"         "severity"       "release_type"   "bond"          
## [13] "fine"           "note"           "length_of_stay" "ftp"           
## [17] "all_ftp"        "any_ftp"        "booking_id"     "raw_row_number"

One thing we might be interested in immediately is how what date range the data cover — that is, what are the earliest and latest bookings in this dataset.

We know from above that there are two different columns in the dataframe that deal with dates: booking_date and release_date. Finding the minimum and maximum in R is easy — R provides the min() and max() functions to us for this exact purpose. The problem is that we have to access the date data in our dataframe separately from the rest of the dataframe. If we try to use the min() function on a dataframe directly, it’ll produce an error.

Instead, we can use the $ to separate out a column that we want to interact with on its own. Compare the difference between our dataframe as a whole:

print(bookings)

## # A tibble: 107,173 x 20
##    race  ethnicity sex     age state zip   booking_date release_date charge
##    <fct> <fct>     <fct> <int> <chr> <chr> <date>       <date>       <chr> 
##  1 white non_hisp… fema…    NA <NA>  <NA>  2005-07-02   2005-07-03   FAMIL…
##  2 white hispanic  fema…    45 tx    77575 2005-02-18   2005-02-18   DRIVI…
##  3 white non_hisp… male     56 tx    77074 2005-03-01   2006-05-18   HOLD …
##  4 white non_hisp… fema…    60 tx    77502 2005-05-17   2005-05-17   THEFT…
##  5 white non_hisp… fema…    60 tx    77502 2005-05-17   2005-05-17   FTA-T…
##  6 white non_hisp… fema…    60 tx    77502 2005-05-17   2005-05-17   THEFT…
##  7 white non_hisp… fema…    49 tx    77533 2005-05-24   2005-05-25   PUBLI…
##  8 white non_hisp… male     31 tx    77320 2005-05-30   2005-05-31   DRIVI…
##  9 white non_hisp… fema…    49 tx    77532 2005-08-03   2005-08-04   PUBLI…
## 10 white non_hisp… fema…    20 tx    <NA>  2005-06-06   2005-09-17   ARRES…
## # … with 107,163 more rows, and 11 more variables: severity <chr>,
## #   release_type <chr>, bond <dbl>, fine <dbl>, note <chr>,
## #   length_of_stay <dbl>, ftp <lgl>, all_ftp <lgl>, any_ftp <lgl>,
## #   booking_id <chr>, raw_row_number <int>

and just the booking dates:

head(bookings$booking_date)

## [1] "2005-07-02" "2005-02-18" "2005-03-01" "2005-05-17" "2005-05-17"
## [6] "2005-05-17"

NOTE: The head() function makes sure we just see the first few entries in booking_date rather than all 107173 booking dates in our dataframe. Try it yourself to see what happens if we just print bookings$booking_date directly! (Remember that CTRL-C halts execution if you end up with more than you bargained for.)

Using the $ accessor, we can see the date range that our booking data cover.

min(bookings$booking_date)

## [1] NA

max(bookings$booking_date)

## [1] NA

Oops—we didn’t get the answer! That’s because our dataframe contains some missing data—people whose booking dates were not recorded—which in R gets recorded as NA. To get the largest and smallest booking dates, we have to add extra arguments to the max() and min() functions so that they tell us the largest and smallest dates we know.

min(bookings$booking_date, na.rm = TRUE)

## [1] "2003-09-17"

max(bookings$booking_date, na.rm = TRUE)

## [1] "2018-07-31"

Cool! So, it looks like we have about 15 years of data in our dataset.

But we only have the first two thirds of 2018, so we’ll want to drop that.

To do that, we’ll use the filter function, from the dplyr package, to keep exaclty those rows where the year is before 2018. Then we’ll use the assignment operator <- to overwrite bookings. (If you’ve used other programming languages, like Python, just think of <- as R’s equivalent of =.)

We have just one problem: there’s no year column in the dataframe! There’s only booking_date. Fortunately, that’s not a problem: the year function in the lubridate package allows us to easily turn a date into a year.

So, to recap, we want to filter all of the bookings where the year of the booking is less than (<) 2018. Then, we need to assign the result back to bookings with <-. Let’s try it!

bookings <- filter(bookings, year(booking_date) < 2018)

You might be wondering why we were able to write year(booking_date) instead of year(bookings$booking_date). If we try to run year(booking_date) < 2018 normally, we’ll get an error:

year(booking_date) < 2018

## Error in year(booking_date): object 'booking_date' not found

The reason is that filter is special: dplyr verbs make the column names into variables. Since there’s a booking_date variable in bookings, inside of filter(), we can use the booking_date variable. Basically, it lets us drop the $. We also could have written filter(bookings, year(bookings$booking_date) < 2018), but it’s a lot less concise.

Let’s explore some of the other columns in the dataframe. I want to highlight two in particular: the charge column, and the booking_id column. Instead of just saying what they are, let’s see if we can explore a little bit and find out for ourselves.

To do this, we’ll use the select() function, which works a lot like filter(), except it operates on columns instead of rows.

select(bookings, charge, booking_id)

## # A tibble: 103,742 x 2
##    charge                                      booking_id                      
##    <chr>                                       <chr>                           
##  1 FAMILY VIOLENCE                             d7ceb4d59f189f674b056354c398db1c
##  2 DRIVING WHILE INTOXICATED                   74af629059cd75c97283b1dd495226c7
##  3 HOLD FOR USM E-TX                           4ea0ddab5c5a3eb6332514263fa1e44c
##  4 THEFT BY CHECK                              4279024ea5137e880bf8758283a1cf49
##  5 FTA-THEFT                                   4279024ea5137e880bf8758283a1cf49
##  6 THEFT BY CHECK                              4279024ea5137e880bf8758283a1cf49
##  7 PUBLIC INTOXICATION                         19fea7c40e633e4fc988c0fdb5d09954
##  8 DRIVING WHILE INTOXICATED 2ND               09c9a2d058929f51d6f5620f2ff75248
##  9 PUBLIC INTOXICATION                         4bd6095db90e3fba2e88ac1ea723915d
## 10 ARREST WARRANT(PROB VIOL-CREDIT CARD ABUSE) 2e05f2a9b891c8937e521d41e2c44a0d
## # … with 103,732 more rows

So, we can see that select() took bookings, which is a dataframe with 20 columns, and made it a two column dataframe. We can also see that the charge column, perhaps unsurprisingly, contains a description of the charges for which different people were booked. The booking_id column, however, is a little more mysterious. It looks like a long string of basically random letters and numbers. (In fact, that’s exactly what it is—this is known as a “hash” in computer science.) What’s important about it is not the exact contents, however, but rather the fact that it is not different for every row. Unsurprisingly, booking_id can be thought of as an identifier for each booking, and every row with the same booking_id represents information from the same booking. If we look at the first booking_id, we’ll see that while certain columns, like charge and bond, change for each row, other information, like the demographic information for the booking, stay the same.

fourth_booking_id <- bookings$booking_id[[4]]
bookings %>%
  filter(booking_id == fourth_booking_id) %>%
  select(sex, age, booking_date, charge, length_of_stay, booking_id)

## # A tibble: 3 x 6
##   sex      age booking_date charge       length_of_stay booking_id              
##   <fct>  <int> <date>       <chr>                 <dbl> <chr>                   
## 1 female    60 2005-05-17   THEFT BY CH…              0 4279024ea5137e880bf8758…
## 2 female    60 2005-05-17   FTA-THEFT                 0 4279024ea5137e880bf8758…
## 3 female    60 2005-05-17   THEFT BY CH…              0 4279024ea5137e880bf8758…

Ok, a few new things happened here. First, we used [[4]] to get the fourth element of a list (technically, in R jargon, a “vector”). Then, we filtered the bookings to just those rows where the booking_id equaled (==) that specific booking_id. Finally, we select()-ed some demographic columns and some charge columns.

What we saw is that the booking_id “d7ceb4d59f189f674b056354c398db1c” corresponds to a 60-year-old female who was booked on the following charges: THEFT BY CHECK, FTA-THEFT, THEFT BY CHECK.

This raises an interesting question. How many bookings are there in our data? It’s not just the number of rows, which we could determine by running nrow() on the dataframe. The reason it’s not is that some of the bookings involved multiple charges, and as a result are spread across several rows. Instead, it’s the number of distinct booking_ids. To find that out, we can use the n_distinct() function in dplyr.

n_distinct(bookings$booking_id)

## [1] 57551

So, there are 57551 bookings spread across 103742 rows. That means a good portion of people were booked on more than one charge.

The kind of charge we are particularly interested in are failure to pay charges. We have precomputed these for you. The results are stored in three columns: ftp, all_ftp, and any_ftp. The first of these, ftp, varies by row. It indicates, for an individual charge, whether the charge was failure-to-pay–related. The second and third, all_ftp and any_ftp, vary by booking. That means that for any two rows with the same value of booking_id, the values of all_ftp and any_ftp are the same. If all_ftp has the value TRUE, that means that all of the charges associated with that booking are failure-to-pay–related; if any_ftp is true, that means that at least one of the charges associated with a booking is failure-to-pay–related.

We can look at some of the charges classified as failure-to-pay just by filtering on ftp. (Do you see why we don’t need to use ftp == TRUE?)

bookings %>%
  filter(ftp) %>%
  select(charge)

## # A tibble: 6,869 x 1
##    charge                            
##    <chr>                             
##  1 CAPIAS PRO FINE(CRIMINAL TRESPASS)
##  2 ASSAULT FV                        
##  3 CAPIAS PRO FINE (DIS. CONDUCT)    
##  4 EXHIBITION OF SPEED               
##  5 FTA                               
##  6 PUBLIC INTOXICATION               
##  7 CAPIAS PRO FINE(RECKLESS DRIVING) 
##  8 PUBLIC INTOXICATION               
##  9 CAPIAS PRO WARRANT                
## 10 PUBLIC INTOXICATION               
## # … with 6,859 more rows

We can see that in Liberty County, most of the failure-to-pay charges have “CPF” in them, which stands for capias pro fine, a kind of warrant issued in cases in Texas in which someone has failed to pay court debts.

We can also see that sometimes people are booked on a failure to pay charge and a more serious charge:

bookings %>%
  filter(any_ftp, ! ftp) %>%
  select(charge, booking_id)

## # A tibble: 3,440 x 2
##    charge                                 booking_id                      
##    <chr>                                  <chr>                           
##  1 POSS CS PG 1 >=4G<200G                 4a388a1554a036d9e8781bde0cf077e8
##  2 ARREST WARRANT(PROB-POSS C/S)          4a388a1554a036d9e8781bde0cf077e8
##  3 REVOCATION OF PROBATION                4a388a1554a036d9e8781bde0cf077e8
##  4 ARREST WARRANT (AGG ROBBERY)           5575cda94cf33cc22604771784b2ced2
##  5 REVOCATION COMMUNITY SUPERVISON        5575cda94cf33cc22604771784b2ced2
##  6 REVOCATION PROBATION(RECKLESS DRIVING) 87094c0aa8e1d48bf781df61c8f6d6c6
##  7 BLUE WARRANT CHANGED 05.23.05          795eb56490361543008b73377be34eab
##  8 ARREST WRT (DWI)                       f38810bd282e11e64f4dbf2dcc562ba9
##  9 REVOCATION PROBATION                   215ab20dd81b74e040285db4874ec0a0
## 10 REVOCATION PROBATION                   352bdcc28ac34a64505263f3ba4f5e6d
## # … with 3,430 more rows

Here, we used the ! operator (pronounced “bang” or “not”) to filter for rows that met two criteria: any_ftp is TRUE, meaning that at least one charge associated with the given booking_id, possibly in a different row, is failure-to-pay–related; and ! ftp, meaning that the charge in the given row is not failure-to-pay–related.

So, how many instances of failure-to-pay bookings do we have in this data set? Here, we’ll count a booking as “failure to pay” only if all the associated charges are failure-to-pay–related. We can find the answer by combining filter() and n_distinct():

bookings %>%
  filter(all_ftp) %>%
  with(n_distinct(booking_id))

## [1] 3115

So between 2003-09-17 and 2017-12-31, there were 3115 bookings in Liberty County, TX for failure to pay. We’ll save these bookings as a new variable, called ftp_bookings, so we don’t have to keep running filter() every time.

ftp_bookings <- filter(bookings, all_ftp)

Basic Plotting

Now, knowing the total number of bookings is useful, but it’s also useful to know how the number of bookings is changing over time. For that, we’ll have to use group_by() and summarize(). What group_by() does is, effectively, split our dataframe into different pieces. We know how to get the number of bookings from a whole dataframe, so if we had different dataframes for each year, we could figure out the year-by-year change. However, all of the years are put together in our ftp_bookings dataframe. The group_by() function will — under the hood — separate them all out. Then, we can apply the summarize() function to get what we want from each dataframe, namely, n_distinct(booking_id). Let’s try it:

ftp_bookings %>%
  group_by(year = year(booking_date)) %>%
  summarize(n_ftp_bookings = n_distinct(booking_id))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 14 x 2
##     year n_ftp_bookings
##    <dbl>          <int>
##  1  2004              1
##  2  2005            301
##  3  2006            295
##  4  2007            193
##  5  2008            172
##  6  2009            188
##  7  2010            175
##  8  2011            207
##  9  2012            249
## 10  2013            294
## 11  2014            296
## 12  2015            243
## 13  2016            276
## 14  2017            225

Looking at this, we might notice two things. First, only one person was arrested for FTP in 2004. Let’s look at what’s going on there:

ftp_bookings %>%
  filter(year(booking_date) == 2004)

## # A tibble: 1 x 20
##   race  ethnicity sex     age state zip   booking_date release_date charge
##   <fct> <fct>     <fct> <int> <chr> <chr> <date>       <date>       <chr> 
## 1 white hispanic  male     33 tx    75901 2004-12-31   2005-01-03   PUBLI…
## # … with 11 more variables: severity <chr>, release_type <chr>, bond <dbl>,
## #   fine <dbl>, note <chr>, length_of_stay <dbl>, ftp <lgl>, all_ftp <lgl>,
## #   any_ftp <lgl>, booking_id <chr>, raw_row_number <int>

Ah! Someone was booked on New Year’s Eve in 2004 and then released in early January in 2005, so they made it into the 2005 data, even though most of the bookings from 2004 didn’t. So, to make sure we’re only using data where we have everyone booked in a given year, we’ll filter 2004 out. (We already filtered out 2018.)

ftp_bookings <- filter(ftp_bookings, year(booking_date) > 2004)

The second thing to notice is that it looks like the number of failure-to-pay bookings has gone up and down, but it’s a little hard to see how, just reading off this table. Instead, it’s better if we visualize it as a graph. For that, we’ll use ggplot. What we’ll need to tell ggplot is the value that goes along the $x$-axis (viz., year), the value that goes on the $y$-axis (viz., n_ftp_bookings). We’ll also need to tell it what kind of graph we want — in this case, we’ll do a line graph. (We’ll also adjust the defaults a little, so that the $y$-axis includes 0, and the labels are a little easier to read.) Putting it all together, we get:

ftp_bookings %>%
  group_by(year = year(booking_date)) %>%
  summarize(n_ftp_bookings = n_distinct(booking_id), .groups = "drop") %>%
  ggplot(aes(x = year, y = n_ftp_bookings)) +
  geom_line() +
  expand_limits(y = 0) +
  scale_x_continuous(breaks = breaks_pretty(13)) +
  labs(
    x = "Year",
    y = "Number of Failure to Pay Jailings",
    title = "Liberty County, TX"
  ) +
  theme_bw()

Length of Stay

The next analysis we’ll do is looking at lengths of stays in jail for people held on failure to pay alone. One of the best ways to get a handle on this sort of question (“How long do court debtors spend in jail?”) is through a histogram. The ggplot package makes plotting histograms easy. All we do is make the variable whose distribution we want to visualize—in this case, length_of_stay—the x aesthetic. However, we need to be careful not to double-count: people who were booked on more than one charge will show up in more than one row, even though they only stayed in jail for a certain length of time once. To help deal with this, we can use the dplyr function distinct(), which will pull out exactly one row for each booking_id.

ftp_bookings %>%
  distinct(booking_id, length_of_stay) %>%
  ggplot(aes(x = length_of_stay)) +
  geom_histogram() +
  labs(
    x = "Length of stay (days)",
    y = "Number of bookings",
    title = "Liberty County, TX"
  ) +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Ok! A good first start. One thing to note is that R is complaining that we haven’t chosen the number of bins in our histogram, so it automatically picked thirty that roughly fit the data. These bins are roughly 2 days wide. But the data are a little sparse at the higher end. It probably makes sense to use wider bins, say, 5 day bins.

ftp_bookings %>%
  distinct(booking_id, length_of_stay) %>%
  ggplot(aes(x = length_of_stay)) +
  geom_histogram(binwidth = 5) +
  labs(
    x = "Length of stay (days)",
    y = "Number of bookings",
    title = "Liberty County, TX"
  ) +
  theme_bw()

That’s a little better. But our plot is still hard to interpret because it’s covering so many different scales. Close to 0 days, there are thousands of bookings, but on the high end of the scale, there are almost none. When data cover multiple orders of magnitude, it’s often useful to visualize them on a logarithmic scale. We can achieve this by adding scale_y_log10() to our plot.

ftp_bookings %>%
  distinct(booking_id, length_of_stay) %>%
  ggplot(aes(x = length_of_stay)) +
  geom_histogram(binwidth = 5) +
  scale_y_log10() +
  labs(
    x = "Length of stay (days)",
    y = "Number of bookings",
    title = "Liberty County, TX"
  ) +
  theme_bw()

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 1 rows containing missing values (geom_bar).

Now we can see the trend in our data slightly more clearly. It looks like the number of bookings is decreasing roughly exponentially comapred to the number of days spent in jail. However, R again raised a couple of errors. These come from the fact that some bins are empty, but the logarithm of 0 is undefined. We can use one last trick to help us visualize the data. Instead of using the transformation $y \mapsto \log_{10}(y)$, we’ll use the transformation $y \mapsto \log_{10}(y + 1)$. Then 0 will get sent to 0 instead of negative infinity, but larger numbers will be unaffected. (The logarithm of 1,000 and 1,001 are approximately the same.)

To manipulate the height of the columns, we’ll have to use ggplot’s stat syntax. In effect, when we called geom_histogram() in the previous plots, it effectively had a tacit aesthetic: aes(x = length_of_stay, y = stat(count)), or aes(x = length_of_stay, y = log10(stat(count))). Now, we want to change the aesthetic to aes(x = length_of_stay, y = log10(stat(count) + 1)). We have to put this aesthetic in the call to geom_histogram() rather than the call to ggplot() because the counts don’t get calculated until the histogram.

ftp_bookings %>%
  distinct(booking_id, length_of_stay) %>%
  ggplot(aes(x = length_of_stay)) +
  geom_histogram(aes(y = log10(stat(count) + 1)), binwidth = 5) +
  # This puts the y-axis labels back on the original scale.
  scale_y_continuous(
    breaks = 0:4,
    labels = c("0", "9", "99", "999", "9,999")
  ) +
  labs(
    x = "Length of stay (days)",
    y = "Number of bookings",
    title = "Liberty County, TX"
  ) +
  theme_bw()

Per Capita Rates

Our earlier counts of FTP bookings are a little hard to make sense of. Does 301 failure to pay jailings in a year mean that the issue is widespread or not? It would be easier to make sense of if we could compare it to some sort of baseline, like the population of Liberty County, TX. Fortunately, we can get that data directly from the Census Bureau using the tidycensus package. (NOTE: To use the tidycensus package, you’ll need to set up a Census API key.)

To use tidycensus, we’ll need to know a few pieces of information:

Geography: What level of geography do we want information on? Geographies range from census blocks, which are the smallest geography you can generally get demographic information on, all the way up to entire states or the whole U.S.
Variables: What demographic information do we actually want to get from the census? Many census data sources have thousands of variables, ranging from total population to “means of transportation to work by workers’ earnings in the past 12 months (in 2013 inflation-adjusted dollars) for workplace geography.”
Years: Which years do we want the data to reflect?
Survey: What is the data source we want to use? The decenial census? Or 3 or 5 year estimates from the American Community Survey? Or something else.

In our case, we’re only interested at aggregate numbers at the county level, so we can make do with "county" geography. We’re also just going to use data from the year 2013, which should be good enough for a first pass. That means we want to get our data from the ACS. (If we wanted to use the census, we would have to use 2010 or 2000.) To keep things simple, we’ll use the ACS 5 year estimates. The hard question is Knowing which variables to get is harder, especially since tidycensus will require us to request our variables using their official variable names, a multi-digit alphanumeric code. Fortunately, tidycensus helps us work with variables by loading them into a dataframe for us. We can load them with the load_variables() function, which takes a year and a census table, and tells us all of the variables that table contains.

# NOTE: "acs5" is the abbreviation for "ACS 5 year estimates"
acs_vars <- load_variables(2013, "acs5")

acs_vars

## # A tibble: 22,711 x 3
##    name       label                           concept                           
##    <chr>      <chr>                           <chr>                             
##  1 B00001_001 Estimate!!Total                 UNWEIGHTED SAMPLE COUNT OF THE PO…
##  2 B00002_001 Estimate!!Total                 UNWEIGHTED SAMPLE HOUSING UNITS   
##  3 B01001_001 Estimate!!Total                 SEX BY AGE                        
##  4 B01001_002 Estimate!!Total!!Male           SEX BY AGE                        
##  5 B01001_003 Estimate!!Total!!Male!!Under 5… SEX BY AGE                        
##  6 B01001_004 Estimate!!Total!!Male!!5 to 9 … SEX BY AGE                        
##  7 B01001_005 Estimate!!Total!!Male!!10 to 1… SEX BY AGE                        
##  8 B01001_006 Estimate!!Total!!Male!!15 to 1… SEX BY AGE                        
##  9 B01001_007 Estimate!!Total!!Male!!18 and … SEX BY AGE                        
## 10 B01001_008 Estimate!!Total!!Male!!20 years SEX BY AGE                        
## # … with 22,701 more rows

As we can see, there’s a lot here. Fortunately, for now, at least, all we’re really interested in is the total population. That variable is on the third line (the first two aren’t exactly what we need), and its name is B00001_001. Now, we can get the population data we need. To do this, we’ll use the get_acs() function, which takes the information we’ve been assembling as arguments.

liberty_pop <- get_acs(
  geography = "county",
  variables = "B01001_001",
  # NOTE: "acs5" is the abbreviation for "ACS 5 year estimates"
  year = 2013,
  state = "Texas",
  county = "Liberty County",
  survey = "acs5"
)

## Getting data from the 2009-2013 5-year ACS

## Using FIPS code '48' for state 'Texas'

## Using FIPS code '291' for 'Liberty County'

liberty_pop

## # A tibble: 1 x 5
##   GEOID NAME                  variable   estimate   moe
##   <chr> <chr>                 <chr>         <dbl> <dbl>
## 1 48291 Liberty County, Texas B01001_001    76013    NA

Now, all we need to do is to divide all our estimates by the total population, and we can tell what the per capita booking rate is. To do this, we’ll overwrite the dataframe tidycensus() gave us just with the value from the estimate column, and then use that to calculate the per capita booking rate in our plot. To make the scale a little easier to understand, we’ll look at bookings per year per 1,000 residents.

liberty_pop <- liberty_pop$estimate

ftp_bookings %>%
  group_by(year = year(booking_date)) %>%
  summarize(
    per_cap_ftp_bookings = n_distinct(booking_id) * (1000 / liberty_pop),
    .groups = "drop"
  ) %>%
  ggplot(aes(x = year, y = per_cap_ftp_bookings)) +
  geom_line() +
  expand_limits(y = 0) +
  scale_x_continuous(breaks = breaks_pretty(13)) +
  labs(
    x = "Year",
    y = "Number of Failure to Pay Jailings\nPer 1,000 Residents",
    title = "Liberty County, TX"
  ) +
  theme_bw()

Doing this required us to know quite a lot about tidycensus. Fortunately, all the information about how to use get_census() is easier to discover than it might at first seem. We can see exactly how to use get_census() by running ?get_census in the R console. That will bring up get_census()’s help page, which explains all of the functions arguments, what it returns, and so on. If we had never used tidycensus at all before, we could have run ?tidycensus to get a brief overview of the package.

Disparities

Another important thing to look into is not just the per capita rate at which debt imprisonment happens, but whether that per capita rate varies between different groups. Fortunately, we can use tidycensus to get the information we need to get the demographic information we’re interested in.

To keep things simple, we will compare the per capita failure to pay jailing rates for white and non-white people in Liberty County, TX. As a first pass, we’ll just use the population of non-hispanic white people in Liberty County as a whole. We can find this in the 2013 ACS 5 year estimates the same way we did before, with a little more work. We’ll look for the words “not Hispanic” and “white” in the concept column of vars and see what we can find.

acs_vars %>%
  filter(
    str_detect(concept, "NOT HISPANIC"),
    str_detect(concept, "WHITE")
  )

## # A tibble: 623 x 3
##    name       label                         concept                             
##    <chr>      <chr>                         <chr>                               
##  1 B01001H_0… Estimate!!Total               SEX BY AGE (WHITE ALONE, NOT HISPAN…
##  2 B01001H_0… Estimate!!Total!!Male         SEX BY AGE (WHITE ALONE, NOT HISPAN…
##  3 B01001H_0… Estimate!!Total!!Male!!Under… SEX BY AGE (WHITE ALONE, NOT HISPAN…
##  4 B01001H_0… Estimate!!Total!!Male!!5 to … SEX BY AGE (WHITE ALONE, NOT HISPAN…
##  5 B01001H_0… Estimate!!Total!!Male!!10 to… SEX BY AGE (WHITE ALONE, NOT HISPAN…
##  6 B01001H_0… Estimate!!Total!!Male!!15 to… SEX BY AGE (WHITE ALONE, NOT HISPAN…
##  7 B01001H_0… Estimate!!Total!!Male!!18 an… SEX BY AGE (WHITE ALONE, NOT HISPAN…
##  8 B01001H_0… Estimate!!Total!!Male!!20 to… SEX BY AGE (WHITE ALONE, NOT HISPAN…
##  9 B01001H_0… Estimate!!Total!!Male!!25 to… SEX BY AGE (WHITE ALONE, NOT HISPAN…
## 10 B01001H_0… Estimate!!Total!!Male!!30 to… SEX BY AGE (WHITE ALONE, NOT HISPAN…
## # … with 613 more rows

Here we use the str_detect function. What this function does is it looks for instances of the second argument (i.e., "NOT HISPANIC" or "WHITE") in the first argument (i.e., the concept column in acs_vars). (Actually, str_detect uses regular expressions, or regex, but that’s a more advanced topic we won’t deal with here.) After doing this, the very first row had what we needed: "B01001H_001".

liberty_pop <- get_acs(
  geography = "county",
  variables = c("B01001_001", "B01001H_001"),
  # NOTE: "acs5" is the abbreviation for "ACS 5 year estimates"
  year = 2013,
  state = "Texas",
  county = "Liberty County",
  survey = "acs5"
)

## Getting data from the 2009-2013 5-year ACS

## Using FIPS code '48' for state 'Texas'

## Using FIPS code '291' for 'Liberty County'

liberty_pop

## # A tibble: 2 x 5
##   GEOID NAME                  variable    estimate   moe
##   <chr> <chr>                 <chr>          <dbl> <dbl>
## 1 48291 Liberty County, Texas B01001_001     76013    NA
## 2 48291 Liberty County, Texas B01001H_001    52098    34

Since we wanted two variables, we had to use the c() function, which is the most basic way of putting variables together into an array in R. To get variables out of an array (or, in R terminology, a vector) we put a [[]] after them, with the position of the element we want inside. For example:

basic_vector <- c("first", "second", "third")

basic_vector[[1]]

## [1] "first"

basic_vector[[2]]

## [1] "second"

basic_vector[[3]]

## [1] "third"

If we had tried to put 0 or 4 in between the square brackets, R would have thrown an error, since basic_vector only had elements in the first two positions.

In the same way, we have to use a vector to manipulate our data a little bit. We have the total population of Liberty County, and the white non-Hispanic population, but not the non-white (i.e., non-white or Hispanic) population. However, we can get that using vectors.

new_estimate <- liberty_pop$estimate

new_estimate[[1]] = new_estimate[[1]] - new_estimate[[2]]

pct_nw <- new_estimate[[1]] / (new_estimate[[1]] + new_estimate[[2]])

pct_nw

## [1] 0.3146172

Now, to see if there are disparities, we can compare the percentage of FTP bookings that are of non-white individuals to the population of the county as a whole. We’ll do this by adding geom_hline()s to our graph—horizontal lines showing what the county’s baseline demographics look like. The one issue is that we need to know not only the year in which the booking occurred, but also whether or not the individual booked is white. To do that, we’ll create a new variable using the dplyr function mutate(), which allows us to use old columns to construct new columns. A second issue, however, is that we can no longer just count the number of distinct bookings—we have to know which bookings are of white individuals, and which are of non-white individuals. To help with that, we’ll use dplyr’s distinct() function. Since is_wnh will be the same for every row belonging to a given booking_id, if we can count the number of unique combinations of booking_ids and is_wnh values, then we don’t have to worry about double-counting. That is exactly what distinct() does: it looks for all of the unique combinations of the columns we give it, and then preserves just those columns. We also need the booking date.

ftp_bookings %>%
  mutate(is_wnh = (race == "white" & ethnicity != "hispanic")) %>%
  # NOTE: We also need the year of the booking to group by!
  distinct(year = year(booking_date), is_wnh, booking_id) %>%
  group_by(year) %>%
  # The `n()` function just counts the number of rows in each group, i.e., the
  # total number of bookings.
  summarize(ftp_pct_nw = sum(is_wnh, na.rm = TRUE) / n(), .groups = "drop") %>%
  ggplot(aes(x = year, y = ftp_pct_nw)) +
  geom_line() +
  geom_hline(yintercept = pct_nw, color = "red") +
  expand_limits(y = 0) +
  scale_x_continuous(breaks = breaks_pretty(13)) +
  labs(
    x = "Year",
    y = "Proportion of FTP Bookings of Non-White Individuals",
    title = "Liberty County, TX"
  ) +
  theme_bw()

Clearly the percentage of people booked into the Liberty County jail who are not white is much higher than their share of the county’s population as a whole. However, there’s still an important issue, which is sometimes called the “denominator problem.” Debt imprisonment is an issue that primarily affects low-income people. If non-white people make up a disproportionate number of the low-income people in Liberty County, what we are seeing may not be a disparity in FTP jailings, per se, but rather a disparity in income. We can try to get a handle on this by benchmarking against a different population. Rather than looking at the population of the county as a whole, we can look at the demographics of the population of the county that lives below the poverty line as a better proxy. The variables we need are:

B17001_002: The total number of individuals with income at or below the poverty line in the last twelve months.
B17001H_002: The total number of individuals who are non-hispanic white at or below the poverty line in the last twelve months.

liberty_pop <- get_acs(
  geography = "county",
  variables = c("B17001_002", "B17001H_002"),
  # NOTE: "acs5" is the abbreviation for "ACS 5 year estimates"
  year = 2013,
  state = "Texas",
  county = "Liberty County",
  survey = "acs5"
)

## Getting data from the 2009-2013 5-year ACS

## Using FIPS code '48' for state 'Texas'

## Using FIPS code '291' for 'Liberty County'

# Remember, we can use `with` to avoid writing `$`
pct_nw <- with(liberty_pop, (estimate[[1]] - estimate[[2]]) / estimate[[1]])

ftp_bookings %>%
  mutate(is_wnh = (race == "white" & ethnicity != "hispanic")) %>%
  # NOTE: We also need the year of the booking to group by!
  distinct(year = year(booking_date), is_wnh, booking_id) %>%
  group_by(year) %>%
  # The `n()` function just counts the number of rows in each group, i.e., the
  # total number of bookings.
  summarize(ftp_pct_nw = sum(is_wnh, na.rm = TRUE) / n(), .groups = "drop") %>%
  ggplot(aes(x = year, y = ftp_pct_nw)) +
  geom_line() +
  geom_hline(yintercept = pct_nw, color = "red") +
  expand_limits(y = 0) +
  scale_x_continuous(breaks = breaks_pretty(13)) +
  labs(
    x = "Year",
    y = "Proportion of FTP Bookings of Non-White Individuals",
    title = "Liberty County, TX"
  ) +
  theme_bw()

So some, but by no means all, of the disparity we observed is explained by demographic differences in Liberty County between the county as a whole and those living below the poverty line.