This is a study of gerrymandering in Alabama. We will test different metrics of spatial compactness and diversity to assess their efficacy in predicting the representiveness of different voting districts. We will then extend the work of prior studies by calculating a representivness metric to combines social and geographic metrics of ‘fairness’.
Key words
: Political Representation, Gerrymeandering,
Alabama, Convex Hull, ElectionsSubject
: Social and Behavioral Sciences: Geography:
Geographic Information SciencesDate created
: 2025-02-17Date modified
: 2020-02-17Spatial Coverage
: Alabama (State)Spatial Resolution
: Census block groupsSpatial Reference System
: EPSG:4269 NAD 1983 Geographic
Coordinate SystemTemporal Coverage
: 2020-2023Temporal Resolution
: Decennial CensusAn original, exploratory study assessing the comparative findings of commonly used to quantify degreess of congressional district gerrymandering. We will also assess the usefulness of a new gerrymandering metric based on the convex hull of a congressional district and the representativeness inside the convex hull compared to the congressional district writ large.
Enumerate specific hypotheses to be tested or research questions to be investigated here, and specify the type of method, statistical test or model to be used on the hypothesis or question.
# record all the packages you are using here
# this includes any calls to library(), require(),
# and double colons such as here::i_am()
packages <- c("tidyverse", "here", "sf", "tmap", "tidycensus", "lwgeom")
# force all conflicts to become errors
# if you load dplyr and use filter(), R has to guess whether you mean dplyr::filter() or stats::filter()
# the conflicted package forces you to be explicit about this
# disable at your own peril
# https://conflicted.r-lib.org/
require(conflicted)
## Loading required package: conflicted
# load and install required packages
# https://groundhogr.com/
if (!require(groundhog)) {
install.packages("groundhog")
require(groundhog)
}
## Loading required package: groundhog
## groundhog says: No default repository found, setting to 'http://cran.r-project.org/'
## Attached: 'Groundhog' (Version: 3.2.2)
## Tips and troubleshooting: https://groundhogR.com
# this date will be used to determine the versions of R and your packages
# it is best practice to keep R and its packages up to date
groundhog.day <- "2025-02-19"
# this replaces any library() or require() calls
groundhog.library(packages, groundhog.day)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## here() starts at /Users/lucas/Documents/GitHub.nosync/gerrymanderAL
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
## Linking to liblwgeom 3.0.0beta1 r16016, GEOS 3.11.0, PROJ 9.1.0
## Warning in fun(libname, pkgname): GEOS versions differ: lwgeom has 3.11.0 sf
## has 3.13.0
## Warning in fun(libname, pkgname): PROJ versions differ: lwgeom has 9.1.0 sf has
## 9.5.1
##
## Attaching package: 'lwgeom'
## The following object is masked from 'package:sf':
##
## st_perimeter
## [36mSuccessfully attached 'tidyverse_2.0.0'[0m
## [36mSuccessfully attached 'here_1.0.1'[0m
## [36mSuccessfully attached 'sf_1.0-19'[0m
## [36mSuccessfully attached 'tmap_4.0'[0m
## [36mSuccessfully attached 'tidycensus_1.7.1'[0m
## [36mSuccessfully attached 'lwgeom_0.2-14'[0m
# you may need to install a correct version of R
# you may need to respond OK in the console to permit groundhog to install packages
# you may need to restart R and rerun this code to load installed packages
# In RStudio, restart r with Session -> Restart Session
# record the R processing environment
# alternatively, use devtools::session_info() for better results
writeLines(
capture.output(sessionInfo()),
here("procedure", "environment", paste0("r-environment-", Sys.Date(), ".txt"))
)
# save package citations
knitr::write_bib(c(packages, "base"), file = here("software.bib"))
# set up default knitr parameters
# https://yihui.org/knitr/options/
knitr::opts_chunk$set(
echo = FALSE, # Run code, show outputs (don't show code)
fig.retina = 4,
fig.width = 8,
fig.path = paste0(here("results", "figures"), "/")
)
Describe the data sources and variables to be used. Data sources may include plans for observing and recording primary data or descriptions of secondary data. For secondary data sources with numerous variables, the analysis plan authors may focus on documenting only the variables intended for use in the study.
Primary data sources for the study are to include census block groups, alabama congressional districts, and presidential voting totals from the 2020 election.
Each of the next subsections describes one data source.
Abstract
: Vector polygon geopackage layer of Census
tracts and demographic data.Spatial Coverage
: Alabama (State). OSM link: [https://www.openstreetmap.org/relation/161950]Spatial Resolution
: Census block groupsSpatial Reference System
: EPSG 4269 NAD 1983 geographic
coordinate systemTemporal Coverage
: 2020 censusTemporal Resolution
: Single census survey periodLineage
: Downloaded from the U.S. Census APL “pl”
public law summary file using ‘tidycensus’ in RDistribution
: US Census APIConstraints
: Public Domain data free for use and
redistribution.Aquiring data using tidycensus in R
blockgroups <- get_decennial(geography = "block group",
sumfile = "pl",
table = "P3",
year = 2020,
state = "Alabama",
output = "wide",
geometry = TRUE,
keep_geo_vars = TRUE)
Label | Alias | Definition | Type | Accuracy | Domain | Missing Data Value(s) | Missing Data Frequency |
---|---|---|---|---|---|---|---|
GEOID | ID Code | Code that uniquely identifies census tracts | Numeric | N/A | … | … | … |
P4_001N | Total population over 18 | Total population over 18 years old in the 2020 census, divided by block group | Numeric | Generally Accurate | … | … | … |
P4_006N | Total black population over 18 | Total black population over 18 years old in the 2020 census, divided by block group | Numeric | The US Census tends to overcount white populations and undercount those of minorities (US Census) | … | … | … |
P5_003N | Institutionalized population | Total institutionalized population in correctional facilities for adults during the 2020 census, 18 years or older divided by block group | Numeric | The US Census tends to overcount white populations and undercount those of minorities (US Census) | … | … | … |
Abstract
: Voting data by precinctSpatial Coverage
: Alabama (State). OSM link: [https://www.openstreetmap.org/relation/161950]Spatial Resolution
: Voting PrecinctsSpatial Reference System
: EPSG 4269 NAD 1983 Geographic
Coordinate SystemTemporal Coverage
: One YearTemporal Resolution
: 2020Lineage
: Downloaded as a sgpkg. Prior processing
information is avalible in al_vest_20_validation_report.pdf and
readme_al_vest_20.txtDistribution
: Publically avalible at the Redistricting
Hub website with free login.Constraints
: Permitted for noncommercial and
nonpartisan use only, as per original data access agreement. Copyright
information found in redistrictingdatahub_legal.txtData Quality
: CompleteLabel | Alias | Definition | Type | Accuracy | Domain | Missing Data Value(s) | Missing Data Frequency |
---|---|---|---|---|---|---|---|
VTDST20 | District ID | Voting District ID | Numeric | … | … | … | … |
GEOID20 | Location | Unique Geographic ID | Coordinate | … | … | … | … |
G20PRETRU | Republican Voters | Total votes for Donald Trump in 2020 | Numeric | … | … | … | … |
G20PREBID | Democratic Voters | Total votes for Joe Biden in 2020 | Numeric | … | … | … | … |
Abstract
: Spatial bounds and characteristics of U.S.
Congressional districts in AlabamaSpatial Coverage
: Alabama (State). OSM link: [https://www.openstreetmap.org/relation/161950]Spatial Resolution
: U.S. Congressional DistrictsSpatial Reference System
: EPSG 3857 WGS 1984 Web
Mercator ProjectionTemporal Coverage
: Districts approved in 2023 for use
in the 2024 elections.Temporal Resolution
: N/ALineage
: Loaded into QGIS as ArcGIS feature service
layer and saved in geopackage format. Etraneous data fields were removed
and the FIX GEOMETRIES tool was used to correect geometry errors.Distribution
: Avalible from the Alabama State GIS via
ESRI feature serviceConstraints
: Public Domain data free for use and
redistribution.Label | Alias | Definition | Type | Accuracy | Domain | Missing Data Value(s) | Missing Data Frequency |
---|---|---|---|---|---|---|---|
DISTRICT | District Number | U.S. Congressional District Number | Numeric | N/A | N/A | N/A | N/A |
POPULATION | Population | Number of people residing in each congressional district (2020 census) | Numeric | Generally accurate on a full-population scale | … | … | … |
WHITE | Number of white residents | Total number of white residents (2020 census) | Numeric | The US Census tends to overcount white populations and undercount those of minorities (US Census) | … | … | … |
BLACK | Number of black residents | Total number of black residents (US Census) | Numeric | The US Census tends to overcount white populations and undercount those of minorities (US Census) | … | … | … |
At the time of this study pre-registration, the authors had very little prior knowledge of the geography of the study region with regards to the potential gerrymandering congressional districts. The study authors have some prior knowledge of the racial distribution of populations in the state as they pertain to historical settlement (oftentimes involuntary) patterns.
For each secondary source, declare the extent to which authors had already engaged with the data:
###Alabama Census Block Groups - [ ] data is not available yet - [x] data is available, but only metadata has been observed - [ ] metadata and descriptive statistics have been observed - [ ] metadata and a pilot test subset or sample of the full dataset have been observed - [ ] the full dataset has been observed. Explain how authors have already manipulated / explored the data.
###2020 Presidential Election Voting Precincts - [ ] data is not available yet - [x] data is available, but only metadata has been observed - [ ] metadata and descriptive statistics have been observed - [ ] metadata and a pilot test subset or sample of the full dataset have been observed - [ ] the full dataset has been observed. Explain how authors have already manipulated / explored the data. - [ ] data is not available yet
###Districts23 layer of districts.gpkg - [ ] data is not available yet - [x] data is available, but only metadata has been observed - [ ] metadata and descriptive statistics have been observed - [ ] metadata and a pilot test subset or sample of the full dataset have been observed - [ ] the full dataset has been observed. Explain how authors have already manipulated / explored the data. - [ ] data is not available yet
Because primary data is not being incorporated in this study, potential sources of bias are limited. The data utilized in this study is generally considered reputable (census, voting totals), although at larger scales the 2020 census has been seen to systematically undercount minorities, a trend that may impact the racial distribution section of this study by not accuratly giving a measure of the relative diversity of different block groups. Because it’s difficult to know how this systemic undercounting might effect areas differently, I will not attempt to make any corrections for it.
Transform the Census coordinate systen to match that of the districts and precincts layer
library(tidyverse)
blockgroups<-blockgroups%>%
st_transform(crs = 3857)
The Census makes it tricky to pull the ‘black’ population data because of the plethora of different combinations of race designations that respondents can use to describe their racial identity. For example, someone who responds that they are both Hispanic AND Black, they will have a different designation than someone who responds as only black. For this study, we’re going to consider the hispanic and black individual black, so that designation’s population total will need to be added to the overall black population total.
To gather this data, I’ll gather a list of all the words that might be used to describe a black individual. Code courtesy of Joseph Holler’s Github, because I had no idea how to do this.
pulled_metadata <- load_variables(2020, "pl")
black_vars <- pulled_metadata |>
dplyr::filter(str_detect(name, "P3"), #P3 are population columns that include race designations
str_detect(label, "Black")) |> #pulls only the data where there label column includes 'Black'
select(-concept) #excludes the descriptor label column
Next, I’ll use this list to aggregate population data from the columns that are included in the ‘black_vars’ list.
blockgroups2<-blockgroups%>%
mutate(BlackPopulation = rowSums(across(all_of(black_vars$name))))
final_population <- blockgroups2 %>%
mutate(
Total_POP = P3_001N,
Black_POP = BlackPopulation,
Black_Percentage = BlackPopulation / P3_001N
) %>%
select(GEOID, Total_POP, Black_POP, Black_Percentage)
This code chunk will output a table named ‘final_population’ with four columns- their names and descriptors are below. Total_POP: Total population in each census block Black_POP: Total black population in each census block Black_Percentage: The percentage of each census block that at minimum partially identifies as black
I’ll calculate three separate compactness metrics that all are in the form of area vs perimeter (Polsby-Popper metric), convex hull area, and minimum bounding circle area.
#sf_use_s2 is set to FALSE to calculate ellipsoidal area instead of spherical
sf_use_s2(FALSE)
#read in districts
districts <- st_read(here("data", "raw", "public", "alabama", "districts.gpkg"), layer = "districts23")
#calculate area/perimeter metric (polsby-popper)
districts1 <- mutate(districts,
districts_area = st_area(geom),
districts_perim = st_length(st_cast(st_cast(geom, "MULTIPOLYGON"), "MULTILINESTRING")),
polsby_popper = round(
as.numeric(
(4 * pi * districts_area) / districts_perim^2),
2))%>%
select(DISTRICT, districts_area, districts_perim,polsby_popper)
#create a seperate layer, districts_convex, that saves creates a convex hull for each district
districts_convex <- districts1%>%
st_convex_hull()
districts_convex <- districts_convex %>%
mutate(hullarea = st_area(geom),
compact_hull = round(as.numeric(districts_area / hullarea), 2))
#create a third layer, bound_circle, that will store the minimum bounding circle metric
bound_circle<- districts1%>%
st_minimum_bounding_circle()
bound_circle <- bound_circle %>%
mutate(bound_circle,
mbcarea = st_area(geom),
compact_circ = round(as.numeric(districts_area / mbcarea), 2))
#combine all into a table
compactness_summary<- tibble(districts$DISTRICT, districts1$polsby_popper, districts_convex$compact_hull, bound_circle$compact_circ, districts$geom)
#change the column names
colnames(compactness_summary)<- c("District", "Polsby_Popper", "Convex_Hull", "Minimum_Bounding_Circle", "geom")
This gives us the compactness scores from each of our three metrics in the same df, ready to be mapped. We also still retain the three seperate geometries used to calculate area for the compactness metrics, which will be potentially useful down the road.
###Race
To gather the population race breakdown for each district, we’ll employ our block group data that we cleaned earlier. The block group data needs to be split by district, convex hull, and minimum bounding circle so that we can get an accurate measure of population- to do this, it also needs to undergo area weighted reaggregation to properly apportion population on either side of a split. This introduces more error into our calculations, as people don’t evenly distribute across space, but it’s a acceptable amount of error given the small scale we’re working at.
#generate
final_population$area<-final_population%>%
st_area()
st_crs(final_population)
st_crs(districts1)
#segmenting block groups by district, convex hull boundaries, and boundary circles.
district_fragments <- st_intersection(final_population, districts1)
chull_fragments <- st_intersection(final_population, districts_convex)
boundcirc_fragments <- st_intersection(final_population, bound_circle)
#calculating area weighted aggregation and re-grouping by district
district_fragments <- district_fragments%>%
mutate(
new_area = st_area(geometry),
aw = as.numeric(new_area / area),
aw_pop = aw * Total_POP,
aw_black = aw * Black_POP,
aw_blackpct = aw_pop * aw_black)
district_pop <- district_fragments %>%
group_by(DISTRICT)%>%
summarize(
sumpop = sum(aw_pop),
sumblack = sum(aw_black),
black_pct = sumblack/sumpop,
geom = st_union(geometry)
)
#calculating area weighted aggregation and re-grouping by convex hull
chull_fragments <- chull_fragments%>%
mutate(
new_area = st_area(geometry),
aw = as.numeric(new_area / area),
aw_pop = aw * Total_POP,
aw_black = aw * Black_POP,
aw_blackpct = aw_pop * aw_black)
chull_pop <- chull_fragments %>%
group_by(DISTRICT) %>%
summarize(
sumpop = sum(aw_pop),
sumblack = sum(aw_black),
black_pct = sumblack/sumpop,
geom = st_union(geometry)
)
##calculating area weighted aggregation and re-grouping by minimum bounding circle
boundcirc_fragments <- boundcirc_fragments%>%
mutate(
new_area = st_area(geometry),
aw = as.numeric(new_area / area),
aw_pop = aw * Total_POP,
aw_black = aw * Black_POP,
aw_blackpct = aw_pop * aw_black)
bound_pop <- boundcirc_fragments %>%
group_by(DISTRICT)%>%
summarize(
sumpop = sum(aw_pop),
sumblack = sum(aw_black),
black_pct = sumblack/sumpop,
geom = st_union(geometry)
)
Now, conducting the same style intersection -> area weighting -> reaggregation for precinct data, saving back the democratic vote share as a percentage.
precincts <- st_read(here("data", "raw", "public", "alabama", "districts.gpkg"), layer = "precincts20")
# 15 precincts hace geometry issues- thus, repair.
precincts <- st_make_valid(precincts)%>%
mutate(area= st_area(geom))
#segmenting block groups by district, convex hull boundaries, and boundary circles.
district_fragments <- st_intersection(precincts, districts1)
chull_precincts <- st_intersection(precincts, districts_convex)
boundcirc_precincts <- st_intersection(precincts, bound_circle)
#calculating area weighted aggregation and re-grouping by district
district_fragments <- district_fragments%>%
mutate(
new_area = st_area(geom),
aw = as.numeric(new_area / area),
aw_total = aw * (G20PREDBID+G20PRERTRU+ G20PRELJOR + G20PREOWRI),
aw_total_democrat = aw * G20PREDBID,
aw_percent_dem = aw_total_democrat/aw_total)
district_votes <- district_fragments %>%
group_by(DISTRICT)%>%
summarize(
total_votes = sum(aw_total),
total_dem = sum(aw_total_democrat),
percent_dem = total_dem/total_votes)
#calculating area weighted aggregation and re-grouping by convex hull geometry
chull_precincts <- chull_precincts%>%
mutate(
new_area = st_area(geom),
aw = as.numeric(new_area / area),
aw_total = aw * (G20PREDBID+G20PRERTRU+ G20PRELJOR + G20PREOWRI),
aw_total_democrat = aw * G20PREDBID,
aw_percent_dem = aw_total_democrat/aw_total)
chull_votes <- chull_precincts %>%
group_by(DISTRICT)%>%
summarize(
total_votes = sum(aw_total),
total_dem = sum(aw_total_democrat),
percent_dem = total_dem/total_votes)
#calculating area weighted aggregation and re-grouping by minimum bounding circle geometry
boundcirc_precincts <- boundcirc_precincts%>%
mutate(
new_area = st_area(geom),
aw = as.numeric(new_area / area),
aw_total = aw * (G20PREDBID+G20PRERTRU+ G20PRELJOR + G20PREOWRI),
aw_total_democrat = aw * G20PREDBID,
aw_percent_dem = aw_total_democrat/aw_total)
boundcirc_votes <- boundcirc_precincts %>%
group_by(DISTRICT)%>%
summarize(
total_votes = sum(aw_total),
total_dem = sum(aw_total_democrat),
percent_dem = total_dem/total_votes)
Having created everything, I’m going to append data to a final table with attached geometry. I’ll conduct the final calculations in this table.
#creating the full table from derived data products
gerrymander_data<- tibble(districts1$DISTRICT,
compactness_summary$`Polsby_Popper`,
compactness_summary$`Convex_Hull`,
compactness_summary$`Minimum_Bounding_Circle`,
district_pop$black_pct,
chull_pop$black_pct,
bound_pop$black_pct,
district_votes$percent_dem,
chull_votes$percent_dem,
boundcirc_votes$percent_dem)
To craft an index that combines the geographic compactness measurements with demographic information, I’m going to take the difference in the black population that lives within each district and subtract the percentage black population that lives within each district’s convex hull and minimum bounding circle. If the resulting number is negative it means that the district is whiter than the surrounding area; if it’s positive, it means that the district has a higher percentage of black residents than the surrounding area. A district that had a equally black population to the surrounding area (or had a geometry the meant that either it’s corresponding convex hull or minimum bounding circle was the exact size of the district itself) would have a score of 0.
#calculate the nerbonne metric
gerrymander_data<- gerrymander_data%>%
mutate(Nerbonne_convex = district_pop$black_pct-chull_pop$black_pct,
Nerbonne_minimum_bounding = `district_pop$black_pct`- bound_pop$black_pct)
#append geometry to districts
gerrymander_data$Geometry<- districts$geom
#rename columns for readability
colnames(gerrymander_data)<- c("District", "Polsby-Popper", "Convex Hull", "Minimum Bounding Circle", "Percent Black Within District", "Percent Black Within CHull", "Percent Black Within MinBounds", "Percent Dem Votes in District", "Percent Dem Votes in CHull", "Percent Dem Votes in MinBounds", "Nerbonne CHull", "Nerbonne MinBounds", "Geometry")
Given the relatively small number of districts (7) much of the analysis will be purely observational. This will include looking at and comparing metrics across districts from a table as well as mapping those metrics so they can be put in context of the actual district shape.
Results are to be presented in a full table – gerrymander_data – that combines compactness metrics, voting information, percentage black residents, and the Nerbonne metric for each district. This data will still be associated with a geometry, so it’s able to be mapped.
District geometry is to be compared to it’s compactness scores, which then can be interpreted through the lens of percentage black population. The effectivness of the gerrymander should be visible in the relative difference in Democratic votes in districts with low compactness (high Democratic percentage) and low compactness (less Democratic). The novel Nerbonne metric should reflect the relative difference in black voter percentage within and outside a district, which should be high in districts that have been specifically designed to condense traditionally Democratic voters in single districts.
This is the only preregistration for this research project.
This project is part of class-based undergraduate research, and as such does not have any funding sources.
This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](https://doi.org/10.17605/OSF.IO/W29MQ)
Discrete Geometry for Electoral Geography; Duchin and Tennor 2024.
Gerrymandering and Compactness; Implementation Flexibility and Abuse; Barnes and Solomon 2020.
Practical Application of District Compactness; Horn, Hampton and Vandenburg 1993.