Week 5 - Demographics

Shriya Yarlagadda

2024/10/02

This week, we explore the impact of demographics on voter turnout, asking “What are the connections, if any, between demographics and voter turnout? Can demographics and turnout be used to predict election outcomes? If so, how and how well?” Thank you, as always, to Matthew Dardet for his starter code and coding assistance. I also thank Jacqui Schlesinger, my partner for this week’s in-class presentation, for her valuable thoughts and advice on interesting visualizations.

To answer this question, we worked to replicate the work of Kim and Zilinsky (2023) to a degree (3). Their work sought to determine if certain demographic factors were actually predictive of voter behavior. While they also considered “tree-based machine learning models” (pg. 67), some code for which is included in a comment in my code for this blog, I focused on building a logistic regression model here. In particular, I trained models to predict both presidential vote and the act of turning out to vote itself, based on available demographic data from the 2016 American National Election Survey (ANES).

I separated this data into two datasets (one for each model). The first dataset, used for my presidential vote choice model, dropped all datapoints where a survey respondent’s vote was not available (indicating that they did not vote). The second, used for my turnout model, incorporated all datapoints from the original ANES data that we used, but included a separate variable indicating turnout, where all responses not including their choice of candidate were marked as not having turned out to vote and those with any response marked as having turned out.

Importantly, building this model required creating numerous factor variables. Given that certain demographic variables like gender naturally exist in categories, rather than as a numeric quantity that can increase in value in conjunction with an outcome like vote choice, I created separate dummies to represent each category of several model inputs, namely gender, race, religion, marriage status, homeownership status, and work status.

Turnout Presidential Vote
(Intercept)13.5183.872***
(196.969)(0.976)
genderMale0.2251.301
(0.569)(0.879)
genderFemale0.1650.895
(0.568)(0.878)
genderOther−0.139−13.229
(0.935)(459.700)
raceBlack−0.339*−4.262***
(0.144)(0.391)
raceHispanic0.110−1.979***
(0.134)(0.233)
raceOther/Multiple0.646***−0.954***
(0.140)(0.232)
raceMissing−0.017−0.714
(0.688)(0.982)
educ−0.379***−0.411***
(0.050)(0.073)
income−0.177***−0.100+
(0.040)(0.058)
religionProtestant−12.324
(196.968)
religionCatholic−12.331−0.354*
(196.968)(0.139)
religionJewish−12.885−1.441***
(196.968)(0.376)
religionOther−12.103−0.728***
(196.968)(0.144)
attend_church0.047+−0.317***
(0.028)(0.039)
southern−0.061−0.707***
(0.087)(0.126)
work_statusUnemployed0.287*−0.007
(0.136)(0.233)
work_statusRetired0.040−0.259
(0.147)(0.180)
work_statusHomemaker−0.0740.882**
(0.184)(0.275)
work_statusStudent−0.325−0.253
(0.270)(0.454)
work_statusMissing−0.227−13.455
(1.303)(882.744)
homeownerNo0.188+−0.342*
(0.097)(0.140)
homeownerMissing0.8511.381
(0.601)(1.253)
marriedNever married−0.098−0.525**
(0.127)(0.190)
marriedDivorced−0.0870.199
(0.131)(0.173)
marriedSeparated0.518+0.307
(0.273)(0.482)
marriedWidowed0.013−0.250
(0.178)(0.235)
marriedPartners0.030−0.269
(0.142)(0.225)
marriedMissing−0.5551.596
(1.318)(1.667)
age_subset30-44−0.085−0.279
(0.134)(0.232)
age_subset45-59−0.529***−0.029
(0.142)(0.230)
age_subset60+−0.990***−0.115
(0.173)(0.256)
Num.Obs.31161924
AIC3868.72098.4
BIC4062.12270.8
Log.Lik.−1902.355−1018.196
RMSE0.460.42
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
ValueIn-Sample AccuracyOut-of-Sample Accuracy
Turnout0.67110.6615
Presidential Vote0.73700.7354

It appears that a logistic regression of these selected variables only predicts both a voter’s decision to turn out and the likelihood of their vote for a particular party about 65-75% of the time. Although these models are slightly different from Kim and Zilinsky’s and are slightly more accurate, they nonetheless provide relatively low certainty of an accurate prediction. However, these regressions appeared to have some statistically significant coefficients. Interestingly, education level and identification as someone of multiple races (or one that was not otherwise identified by the ANES) or Black were the only variables that were statistically significant predictors of both turnout and vote choice (assuming an alpha level of 0.05). Turnout was also largely predicted by income, unemployment status, and age above 45, while vote choice was predicted by identification as Hispanic, Catholicism, Judaism, other religion, church attendance frequency, location in the US South, status as a homemaker, homeownership, and marriage status. Most interesting to me, though, is that gender does not appear to significantly predict either outcome.

Given these results, I wanted to visualize how several of these characteristics have played out in key battleground states, using voter roll data generously provided by Statara. This information could theoretically provide important information regarding how my findings from survey data could translate into the real world. I produced graphs for Arizona, Michigan, Georgia, and Nevada, while my partner for this week’s in-class presentation, Jacqui, focused on making graphs for Wisconsin, Pennsylvania, and North Carolina.

While I could not visualize how these demographics related to vote choice, as this information is not captured in voter rolls, I visualized variations in the proportion of people in each group who turned out to vote in the 2020 presidential election. This is not a perfect measure of turnout, especially for the 2024 election. Aside from the 2020 election being impacted by the COVID-19 pandemic, my use of voter roll data precludes me from accessing the most accurate measure of national turnout, because those who take the time to register on the voter roll (a necessary step to even vote) are logically more likely to turn out to vote at any given election than those who do not (6). However, keeping in mind that this data only represents those who have registered to vote, these results show some interesting conclusions.

These results show that females and older people consistently turn out at higher rates than males and younger people in each of the selected states. This is interesting when considering that our logistic regression found this difference to not be statistically significant in a national sample. Similarly, voters of a racial category other than the ones identified in voter files appear to have higher turnout rates than any other category in all of these states. Interestingly, Hispanic voters have the lowest turnout rates in each of these states except Nevada, voters with a vocational education tend to turn out more than those with only some college education, and there does not appear to be a strong difference between turnout for rural and suburban voters in each of these states.

References:

“Answer to ‘How Do I Undo the Most Recent Local Commits in Git?’” 2009. Stack Overflow. https://stackoverflow.com/a/927386.

(1) Ahorn. 2020. “Answer to ‘Export R Data to Csv.’” Stack Overflow. https://stackoverflow.com/a/62006344.

(2) dickoa. 2012. “Answer to ‘Read All Files in a Folder and Apply a Function to Each Data Frame.’” Stack Overflow. https://stackoverflow.com/a/9565134; Guidance from Matt Dardet in office hours

(3) Kim, Seo-young Silvia, and Jan Zilinsky. 2024. “Division Does Not Imply Predictability: Demographics Continue to Reveal Little About Voting and Partisanship.” Political Behavior 46 (1): 67–87. https://doi.org/10.1007/s11109-022-09816-z.

(4) GLM help page

(5) Tibble Row help page - as used in previous blogs

(6) Hartig, Hannah, Andrew Daniller, Scott Keeter, and Ted Van Green. 2023. “1. Voter Turnout, 2018-2022.” Pew Research Center (blog). July 12, 2023. https://www.pewresearch.org/politics/2023/07/12/voter-turnout-2018-2022/.

(7) aosmith. 2014. “Answer to ‘Why Are My Dplyr Group_by & Summarize Not Working Properly? (Name-Collision with Plyr).’” Stack Overflow. https://stackoverflow.com/a/26933112.; GGPlot help page; stevec. 2020. “Answer to ‘Adding Data Labels above Geom_col() Chart with Ggplot2.’” Stack Overflow. https://stackoverflow.com/a/61574728.; GeomText Help Page; Geom_Col help page; duhaime. 2018. “Answer to ‘Remove Legend Ggplot 2.2.’” Stack Overflow. https://stackoverflow.com/a/51923574.

(8) Scale Manual Help Page; Geom Label Help Page; stefan. 2021. “Answer to ‘How to Center Labels over Side-by-Side Bar Chart in Ggplot2.’” Stack Overflow. https://stackoverflow.com/a/70197229.; duhaime. 2018. “Answer to ‘Remove Legend Ggplot 2.2.’” Stack Overflow. https://stackoverflow.com/a/51923574.

Data Sources

American National Election Survey (ANES)

Subsetted version of state voter files (Statara)