!pip install pyreadr
!pip install rpy2
import pandas as pd
import pyreadr
from rpy2.robjects import r, pandas2ri
The goal of this project is to analyze changing voting patterns in the US based on economic, educational and demographic factors. We will be focusing on election results in the past 40 years on a per county basis. We will be using race, origin, gender, and age as our demographic facors. We will be using educational achievement as our educational factor. Lastly, we will be using poverty rate as our economic factor. Based on our analysis, we will use a machine learning model to predict the upcoming election results based on current state-level data.
Some inital Hypotheses:
We have a set time to meet every Thursday at 1:00p to discuss the work we did that week and what needs to be done the next week, along with any coding we can get done together. We will meet over zoom an additional time every other week to collaborate on code and analysis. We have a shared google drive folder that keeps everything organized and accessible for both of us, along with a private repository branch so we can easily share code.
For the presidential election results we will be using a data set from Harvard Dataverse containing the presidential election results per county from 1868-2020. The dataset can be found here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DGUMFI
#reset to /content root folder and run these in order
%cd /content/
!git clone https://github.com/cathbrooks/cathbrooks.github.io.git
%cd /content/cathbrooks.github.io/data
!ls
# Activate automatic conversion
pandas2ri.activate()
#load in Rdata file
r['load']('dataverse_shareable_presidential_county_returns_1868_2020.Rdata')
#get dataframe from file
pres_elections_release_r = r['pres_elections_release']
pres_elections_release_df = pd.DataFrame(pres_elections_release_r)
pres_elections_release_df
# Select the necessary columns
selected_cols = ['election_year','fips','county_name','state','sfips','democratic_raw_votes','republican_raw_votes','raw_county_vote_totals']
pres_elections_release_df = pres_elections_release_df[selected_cols].copy()
# Create new columns for the percenatage democratic and republican vote
pres_elections_release_df['percent_democrat'] = pres_elections_release_df['democratic_raw_votes'] / pres_elections_release_df['raw_county_vote_totals']
pres_elections_release_df['percent_republican'] = pres_elections_release_df['republican_raw_votes'] / pres_elections_release_df['raw_county_vote_totals']
# Set the election_year column to an int64 data type
pres_elections_release_df['election_year'] = pres_elections_release_df['election_year'].astype('int64')
#filter for election years 1988, 2000, 2012, 2020
pres_election_filtered_df = pres_elections_release_df[pres_elections_release_df['election_year'].isin([1988,2000,2012,2020])].copy()
pres_election_filtered_df.rename(columns={'fips': 'FIPS_code', 'election_year': 'Year'}, inplace=True)
pres_election_filtered_df.drop(columns=['sfips'], inplace=True)
pres_elections_release_df
A data set with county level eduaction attainment percentage for their adult population. These categories will be used as our education factors in our model. The data set can be found here: https://www.ers.usda.gov/data-products/county-level-data-sets/county-level-data-sets-download-data/
# pivot the df according to tidy data standards
df_education = pd.read_csv("EducationData.csv")
years = [1990, 2000, 2008, 2017]
df_education_new = pd.DataFrame()
for year in years:
col_names = ["Federal Information Processing Standard (FIPS) Code", "State" , "Area name"]
col_new = ["Federal Information Processing Standard (FIPS) Code", "State" , "Area name"]
for col in df_education.columns:
if str(year) in col:
col_names.append(col)
col_new.append(col.split(',')[0])
df_temp = df_education[col_names].set_axis(col_new, axis=1)
df_temp['Year'] = year
df_education_new = pd.concat([df_education_new, df_temp])
#fill fips code
df_education_new = df_education_new.rename(columns={'Federal Information Processing Standard (FIPS) Code': 'FIPS_code'})
df_education_new['FIPS_code'] = df_education_new['FIPS_code'].astype(str).str.zfill(5)
df_education_new.drop(columns=['Less than a high school diploma', 'High school diploma only', 'Some college or associate\'s degree', 'Bachelor\'s degree or higher'], inplace=True)
#match election years
df_education_new['Year'].replace(1990, 1988, inplace=True)
df_education_new['Year'].replace(2008, 2012, inplace=True)
df_education_new['Year'].replace(2017, 2020, inplace=True)
df_education_new
A data set containing country level demographic data based on race, origin, sex, and age group. These categories will be the demographic factors we use in our model. The data can be found here: https://seer.cancer.gov/popdata/download.html#single
df_demographics = pd.read_csv("demographic_data.csv")
df_demographics['County FIPS code'] = df_demographics['County FIPS code'].astype(str).str.zfill(3) #make the county code three digits
# Convert 'state_fips' and 'county_fips' columns to strings so we can combine them into one FIPS_code column for merging purposes
df_demographics['State FIPS code'] = df_demographics['State FIPS code'].astype(str)
df_demographics['County FIPS code'] = df_demographics['County FIPS code'].astype(str)
df_demographics['FIPS_code'] = df_demographics['State FIPS code'] + df_demographics['County FIPS code']
#drop unneeded columns
df_demographics.drop(columns=['Registry (Geographic Definition Only)', 'Unnamed: 0', 'State FIPS code', 'County FIPS code'], inplace=True)
df_demographics['FIPS_code'] = df_demographics['FIPS_code'].str.zfill(5)
# convert race, orgin, sex, and age codes to categorical values
df_demographics['Race'] = df_demographics['Race'].map({1:'White', 2:'Black', 3:"American Indian/Alaska Native", 4:"Asian or Pacific Islander"})
df_demographics["Origin"] = df_demographics["Origin"].map({0:'Non-Hispanic', 1:'Hispanic'})
df_demographics['Sex'] = df_demographics['Sex'].map({1:'Male',2:'Female'})
# conver to correct years
df_demographics['Year'].replace({1990:1988, 2000:2000, 2010:2012, 2020:2020}, inplace=True)
df_demographics
Multiple data sets with county level poverty rates. The individual data sets can be downloaded here: https://data.ers.usda.gov/reports.aspx?ID=17826
Tidy 2021 dataframe:
df_poverty_2021 = pd.read_csv("PovertyEstimates 2021.csv")
df_poverty_2021.drop(columns=['Rural-urban_Continuum_Code_2003',
'Urban_Influence_Code_2003',
'Rural-urban_Continuum_Code_2013',
'Urban_Influence_Code_ 2013',
'CI90LBALL_2021',
'CI90UBALL_2021',
'CI90LBALLP_2021',
'CI90UBALLP_2021',
'CI90LB017_2021',
'CI90UB017_2021',
'CI90LB017P_2021',
'CI90UB017P_2021',
'CI90LB517_2021',
'CI90UB517_2021',
'CI90LB517P_2021',
'CI90UB517P_2021',
'CI90LBINC_2021',
'CI90UBINC_2021',
'CI90LB04_2021',
'CI90UB04_2021',
'CI90LB04P_2021',
'CI90UB04P_2021',
'POVALL_2021', #POVALL_2021 = Estimate of people of all ages in poverty 2021
'POV017_2021',
'PCTPOV017_2021',
'POV517_2021',
'PCTPOV517_2021',
'MEDHHINC_2021',
'POV04_2021',
'PCTPOV04_2021'], inplace=True)
#rename poverty_rate column for clarity
df_poverty_2021.rename(columns={'PCTPOVALL_2021': 'Poverty_rate', 'FIPS_Code': 'FIPS_code'}, inplace=True)
# Add a new column 'year'
df_poverty_2021['Year'] = 2021
Tidy 1989 and 1999 dataframe:
df_poverty_8999 = pd.read_csv("povertyrates1989-1999.csv")
#idk what these columns were they arent in the original excel file...
#also the last two columns are missing from the csv idk why
df_poverty_8999.drop(columns=['BEALE93',
'FPVDET89',
'FINPOV89',
'FPOV89',
'FPVDET99',
'FINPOV99',
'FPOV99',
'Population base 1989',
'Population in poverty 1989',
'Population base 1999',
'Population in poverty 1999',
'Children <18 1989 base',
'Children <18 1989 in pov',
'Children <18 1989 pov rate',
'Children <18 1999 base',
'Children <18 1999 in pov',
'Children <18 1999 pov rate'], inplace=True)
#make one column for poverty_rate and create a year column
df_1989 = df_poverty_8999[['FIPS', 'STABR', 'AREA', 'Poverty rate 1989']].copy()
df_1989.rename(columns={'Poverty rate 1989': 'Poverty_rate', 'FIPS': 'FIPS_code', 'STABR': 'Stabr', 'AREA': 'Area_name'}, inplace=True)
df_1989['Year'] = 1989
# Create a new DataFrame with the 'Poverty rate 1999' data
df_1999 = df_poverty_8999[['FIPS', 'STABR', 'AREA', 'Poverty rate 1999']].copy()
df_1999.rename(columns={'Poverty rate 1999': 'Poverty_rate', 'FIPS': 'FIPS_code', 'STABR': 'Stabr', 'AREA': 'Area_name'}, inplace=True)
df_1999['Year'] = 1999
# Concatenate the two DataFrames
poverty_8999_df = pd.concat([df_1989, df_1999], ignore_index=True)
poverty_8999_df
Tidy 2009 dataframe:
df_poverty_2009 = pd.read_csv("poverty_2009.csv")
df_poverty_2009.drop(columns=['RUC',
'Lower bound',
'Upper bound',
'Children en Poverty Percent',
'Lower bound.1',
'Upper bound.1'
], inplace=True)
df_poverty_2009.rename(columns={'FIPS': 'FIPS_code', 'All People In Poverty Percent': 'Poverty_rate', 'Name': 'Area_name'}, inplace=True)
df_poverty_2009['Year'] = 2009
#There was one weird row with wrong values index=2948 FIPS_code=Adams County Area_name=6 Poverty_rate=14.5 Year=2009 so I dropped it
# df_poverty_2009[~df_poverty_2009['FIPS_code'].astype(str).str.isdigit()]
df_poverty_2009 = df_poverty_2009.drop(2948)
#convert FIPS_code to ints so we can merge
df_poverty_2009['FIPS_code'] = df_poverty_2009['FIPS_code'].astype(int)
#Get Stabr column for 2009
poverty_2009_df = pd.merge(poverty_8999_df, df_poverty_2009, how='inner', on='FIPS_code', suffixes=['1989',''])
poverty_2009_df.drop(columns=['Area_name1989', 'Poverty_rate1989', 'Year1989'], inplace=True)
Concatenate all poverty dataframes:
#concantenate all the poverty data
complete_poverty_df = pd.concat([df_poverty_2021, poverty_8999_df, poverty_2009_df], ignore_index=True)
complete_poverty_df['FIPS_code'] = complete_poverty_df['FIPS_code'].astype(str).str.zfill(5)
#reset years
complete_poverty_df['Year'].replace(1989, 1988, inplace=True)
complete_poverty_df['Year'].replace(1999, 2000, inplace=True)
complete_poverty_df['Year'].replace(2009, 2012, inplace=True)
complete_poverty_df['Year'].replace(2021, 2020, inplace=True)
complete_poverty_df
# Filter counties that voted Democrat
democrat_counties = pres_election_filtered_df[pres_election_filtered_df['percent_democrat'] > pres_election_filtered_df['percent_republican']]
# Filter counties that voted Republican
republican_counties = pres_election_filtered_df[pres_election_filtered_df['percent_republican'] > pres_election_filtered_df['percent_democrat']]
# Make a party column
pres_election_filtered_df['party'] = ''
# Set 'democrat' for counties in democrat_counties
pres_election_filtered_df.loc[democrat_counties.index, 'party'] = 'Democrat'
# Set 'republican' for counties in republican_counties
pres_election_filtered_df.loc[republican_counties.index, 'party'] = 'Republican'
#reset these with the party column
democrat_counties = pres_election_filtered_df[pres_election_filtered_df['percent_democrat'] > pres_election_filtered_df['percent_republican']]
republican_counties = pres_election_filtered_df[pres_election_filtered_df['percent_republican'] > pres_election_filtered_df['percent_democrat']]
import matplotlib.pyplot as plt
# Group by year and party, count the number of counties for each party in each year
county_counts = pres_election_filtered_df.groupby(['Year', 'party']).size().unstack(fill_value=0)
# Plotting
bar_graph = county_counts.plot(kind='bar', stacked=True, color=['red', 'blue'])
# Add labels and title
plt.title('Counties Voting Democrat vs Republican by Year')
plt.xlabel('Year')
plt.ylabel('Number of Counties')
# Show plot
plt.legend(title='Party')
plt.show()
Graph Analysis: This graph shows the number of counties that vote democrat vs republic in the four election years we are studying. We can see that in each year that more counties vote republic vs democrat.
Average Education Level by Political Party:
#merge pres_election and education and poverty
election_and_education_df = pd.merge(df_education_new, pres_election_filtered_df, how='inner', on=['FIPS_code','Year'])
#merge poverty
election_education_poverty_df = pd.merge(election_and_education_df, complete_poverty_df, how='inner', on=['FIPS_code','Year'])
election_education_poverty_df.columns
election_education_poverty_df.drop(columns=['Area name', 'state', 'Area_name', 'Stabr',], inplace=True)
import matplotlib.pyplot as plt
years = election_education_poverty_df['Year'].unique()
education_categories = ['Percent of adults with less than a high school diploma', 'Percent of adults with a high school diploma only',
"Percent of adults completing some college or associate's degree", "Percent of adults with a bachelor's degree or higher"]
fig, axes = plt.subplots(len(years), 1, figsize=(10, 6 * len(years)), sharex=True)
for i, year in enumerate(years):
# Get current year data
data_year = election_education_poverty_df[election_education_poverty_df['Year'] == year]
# Calculate average education levels for Democrat and Republican counties
avg_education_democrat = data_year[data_year['party'] == 'Democrat'][education_categories].mean()
avg_education_republican = data_year[data_year['party'] == 'Republican'][education_categories].mean()
# Plot
index = range(len(education_categories))
bar_width = 0.35
axes[i].bar(index, avg_education_democrat, bar_width, label='Democrat', color='blue', alpha=0.6)
axes[i].bar([x + bar_width for x in index], avg_education_republican, bar_width, label='Republican', color='red', alpha=0.6)
axes[i].set_title(f'Average Education Level by Political Party - {year}')
axes[i].set_ylabel('Average Percent of Adults')
axes[i].set_xticks([x + bar_width / 2 for x in index])
axes[i].set_xticklabels(['Less than a High School Diploma', 'High School Diploma only', 'Some College or Associate\'s Degree', "Bachelor\'s Degree or higher"], rotation=45, ha='right')
axes[i].legend()
axes[-1].set_xlabel('Education Level')
plt.tight_layout()
plt.show()
Graph Analysis: These graphs show the percent of the adult population in each of the four educational attainment categories in each of the four election years. We can see that in the most recent election year, the democratic party is recieving a lot more votes from the "Bacherlor's Degree or higher" category.
Average Poverty Rate by Politcal Party:
import matplotlib.pyplot as plt
average_poverty_by_year_party = election_education_poverty_df.groupby(['Year', 'party'])['Poverty_rate'].mean().unstack()
plt.figure(figsize=(8, 6))
bar_width = 0.8
# Plot Democrat average poverty rate
plt.bar(average_poverty_by_year_party.index - bar_width/2, average_poverty_by_year_party['Democrat'], color='blue', alpha=0.7, width=bar_width, label='Democrat')
# Plot Republican average poverty rate
plt.bar(average_poverty_by_year_party.index + bar_width/2, average_poverty_by_year_party['Republican'], color='red', alpha=0.7, width=bar_width, label='Republican')
plt.title('Average Poverty Rate by Political Party and Year')
plt.xlabel('Year')
plt.ylabel('Average Poverty Rate')
plt.xticks(average_poverty_by_year_party.index, rotation=45) # Rotate x-axis labels for better readability
plt.legend()
plt.tight_layout()
plt.show()
Graph Analysis: Bar charts showing the average poverty rate of the counties that voted republic vs democrat in each of the four election years. This visualizes the mean value of each of these summary statistics. We can see that in each year the average poverty rate is higher in democratic counties vs republican.
Distribution of Poverty Rates by Political Party:
# Plot the distribution of poverty rates by year based on if the county voted democrat or republican
years = election_education_poverty_df['Year'].unique()
fig, axes = plt.subplots(len(years), 1, figsize=(6, 10), sharex=True)
for i, year in enumerate(years):
axes[i].hist(election_education_poverty_df[(election_education_poverty_df['Year'] == year) & (election_education_poverty_df['party'] == 'Democrat')]['Poverty_rate'],
alpha=0.5, color='blue', label='Democrat')
axes[i].hist(election_education_poverty_df[(election_education_poverty_df['Year'] == year) & (election_education_poverty_df['party'] == 'Republican')]['Poverty_rate'],
alpha=0.5, color='red', label='Republican')
axes[i].set_title('Distribution of Poverty Rates by Political Party in ' + str(year))
axes[i].set_ylabel('Number of Counties')
axes[i].legend()
axes[-1].set_xlabel('County Percent Poverty Rate')
plt.show()
Graph Analysis: These graphs show the distribution of the county level poverty rates based on if they voted republican or democrat in the four election years. We can see that there isn't a clear difference in the general shape of the distributions, the republican distribution is larger simply because it has more overall counties.
# Create a Scatter Plot of the Black Percentage Population vs Democratic Vote for every year
years = df_demographics['Year'].unique()
fig, axes = plt.subplots(len(years), 1, figsize=(6, 15), sharex=True)
for i, year in enumerate(years):
# Calculate the black population and the total population in each county
black_pop = df_demographics[(df_demographics['Year'] == year) & (df_demographics['Race'] == 'Black')].groupby(['FIPS_code'])['Population'].sum()
total_pop = df_demographics[(df_demographics['Year'] == year)].groupby(['FIPS_code'])['Population'].sum()
# Merge the population series into a DataFrame and calculate the percent black population
pop_demographics_df = pd.merge(total_pop.to_frame('Total Population'),
black_pop.to_frame('Black Population'),
how='left', left_index=True, right_index=True)
pop_demographics_df['Black Population'].fillna(0, inplace=True)
pop_demographics_df['Percent Black Population'] = pop_demographics_df['Black Population'] / pop_demographics_df['Total Population']
# Merge with the election results
election_categories = ['percent_democrat', "percent_republican"]
election_pop_demographics_df = pd.merge(pop_demographics_df,
pres_elections_release_df[pres_elections_release_df['election_year'] == year].set_index('fips')[election_categories],
how='left', left_index=True, right_index=True)
# Create Scatter Plot
axes[i].scatter(election_pop_demographics_df['Percent Black Population'], election_pop_demographics_df['percent_democrat'], s=.75)
axes[i].set_ylabel('Democratic Percentage Vote', fontsize=10)
axes[i].set_title('Black Percentage Vote vs Democratic Vote in ' + str(year))
axes[-1].set_xlabel('Percent Black Population')
plt.show()
Graph Analysis: These scatter plots show the percent black population plotted against the percent democratic vote in the four election years. We can see from the graphs that in every year as the black population increase, the democratic vote increase, suggesting a strong relationship between these variables.
# Create a Scatter Plot of the White Percentage Population vs Democratic Vote for every year
years = df_demographics['Year'].unique()
fig, axes = plt.subplots(len(years), 1, figsize=(6, 15), sharex=True)
for i, year in enumerate(years):
# Calculate the black population and the total population in each county
white_pop = df_demographics[(df_demographics['Year'] == year) & (df_demographics['Race'] == 'White')].groupby(['FIPS_code'])['Population'].sum()
total_pop = df_demographics[(df_demographics['Year'] == year)].groupby(['FIPS_code'])['Population'].sum()
# Merge the population series into a DataFrame and calculate the percent black population
pop_demographics_df = pd.merge(total_pop.to_frame('Total Population'),
white_pop.to_frame('White Population'),
how='left', left_index=True, right_index=True)
pop_demographics_df['White Population'].fillna(0, inplace=True)
pop_demographics_df['Percent White Population'] = pop_demographics_df['White Population'] / pop_demographics_df['Total Population']
# Merge with the election results
election_categories = ['percent_democrat', "percent_republican"]
election_pop_demographics_df = pd.merge(pop_demographics_df,
pres_elections_release_df[pres_elections_release_df['election_year'] == year].set_index('fips')[election_categories],
how='left', left_index=True, right_index=True)
# Create Scatter Plot
axes[i].scatter(election_pop_demographics_df['Percent White Population'], election_pop_demographics_df['percent_democrat'], s=.75)
axes[i].set_ylabel('Democratic Percentage Vote', fontsize=10)
axes[i].set_title('White Percentage Vote vs Democratic Vote in ' + str(year))
axes[-1].set_xlabel('Percent White Population')
plt.show()
Graph Analysis: We can essentially see the opposite happening here. These scatter plots show the percent white population plotted against the percent democratic vote in the four election years. We can see from the graphs that in every year as the white population increases in the county, the democratic vote decreases, suggesting a strong relationship between these variables.
In conclusion, as we can see from these graphs. Uur hypotheses all proved to be correct.
#more data cleaning
final_demographics_df = pd.DataFrame(columns = ['FIPS_code','Year'])
# calculate percent race for each county in each year
races = df_demographics['Race'].unique()
last_race = ''
for race in races:
temp_race_df = (df_demographics[(df_demographics['Race'] == race)].groupby(['FIPS_code','Year'])['Population'].sum() /
df_demographics.groupby(['FIPS_code','Year'])['Population'].sum()).reset_index()
final_demographics_df = pd.merge(final_demographics_df, temp_race_df, how='outer', left_on = ['FIPS_code','Year'], right_on=['FIPS_code','Year'], suffixes=(last_race, race))
last_race = race
# calculate percent origin for each county in each year
origins = df_demographics['Origin'].unique()
last_origin = ''
for origin in origins:
temp_origin_df = (df_demographics[(df_demographics['Origin'] == origin)].groupby(['FIPS_code','Year'])['Population'].sum() /
df_demographics.groupby(['FIPS_code','Year'])['Population'].sum()).reset_index()
final_demographics_df = pd.merge(final_demographics_df, temp_origin_df, how='outer', left_on = ['FIPS_code','Year'], right_on=['FIPS_code','Year'], suffixes=(last_origin, origin))
last_origin = origin
# calculate percent sex for each county in each year
sexes = df_demographics['Sex'].unique()
last_sex = ''
for sex in sexes:
temp_sex_df = (df_demographics[(df_demographics['Sex'] == sex)].groupby(['FIPS_code','Year'])['Population'].sum() /
df_demographics.groupby(['FIPS_code','Year'])['Population'].sum()).reset_index()
final_demographics_df = pd.merge(final_demographics_df, temp_sex_df, how='outer', left_on = ['FIPS_code','Year'], right_on=['FIPS_code','Year'], suffixes=(last_sex, sex))
last_sex = sex
final_demographics_df
final_demographics_df = final_demographics_df.fillna(0)
Model Questions: Classify counties are democrat or republican based on the economic, demographic and educational factors
Features:
Dependent Variable: Presidential Party
# Merge all of the data frames
df = pd.merge(election_education_poverty_df, final_demographics_df, how='inner', on=['FIPS_code', 'Year'])
# drop unnecessary rows
df = df.drop(columns=['county_name', 'democratic_raw_votes', 'republican_raw_votes',
'raw_county_vote_totals', 'percent_democrat', 'percent_republican']).copy()
#remove na rows
df = df.dropna()
#remove rows with no party
df = df.drop(df[df['party'] == ''].index).copy()
# Ensure all the percentages are in the form of decimals
divide_columns = ['Percent of adults with less than a high school diploma',
'Percent of adults with a high school diploma only',
'Percent of adults completing some college or associate\'s degree',
'Percent of adults with a bachelor\'s degree or higher', 'Poverty_rate']
df[divide_columns] = df[divide_columns] / 100
df
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
# from sklearn.metrics import plot_confusion_matrix
features = ['Percent of adults with less than a high school diploma',
'Percent of adults with a high school diploma only',
'Percent of adults completing some college or associate\'s degree',
'Percent of adults with a bachelor\'s degree or higher', 'Poverty_rate', 'PopulationWhite', 'PopulationBlack',
'PopulationAmerican Indian/Alaska Native',
'PopulationAsian or Pacific Islander', 'PopulationNon-Hispanic',
'PopulationHispanic', 'PopulationMale', 'PopulationFemale']
X = df[features]
y = df['party']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
#train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
#evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred, labels=df['party'].unique())
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
import seaborn as sns
import matplotlib.pyplot as plt
#plot confusion matrix
plt.figure(figsize=(10, 8))
sns.set(font_scale=1.4)
# Plot using seaborn
sns.heatmap(conf_matrix, annot=True, fmt="d", annot_kws={"size": 16}, cmap='Blues',
xticklabels=df['party'].unique(), yticklabels=df['party'].unique()) # Font size
plt.xlabel('Predicted Labels', fontsize=18)
plt.ylabel('True Labels', fontsize=18)
plt.title('Confusion Matrix', fontsize=20)
plt.show()
Accuracy: 0.84 Confusion Matrix: [[2355 71] [ 439 240]]
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score, confusion_matrix
import numpy as np
# Assuming df is your DataFrame
X = df[features]
y = df['party']
# Create a logistic regression model
model = LogisticRegression(max_iter=1000) # Increase max_iter if convergence issues occur
# Perform cross-validation to get predictions
y_pred = cross_val_predict(model, X, y, cv=5) # 5-fold cross-validation
# Calculate metrics based on the predictions
accuracy = accuracy_score(y, y_pred)
conf_matrix = confusion_matrix(y, y_pred, labels=df['party'].unique())
# Print the results
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
#plot confusion matrix
plt.figure(figsize=(10, 8))
sns.set(font_scale=1.4)
# Plot using seaborn
sns.heatmap(conf_matrix, annot=True, fmt="d", annot_kws={"size": 16}, cmap='Blues',
xticklabels=df['party'].unique(), yticklabels=df['party'].unique()) # Font size
plt.xlabel('Predicted Labels', fontsize=18)
plt.ylabel('True Labels', fontsize=18)
plt.title('Confusion Matrix', fontsize=20)
plt.show()
Accuracy: 0.83 Confusion Matrix: [[11770 358] [ 2223 1174]]
# Create and evaluate a model for each election year
years = df['Year'].unique()
for year in years:
df_year = df[df['Year'] == year]
# Assuming df is your DataFrame
X = df_year[features]
y = df_year['party']
# Create a logistic regression model
model = LogisticRegression(max_iter=1000) # Increase max_iter if convergence issues occur
# Perform cross-validation to get predictions
y_pred = cross_val_predict(model, X, y, cv=5) # 5-fold cross-validation
# Calculate metrics based on the predictions
accuracy = accuracy_score(y, y_pred)
conf_matrix = confusion_matrix(y, y_pred, labels=df['party'].unique())
# Print the results
print(f"{year}'s Accuracy: {accuracy:.2f}")
print(f"{year}'s Confusion Matrix:")
print(conf_matrix)
#plot confusion matrix
plt.figure(figsize=(10, 8))
sns.set(font_scale=1.4)
# Plot using seaborn
sns.heatmap(conf_matrix, annot=True, fmt="d", annot_kws={"size": 16}, cmap='Blues',
xticklabels=df['party'].unique(), yticklabels=df['party'].unique()) # Font size
plt.xlabel('Predicted Labels', fontsize=18)
plt.ylabel('True Labels', fontsize=18)
plt.title(f'{year} Confusion Matrix', fontsize=20)
plt.show()
1988's Accuracy: 0.76 1988's Confusion Matrix: [[2235 51] [ 698 112]]
2000's Accuracy: 0.83 2000's Confusion Matrix: [[2381 54] [ 489 181]]
2012's Accuracy: 0.85 2012's Confusion Matrix: [[4654 180] [ 746 634]]
2020's Accuracy: 0.91 2020's Confusion Matrix: [[2514 59] [ 232 305]]
Our model does not perform with as much accuracy as we would have hoped. Although, we noticed that if we restrict it for each respective year, it performs better on certain years. This could be due to the fact that in some years, demographics, poverty level, and education level, had more of a correlation with voting patterns than in other years. This could be due to the candidates and the United States economy and politics of those years.
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# !jupyter nbconvert --to html /content/drive/MyDrive/Data_Science_Project/Milestone2.ipynb