Using Principal Component Analysis to Produce a Composite Variable for Socioeconomic Analysis
by Jacob Dichter
February 19, 2025
I will implement a simple step-by-step outline of a basic ETL project where we will load a CSV into Python, clean, then load the data into SQLite where it can be queried.
Motivation
Literature and problem-scope basis for PCA on target variable.
Paper
The target variable, Socioeconomic Status (SES) Change ( \(Y=ΔSES_{2018−2023}\) ) is operationalized using Principal Component Analysis (PCA) to address endogeneity and multicollinearity concerns that a simple weighted average may introduce. Reades (2018) used PCA to reduce 4 dimensions to a single principal component vector (PC1) and had this serve as the target variable in their analysis. In this analysis, PCA is employed to reduce the three variables into a single principal component vector which will serve as the target variable. The data is first standardized to ensure compatibility with PCA, and the covariance matrix is decomposed to derive eigenvalues and eigenvectors. PC1, representing the largest proportion of variance, is selected as the composite index of SES change.
With the PCA-based approach, the loadings of each variable in PC1 reflect their contribution to the overall variance, offering a more empirically grounded measure of SES change than an arbitrarily weighted average. Validation such as correlation analysis and variance checks should be employed to ensure the robustness of the selected component and align our analysis with best practices in dimensionality reduction and quantitative socioeconomic research.
tags: