23 December 2024

Generalizing Regression: t-tests and OLS

by Jacob Dichter

December 23, 2024

Regression analysis is one of the most foundational tools in statistics, enabling us to model and understand relationships between variables. I was surprised to learn that t-tests, a staple technique for comparing group means in statistics, is actually a simple case of Ordinary Least Squares (OLS) regression. Specifically, a t-test can be thought of as a linear regression where the predictor variable is binary (representing the two groups) and its coefficient represents the difference in means between the two groups.

Key Concepts

t-tests: Used to compare the means of two groups.
OLS Regression: A method for modeling the relationship between a dependent variable and one or more independent variables.

At its core, a t-test is designed to compare the means of two groups, determining whether the observed differences are statistically significant. On the other hand, OLS regression is a method for modeling the relationship between a dependent variable and one or more independent variables. While these two techniques may seem distinct, they are deeply interconnected. In fact, a t-test can be viewed as a simplified form of OLS regression, where the predictor variable is binary (e.g., representing two groups).

The Connection

When you perform a t-test, you’re essentially running an OLS regression model with a single binary predictor. Here’s how it works: the binary predictor (e.g., group A vs. group B) is coded as 0 or 1, where 0 represents one group and 1 represents the other. The regression coefficient in this model corresponds to the difference in means between the two groups. Remarkably, the t-statistic generated by the regression output is identical to the t-statistic you would obtain from a traditional t-test. This equivalence demonstrates that t-tests are a specific application of the broader framework of regression analysis.

The binary predictor (e.g., group A vs. group B) is coded as 0 or 1.
The regression coefficient represents the difference in means between the two groups.
The t-statistic from the regression output is identical to the t-statistic from a t-test.

Example

Let’s say we want to compare the average test scores of two groups: Group A and Group B. We can use either a t-test or OLS regression to answer this question.

Using a t-test:

from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(group_a_scores, group_b_scores)

By understanding the connection between t-tests and OLS regression, we gain a deeper appreciation for the flexibility and power of regression analysis. Whether you’re comparing means or modeling complex relationships, regression provides a unified framework for tackling a wide range of statistical problems. This insight not only simplifies your toolkit but also enhances your ability to interpret and apply statistical methods effectively.

Old Post

Is a t-Test Equivalent to Regression on a Group Dummy Variable?

Yes, running a t-test comparing two groups is mathematically equivalent to performing a simple linear regression where the outcome variable is regressed on a single binary (dummy) variable indicating group membership. Here’s why:

The t-Test

A t-test for two independent groups compares the means of a dependent variable (\(Y\)) between two groups (e.g., Group A and Group B). The null hypothesis is that the means of the two groups are equal \((H_0: \mu_A = \mu_B)\).

Simple Linear Regression

In simple linear regression, you can model the dependent variable ((Y)) as:

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

where:

\(X_i\) is a binary variable (e.g., 0 for Group A and 1 for Group B),
\(\beta_0\) is the mean of (Y) for Group A,
\(\beta_1\) is the difference in means between Group B and Group A ((\mu_B - \mu_A)),
\(\epsilon_i\) is the error term.

Equivalence

The t-statistic for the slope (\(\beta_1\)) in this regression is the same as the t-statistic from the t-test comparing the two groups.
The regression’s \(p\)-value for testing whether \(\beta_1 = 0\) is identical to the \(p\)-value from the t-test for the null hypothesis that the two group means are equal.
The estimated coefficients in the regression (\(\beta_0)\) and \(\beta_1\)) correspond directly to the group means and the mean difference.

Assumptions

For the equivalence to hold, the assumptions underlying both methods must be satisfied:

The dependent variable is approximately normally distributed within each group.
The variances of the dependent variable in the two groups are equal (homoscedasticity).
Observations are independent.

Why Use One Over the Other?

t-Test: Direct and simpler for comparing two group means.
Regression: More flexible, especially when additional covariates are included (e.g., adjusting for confounders or interacting variables).

tags: