Endogeneity Test: Detecting Error Term Correlation

Aug 28, 2025 by Lucas 51 views

Statistical Test for Endogeneity: Detecting Correlation Between Error Term and Independent Variables

Endogeneity can be a major buzzkill in regression analysis, potentially leading to biased and inconsistent estimates. Essentially, it means that one or more of your independent variables are correlated with the error term. This violates a key assumption of ordinary least squares (OLS) regression, making your results unreliable. So, how do you know if you have an endogeneity problem? That's where statistical tests come in handy! Let's dive into the nitty-gritty of how to test for endogeneity, particularly focusing on the correlation between the error term and independent variables.

Understanding the Regression Model and Endogeneity

First, let's lay the groundwork with a typical regression model. We represent it as:

\begin{align} \boldsymbol{y} & =\boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \end{align}

Where:

$\boldsymbol{y}$ is the dependent variable.
$\boldsymbol{X}$ is the matrix of independent variables.
$\boldsymbol{\beta}$ is the vector of coefficients.
$\boldsymbol{\epsilon}$ is the error term.

The heart of the issue lies in the relationship between $\boldsymbol{X}$ and $\boldsymbol{\epsilon}$ . OLS regression assumes that these are uncorrelated. When this assumption is violated—that is, when $\boldsymbol{X}$ and $\boldsymbol{\epsilon}$ are correlated—we say that endogeneity exists. This correlation can arise from several sources, including omitted variables, simultaneity, or measurement error. For instance, imagine you're trying to figure out how education affects income. Seems straightforward, right? But what if smarter people tend to get more education and also are better at making money, no matter their education level? That innate ability is tough to measure and probably hangs out in the error term. Now your education variable is dancing with the error term, and that's an endogeneity party you don't want to be at!

Sources of Endogeneity

Omitted Variables: When a relevant variable is left out of the model, and this variable is correlated with both the dependent variable and one or more independent variables, it can cause endogeneity. For example, if we are trying to estimate the effect of exercise on weight loss but fail to account for diet, and diet is correlated with both exercise and weight loss, the exercise variable may be endogenous.
Simultaneity: This occurs when the dependent variable affects one or more of the independent variables, creating a feedback loop. For instance, in a supply and demand model, price affects quantity demanded, and quantity demanded, in turn, affects price. This simultaneity can lead to endogeneity.
Measurement Error: If an independent variable is measured with error, and this error is correlated with the true value of the independent variable, it can induce endogeneity. For example, if we are using self-reported income as an independent variable, and individuals tend to underreport their income, this measurement error can be correlated with the true income level, leading to endogeneity.

Why Testing for Endogeneity Matters

So, why should you care about endogeneity? Well, if you ignore it, your regression results might be totally bogus! The estimated coefficients won't reflect the true causal effect of the independent variables on the dependent variable. This can lead to incorrect conclusions and flawed policy recommendations. Imagine basing important business decisions on faulty data – yikes! Therefore, detecting and addressing endogeneity is crucial for reliable and valid regression analysis.

Common Statistical Tests for Endogeneity

Alright, let's get to the good stuff: how to actually test for endogeneity. Several statistical tests can help you determine if your independent variables are flirting with the error term. Here are some of the most common ones:

1. Durbin-Wu-Hausman Test

The Durbin-Wu-Hausman (DWH) test is probably the most widely used test for endogeneity. It compares the estimates from ordinary least squares (OLS) regression with those from an instrumental variables (IV) regression. The basic idea is this: if there's no endogeneity, OLS estimates are consistent and efficient. However, if endogeneity is present, OLS estimates are biased and inconsistent, while IV estimates (using valid instruments) are consistent. The DWH test essentially checks whether the OLS and IV estimates are significantly different. A significant difference suggests endogeneity.

How the DWH Test Works:

Estimate the OLS regression: $\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ .
Find instrumental variables ( $\boldsymbol{Z}$ ): These variables should be correlated with the endogenous independent variables but uncorrelated with the error term.
Estimate the first-stage regression: Regress the endogenous independent variables on the instrumental variables and any exogenous variables in the original model. This gives you predicted values for the endogenous variables.
Estimate the IV regression: Replace the endogenous independent variables in the original model with their predicted values from the first-stage regression. This gives you the IV estimates.
Perform the Hausman test: Compare the OLS and IV estimates. The test statistic is calculated as:

\begin{align} H = (\hat{\boldsymbol{\beta}}{IV} - \hat{\boldsymbol{\beta}}{OLS})^T [Var(\hat{\boldsymbol{\beta}}{OLS}) - Var(\hat{\boldsymbol{\beta}}{IV})]^{-1} (\hat{\boldsymbol{\beta}}{IV} - \hat{\boldsymbol{\beta}}{OLS}) \end{align}

Where $\hat{\boldsymbol{\beta}}_{IV}$ and $\hat{\boldsymbol{\beta}}_{OLS}$ are the IV and OLS coefficient estimates, respectively, and $Var(\hat{\boldsymbol{\beta}}_{IV})$ and $Var(\hat{\boldsymbol{\beta}}_{OLS})$ are their respective variance-covariance matrices.

The test statistic $H$ follows a chi-squared distribution with degrees of freedom equal to the number of endogenous variables. If the p-value of the test is below a predetermined significance level (e.g., 0.05), you reject the null hypothesis of no endogeneity.

2. Two-Stage Least Squares (2SLS) with a Test for Overidentification

Two-Stage Least Squares (2SLS) is a method used to address endogeneity, but it can also be part of a testing procedure. After performing 2SLS, you can conduct a test for overidentification to assess the validity of your instruments. This is particularly useful when you have more instruments than endogenous variables.

How 2SLS Works:

First Stage: Regress each endogenous variable on all the instruments and exogenous variables. Obtain the predicted values of the endogenous variables.
Second Stage: Regress the dependent variable on the predicted values from the first stage and the exogenous variables.

Testing for Overidentification:

If you have more instruments than endogenous variables, you can perform a test for overidentification to check whether the instruments are valid (i.e., uncorrelated with the error term). A common test is the Sargan test or the Hansen J-test.

Sargan Test: This test calculates a test statistic based on the residuals from the second-stage regression. The test statistic is:

\begin{align} Sargan = n \cdot R^2 \end{align}

Where $n$ is the sample size and $R^2$ is the R-squared from regressing the residuals from the 2SLS regression on all the exogenous variables and instruments. The Sargan statistic follows a chi-squared distribution with degrees of freedom equal to the number of overidentifying restrictions (i.e., the number of instruments minus the number of endogenous variables). A high p-value (typically > 0.05) indicates that the instruments are valid.
Hansen J-Test: The Hansen J-test is a more general version of the Sargan test that is robust to heteroskedasticity. It also tests the null hypothesis that the instruments are uncorrelated with the error term. A high p-value supports the validity of the instruments.

3. Control Function Approach

The Control Function (CF) approach is another way to deal with endogeneity and test for it simultaneously. This method involves estimating an additional equation to