Simple Linear Regression

Read Post

In this post we will review simple linear regression and the assumptions that goes with it.

Least Squares Estimator

By minimizing the Sum of Squares Error:

\[\begin{aligned} \mathrm{SSE}(\beta_{0}, \beta_{1}) &= \hat{e}_{i}^{2}\\ &= \sum_{i = 1}^{N}(y_{i} - \hat{y}_{i})^{2}\\ &= \sum_{i = 1}^{N}(y_{i} - \beta_{0} - \beta_{1}x_{i})^{2}\\ \end{aligned}\]

Because the function is convex (it is a quadratic function), we can take the derivative and set it to 0 to find the critical point:

\[\begin{aligned} \frac{\partial SSE}{\partial \beta_{0}} &= 2N\beta_{0} - 2\sum_{i} y_{i} + 2\sum x_{i}\beta_{1}\\ &= 0\\ \frac{\partial SSE}{\partial \beta_{1}} &=2\sum_{i} x_{i}^{2} \beta_{2} - 2\sum_{i} x_{i}y_{i} + 2\sum_{i} x_{i}\beta_{2}\\ &= 0 \end{aligned}\]

Solving the simultaneous equation will result in:

\[\begin{aligned} \hat{\beta}_{1} &= \frac{N\sum_{i}x_{i}y_{i} - \sum_{i}x_{i}\sum_{i}y_{i}}{N\sum_{i}x_{i}^{2} - \Big(\sum_{i}x_{i}\Big)^{2}}\\[5pt] \hat{\beta}_{0} &= \bar{y} - \hat{\beta}_{1}\bar{x} \end{aligned}\]

To simplify the result, note that:

\[\begin{aligned} \sum_{i}(x_{i} - \bar{x})^{2} &= \sum_{i}x_{i}^{2} - 2\bar{x}\sum_{i}x_{i} + N\bar{x}^{2}\\ &= \sum_{i}x_{i}^{2} - 2\bar{x}\Big(N \times \frac{1}{N}\sum_{i}x_{i}\Big) + N\bar{x}^{2}\\ &= \sum_{i}x_{i}^{2} - 2N\bar{x}^{2} + N\bar{x}^{2}\\ &= \sum_{i}x_{i}^{2} - N\bar{x}^{2} \end{aligned}\]

And:

\[\begin{aligned} \sum_{i}(x_{i} - \bar{x})(y_{i} - \bar{y}) &= \sum_{i}x_{i}y_{i} - N\bar{x}\bar{y}\\ &= \sum_{i}x_{i}y_{i} - \frac{\sum_{i}x_{i}\sum_{i}y_{i}}{N} \end{aligned}\]

Hence We get the least squares estimators as:

\[\begin{aligned} \hat{\beta}_{1} &= \frac{\sum_{i} (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i} (x_{i} - \bar{x}_{i})^{2}}\\[5pt] \hat{\beta}_{0} &= \bar{y} - \hat{\beta}_{1}\bar{x} \end{aligned}\]

We can rewrite \(\hat{\beta}_{1}\) as:

\[\begin{aligned} \hat{\beta}_{1} &= \sum_{i = 1}^{N}w_{i}y_{i}\\[5pt] w_{i} &= \frac{x_{i} - \bar{x}}{\sum_{i} (x_{i} - \bar{x})^{2}} \end{aligned}\]

This is true because:

\[\begin{aligned} \hat{\beta}_{1} &= \frac{\sum_{i} (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i} (x_{i} - \bar{x}_{i})^{2}}\\[10pt] &= \frac{\sum_{i} (x_{i} - \bar{x})y_{i} - \sum_{i} (x_{i} - \bar{x})}{\sum_{i}(x - \bar{x})^{2}\bar{y}}\\[10pt] &= \frac{\sum_{i} (x_{i} - \bar{x})y_{i} - 0}{\sum_{i}(x - \bar{x})^{2}}\\[10pt] &= \frac{\sum_{i} (x_{i} - \bar{x})y_{i}}{\sum_{i}(x - \bar{x})^{2}}\\[10pt] &= \sum_{i}\Big[ \frac{(x_{i} - \bar{x})}{\sum_{i}(x - \bar{x})^{2}}\Big]y_{i}\\[10pt] &= \sum_{i} w_{i}y_{i} \end{aligned}\]

So in other words, \(\hat{\beta}_{1}\) is a weighted average of \(y_{i}\).

By further simplifying, we get:

\[\begin{aligned} \hat{\beta}_{1} &= \beta_{2} + \sum_{i}w_{i}e_{i} \end{aligned}\]

Unbiased Estimator

In this form, we can do analysis easier. For example, we can show that the estimator is unbiased given the sample \(\vec{\textbf{x}}\):

\[\begin{aligned} E[\hat{\beta}_{1}|\vec{\textbf{x}}] &= E\Big[\beta_{1} + \sum_{i}w_{i}e_{i}|\vec{\textbf{x}}\Big]\\ &= \beta_{1} + \sum_{i}E\Big[w_{i}e_{i}|\vec{\textbf{x}}\Big]\\ &= \beta_{1} + \sum_{i}w_{i}E\Big[e_{i}|\vec{\textbf{x}}\Big]\\ &= \beta_{1} \end{aligned}\]

We can take out \(w_{i}\) from the expectation is because we are conditioning on the sample \(\vec{\textbf{x}}\). What this mean is that we are holding \(\vec{\textbf{x}}\) as a constant and because \(w_{i}\) on depends on \(\vec{\textbf{x}}\), it is a constant as well. Another way to think of conditioning on \(\vec{\textbf{x}}\) is treating \(\vec{\textbf{x}}\) as given in a controlled repeatable experiment.

One reason why \(E\Big[e_{i}\mid \vec{\textbf{x}}\Big] = 0\) might not be true is due to omitted variables. If we have omitted a variable that is correlated with \(\vec{\textbf{x}}\), then \(E\Big[e_{i}\mid \vec{\textbf{x}}\Big] \neq 0\).

Recall the covariance equation and using iterated expectation:

\[\begin{aligned} \mathrm{Cov}(e_{i}, \vec{\textbf{x}}) &= E[XY] - E[X]E[Y]\\ &= E_{Y}[E[XY|Y]] - E_{Y}[E[X|Y]]E[Y]\\ &= E[YE[X|Y]] - E_{Y}[E[X|Y]]E[Y] \end{aligned}\]

If \(E[X\mid Y] = 0\), the covariance will be zero as well.

Variances and Covariances of \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\)

The conditional variance of \(\hat{\beta}_{2}\) is defined as:

\[\begin{aligned} \mathrm{Var}(\hat{\beta}_{1}|\vec{\textbf{x}}) &= E\Big[(\hat{\beta}_{1} - e[\hat{\beta}_{1}|\vec{\textbf{x}})^{2}]\Big] \end{aligned}\]

If the assumptions from SR1 to SR5 (SR6 not required), then:

\[\begin{aligned} \mathrm{Var}(\hat{\beta}_{0}|\vec{\textbf{x}}) &= \sigma^{2} \times \frac{\sum x_{i}^{2}}{N \sum (x_{i} - \bar{x})^{2}}\\[5pt] \mathrm{Var}(\hat{\beta}_{1}|\vec{\textbf{x}}) &= \frac{\sigma^{2}}{\sum (x_{i} - \bar{x})^{2}}\\[5pt] \mathrm{Cov}(\hat{\beta}_{0}, \hat{\beta}_{1}|\vec{\textbf{x}}) &= \sigma^{2}\frac{-\bar{x}}{\sum (x_{i} - \bar{x})^{2}} \end{aligned}\]

Note the derivation for the above are tedious and not shown here.

Central Limit Theorem

According to the Central Limit Theorem, if SR1 to SR5 holds and \(N\) is sufficiently large, then the least squares estimators have a distribution that approximates the normal distributions:

\[\begin{aligned} \hat{\beta}_{0}|\vec{\textbf{x}} \sim N\Bigg(\beta_{0}, \frac{\sigma^{2}\sum_{i}x_{i}^{2}}{N\sum_{i}(x_{i} - \bar{x})^{2}} \Bigg)\\ \hat{\beta}_{1}|\vec{\textbf{x}} \sim N\Bigg(\beta_{1}, \frac{\sigma^{2}}{\sum_{i}(x_{i} - \bar{x})^{2}} \Bigg)\\ \end{aligned}\]

Variance of the Error Term

The conditional variance of the random error:

\[\begin{aligned} \mathrm{Var}(e_{i}|\vec{\textbf{x}}) &= \sigma^{2}\\ &= E\Big[(e_{i} - E[e_{i}|\vec{\textbf{x}}])^{2}|\vec{\textbf{x}}\Big]\\ &= E[e_{i}^{2}|\vec{\textbf{x}}] \end{aligned}\]

This is because \(E[e_{i}\mid \vec{\textbf{x}}] = 0\).

Since expectation is just an average value, the sample average is:

\[\begin{aligned} \hat{\sigma_{i}^{2}} &= \frac{\sum_{i}e_{i}^{2}}{N} \end{aligned}\]

If we replace \(e_{i}\) with \(\hat{e}_{i}\), in order to get a unbiased estimator:

\[\begin{aligned} \hat{\sigma_{i}^{2}} &= \frac{\sum_{i}\hat{e}_{i}^{2}}{N - 2} \end{aligned}\]

Gauss-Markov Theorem

Given the assumption SR1-SR5, Gauss-Markov Theorem states that the estimators \(\hat{\beta}_{0}, \hat{\beta}_{1}\) have the smallest variance of all linear and unbiased estimators of \(\beta_{0},\)\beta_{1}$$. In other words, they are the Best Linear Unbiased Estimators (BLUE).

Data Generating Process

Let us assume randomly drawing a sample of N data pairs \((x_{i}, y_{i})\) from a population at a point of time (cross-sectional data). The act of randomly drawing from the population results in statistically independent data pairs. We also assume we have a joint pdf \(f(y_{i}, x_{i})\) that describes their distribution (could be bivariate normal but we don’t make this assumption). This is known as identically distributed (i.i.d.) and the data pair is said to be a random sample.

For each data pair and given the below regression model:

\[\begin{aligned} y_{i} &= \beta_{0} + \beta_{1}x_{i} + e_{i} \end{aligned}\]

This is also known as a data generating process (DGP).

Exogeneity vs Endogeneity

The assumption of the DGP is that the random variable term \(e_{i}\). The error term encompasses all other factors that other than \(x_{i}\). While \(y_{i}\) and \(x_{i}\) are observables, \(e_{i}\) are unobserable.

The assumption that \(x_{i}\) cannot be used to predict \(e_{i}\):

\[\begin{aligned} E[e_{i}|x_{i}] &= 0 \end{aligned}\]

In the Iterated Expectations section of Probability Primer:

\[\begin{aligned} E[Y] &= E_{X}\Big[ E[Y|X]\Big] \end{aligned}\]

Given \(E[e_{i} \mid x_{i}] = 0\), implies that:

\[\begin{aligned} E[e_{i}] = 0 \end{aligned}\]

Similarly, in the Covariance Decompositon section:

\[\begin{aligned} \mathrm{Cov}(X, Y) &= E_{X}\Big[(X - \mu_{X})E[Y|X]\Big] \end{aligned}\]

Given \(E[e_{i} \mid x_{i}] = 0\), implies that:

\[\begin{aligned} \mathrm{Cov}(e_{i}, x_{i}) &= 0 \end{aligned}\]

It is very hard to ascertain that \(E[e_{i} \mid x_{i}] = 0\) is true, and in fact in most cases, it is not strictly true. If \(E[e_{i} \mid x_{i}] = 0\) is true, and the process is DGP, then \(x\) is said to be “strictly exogenous”. If only \(\mathrm{Cov}(e_{i}, x_{i}) = 0\), \(x\) is said to be simply “exogenous”. If \(\mathrm{Cov}(e_{i}, x_{i}) \neq 0\), then \(x\) is said to be “endogenous”.

For time-series data, a lack of independence is expected. In other words, the assumption that the pairs \((y_{t}, x_{t})\) are i.i.d. is unrealistic. To extend the strict exogeneity assumption to time-series data, we denote a weaker form of strict exogeneity:

\[\begin{aligned} E[e_{t}|\vec{\textbf{x}}] &= 0 \end{aligned}\]

Error Correlation

In addition to the random error component correlating with explanatory variable, it is possible for error components to be correlated between themselves. Usually it occurs in time series data known as serial correlation (autocorrelation):

\[\begin{aligned} \mathrm{Cov}(e_{t}, e_{t-1}) \neq 0 \end{aligned}\]

In spatial context, observations near the neighborhood are similar to each other. This would cause clusters of observations with correlated errors.

Simple Linear Regression Assumptions

Assumptions taken from Carter R., Griffiths W., Lim G. (2018) Principles of Econometrics

SR1: Econometric Model

All data pairs \((y_{i}, x_{i})\) collected from a population statisfy the linear relationship:

\[\begin{aligned} y_{i} &= \beta_{0} + \beta_{1}x_{i} + e_{i} \end{aligned}\]

SR2: Strict Exogeneity

Given \(\vec{\textbf{x}} = (x_{1}, \cdots, x_{N})\), then:

\[\begin{aligned} y_{i} &= \beta_{0} + \beta_{1}x_{i} + e_{i} \end{aligned}\]

If strict exogeneity holds:

\[\begin{aligned} E[y_{i}|\vec{\textbf{x}}] &= \beta_{0} + \beta_{1}x_{i} \end{aligned}\] \[\begin{aligned} y_{i} &= E[y_{i}|\vec{\textbf{x}}] + e_{i} \end{aligned}\]

SR3: Conditional Homoskedasticity

The conditional variance of the random error is constant:

\[\begin{aligned} \mathrm{Var}(e_{i}|\vec{\textbf{x}}) = \sigma^{2} \end{aligned}\]

SR4: Conditionally Uncorrelated Errors

\[\begin{aligned} \mathrm{Cov}(e_{i}, e_{j}|\vec{\textbf{x}}) = 0 \end{aligned}\]

SR5: Explanatory Variable Must Vary

Need to have enough variation in the explanatory data. Theoretically need to have at least 2 data points.

SR6: Error Normality

\[\begin{aligned} e_{i}|\vec{\textbf{x}} \sim N(0, \sigma^{2}) \end{aligned}\]

The random error \(e\) represents all factors affecting \(y\) not included in \(x\).

Elasticities

The elasticity of \(y\) w.r.t \(x\):

\[\begin{aligned} \epsilon &= \frac{\text{percentage change in y}}{\text{percentage change in x}}\\ &= \frac{100 \times \frac{\Delta y} {y}}{100 \times \frac{\Delta x}{x}}\\ &= \frac{\Delta y}{\Delta x}\times \frac{x}{y} \end{aligned}\]

Elasticity in terms of mean of \(y\):

\[\begin{aligned} \epsilon &= \frac{\Delta E[y|\vec{\textbf{x}}]}{\Delta x} \times \frac{x}{E[y|\vec{\textbf{x}}]}\\ &= \beta_{2}\times \frac{x}{E[y|\vec{\textbf{x}}]} \end{aligned}\]

More commonly, the elasticity is calculated at the means \((\bar{x}, \bar{y})\) as it is a representative point:

\[\begin{aligned} \hat{\epsilon} &= \hat{beta}_{1}\frac{\bar{x}}{\bar{y}} \end{aligned}\]

For example, if \(\hat{\epsilon} = 0.71\), it means that a 1% increase in \(x\) will lead to a 0.71% increase in \(y\).

Transformation

Log Linear Models

The log-linear model:

\[\begin{aligned} \mathrm{\mathrm{ln}}(y) &= \beta_{0} + \beta_{1}x \end{aligned}\]

Taking the exponential of both sides:

\[\begin{aligned} y &= e^{\beta_{0} + \beta_{1}x} \end{aligned}\]

We can see that \(y\) is in fact exponential.

Log transformation are used for variables that are positive and have distributions that are positively skewed. Some examples are:

  • Prices
  • Wages/Salaries
  • Income
  • Sales
  • Expenditures

Given the change in \(\mathrm{ln}(y)\) is approximately the relative change in \(y\):

\[\begin{aligned} \mathrm{ln}(y_{1}) - \mathrm{ln}(y_{0}) &\cong \frac{y_{1} - y_{0}}{y_{0}} \end{aligned}\]

This can be derived from Taylor series of \(\mathrm{ln}(y_{1})\) given that \(y_{1}\) is close to \(y_{0}\):

\[\begin{aligned} \mathrm{ln}(y_{1}) &= \mathrm{ln}(y_{0}) + \frac{1}{y_{0}}(y_{1} - y_{0}) \end{aligned}\]

A useful result is given x is small:

\[\begin{aligned} \mathrm{ln}(1 + x) &\cong \mathrm{ln}(1) + \frac{x}{1}\\ &= x \end{aligned}\]

Hence, \(\beta_{1}\) can be interpreted as approximate percentage change in \(y\):

\[\begin{aligned} \mathrm{ln}(y_{1}) - \mathrm{ln}(y_{0}) &\cong \frac{\Delta y}{y_{0}}\\ &= (\beta_{0} + \beta_{1}x_{2}) - (\beta_{0} + \beta_{1}x_{1})\\ &= \beta_{1}x_{2} - \beta_{1}x_{1}\\ &= \beta_{1}\Delta x \end{aligned}\]

Example: Compound Interest

Starting with initial principal \(P_{0}\), the accumulated value with a growth rate of \(r\) in \(t\) years:

\[\begin{aligned} P_{t} &= P_{0}(1 + r)^{t}\\ ln(P_{t}) &= ln(P_{0}) + ln((1+r)^{t})\\ &= ln(P_{0}) + ln(1 + r)t\\ &= \beta_{0} + \beta_{1}t \end{aligned}\]

Where:

\[\begin{aligned} \beta_{1} &= ln(1 + r) \end{aligned}\]

Prediction in the Log-Linear Model

To get back \(\hat{y}\) from \(ln(\hat{y})\), we would take the exponent:

\[\begin{aligned} \hat{y} &= e^{ln(\hat{y})}\\ &= e^{\hat{\beta}_{0} + \hat{\beta}_{1}x} \end{aligned}\]

Due to the nature of Log-Normal distribution, the above leads to the median and not the mean of \(y\). To get the mean of \(y\), we need to add an adjustment:

\[\begin{aligned} E[\hat{y}] &= e^{\hat{\beta}_{0} + \hat{\beta}_{1}x + \frac{\sigma^{2}}{2}}\\ &= \hat{y}e^{\frac{\hat{\sigma}^{2}}{2}} \end{aligned}\]

Log-Log Models

The log-log model is widely used in demand equations and production functions. Both x and y have to be positive.

\[\begin{aligned} \mathrm{ln}(y) &= \beta_{0} + \beta_{1}\mathrm{ln}(x) \end{aligned}\]

Taking a difference of \(\mathrm{ln}(y_{1})\) and \(\mathrm{ln}(y_{0})\):

\[\begin{aligned} \mathrm{ln}(y_{1}) - \mathrm{ln}(y_{0}) &= \beta_{0} + \beta_{1}\mathrm{ln}(x_{1}) - (\beta_{0} + \beta_{1}\mathrm{ln}(x_{0}))\\[5pt] &= \beta_{1}(\mathrm{ln}(x_{1}) - \mathrm{ln}(x_{0}))\\[5pt] \beta_{1} &= \frac{\mathrm{ln}(y_{1}) - \mathrm{ln}(y_{0})}{\mathrm{ln}(x_{1}) - \mathrm{ln}(x_{0})} \end{aligned}\]

Recall that:

\[\begin{aligned} \mathrm{ln}(y_{1}) - \mathrm{ln}(y_{0}) &\cong \frac{y_{1} - y_{0}}{y_{0}}\\ \mathrm{ln}(x_{1}) - \mathrm{ln}(x_{0}) &\cong \frac{x_{1} - x_{0}}{x_{0}} \end{aligned}\] \[\begin{aligned} \beta_{1} &= \frac{\%\Delta y}{\%\Delta x}\\ &= \epsilon_{yx} \end{aligned}\]

Thus \(\beta_{1}\) is equivalent to the elasticity of y with respect to a change in x, and this elasticity is constant over the entire curve.

Another way to look at this is:

\[\begin{aligned} \beta_{1} &= \frac{\frac{\Delta y}{y}}{\frac{\Delta x}{x}}\\ &= \frac{\frac{dy}{y}}{\frac{dx}{x}}\\ \end{aligned}\]

The slope \(\beta_{1}\) displays constant absolute change.

\[\begin{aligned} \mathrm{ln}(y) &= \beta_{0} + \beta_{1}\mathrm{ln}(x)\\ y &= e^{\beta_{0} + \beta_{1}\mathrm{ln}(x)}\\ &= e^{\beta_{0} + \mathrm{ln}(x^{\beta_{1}})}\\ &= e^{\beta_{0}} e^{\mathrm{ln}(x^{\beta_{1}})}\\ &= e^{\beta_{0}} x^{\beta_{1}}\\ &= Ax^{\beta_{1}} \end{aligned}\]

Where \(A = e^{\beta_{0}}\).

If \(\beta_{1} > 0\), y is an increasing function of x.

If \(\beta_{1} > 1\), the slope is increasing as well.

If \(0 < \beta_{1} < 1\), y is increasing but the slope is decreasing.

If \(\beta_{1} < 0\), there is an inverse relationship between y and x. For example if \(\beta_{1} = 1\), then the curve has unit elasticity:

\[\begin{aligned} y &= Ax^{-1} \end{aligned}\]

This mean that an increase of 1% in x is associated with a 1% decrease in y.

Prediction Interval

The forecast error:

\[\begin{aligned} y_{0} - \hat{y}_{0} &= (\beta_{0} + \beta_{1}x_{0} + e_{0}) - (\hat{\beta}_{0} + \hat{\beta}_{1}x_{0}) \end{aligned}\]

To find the variance, we first find the variance for \(\hat{y}_{0}\):

\[\begin{aligned} \text{Var}(\hat{y}_{0}|\mathbf{x}) &= \text{Var}\Big((\hat{\beta}_{0} + \hat{\beta}_{1}x_{0})|\mathbf{x}\Big)\\ &= \text{Var}(\hat{\beta}_{0}|\mathbf{x}) + x_{0}^{2}\text{Var}(\hat{\beta}_{0}|\mathbf{x}) + 2x_{0}\text{Cov}(\hat{\beta}_{0}, \hat{\beta}_{1}|\mathbf{x}) \end{aligned}\]

Given the earlier derivation of variance and covariance of the coefficients:

\[\begin{aligned} \text{Var}(\hat{y}_{0}|\mathbf{x}) &= \frac{\sigma^{2}}{N\sum_{i}(x_{i} - \bar{x})^{2}}\Big(\sum_{i}x_{i}^{2} + N x_{0}^{2} - 2N\bar{x}x_{0}\Big)\\ &= \frac{\sigma^{2}}{N\sum_{i}(x_{i} - \bar{x})^{2}}\Big(\sum_{i}x_{i}^{2} + N(x_{0}^{2} - 2\bar{x}x_{0})\Big)\\ &= \frac{\sigma^{2}}{N\sum_{i}(x_{i} - \bar{x})^{2}}\Big(\sum_{i}x_{i}^{2} - N\bar{x}^{2} -+ N(x_{0}^{2} - 2\bar{x}x_{0} + \bar{x}^{2}\Big)\\ &= \sum_{i}(x_{i} - \bar{x})^{2} + N(x_{0} - \bar{x})^{2}\\ &= \sigma^{2}\Bigg(\frac{1}{N} + \frac{(x_{0}-\bar{x})^{2}}{\sum_{i}(x_{i} - \bar{x})^{2}}\Bigg) \end{aligned}\]

And the variance of \(y_{0}\):

\[\begin{aligned} \text{Var}(y_{0}|\mathbf{x}) &= \text{Var}(\beta_{0} + \beta_{1}x_{0} + e_{0}|\mathbf{x})\\ &= 0 + 0 + \text{Var}(e_{0}|\mathbf{x})\\ &= \sigma^{2} \end{aligned}\]

Hence:

\[\begin{aligned} \text{Var}(y_{0} - \hat{y}_{0}) &= \text{Var}(\hat{y}_{0}|\mathbf{x}) + \sigma^{2}\\ &= \sigma^{2}\Bigg(1 + \frac{1}{N} + \frac{(x_{0}-\bar{x})^{2}}{\sum_{i}(x_{i} - \bar{x})^{2}}\Bigg) \end{aligned}\]

Endogenous Regressors

When the regressors (explanatory variables) are endogenous, the least squares estimator is not an unbiased estimator, not a consistent estimator, and the confidence intervals do not have the anticipated properties and no large sample properties.

Recall the strict exogeneity assumption:

\[\begin{aligned} E[e_{i}|\mathbf{x}] &= 0 \end{aligned}\]

And the simpler contemporaneous exogeneity assumption:

\[\begin{aligned} E[e_{i}|x_{i}] &= 0 \end{aligned}\]

Note that if the samples are i.i.d., then the contemporaneous exogeneity is equivalent to strict exogeneity.

The “gold standard” in research is a randomized controlled experiment. In an ideal world, we would randomly assign \(x_{i}\) values (the treatment) and examine changes in outcomes \(y_{i}\) (the effect). Other random factors are isolated to the error term \(e\), and we can isolate the effect of changes in \(x\) alone and claim that changes in \(x\) cause changes in the outcome \(y\):

\[\begin{aligned} \beta_{1} &= \frac{\Delta E[y_{i}|x_{i}]}{\Delta x_{i}} \end{aligned}\]

If there is strict exogeneity, \(x\) is as good as randomly assigned. It is as if we had randomly assigned the treatments \(x_{i}\) to experimental subjects.

Large Sample Properties

However, with large samples of data, strict exogeneity is not required to identify and estimate a causal effect. Instead, the following assumptions would suffice:

\[\begin{aligned} E[e_{i}] &= 0\\ \text{Cov}(x_{i}, e_{i}) &= 0 \end{aligned}\]

The above is known as contemporaneously uncorrelated, and is weaker than contemporaneous exogeneity. Note that contemporaneously exogeneity implies contemporaneously uncorrelated:

\[\begin{aligned} E[e_{i}] = 0 &\implies \text{Cov}(x_{i}, e_{i}) = 0\\ E[e_{i}] = 0 &\implies E[e_{i}] = 0 \end{aligned}\]

Under the above assumption and other linear regression assumptions, the least squares estimators are:

  • are consistent: they converge in probability to the true parameter values as \(N \rightarrow \infty\)
  • have approximately normal distributions in large samples
  • provide interval estimators and test statistics if sample is large

Note that if the sample size is not large enough, the asymptotic properties of estimators may be misleading. Estimates may appear statistically significant when they are not and confidence intervals may be too narrow or too wide.

The consequence of a contemporaneous correlation between \(x_{i}\) and \(e_{i}\) is that the least squares estimator is biased and will stay biased no matter how large the sample is and hence inconsistent. However, endogenous regressors are still useful for prediction. What we cannot do is to interpret the slope as a causal effect.

Measurement Error

If we measure an explanatory variable with error, it will be correlated with the error term. Let \(y_{i}\) be the annual savings of the ith person, and let \(x_{i}^{*}\) be the permanent annual income of the ith person. A simple linear regression model would be:

\[\begin{aligned} y_{i} &= \beta_{1} + \beta_{2}x_{i}^{*} + \nu_{i} \end{aligned}\]

Because permanent income is hard to measure, the measure of current income is used instead. Let’s define current income as permanent income with a random error \(u_{i}\):

\[\begin{aligned} x_{i} &= x_{i}^{*} + u_{i} \end{aligned}\]

Substituting into the regression model:

\[\begin{aligned} y_{i} &= \beta_{1} + \beta_{2}(x_{i} - u_{i}) + \nu_{i}\\ &= \beta_{1} + \beta_{2}x_{i} + (\nu_{i} - \beta_{2}u_{i})\\ &= \beta_{1} + \beta_{2}x_{i} + e_{i} \end{aligned}\]

Where the error is in fact:

\[\begin{aligned} e_{i} &= \nu_{i} - \beta_{2}u_{i} \end{aligned}\]

Let check whether \(x_{i}\) is contemporaneously correlated with \(e_{i}\):

\[\begin{aligned} \text{Cov}(x_{i}, e_{i}) &= E[x_{i}e_{i}] - E[x_{i}]E[e_{i}]\\ &= E[x_{i}e_{i}] - E[x_{i}]\times 0\\ &= E[x_{i}e_{i}]\\ &= E[(x_{i}^{*} + u_{i})(\nu_{i} - \beta_{2}u_{i})]\\ &= E[x_{i}^{*}\nu_{i} - \beta_{2}x_{i}^{*}u_{i} + u_{i}v_{i} - \beta_{2}u_{i}^{2}] \end{aligned}\]

Since the assumption is \(x_{i}^{*}\) is exogenous, then:

\[\begin{aligned} E[x_{i}^{*}\nu_{i}] &= 0 \end{aligned}\]

So:

\[\begin{aligned} E[x_{i}^{*}\nu_{i} - \beta_{2}x_{i}^{*}u_{i} + u_{i}v_{i} - \beta_{2}u_{i}^{2}] &= E[x_{i}^{*}\nu_{i}] - \beta_{2}E[x_{i}^{*}u_{i}] + E[u_{i}v_{i}] - E[\beta_{2}u_{i}^{2}]\\ &= 0 - 0 + 0 - \beta_{2}E[u_{i}^{2}]\\ &= -\beta_{2}\sigma_{u}^{2}\\ &\neq 0 \end{aligned}\]

If \(\beta_{2} > 0\), then the least squares estimator will underestimate \(\beta_{2}\) and this is known as the “attenuation bias”.

Simultaneous Equations Bias

Simultaneous equations are equations that are jointly determined by a set of simultaenous equations. For example, if \(P_{i}\) is the equilibrium price and \(Q_{i}\) is the equilibrium quantity, both \(P_{i}\) and \(Q_{i}\) are endogenous. There is a set of equation for the supply curve, and a set of equation for the demand curve.

\[\begin{aligned} Q_{di} &= \beta_{d0} - \beta_{d1}P_{i} + e_{i}\\ Q_{si} &= \beta_{s0} + \beta_{s1}P_{i} + \nu_{i} \end{aligned}\]

Changes in price affect the quantities supplied and demanded. Similarly, changes in quantities supplied and demanded affect price. This will cause a feedback. This will cause the endogeneity problem:

\[\begin{aligned} \text{Cov}(P_{i}, e_{i}) \neq 0 \end{aligned}\]

This is known as the “simultaneous equations bias”.

Serial Correlation

Recall the dynamic models with stationary variables:

\[\begin{aligned} y_{t} &= \beta_{0} + \beta_{1}y_{t-1} + \beta_{2}x_{t} + e_{t} \end{aligned}\]

As long as \(y_{t-1}\) is not correlated with the error term:

\[\begin{aligned} \text{Cov}(y_{t-1}, e_{t}) = 0 \end{aligned}\]

Then the least squares estimator is consistent. But if the errors follow a AR(1) process:

\[\begin{aligned} e_{t} = \rho e_{t-1} + \nu_{t} \end{aligned}\]

Because \(y_{t-1}\) depends on \(e_{t-1}\), then:

\[\begin{aligned} \text{Cov}(y_{t-1}, e_{t}) \neq 0 \end{aligned}\]

This is known as “serial correlation”.

Omitted Variables

When a variable that is correlated with an existing explanatory variable, then the regression error will correlated with the explanatory variable as the error term includes this omitted variable.

Method of Moments

The kth moment of a random variable \(Y\):

\[\begin{aligned} E[Y^{k}] &= \mu_{k}\\ &= \text{kth moment of}Y \end{aligned}\]

The law of large numbers states that if:

\[\begin{aligned} E[X_{i}] &= \mu\\ &< \infty\\ \text{Var}(X_{i}) &= \sigma^{2}\\ &< \infty \end{aligned}\]

Then the sample mean:

\[\begin{aligned} \bar{X} &= \frac{\sum_{i}X_{i}}{N} \end{aligned}\]

Converges in probability to \(\mu\) as \(N\) increases. Hence the sample mean is a consistent estimator of \(\mu\). Similarly, it can be shown using the law of large number that:

\[\begin{aligned} E[\hat{Y}^{k}] &= \hat{\mu}_{k}\\ &= \frac{\sum_{i}Y^{k}}{N} \end{aligned}\]

To estimate the variance using method of moments:

\[\begin{aligned} \hat{\sigma}^{2} &= \frac{\sum_{i}Y_{i}^{2}}{N} - \bar{Y}^{2} \end{aligned}\]

Method of Moments Estimation in Simple Regression Model

Given the contemporaneous assumption \(E[e_{i}] = 0\) and \(E[x_{i}e_{i}] = 0\):

\[\begin{aligned} E[e_{i}] &= 0\\ E[y_{i} - \beta_{0} - \beta_{1}x_{i}] &= 0 \end{aligned}\] \[\begin{aligned} E[x_{i}e_{i}] &= 0\\ E[x_{i}(y_{i} - \beta_{0} - \beta_{1}x_{i})] &= 0 \end{aligned}\]

The LLN says that under random sampling, sample moments converge to population moments:

\[\begin{aligned} \frac{1}{N}\sum_{i}(y_{i} - \beta_{0} - \beta_{1}x_{i})&\overset{p}{\rightarrow}E[y_{i} - \beta_{0} - \beta_{1}x_{i}] &= 0\\ \frac{1}{N}\sum_{i}[x_{i}(y_{i} - \beta_{0} - \beta_{1}x_{i})]&\overset{p}{\rightarrow}E[x_{i}(y_{i} - \beta_{0} - \beta_{1}x_{i})] &= 0 \end{aligned}\]

Solving the above simultaneous equations, we derive the following normal equations:

\[\begin{aligned} \hat{\beta}_{1} &= \frac{\sum_{i}(x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i}(x_{i} - \bar{x})^{2}}\\ \hat{\beta}_{0} &= \bar{y}- \hat{b}_{1}\bar{x} \end{aligned}\]

Which is the same as OLS estimates and are consistent estimators.

Instrumental Variables Estimation

When \(x_{i}\) is random and contemporaneously correlated (endogenous):

\[\begin{aligned} \text{Cov}(x_{i}, e_{i}) \neq 0 \end{aligned}\]

Suppose there is another variable \(z_{i}\) such that:

  • \(z_{i}\) does not have a direct effect on \(y_{i}\)
  • \[\text{Cov}(z_{i}, e_{i}) = 0\]
  • \(z_{i}\) is not weakly correlated with \(x_{i}\)

Such variable \(z_{i}\) with such properties is known as instrumental variable. \(z_{i}\) is seen as an “instrument”.

Given:

\[\begin{aligned} E[e_{i}] &= E[y_{i} - \beta_{0} - \beta_{1}x_{i}]\\ &= 0\\ E[z_{i}e_{i}] &= E[z_{i}(y_{i} - \beta_{0} - \beta_{1}x_{i})]\\ &= 0 \end{aligned}\]

Invoking the LLN, we set the sample moments to zero:

\[\begin{aligned} \frac{1}{N}\sum_{i}(y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i}) &= 0\\ \frac{1}{N}\sum_{i}z_{i}(y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i}) &= 0 \end{aligned}\]

Solving the simultaneous equations, we get the “instrumental variable (IV)” estimators:

\[\begin{aligned} \hat{\beta}_{1} &= \frac{N\sum_{i}z_{i}y_{i} - \sum_{i}z_{i}\sum_{i}y_{i}}{N\sum_{i}z_{i}x_{i} - \sum_{i}z_{i}\sum_{i}x_{i}}\\[5pt] &= \frac{\sum_{i}(z_{i} - \bar{z})(y_{i} - \bar{y})}{\sum_{i}(z_{i} - \bar{z})(x_{i} - \bar{x})}\\[5pt] \hat{\beta}_{0} &= \bar{y} - \hat{\beta}_{1}\bar{x} \end{aligned}\]

If the above 3 properties hold, the IV estimators are consistent and will converge to the true parameter as \(N \rightarrow \infty\) and:

\[\begin{aligned} \hat{\beta}_{1} &\sim N(\beta_{1}, \text{Var}(\hat{\beta})_{1})\\ \text{Var}(\hat{\beta}_{1}) &= \frac{\sigma_{IV}^{2}\sum_{i}(z_{i} - \bar{z})^{2}}{\Big(\sum_{i}(z_{i} - \bar{z})(x_{i} - \bar{x})\Big)}\\ \hat{\sigma}_{IV}^{2} &= \frac{\sum_{i}(y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i})^{2}}{N-2} \end{aligned}\]

With some algebra we can show that:

\[\begin{aligned} \text{Var}(\hat{\beta}_{1}) &= \frac{\hat{\sigma}_{IV}^{2}}{r_{zx}^{2}\sum_{i}(x_{i} - \bar{x})^{2}} \end{aligned}\]

In other words, the stronger the correlation between \(z\) and \(x\), the smaller the variance of \(\hat{\beta}_{1}\). In general, the variance of IV estimator will be higher than OLS estimator but given the endogenous \(x_{i}\), it will give a nonbiased and consistent estimation.

Two-Stage Least Squares (2SLS)

The 2SLS method uses two least squares regression to calculate the IV estimates. The first stage is regressing the instrument with \(x\). Note this the relationship which we are expecting to have a strong relationship:

\[\begin{aligned} x &= \theta_{0} + \theta_{1}z + \nu \end{aligned}\]

In the second stage, we replace the endogenous variable \(x\) with the OLS estimates from stage I:

\[\begin{aligned} \hat{x} &= \hat{\theta}_{0} + \hat{theta}_{1}z\\ y &= \beta_{0} + \beta_{1}\hat{x} + e \end{aligned}\]

The OLS estimates from stage II will be the same as the IV estimates.

Note that when calculating the variance, we need to use the original \(x\) and not \(\hat{x}\):

\[\begin{aligned} \hat{\sigma}_{IV}^{2} &= \frac{(\sum_{i}y_{i}- \hat{\beta}_{0} - \hat{\beta}_{1}x_{i})^{2}}{N-2}\\ \hat{\sigma}_{Wrong}^{2} &= \frac{(\sum_{i}y_{i}- \hat{\beta}_{0} - \hat{\beta}_{1}\hat{x}_{i})^{2}}{N-2}\\ \end{aligned}\]

Note that the software will report the wrong variance and you would need to correct it.

Finally , the sample variance of \(\hat{\beta}_{1}\):

\[\begin{aligned} \text{Var}(\hat{\beta}_{1}) &= \frac{\hat{\sigma}_{IV}^{2}}{\sum_{i}(\hat{x}_{i} - \bar{x})^{2}} \end{aligned}\]

More than One Instrumental Variable

Suppose we have two good instruments, \(z_{1}, z_{2}\). Using the original sample moment conditions:

\[\begin{aligned} \frac{1}{N}\sum_{i}(y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i}) &= 0\\ \frac{1}{N}\sum_{i}z_{i1}(y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i}) &= 0\\ \frac{1}{N}\sum_{i}z_{i2}(y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i}) &= 0 \end{aligned}\]

There are 3 equations with only 2 unknowns (\(\hat{\beta}_{0}, \hat{\beta}_{1}\)). There are no solutions satisfying all 3 equations. However, we are able to use 2SLS to resolve tihs:

\[\begin{aligned} x &= \theta_{0} + \theta_{1}z_{1} + \theta_{2}z_{2} + \nu \end{aligned}\]

Estimating the first-stage equation by OLS:

\[\begin{aligned} \hat{x} &= \hat{\theta}_{0} + \hat{\beta}_{1}z_{1} + \hat{\theta}_{2}z_{2} \end{aligned}\]

Now using \(\hat{x}\) as an instrument leads to two sample-moment conditions:

\[\begin{aligned} \frac{1}{N}\sum_{i}(y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i}) &= 0\\ \frac{1}{N}\sum_{i}\hat{x}_{i}(y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i}) &= 0 \end{aligned}\]

Solving the equations:

\[\begin{aligned} \hat{\beta}_{1} &= \frac{\sum_{i}(\hat{x}_{i} - \bar{\hat{x}})(\hat{y}_{i} - \bar{\hat{y}})}{\sum_{i}(\hat{x}_{i} - \bar{\hat{x}})(x_{i} - \bar{x})}\\[5pt] \hat{\beta}_{0} &= \bar{y} - \hat{\beta}_{1}\bar{x} \end{aligned}\]

Multiple Regression with 1 IV

To generalize to multiple variables, assume the first \(K - 1\) variables are exogenous, \(x_{K}\) is an endogenous variable, and \(L\) instrumental variables. Using 2SLS:

\[\begin{aligned} x_{k} &= \beta_{0} + \beta_{1}x_{1} + \cdots + \beta_{K-1}x_{K-1} + \theta_{1}z_{1} + \cdots + \theta_{L}z_{L} + \nu_{K} \end{aligned}\]

Estimating the first-stage regression:

\[\begin{aligned} \hat{x}_{k} &= \hat{\beta}_{0} + \hat{\beta}_{1}x_{1} + \cdots + \hat{\beta}_{K-1}x_{K-1} + \hat{\theta}_{1}z_{1} + \cdots + \hat{\theta}_{L}z_{L} \end{aligned}\]

The second-stage regression:

\[\begin{aligned} y &= \beta_{0} + \beta_{1}x_{1} + \cdots + \beta_{K}\hat{x}_{K} + e \end{aligned}\]

Similarly the correct general sample error variance:

\[\begin{aligned} \hat{\sigma}_{IV}^{2} &= \frac{\sum_{i}(y_{i} - \hat{\beta}_{0} - \hat{\beta}_{0}x_{i1} - \cdots - \hat{\beta}_{K}x_{iK})^{2}}{N-K} \end{aligned}\]

And the coefficient variance estimate:

\[\begin{aligned} \text{Var}(\hat{\beta}_{K}) &= \frac{\hat{\sigma}_{IV}^{2}}{SSE_{\hat{x}_{K}}}\\ &= \frac{\hat{\sigma}_{IV}^{2}}{\sum_{i}(\hat{x}_{iK} - x_{iK})} \end{aligned}\]

To access instrument strength in the simple linear regression, we just had to look at the correlation. But for multiple regression, we have to deal with other exogenous variables. The coefficient \(\theta_{1}\) in the first-stage regression measures the effect of \(z_{1}\) on \(x_{K}\) after accounting for the effects of the other variables. To measure the instrument strength, we would use the F-test statistic. However, the effect of \(z_{1}\) has to be very strong. Ideally \(F > 10\) rule has been set by econometric researchers Stock and Yogo.

Using Frisch-Waugh-Lovell approach, we can derive the following alternative expression for the sample variance:

\[\begin{aligned} \text{Var}(\hat{\beta}_{K}) &= \frac{\hat{\sigma}_{IV}^{2}}{SSE_{\hat{x}_{K}}}\\ &= \frac{\hat{\sigma}_{IV}^{2}}{\hat{\theta}_{1}^{2}\sum_{i}\tilde{z}_{i1}^{2}} \end{aligned}\]

Where \(\tilde{z}_{i1}\) is the residuals by partially out \(\mathbf{x}_{exog}_\) from the instrument \(z_{1}\). As we can see, the larger the estimate \(\hat{\theta}_{1}\), the stronger the instrument. Also the amount of variation in \(z_{1}\) not explained by the \(\mathbf{x}_{exog}\). We want \(z_{1}\) to be uncorrelated with \(\mathbf{x}_{exog}\) and exhibit large variation.

General Model

Suppose onw that the first G variables bing exogenous and the second group \(B = K- G\) variables are endogenous:

\[\begin{aligned} y &= \beta_{0} + \beta_{1}x_{1} + \cdots + \beta_{G}x_{G} + \beta{G + 1}x_{G+1} + \cdots + \beta_{K}x_{K} + e \end{aligned}\]

It is necessary that \(L \geq B\).

The First-Stage \(B\) OLS:

\[\begin{aligned} x_{G+j} &= \beta_{0j} + \beta_{1j}x_{1j} + \cdots + \beta_{Gj}x_{G} + \cdots + \theta_{1j}z_{1} + \cdots + \theta_{Lj}z_{L} + \nu_{j}\\[5pt] \hat{x}_{G+j} &= \hat{\beta}_{0j} + \hat{beta}_{1j}x_{1j} + \cdots + \hat{\beta}_{Gj}x_{G} + \cdots + \hat{\theta}_{1j}z_{1} + \cdots + \hat{\theta}_{Lj}z_{L}\\[5pt] j &= 1, \cdots, B \end{aligned}\]

The Second-Stage:

\[\begin{aligned} y &= \beta_{0} + \beta_{1}x_{1} + \cdots + \beta_{G}\hat{x}_{G} + \beta_{G+1}\hat{x}_{G+1} + \cdots + \beta_{K}\hat{x}_{K} + e \end{aligned}\]

The F-test is not valid for models having more than one endogenous variable as the test is for all variables rather than for each individual endogenous variable. A more general “partial correlation” test is needed.

See Also

References

Carter R., Griffiths W., Lim G. (2018) Principles of Econometrics

Jason

Passionate software developer with a background in CS, Math, and Statistics. Love challenges and solving hard quantitative problems with interest in the area of finance and ML.