In this post, we will cover Shumway H. (2017) Time Series Analysis, focusing on classical regression.
Let us look at the datasets we will be using in this post. All the datasets can be found in the astsa package.
Quarterly earnings per share of Johnson $ Johnson from 1960 to 1980 for a total of 84 quarters (21 years).
Global mean land-ocean temperature index from 1880 to 2015, with base period 1951-1980. The index is computed as deviation from the 1951-1980 average.
A 0.1 sec 1000 point sample of recorded speech for th eprhase “aahhhh”.
Daily returns of the DJIA index from April 20, 2006 to April 20, 2016.
Monthly values of the Southern Oscillation Index (SOI) and associated Recruitment (number of new fish) index ranging over the years 1950-1987. SOI measures changes in air pressure.
Seismic recording of an earthquake and explosion.
Data is collected from various locations in the brain via Functional Magnetic Reasonance Imaging. A stimulus was applied for 32s and then stopped for 32 secs giving a total signal period of 64s. 1 observation is taken every 2s for 256s (n = 128).
The autocovariance function is defined as:
\[\begin{aligned} \gamma_{x}(s, t) &= \mathrm{cov}(x_{s}, x_{t})\\ &= E[(x_{s} - \mu_{s})(x_{t} - \mu_{t})] \end{aligned}\]Autocovariance function measures the linear dependence between two points in time. Smooth series have large autocovariance function even when \(s, t\) are far apart, whereas choppy series tend to have autocovariance function near 0.
If the random variables are linear combinations of \(X_{j}\) and \(Y_{k}\):
\[\begin{aligned} X &= \sum_{j=1}^{m}a_{j}X_{j}\\ Y &= \sum_{k = 1}^{r}b_{k}Y_{k}\\ \mathrm{cov}(X, Y) &= E\Big[\sum_{j=1}^{m}a_{j}X_{j}\sum_{k = 1}^{r}b_{k}Y_{k}\Big]\\ &= \sum_{j=1}^{m}\sum_{k = 1}^{r}a_{j}b_{k}E[X_{j}Y_{k}]\\ &= \sum_{j = 1}^{m}\sum_{k = 1}^{r}a_{j}b_{k}\mathrm{cov}(X_{j}, Y_{k}) \end{aligned}\]If \(x_{s}, x_{t}\) are bivariate normal, \(\gamma_{x}(s, t) = 0\) implies independence as well.
And the variance:
\[\begin{aligned} \gamma_{x}(t, t) &= E[(x_{t} - \mu_{t})^{2}]\\ &= \mathrm{var}(x_{t}) \end{aligned}\]The variance of a linear combination of \(x_{t}\) can be represented as sum of autocovariances:
\[\begin{aligned} \mathrm{var}(a_{1}x_{1} + \cdots + a_{n}x_{n}) &= \mathrm{cov}(a_{1}x_{1} + \cdots + a_{n}x_{n}, a_{1}x_{1} + \cdots + a_{n}x_{n})\\ &= \sum_{j= 1}^{n}\sum_{k = 1}^{n}a_{j}a_{k}\mathrm{cov}(x_{j}, x_{k})\\ &= \sum_{j= 1}^{n}\sum_{k = 1}^{n}a_{j}a_{k}\gamma(j - k) \end{aligned}\]Note that:
\[\begin{aligned} \gamma(h) &= \gamma(-h) \end{aligned}\]The autocorrelation function (ACF) is defined as:
\[\begin{aligned} \rho(s, t) &= \frac{\gamma(s, t)}{\sqrt{\gamma(s, s)\gamma(t, t)}} \end{aligned}\]The cross-covariance function between two time series, \(x_{t}\) and \(y_{t}\):
\[\begin{aligned} \gamma_{xy}(s, t) &= \mathrm{cov}(x_{s}, y_{t})\\ &= E[(x_{s} - \mu_{xs})(y_{t} - \mu_{yt})] \end{aligned}\]And the cross-correlation function (CCF):
\[\begin{aligned} \rho_{xy}(s, t) &= \frac{\gamma_{xy}(s, t)}{\sqrt{\gamma_{x}(s, s)\gamma_{y}(t, t)}} \end{aligned}\]In most situation, we only have a single realization (sample) to estimate the population means and covariance functions.
To estimate the mean, we use the sample mean:
\[\begin{aligned} \bar{x} &= \frac{1}{n}\sum_{t=1}^{n}x_{t} \end{aligned}\]The variance of the mean:
\[\begin{aligned} \mathrm{var}(\bar{x}) &= \mathrm{var}\Big(\frac{1}{n}\sum_{t=1}^{n}x_{t}\Big)\\ &= \sum_{j=1}^{n}\sum_{k=1}^{n}\frac{1}{n}\frac{1}{n}\mathrm{cov}(x_{j}, x_{k})\\ &= \frac{1}{n^{2}}\mathrm{cov}\Big(\sum_{t=1}^{n}x_{t},\sum_{t=1}^{n}x_{t} \Big)\\ &= \frac{1}{n^{2}}\Big(n \gamma_{x}(0) + (n-1)\gamma_{x}(1) + (n-2)\gamma_{x}(2) + \cdots + \gamma_{x}(n-1) + \\ &\ \ \ \ \ \ (n-1)\gamma_{x}(-1) + (n-2)\gamma_{x}(-2) + \cdots + \gamma_{x}(1 - n)\Big)\\ &= \frac{1}{n^{2}}\sum_{h=-n}^{n}n\Big(1 - \frac{|h|}{n}\Big)\gamma_{x}(h)\\ &= \frac{1}{n}\sum_{h=-n}^{n}\Big(1 - \frac{|h|}{n}\Big)\gamma_{x}(h) \end{aligned}\]If the process is white noise, the variance would reduce to:
\[\begin{aligned} \mathrm{var}(\bar{x}) &= \frac{1}{n}\gamma_{x}(0)\\ &= \frac{\sigma_{x}^{2}}{n} \end{aligned}\]The sample autocovariance function is defined as:
\[\begin{aligned} \hat{\gamma}(h) &= \frac{1}{n}\sum_{t=1}^{n-h}(x_{t+h} - \bar{x})(x_{t} - \bar{x}) \end{aligned}\]Note that the estimator above divides by \(n\) instead of \(n-h\). This guarantee that the variance is never negative (non-negative definite function). And because a variance is never negative, the estimate of that variance should also be non-negative:
\[\begin{aligned} \hat{\mathrm{cov}}(a_{1}x_{1} + \cdots + a_{n}x_{n}) &= \sum_{j=1}^{n}\sum_{k = 1}\hat{\gamma}(j-k) \end{aligned}\]But note that neither dividing by \(n\) or \(n-h\) result in an unbiased estimator of \(\gamma(h)\).
Similarly, the sample ACF is defined as:
\[\begin{aligned} \hat{\rho}(h) &= \frac{\hat{\gamma}(h)}{\hat{\gamma}(0)} \end{aligned}\]Using R to calculate ACF:
If \(x_{t}\) is white noise, then for large \(n\), the sample ACF is approximately normally distributed with 0 mean and sd:
\[\begin{aligned} \sigma_{\hat{\rho}_{x}(h)} &= \frac{1}{\sqrt{n}} \end{aligned}\]The larger the sample-size \(n\), the smaller standard deviation.
An example of ACF of speech dataset:
And the sample CCF:
\[\begin{aligned} \hat{\rho}_{xy}(h) &= \frac{\hat{\rho}_{xy}(h)}{\sqrt{\hat{\gamma}_{x}(0)\hat{\gamma}_{y}(0)}} \end{aligned}\]If at least one of the processes is independent white noise:
\[\begin{aligned} \sigma_{\hat{\rho}_{xy}} &= \frac{1}{\sqrt{n}} \end{aligned}\]In order to test for cross linear dependence, at least one of the series must be white noise.
To show as example, we generate two series:
\[\begin{aligned} x_{t} &= 2\mathrm{cos}\Big(2\pi \frac{t}{12}\Big) + w_{t1}\\ y_{t} &= 2\mathrm{cos}\Big(2\pi \frac{t+5}{12}\Big) + w_{t2}\\ \end{aligned}\]If you look at the CCF:
There appears to be cross-correalation between the sampe CCF although we know that the two series are independent.
By prewhitening \(y_{t}\) , we mean that the signal has been removed from the data by running a regression of \(y_{t}\) on \(\mathrm{cos}(2\pi t), \mathrm{sin}(2\pi t)\) and then assigning \(\tilde{y}_{t} = y_{t} - \hat{y}_{t}\). Doing this in R:
We can see that the cross-correlation is now insignificant.
A strictly stationary time series has the same probability distribution when shifted by \(h\):
\[\begin{aligned} P(x_{t1} \leq c_{1}, \cdots, x_{tk} \leq c_{k}) &= P(x_{t1 + h} \leq c_{1}, \cdots, x_{tk + h} \leq c_{k})\\ k &= 1, 2, \cdots \end{aligned}\]Usually the strictly stationary condition is too imposing and a weaker version that only impose the first two moments of a series is known as a “weakly statoionary” time series such that:
\[\begin{aligned} E[x_{t}] &= \mu_{t}\\ &= \mu\\ \gamma(t + h, t) &= \mathrm{cov}(x_{t + h}, x_{t})\\ &= \mathrm{cov}(x_{h}, x_{0})\\ &= \gamma(h, 0)\\ &= \gamma(h) \end{aligned}\]In other words, the autocovariance function can be written as:
\[\begin{aligned} \gamma(h) &= \mathrm{cov}(x_{t+h}, x_{t})\\ &= E[(x_{t+h} - \mu)(x_{t} - \mu)] \end{aligned}\]And the ACF:
\[\begin{aligned} \rho(h) &= \frac{\gamma(t+h, t)}{\sqrt{\gamma(t+h, t+h)\gamma(t, t)}}\\[5pt] &= \frac{\gamma(h)}{\gamma(0)} \end{aligned}\]Two stationary time series \(x_{t}\) and \(y_{t}\) are said to be jointly stationary if the cross-covariance function is a function only of lag h:
\[\begin{aligned} \gamma_{xy}(h) &= \mathrm{cov}(x_{t+h}, y_{t})\\ &= E[(x_{t+h} - \mu_{x})(y_{t} - \mu_{y})] \end{aligned}\]The CCF:
\[\begin{aligned} \rho_{xy}(h) &= \frac{\gamma_{xy}(h)}{\sqrt{\gamma_{x}(0)\gamma_{y}(0)}} \end{aligned}\]In general,
\[\begin{aligned} \rho_{xy}(h) &\neq \rho_{xy}(-h)\\ \rho_{xy}(h) &= \rho_{yx}(-h) \end{aligned}\]For example, if:
\[\begin{aligned} x_{t} &= w_{t} + w_{t-1}\\ y_{t} &= w_{t} - w_{t-1}\\ x_{t+1} &= w_{t+1} + w_{t}\\ y_{t+1} &= w_{t+1} - w_{t} \end{aligned}\] \[\begin{aligned} \gamma_{x}(0) &= 2\sigma_{w}^{2}\\ \gamma_{y}(0) &= 2\sigma_{w}^{2}\\ \gamma_{x}(1) &= \sigma_{w}^{2}\\ \gamma_{y}(1) &= E[y_{t+1}y_{t}]\\ &= E[(w_{t+1} - w_{t})(w_{t} - w_{t-1})]\\ &= 0 - 0 - E[w_{t}w_{t}] + 0\\ &= -\sigma_{w}^{2} \end{aligned}\]We can show that the autocovariance function depend only on the lag \(h\) and hence jointly stationary:
\[\begin{aligned} \gamma_{xy}(1) &= \mathrm{cov}(x_{t+1}, y_{t})\\ &= \mathrm{cov}(w_{t+1} + w_{t}, w_{t} - w_{t-1})\\ &= E[(w_{t+1} + w_{t})(w_{t} - w_{t-1})]\\ &= 0 + 0 + E[w_{t}w_{t}] - 0\\ &= \sigma_{w}^{2} \end{aligned}\]CCF:
\[\begin{aligned} \sqrt{\gamma_{x}(0)\gamma_{y}(0)} &= \sqrt{2\sigma_{w}^{2}\times 2\sigma_{w}^{2}}\\ &= 2\sigma_{w}^{2} \end{aligned}\] \[\begin{aligned} \rho_{xy}(h) &= \begin{cases} 0 & h = 0\\ \frac{1}{2} & h = 1\\ -\frac{1}{2} & h = -1\\ 0 & |h| \geq 2 \end{cases} \end{aligned}\]Another example of cross-correlation where one series lead/lag another series. For the following model:
\[\begin{aligned} y_{t} &= Ax_{t-l} + w_{t} \end{aligned}\]If \(l > 0\), \(x_{t}\) is said to lead \(y_{t}\) and if \(l < 0\), \(x_{t}\) is said to lag \(y_{t}\).
Assuming \(w_{t}\) is uncorrelated with the \(x_{t}\) series, the cross-covariance function:
\[\begin{aligned} \gamma_{yx}(h) &= \mathrm{cov}(y_{t+h}, x_{t})\\ &= \mathrm{cov}(A x_{t+h - l} + w_{t+h}, x_{t})\\ &= \mathrm{cov}(A x_{t+h - l}, x_{t})\\ &= A\gamma_{x}(h - l) \end{aligned}\]\(\gamma_{x}(h - l)\) will peak when \(h - l = 0\):
\[\begin{aligned} \gamma_{x}(h - l) &= \gamma_{x}(0) \end{aligned}\]If \(x_{t}\) leads \(y_{t}\), the peak will be on the positive side and vice versa.
A linear process, \(x_{t}\), is defined to be a linear combination of white noise variates \(w_{t}\):
\[\begin{aligned} x_{t} &= \mu + \sum_{j = -\infty}^{\infty}\psi_{j}w_{t-j} \end{aligned}\] \[\begin{aligned} \sum_{j=-\infty}^{\infty}|\psi_{j}| &< \infty \end{aligned}\]The autocovariance can be shown to be:
\[\begin{aligned} \gamma_{x}(h) &= \sigma_{w}^{2}\sum_{j=-\infty}^{\infty}\psi_{j+h}\psi_{j} \end{aligned}\]For a causal linear process, the processes do not depend on the future and we will have:
\[\begin{aligned} \psi_{j} &= 0\\ j &< 0 \end{aligned}\]A process, \(\{x_{t}\}\), is said to be a Gaussian process if the n-dimensional \(\mathbf{x} = (x_{t1}, x_{t2}, \cdots, x_{tn})\) have a multivariate normal distribution.
Defining the mean and variance to be:
\[\begin{aligned} E[\mathbf{x}] &\equiv \boldsymbol{\mu}\\ &= \begin{bmatrix} \mu_{t1}\\ \mu_{t2}\\ \vdots\\ \mu_{tn} \end{bmatrix}\\[5pt] \mathrm{var}(\mathbf{x}) &\equiv \boldsymbol{\Gamma}\\ &= \{\gamma(t_{i}, t_{j}); i, j = 1, \cdots, n\} \end{aligned}\]The multivariate normal density function can then be written as:
\[\begin{aligned} f(\mathbf{x}) &= (2\pi)^{-\frac{n}{2}}|\boldsymbol{\Gamma}|^{-\frac{1}{2}}e^{\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^{T}\boldsymbol{\Gamma}^{-1}(\mathbf{x}- \boldsymbol{\mu})} \end{aligned}\]If the Gaussian time series, \(x_{t}\), is weakly stationary, then \(\mu_{t}\) is constant and:
\[\begin{aligned} \gamma(t_{i}, t_{j}) = \gamma(|t_{i} - t_{j}|) \end{aligned}\]So that the vector \(\boldsymbol{\mu}\) and matrix \(\boldsymbol{\Gamma}\) are independent of time and consist of constant values.
Note that it is not enough for the marginal distributions to be Gaussian in order for the joint distribution to be Gaussian.
Wold Decomposition states that a stationary non-deterministic time series is a causal linear process:
\[\begin{aligned} x_{t} &= \mu + \sum_{j = 0}^{\infty}\psi_{j}w_{t-j} + v_{t}\\ \sum_{j=0}^{\infty}|\psi_{j}| &< \infty\\ \psi_{0} &= 1 \end{aligned}\]If the linear process is Gaussian, then:
\[\begin{aligned} w_{t} &\sim \text{iid } N(0, \sigma_{w}^{2})\\ \text{cov}(w_{s}, v_{t}) &= 0 \end{aligned}\] \[\begin{aligned} \forall s, t \end{aligned}\]A collection of uncorrelated random variables with mean 0 and finite variance. Additionally, we require the random variables to be iid. If the RV is further Gaussian then:
\[\begin{aligned} w_{t} \sim N(0, \sigma_{\omega}^{2}) \end{aligned}\]Autocovariance function:
\[\begin{aligned} \gamma_{w}(s, t) &= \mathrm{cov}(w_{s}, w_{t})\\ &= \begin{cases} \sigma_{w}^{2} & s = t\\ 0 & s \neq t \end{cases} \end{aligned}\]Because white noise is weakly stationary, we can represent it in terms of lag \(h\):
\[\begin{aligned} \gamma_{w}(h) &= \mathrm{cov}(w_{t + h}, w_{t})\\ &= \begin{cases} \sigma_{w}^{2} & h = 0\\ 0 & h \neq 0 \end{cases} \end{aligned}\]The explosion series bears a slight resemblance to white noise:
To introduce serial correlation into the time series, we can replace the white noise series by a moving average that smooths out the series:
\[\begin{aligned} \nu_{t} = \frac{1}{3}(w_{t-1} + w_{t} + w_{t+1}) \end{aligned}\]And the mean:
\[\begin{aligned} \mu_{\nu t} &= E[\nu_{t}]\\ &= \frac{1}{3}\Big(E[w_{t-1} + w_{t} + w_{t+1}]\Big)\\ &= \frac{1}{3}\Big(E[w_{t-1}] + E[w_{t}] + E[w_{t+1}]\Big)\\ &= 0 \end{aligned}\]Autocovariance function:
\[\begin{aligned} \gamma_{\nu}(s, t) = &\ \mathrm{cov}(\nu_{s}, \nu_{t})\\ = &\ \frac{1}{9}\mathrm{cov}(w_{s-1} + w_{s} + w_{s+1}, w_{t-1} + w_{t} + w_{t+1})\\ = &\ \frac{1}{9}\Big( \mathrm{cov}(w_{s-1}, w_{t-1}) + \mathrm{cov}(w_{s-1}, w_{t}) + \mathrm{cov}(w_{s-1}, w_{t+1}) +\\ &\ \ \ \mathrm{cov}(w_{s}, w_{t-1}) + \mathrm{cov}(w_{s}, w_{t}) + \mathrm{cov}(w_{s}, w_{t+1}) +\\ &\ \ \ \mathrm{cov}(w_{s+1}, w_{t-1}) + \mathrm{cov}(w_{s+1}, w_{t}) + \mathrm{cov}(w_{s+1}, w_{t+1}) \Big) \end{aligned}\] \[\begin{aligned} \gamma_{\nu}(t, t) &= \frac{1}{9}\Big( \mathrm{cov}(w_{t-1}, w_{t-1}) + \mathrm{cov}(w_{t}, w_{t}) + \mathrm{cov}(w_{t+1}, w_{t})\Big)\\ &= \frac{3}{9}\sigma_{w}^{2} \end{aligned}\] \[\begin{aligned} \gamma_{\nu}(t + 1, t) &= \frac{1}{9}\Big[ \mathrm{cov}(w_{t}, w_{t}) + \mathrm{cov}(w_{t+1}, w_{t})\Big]\\ &= \frac{2}{9}\sigma_{w}^{2} \end{aligned}\]In summary:
\[\begin{aligned} \gamma_{\nu}(s, t) &= \begin{cases} \frac{3}{9}\sigma_{w}^{2} & s = t\\ \frac{2}{9}\sigma_{w}^{2} & |s-t| = 1\\ \frac{1}{9}\sigma_{w}^{2} & |s-t| = 2\\ 0 & |s - t| > 2 \end{cases} \end{aligned}\]And in terms of lags:
\[\begin{aligned} \gamma_{\nu}(h) &= \begin{cases} \frac{3}{9}\sigma_{w}^{2} & h = 0\\ \frac{2}{9}\sigma_{w}^{2} & h = \pm 1\\ \frac{1}{9}\sigma_{w}^{2} & h = \pm 2\\ 0 & |h| > 2 \end{cases}\\ \rho_{\nu}(h) &= \begin{cases} 1 & h = 0\\ \frac{2}{3} & h = \pm 1\\ \frac{1}{3} & h = \pm 2\\ 0 & |h| > 2 \end{cases} \end{aligned}\]Note that the autocovariance function depends only on the lag and decreases and completely disappear after 3 lags.
Note the absence of any periodicity in the series. We can see the resemblance to the SOI series:
To generate a somewhat periodic behavior, we can turn to autoregression model. For example:
\[\begin{aligned} x_{t} &= x_{t-1} - 0.9x_{t-2} + w_{t} \end{aligned}\]We can see the resemblance to the speech data:
The above is the same as a cumulative sum of white noise variates:
\[\begin{aligned} x_{t} &= \delta t + \sum_{j=1}^{t}w_{j} \end{aligned}\]And the mean:
\[\begin{aligned} \mu_{xt} &= E[x_{t}]\\ &= \delta t + \sum_{j=1}^{t}E[w_{j}]\\ &= \delta t \end{aligned}\]Autocovariance function:
\[\begin{aligned} \gamma_{x}(s, t) &= \mathrm{cov}(x_{s}, x_{t})\\ &= \mathrm{cov}\Big( \sum_{j=1}^{s}w_{j}, \sum_{k = 1}^{t}w_{k} \Big)\\ &= \mathrm{min}\{s, t\}\sigma_{w}^{2} \end{aligned}\]Because \(w_{t}\) are uncorrelated, and thus only the overlapped variances are added. Note that the autocovariance function depends on particular values of \(s\) and \(t\), and thus not stationary.
For the variance:
\[\begin{aligned} \mathrm{var}(x_{t}) &= \gamma_{x}(t, t)\\ &= t\sigma_{w}^{2} \end{aligned}\]Note that the variance increases without bound as time \(t\) increases.
And note the resemblance to Global Temperature Deviations:
Consider the model:
\[\begin{aligned} x_{t} &= \alpha + \beta t + y_{t} \end{aligned}\]Where \(y_{t}\) is stationary. Then the mean:
\[\begin{aligned} \mu_{x, t} &= E[x_{t}]\\ &= \alpha + \beta t + \mu_{y} \end{aligned}\]Which is not independent of time. However, for the autocovariance function:
\[\begin{aligned} \gamma_{x}(h) &= \mathrm{cov}(x_{t+h}, x_{t})\\ &= E[(x_{t+h} - \mu_{x, t+h})(x_{t} - \mu_{x,t})]\\ &= E[(y_{t+h} - \mu_{y})(y_{t} - \mu_{y})]\\ &= \gamma_{y}(h) \end{aligned}\]Is independent of time. In other words, after removing the trend, the model is stationary, or the model is considered stationary around a linear trend.
Consider the model:
\[\begin{aligned} x_{t} &= 2 \mathrm{cos}\Big(2\pi \frac{t + 15}{50}\Big) + w_{t} \end{aligned}\]And the mean:
\[\begin{aligned} \mu_{xt} &= E[x_{t}]\\ &= E\Big[2\mathrm{cos}\Big(2\pi \frac{t+15}{50}\Big) + w_{t}\Big]\\ &= 2\mathrm{cos}\Big(2\pi \frac{t+15}{50}\Big) + E[w_{t}]\\ &= 2\mathrm{cos}\Big(2\pi \frac{t+15}{50}\Big) \end{aligned}\]Note that any sinusoidal wave can be written as:
\[\begin{aligned} A \mathrm{cos}(2\pi \omega t + \phi) \end{aligned}\]In the example:
\[\begin{aligned} A &= 2\\ \omega &= \frac{1}{50}\\ \phi &= \frac{2\pi 15}{50}\\ &= 0.6 \pi \end{aligned}\]Adding white noise to the signal:
Note the similarity with fMRI series:
In a classical linear regression model:
\[\begin{aligned} x_{t} &= \beta_{0} + \beta_{1}z_{t1} + \cdots + \beta_{q}z_{tq} + w_{t} \end{aligned}\]For time series regression, it is rarely the case where the residual is white noise.
For example, consider the monthly price (per pound) of a chicken for 180 months:
Linear regression model:
\[\begin{aligned} x_{t} &= \beta_{0} + \beta_{1}z_{t} + w_{t} \end{aligned}\]We would like to minimize the SSE:
\[\begin{aligned} \text{SSE} &= \sum_{t = 1}^{n}\Big(x_{t} - (\beta_{0} + \beta_{1}z_{t})^{2}\Big) \end{aligned}\]By differentiating and setting both parameters to 0, we get:
\[\begin{aligned} \hat{\beta_{1}} &= \frac{\sum_{t=1}^{n}(x_{t} - \bar{x})(z_{t} - \bar{z})}{\sum_{t=1}^{n}(z_{t} - \bar{z})}\\[5pt] \hat{\beta}_{0} &= \bar{x} - \hat{\beta}_{1}\bar{z} \end{aligned}\]Using R to fit a linear model:
Let’s take a look at another example where we have more than one explanatory variables. We look at the possible effects of temperature and pollution on weekly mortality in LA county.
Taking a look at the time plots, we can see strong seasonal components in all 3 series:
From the scatterplot, we can see strong linear relations between mortality and temperature and also pollutant. The mortality vs pollutants indicate a possible quadratic shape:
We will attempt to fit the following models:
\[\begin{aligned} M_{t} &= \beta_{0} + \beta_{1}t + w_{t}\\ M_{t} &= \beta_{0} + \beta_{1}t + \beta_{2}(T_{t} - T_{\mu}) + w_{t}\\ M_{t} &= \beta_{0} + \beta_{1}t + \beta_{2}(T_{t} - T_{\mu}) + \beta_{3}(T_{3} - T_{\mu})^{2} + w_{t}\\ M_{t} &= \beta_{0} + \beta_{1}t + \beta_{2}(T_{t} - T_{\mu}) + \beta_{3}(T_{3} - T_{\mu})^{2} + \beta_{4}P_{4} + w_{t}\\ \end{aligned}\]Fitting the models in R:
We can see each subsequent model does better via the RSS/SSE.
We saw the the Southern Oscillation Index (SOI) with a lag of 6 months is highly correlated with the Recruitment series. We therefore consider the following model:
\[\begin{aligned} R_{t} &= \beta_{0} + \beta_{1}S_{t-6} + w_{t} \end{aligned}\]Fitting the model in R:
The fitted model is:
\[\begin{aligned} \hat{R}_{t} &= 65.79 - 44.28S_{t-6} \end{aligned}\]The easiest form of nonstationarity to work with is trend stationary model:
\[\begin{aligned} x_{t} &= \mu_{t} + e_{t} \end{aligned}\]Where \(\mu_{t}\) is the trend and \(e_{t}\) is a stationary process. We will work with the detrended residual:
\[\begin{aligned} \hat{y}_{t} &= x_{t} - \hat{\mu}_{t} \end{aligned}\]Recall the model for the chicken prices:
\[\begin{aligned} \hat{\mu}_{t} &= 59 + 0.30 t \end{aligned}\]To obtain the detrended series:
\[\begin{aligned} \hat{y}_{t} &= x_{t} - 59 - 0.30 t \end{aligned}\]The plot:
The ACF plot:
It is also possible to treat the model as a random walk with drift model:
\[\begin{aligned} \mu_{t} &= \delta + \mu_{t-1} + w_{t} \end{aligned}\]If the true source of the model is:
\[\begin{aligned} x_{t} &= \mu_{t} + e_{t} \end{aligned}\]Differencing the data will yield a stationary process:
\[\begin{aligned} x_{t} - x_{t-1} &= (\mu_{t} + e_{t}) - (\mu_{t-1} + e_{t-1})\\ &= (\mu_{t} - \mu_{t-1}) + (e_{t} - e_{t-1})\\ &= \delta + w_{t} + e_{t} - e_{t-1} \end{aligned}\]It is easy to show that \(e_{t}- e_{t-1}\) is stationary.
If an estimate of \(e_{t}\) is essential, then detrending may be more appropriate. If the goal is to coerce the data to stationarity, then differencing may be more appropriate. Differencing is also a viable tool if the trend is fixed. For example:
\[\begin{aligned} \mu_{t} &= \beta_{0} + \beta_{1}t\\ x_{t}- x_{t-1} &= (\mu_{t} + e_{t}) - (\mu_{t-1} + e_{t-1})\\ &= \beta_{0} + \beta_{1}t + e_{t} - (\beta_{0} + \beta_{1}(t-1) + e_{t-1})\\ &= \beta_{1}t - \beta_{1}t + \beta_{1} + e_{t} - e_{t-1}\\ &= \beta_{1} + e_{t} - e_{t-1} \end{aligned}\]One disadvantage of differencing is it does not yield an estimate for the stationary process \(e_{t}\).
As we have seen, the first difference eliminates a linear trend. A second difference can eliminate a quadratic trend, and so on.
We define the backshift operator as:
\[\begin{aligned} Bx_{t} &= x_{t-1}\\ B^{2}x_{t} &= B(Bx_{t})\\ &= B(x_{t-1})\\ &= x_{t-2}\\ B^{k} &= x_{t-k} \end{aligned}\]We also require that:
\[\begin{aligned} B^{-1}B &= 1 \end{aligned}\] \[\begin{aligned} x_{t} &= B^{-1}Bx_{t}\\ &= B^{-1}x_{t-1} \end{aligned}\]\(B^{-1}\) is known as the forward-shift operator.
We denote the first difference as:
\[\begin{aligned} \nabla x_{t} &= x_{t} - x_{t-1}\\ &= (1 - B)x_{t} \end{aligned}\]And second order difference:
\[\begin{aligned} \nabla^{2}x_{t} &= (1 - B)^{2}x_{t}\\ &= (1 - 2B + B^{2})x_{t}\\ &= x_{t} - 2x_{t-1} + x_{t-2}\\[5pt] \nabla(\nabla x_{t}) &= \nabla(x_{t} - x_{t-1})\\ &= (x_{t} - x_{t-1}) - (x_{t-1} - x_{t-2})\\ &= x_{t} - 2x_{t-1} + x_{t-2} \end{aligned}\]In general, differences of order d are defined as:
\[\begin{aligned} \nabla^{d} &= (1 - B)^{d} \end{aligned}\]Sometimes transformations may be useful to equalize the variability of a time series. One example of such transformation is the log transform:
\[\begin{aligned} y_{t} &= \mathrm{log}(x_{t}) \end{aligned}\]Power transformations in the Box-Cox family are of the form:
\[\begin{aligned} y_{t} &= \begin{cases} \frac{x_{t}^{\lambda} - 1}{\lambda} & \lambda \neq 0\\ \mathrm{log}(x_{t}) & \lambda = 0 \end{cases} \end{aligned}\]Transformations are normally done to improve linearity and normality.
The ACF gives a linear correlation of lags. However, the relationship might be nonlinear. The example in this section displays the lagged scatterplot of the SOI series. Lowess (locally weighted scatterplot smoothing) are plotted together with the scatterplot to help show the nonlinear relationship.
We can see the correlation is highest at t-6. However the relationship is nonlinear. In this case, we could add a dummy variable to implement a segmented regression (hockey stick):
\[\begin{aligned} R_{t} &= \beta_{0} + \beta_{1}S_{t-6} + \beta_{2}D_{t-6} + \beta_{3}D_{t-6} + w_{t}\\[5pt] R_{t} &= \begin{cases} \beta_{0} + \beta_{1}S_{t-6} + w_{t} & S_{t-6} < 0\\ (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3})S_{t-6} + w_{t} & S_{t-6} \geq 0 \end{cases} \end{aligned}\]Plotting the lowess and fitted curve:
If \(x_{t}\) represents the data observations, then:
\[\begin{aligned} m_{t} &= \sum_{j= -k}^{k}a_{j}x_{t-j}\\ a_j &= a_{-j} \geq 0 \end{aligned}\] \[\begin{aligned} \sum_{j=-k}^{k}a_{j} &= 1 \end{aligned}\]For example, using the monthly SOI series for \(k = 6\) using weights:
\[\begin{aligned} a_{0} &= a_{\pm 1} = \cdots = a_{\pm 5} = \frac{1}{12} \end{aligned}\] \[\begin{aligned} a_{\pm 6} &= \frac{1}{24} \end{aligned}\]Although the moving average smoothing highlight the EL Nino effect, the smoothing still look too choppy. We can obtain a smoother fit using normal distribution for the weights (kernel smoothing) rather than the boxcar-type weights above. “b” is the bandwidth and \(K()\) is known as the kernel function.
\[\begin{aligned} m_{t} &= \sum_{i=1}^{n}w_{i}(t)x_{i}\\ w_{i}(t) &= \frac{K\Big(\frac{t-i}{b}\Big)}{\sum_{j=1}^{n}K\Big(\frac{t-i}{b}\Big)} \end{aligned}\] \[\begin{aligned} K(z) &= \frac{1}{\sqrt{2\pi}}e^{\frac{-z^{2}}{2}} \end{aligned}\]To implement this in R, we use the “ksmooth” function. The wider the bandwidth, the smoother the result. In the SOI example, we use a value of \(b=12\) to corespond to approximately smoothing a little over a year.
Lowest is a method of smoothing that employs nearest neighbor local regression. The larger the fraction of nearest neighbors are included, the smoother the fit will be. We fit 2 lowess curve where one is smoother than the other:
Given a cubic polynomial:
\[\begin{aligned} m_{t} &= \beta_{0} + \beta_{1}t + \beta_{2}t^{2} + \beta_{3}t^{3} \end{aligned}\]Cubic splines divide the time into \(k\) intervals. The values \(t_{0}, t_{1}, \cdots, t_{k}\) are called knots. In each interval, a cublic polynomial is fit with conditions that the ends are connected.
For smoothing splines, the following objective function is minimized given a regularizing \(\lambda > 0\):
\[\begin{aligned} \sum_{t=1}^{n}(x_{t} - m_{t})^{2} + \lambda \int (m_{t}^{\prime\prime})^{2}\ dt \end{aligned}\]It is also possible to smooth one series over another. Using the Temperature and Mortality series data: