Lesson 6.7: Finding Long-Run Relationships: Cointegration

We now explore one of the most profound ideas in time series analysis. Cointegration allows us to find stable, long-run equilibrium relationships between non-stationary variables. This Nobel Prize-winning concept, developed by Clive Granger and Robert Engle, moves beyond short-term dynamics to uncover the fundamental anchors that tie economic and financial systems together.

Part 1: The Problem - The 'Spurious Regression'

A major pitfall when working with non-stationary time series (like stock prices or GDP) is the problem of **spurious regression**. If you regress one random walk on another, unrelated random walk, you will very often find a statistically significant relationship ( $t$ -stats are high, $R^2$ is high) even when no true economic relationship exists.

The shared trend in the data creates a statistical illusion. In the last lesson, we solved this by differencing the data to make it stationary before putting it into a VAR model. However, differencing is a blunt instrument. It throws away all information about the long-run co-movement of the variables in their levels.

The Core Analogy: The Drunk and Her Dog

Imagine a drunk person walking randomly through a park. Their path is a classic random walk ( $I(1)$ ) and is unpredictable. Now, imagine they are walking their dog on a leash. The dog is also free to wander randomly, so its path is also a random walk ( $I(1)$ ).

If you look at the drunk's path in isolation, it's non-stationary. The same is true for the dog's path.
However, their paths are not independent. The leash ensures they can't wander arbitrarily far apart from each other. The **distance between them** is constrained.
While both paths are non-stationary, the spread (or the linear combination: $\text{Drunk's Position} - \text{Dog's Position}$ ) is **stationary**. It hovers around a mean of zero.

This is the essence of cointegration. The drunk and her dog are **cointegrated**. They are a system with a long-run equilibrium relationship (the leash) that pulls them back together whenever they stray too far apart.

Part 2: The Formal Definition of Cointegration

Cointegration gives a formal mathematical structure to this "leash" concept.

Definition of Cointegration

Let $Y_t$ and $X_t$ be two non-stationary time series that are both integrated of order 1, i.e., $Y_t, X_t \sim I(1)$ .

They are said to be **cointegrated** if there exists a linear combination of them that is stationary ( $I(0)$ ). That is, if there exists a coefficient $\beta$ such that the series $Z_t$ :

Z_t = Y_t - \beta X_t

is a stationary ( $I(0)$ ) process.

The vector $[1, -\beta]$ is called the **cointegrating vector**.
The stationary series $Z_t$ represents the **equilibrium error** or the "spread." When $Z_t$ is far from its mean, we expect a correction to occur in $Y_t$ and/or $X_t$ to restore the long-run balance.

Part 3: The Engle-Granger Two-Step Test for Cointegration

How do we test if two series are cointegrated? The most intuitive method is the two-step procedure developed by Engle and Granger.

The Engle-Granger Testing Procedure

Step 1: Test for Unit Roots in Individual Series.First, confirm that both series, $Y_t$ and $X_t$ , are non-stationary and have the same order of integration (usually $I(1)$ ). Use the Augmented Dickey-Fuller (ADF) test on each series. If one is $I(1)$ and the other is $I(0)$ , they cannot be cointegrated.
Step 2: Run the 'Cointegrating Regression'.If both series are $I(1)$ , run a simple OLS regression of one on the other to estimate the long-run relationship:
$Y_t = \alpha + \beta X_t + Z_t$
Save the residuals from this regression: $\hat{Z}_t = Y_t - \hat{\alpha} - \hat{\beta} X_t$ . This residual series, $\hat{Z}_t$ , is our estimated "spread" or "equilibrium error."
Step 3: Test the Residuals for a Unit Root.The final, critical step is to test whether this residual series $\hat{Z}_t$ is stationary. We perform an ADF test on the residuals.
- $H_0$ : The residuals have a unit root (i.e., they are non-stationary). This means the series are **NOT** cointegrated.
- $H_1$ : The residuals are stationary. This means the series **ARE** cointegrated.

Crucial Detail:

When performing the ADF test on residuals from a cointegrating regression, we cannot use the standard critical values. We must use a special set of critical values, often called Engle-Granger or MacKinnon critical values, which are more stringent. Modern statistical packages handle this automatically.

Part 4: The Quant's Payoff - Pairs Trading

Cointegration is not just an abstract statistical property; it is the theoretical foundation for one of the most famous quantitative trading strategies: **pairs trading**.

The Pairs Trading Strategy

The strategy works as follows:

Find a Cointegrated Pair: Identify two assets (e.g., two similar stocks like Coca-Cola and Pepsi, or two related ETFs) that are cointegrated. This means their prices are bound by a long-run equilibrium relationship.
Model the Spread: Run the cointegrating regression to get the equilibrium relationship ( $\hat{\beta}$ ) and calculate the historical spread series, $\hat{Z}_t = Y_t - \hat{\beta} X_t$ .
Generate Trading Signals: Since the spread $\hat{Z}_t$ is stationary (mean-reverting), we can define trading thresholds. For example, we might calculate the standard deviation of the spread, $\sigma_Z$ , and set entry thresholds at $\pm 2\sigma_Z$ .
Execute Trades:
- When the spread $\hat{Z}_t$ rises above $+2\sigma_Z$ , it means $Y_t$ is "too expensive" relative to $X_t$ . The strategy would be to **short** the spread: sell $Y_t$ and buy $\beta$ units of $X_t$ .
- When the spread falls below $-2\sigma_Z$ , it means $Y_t$ is "too cheap." The strategy would be to **go long** the spread: buy $Y_t$ and sell $\beta$ units of $X_t$ .
Exit the Trade: The position is closed when the spread reverts back to its mean (i.e., crosses zero).

This strategy is fundamentally a bet on the stability of the cointegrating relationship. It is a market-neutral strategy because you are always long one asset and short another, insulating you from the overall direction of the market.

Part 5: Python Implementation - Finding a Cointegrated Pair

Testing for Cointegration Between Two ETFs

import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller, coint

# --- 1. Get and Prepare Data ---
# Let's test two ETFs that track emerging markets: EEM and VWO
# They should theoretically be cointegrated.
eem = yf.download('EEM', start='2010-01-01', end='2023-12-31')['Adj Close']
vwo = yf.download('VWO', start='2010-01-01', end='2023-12-31')['Adj Close']
data = pd.DataFrame({'EEM': eem, 'VWO': vwo}).dropna()

# --- 2. Step 1: Test for Unit Roots in Individual Series ---
adf_eem = adfuller(data['EEM'])
adf_vwo = adfuller(data['VWO'])

print(f"ADF p-value for EEM: {adf_eem[1]:.4f}")
print(f"ADF p-value for VWO: {adf_vwo[1]:.4f}")
# We expect both p-values to be high (> 0.05), indicating they are I(1).

# --- 3. Step 2 & 3: Run Engle-Granger Cointegration Test ---
# The coint() function in statsmodels does the two steps for us:
# 1. Runs the OLS regression: EEM = beta * VWO + const
# 2. Runs the ADF test on the residuals with the correct critical values.

coint_result = coint(data['EEM'], data['VWO'])
coint_t_statistic = coint_result[0]
coint_p_value = coint_result[1]
critical_values = coint_result[2]

print(f"\nCointegration Test t-statistic: {coint_t_statistic:.4f}")
print(f"Cointegration Test p-value: {coint_p_value:.4f}")
print("Critical Values (1%, 5%, 10%):", critical_values)

# --- 4. Interpret the Results ---
# H₀: The series are NOT cointegrated.
# If the p-value is low (< 0.05), we REJECT H₀ and conclude they are cointegrated.
if coint_p_value < 0.05:
    print("\nResult: The series appear to be cointegrated.")
else:
    print("\nResult: The series do not appear to be cointegrated.")

# --- 5. Visualize the Spread (Equilibrium Error) ---
# To get the spread, we need to run the OLS regression ourselves.
import statsmodels.api as sm
X = sm.add_constant(data['VWO'])
model = sm.OLS(data['EEM'], X).fit()
beta = model.params['VWO']
spread = data['EEM'] - beta * data['VWO']

spread.plot(figsize=(12,6), title='EEM-VWO Cointegrated Spread')
plt.axhline(spread.mean(), color='red', linestyle='--')
plt.ylabel('Spread')
plt.show()
# The plot should show a stationary, mean-reverting series.

What's Next? Reconnecting the Short Run and the Long Run

We have a puzzle. In the last lesson, we learned that to model non-stationary systems with a VAR, we must difference the data. But in this lesson, we learned that differencing throws away crucial information about the long-run cointegrating relationship.

How can we build a model that does both? How can we capture the short-run dynamics of the differenced data while also respecting the long-run equilibrium of the levels data?

The solution is a sophisticated model that combines the VAR framework with the concept of cointegration. In our final modeling lesson, we will introduce the **Vector Error Correction Model (VECM)**.

Up Next: Putting It Together: The Vector Error Correction Model (VECM)