Lesson 6.7: Finding Long-Run Relationships: Cointegration
We now explore one of the most profound ideas in time series analysis. Cointegration allows us to find stable, long-run equilibrium relationships between non-stationary variables. This Nobel Prize-winning concept, developed by Clive Granger and Robert Engle, moves beyond short-term dynamics to uncover the fundamental anchors that tie economic and financial systems together.
Part 1: The Problem - The 'Spurious Regression'
A major pitfall when working with non-stationary time series (like stock prices or GDP) is the problem of **spurious regression**. If you regress one random walk on another, unrelated random walk, you will very often find a statistically significant relationship (-stats are high, is high) even when no true economic relationship exists.
The shared trend in the data creates a statistical illusion. In the last lesson, we solved this by differencing the data to make it stationary before putting it into a VAR model. However, differencing is a blunt instrument. It throws away all information about the long-run co-movement of the variables in their levels.
The Core Analogy: The Drunk and Her Dog
Imagine a drunk person walking randomly through a park. Their path is a classic random walk () and is unpredictable. Now, imagine they are walking their dog on a leash. The dog is also free to wander randomly, so its path is also a random walk ().
- If you look at the drunk's path in isolation, it's non-stationary. The same is true for the dog's path.
- However, their paths are not independent. The leash ensures they can't wander arbitrarily far apart from each other. The **distance between them** is constrained.
- While both paths are non-stationary, the spread (or the linear combination: ) is **stationary**. It hovers around a mean of zero.
This is the essence of cointegration. The drunk and her dog are **cointegrated**. They are a system with a long-run equilibrium relationship (the leash) that pulls them back together whenever they stray too far apart.
Part 2: The Formal Definition of Cointegration
Cointegration gives a formal mathematical structure to this "leash" concept.
Definition of Cointegration
Let and be two non-stationary time series that are both integrated of order 1, i.e., .
They are said to be **cointegrated** if there exists a linear combination of them that is stationary (). That is, if there exists a coefficient such that the series :
is a stationary () process.
- The vector is called the **cointegrating vector**.
- The stationary series represents the **equilibrium error** or the "spread." When is far from its mean, we expect a correction to occur in and/or to restore the long-run balance.
Part 3: The Engle-Granger Two-Step Test for Cointegration
How do we test if two series are cointegrated? The most intuitive method is the two-step procedure developed by Engle and Granger.
- Step 1: Test for Unit Roots in Individual Series.First, confirm that both series, and , are non-stationary and have the same order of integration (usually ). Use the Augmented Dickey-Fuller (ADF) test on each series. If one is and the other is , they cannot be cointegrated.
- Step 2: Run the 'Cointegrating Regression'.If both series are , run a simple OLS regression of one on the other to estimate the long-run relationship:Save the residuals from this regression: . This residual series, , is our estimated "spread" or "equilibrium error."
- Step 3: Test the Residuals for a Unit Root.The final, critical step is to test whether this residual series is stationary. We perform an ADF test on the residuals.
- : The residuals have a unit root (i.e., they are non-stationary). This means the series are **NOT** cointegrated.
- : The residuals are stationary. This means the series **ARE** cointegrated.
Crucial Detail:
When performing the ADF test on residuals from a cointegrating regression, we cannot use the standard critical values. We must use a special set of critical values, often called Engle-Granger or MacKinnon critical values, which are more stringent. Modern statistical packages handle this automatically.
Part 4: The Quant's Payoff - Pairs Trading
Cointegration is not just an abstract statistical property; it is the theoretical foundation for one of the most famous quantitative trading strategies: **pairs trading**.
The Pairs Trading Strategy
The strategy works as follows:
- Find a Cointegrated Pair: Identify two assets (e.g., two similar stocks like Coca-Cola and Pepsi, or two related ETFs) that are cointegrated. This means their prices are bound by a long-run equilibrium relationship.
- Model the Spread: Run the cointegrating regression to get the equilibrium relationship () and calculate the historical spread series, .
- Generate Trading Signals: Since the spread is stationary (mean-reverting), we can define trading thresholds. For example, we might calculate the standard deviation of the spread, , and set entry thresholds at .
- Execute Trades:
- When the spread rises above , it means is "too expensive" relative to . The strategy would be to **short** the spread: sell and buy units of .
- When the spread falls below , it means is "too cheap." The strategy would be to **go long** the spread: buy and sell units of .
- Exit the Trade: The position is closed when the spread reverts back to its mean (i.e., crosses zero).
This strategy is fundamentally a bet on the stability of the cointegrating relationship. It is a market-neutral strategy because you are always long one asset and short another, insulating you from the overall direction of the market.
Part 5: Python Implementation - Finding a Cointegrated Pair
Testing for Cointegration Between Two ETFs
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller, coint
# --- 1. Get and Prepare Data ---
# Let's test two ETFs that track emerging markets: EEM and VWO
# They should theoretically be cointegrated.
eem = yf.download('EEM', start='2010-01-01', end='2023-12-31')['Adj Close']
vwo = yf.download('VWO', start='2010-01-01', end='2023-12-31')['Adj Close']
data = pd.DataFrame({'EEM': eem, 'VWO': vwo}).dropna()
# --- 2. Step 1: Test for Unit Roots in Individual Series ---
adf_eem = adfuller(data['EEM'])
adf_vwo = adfuller(data['VWO'])
print(f"ADF p-value for EEM: {adf_eem[1]:.4f}")
print(f"ADF p-value for VWO: {adf_vwo[1]:.4f}")
# We expect both p-values to be high (> 0.05), indicating they are I(1).
# --- 3. Step 2 & 3: Run Engle-Granger Cointegration Test ---
# The coint() function in statsmodels does the two steps for us:
# 1. Runs the OLS regression: EEM = beta * VWO + const
# 2. Runs the ADF test on the residuals with the correct critical values.
coint_result = coint(data['EEM'], data['VWO'])
coint_t_statistic = coint_result[0]
coint_p_value = coint_result[1]
critical_values = coint_result[2]
print(f"\nCointegration Test t-statistic: {coint_t_statistic:.4f}")
print(f"Cointegration Test p-value: {coint_p_value:.4f}")
print("Critical Values (1%, 5%, 10%):", critical_values)
# --- 4. Interpret the Results ---
# H₀: The series are NOT cointegrated.
# If the p-value is low (< 0.05), we REJECT H₀ and conclude they are cointegrated.
if coint_p_value < 0.05:
print("\nResult: The series appear to be cointegrated.")
else:
print("\nResult: The series do not appear to be cointegrated.")
# --- 5. Visualize the Spread (Equilibrium Error) ---
# To get the spread, we need to run the OLS regression ourselves.
import statsmodels.api as sm
X = sm.add_constant(data['VWO'])
model = sm.OLS(data['EEM'], X).fit()
beta = model.params['VWO']
spread = data['EEM'] - beta * data['VWO']
spread.plot(figsize=(12,6), title='EEM-VWO Cointegrated Spread')
plt.axhline(spread.mean(), color='red', linestyle='--')
plt.ylabel('Spread')
plt.show()
# The plot should show a stationary, mean-reverting series.
What's Next? Reconnecting the Short Run and the Long Run
We have a puzzle. In the last lesson, we learned that to model non-stationary systems with a VAR, we must difference the data. But in this lesson, we learned that differencing throws away crucial information about the long-run cointegrating relationship.
How can we build a model that does both? How can we capture the short-run dynamics of the differenced data while also respecting the long-run equilibrium of the levels data?
The solution is a sophisticated model that combines the VAR framework with the concept of cointegration. In our final modeling lesson, we will introduce the **Vector Error Correction Model (VECM)**.
Up Next: Putting It Together: The Vector Error Correction Model (VECM)