Lesson 6.7: Finding Long-Run Relationships: Cointegration

We now explore one of the most profound ideas in time series analysis. Cointegration allows us to find stable, long-run equilibrium relationships between non-stationary variables. This Nobel Prize-winning concept, developed by Clive Granger and Robert Engle, moves beyond short-term dynamics to uncover the fundamental anchors that tie economic and financial systems together.

Part 1: The Problem - The 'Spurious Regression'

A major pitfall when working with non-stationary time series (like stock prices or GDP) is the problem of **spurious regression**. If you regress one random walk on another, unrelated random walk, you will very often find a statistically significant relationship (tt-stats are high, R2R^2 is high) even when no true economic relationship exists.

The shared trend in the data creates a statistical illusion. In the last lesson, we solved this by differencing the data to make it stationary before putting it into a VAR model. However, differencing is a blunt instrument. It throws away all information about the long-run co-movement of the variables in their levels.

The Core Analogy: The Drunk and Her Dog

Imagine a drunk person walking randomly through a park. Their path is a classic random walk (I(1)I(1)) and is unpredictable. Now, imagine they are walking their dog on a leash. The dog is also free to wander randomly, so its path is also a random walk (I(1)I(1)).

  • If you look at the drunk's path in isolation, it's non-stationary. The same is true for the dog's path.
  • However, their paths are not independent. The leash ensures they can't wander arbitrarily far apart from each other. The **distance between them** is constrained.
  • While both paths are non-stationary, the spread (or the linear combination: Drunk’s PositionDog’s Position\text{Drunk's Position} - \text{Dog's Position}) is **stationary**. It hovers around a mean of zero.

This is the essence of cointegration. The drunk and her dog are **cointegrated**. They are a system with a long-run equilibrium relationship (the leash) that pulls them back together whenever they stray too far apart.

Part 2: The Formal Definition of Cointegration

Cointegration gives a formal mathematical structure to this "leash" concept.

Definition of Cointegration

Let YtY_t and XtX_t be two non-stationary time series that are both integrated of order 1, i.e., Yt,XtI(1)Y_t, X_t \sim I(1).

They are said to be **cointegrated** if there exists a linear combination of them that is stationary (I(0)I(0)). That is, if there exists a coefficient β\beta such that the series ZtZ_t:

Zt=YtβXtZ_t = Y_t - \beta X_t

is a stationary (I(0)I(0)) process.

  • The vector [1,β][1, -\beta] is called the **cointegrating vector**.
  • The stationary series ZtZ_t represents the **equilibrium error** or the "spread." When ZtZ_t is far from its mean, we expect a correction to occur in YtY_t and/or XtX_t to restore the long-run balance.

Part 3: The Engle-Granger Two-Step Test for Cointegration

How do we test if two series are cointegrated? The most intuitive method is the two-step procedure developed by Engle and Granger.

The Engle-Granger Testing Procedure
  1. Step 1: Test for Unit Roots in Individual Series.First, confirm that both series, YtY_t and XtX_t, are non-stationary and have the same order of integration (usually I(1)I(1)). Use the Augmented Dickey-Fuller (ADF) test on each series. If one is I(1)I(1) and the other is I(0)I(0), they cannot be cointegrated.
  2. Step 2: Run the 'Cointegrating Regression'.If both series are I(1)I(1), run a simple OLS regression of one on the other to estimate the long-run relationship:
    Yt=α+βXt+ZtY_t = \alpha + \beta X_t + Z_t
    Save the residuals from this regression: Z^t=Ytα^β^Xt\hat{Z}_t = Y_t - \hat{\alpha} - \hat{\beta} X_t. This residual series, Z^t\hat{Z}_t, is our estimated "spread" or "equilibrium error."
  3. Step 3: Test the Residuals for a Unit Root.The final, critical step is to test whether this residual series Z^t\hat{Z}_t is stationary. We perform an ADF test on the residuals.
    • H0H_0: The residuals have a unit root (i.e., they are non-stationary). This means the series are **NOT** cointegrated.
    • H1H_1: The residuals are stationary. This means the series **ARE** cointegrated.

Crucial Detail:

When performing the ADF test on residuals from a cointegrating regression, we cannot use the standard critical values. We must use a special set of critical values, often called Engle-Granger or MacKinnon critical values, which are more stringent. Modern statistical packages handle this automatically.

Part 4: The Quant's Payoff - Pairs Trading

Cointegration is not just an abstract statistical property; it is the theoretical foundation for one of the most famous quantitative trading strategies: **pairs trading**.

The Pairs Trading Strategy

The strategy works as follows:

  1. Find a Cointegrated Pair: Identify two assets (e.g., two similar stocks like Coca-Cola and Pepsi, or two related ETFs) that are cointegrated. This means their prices are bound by a long-run equilibrium relationship.
  2. Model the Spread: Run the cointegrating regression to get the equilibrium relationship (β^\hat{\beta}) and calculate the historical spread series, Z^t=Ytβ^Xt\hat{Z}_t = Y_t - \hat{\beta} X_t.
  3. Generate Trading Signals: Since the spread Z^t\hat{Z}_t is stationary (mean-reverting), we can define trading thresholds. For example, we might calculate the standard deviation of the spread, σZ\sigma_Z, and set entry thresholds at ±2σZ\pm 2\sigma_Z.
  4. Execute Trades:
    • When the spread Z^t\hat{Z}_t rises above +2σZ+2\sigma_Z, it means YtY_t is "too expensive" relative to XtX_t. The strategy would be to **short** the spread: sell YtY_t and buy β\beta units of XtX_t.
    • When the spread falls below 2σZ-2\sigma_Z, it means YtY_t is "too cheap." The strategy would be to **go long** the spread: buy YtY_t and sell β\beta units of XtX_t.
  5. Exit the Trade: The position is closed when the spread reverts back to its mean (i.e., crosses zero).

This strategy is fundamentally a bet on the stability of the cointegrating relationship. It is a market-neutral strategy because you are always long one asset and short another, insulating you from the overall direction of the market.

Part 5: Python Implementation - Finding a Cointegrated Pair

Testing for Cointegration Between Two ETFs

import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller, coint

# --- 1. Get and Prepare Data ---
# Let's test two ETFs that track emerging markets: EEM and VWO
# They should theoretically be cointegrated.
eem = yf.download('EEM', start='2010-01-01', end='2023-12-31')['Adj Close']
vwo = yf.download('VWO', start='2010-01-01', end='2023-12-31')['Adj Close']
data = pd.DataFrame({'EEM': eem, 'VWO': vwo}).dropna()

# --- 2. Step 1: Test for Unit Roots in Individual Series ---
adf_eem = adfuller(data['EEM'])
adf_vwo = adfuller(data['VWO'])

print(f"ADF p-value for EEM: {adf_eem[1]:.4f}")
print(f"ADF p-value for VWO: {adf_vwo[1]:.4f}")
# We expect both p-values to be high (> 0.05), indicating they are I(1).

# --- 3. Step 2 & 3: Run Engle-Granger Cointegration Test ---
# The coint() function in statsmodels does the two steps for us:
# 1. Runs the OLS regression: EEM = beta * VWO + const
# 2. Runs the ADF test on the residuals with the correct critical values.

coint_result = coint(data['EEM'], data['VWO'])
coint_t_statistic = coint_result[0]
coint_p_value = coint_result[1]
critical_values = coint_result[2]

print(f"\nCointegration Test t-statistic: {coint_t_statistic:.4f}")
print(f"Cointegration Test p-value: {coint_p_value:.4f}")
print("Critical Values (1%, 5%, 10%):", critical_values)

# --- 4. Interpret the Results ---
# H₀: The series are NOT cointegrated.
# If the p-value is low (< 0.05), we REJECT H₀ and conclude they are cointegrated.
if coint_p_value < 0.05:
    print("\nResult: The series appear to be cointegrated.")
else:
    print("\nResult: The series do not appear to be cointegrated.")

# --- 5. Visualize the Spread (Equilibrium Error) ---
# To get the spread, we need to run the OLS regression ourselves.
import statsmodels.api as sm
X = sm.add_constant(data['VWO'])
model = sm.OLS(data['EEM'], X).fit()
beta = model.params['VWO']
spread = data['EEM'] - beta * data['VWO']

spread.plot(figsize=(12,6), title='EEM-VWO Cointegrated Spread')
plt.axhline(spread.mean(), color='red', linestyle='--')
plt.ylabel('Spread')
plt.show()
# The plot should show a stationary, mean-reverting series.

What's Next? Reconnecting the Short Run and the Long Run

We have a puzzle. In the last lesson, we learned that to model non-stationary systems with a VAR, we must difference the data. But in this lesson, we learned that differencing throws away crucial information about the long-run cointegrating relationship.

How can we build a model that does both? How can we capture the short-run dynamics of the differenced data while also respecting the long-run equilibrium of the levels data?

The solution is a sophisticated model that combines the VAR framework with the concept of cointegration. In our final modeling lesson, we will introduce the **Vector Error Correction Model (VECM)**.

Up Next: Putting It Together: The Vector Error Correction Model (VECM)