Lesson 5.10: Capstone: Building a GARCH Model to Forecast Stock Market Volatility

This capstone lesson synthesizes our entire univariate time series module into a single, practical project. We will follow the complete, end-to-end workflow of a professional quant to model and forecast the volatility of the S&P 500. We will combine the Box-Jenkins methodology for the mean with the GARCH framework for the variance to build a robust, real-world risk model.

Part 1: The Objective - Forecasting a Key Financial Indicator

In quantitative finance, forecasting the direction of the market (the conditional mean) is incredibly difficult due to market efficiency. However, forecasting the *magnitude* of its future movements (the conditional variance, or volatility) is not only possible but is a cornerstone of modern risk management, options pricing, and portfolio construction.

Our objective in this capstone is to build a complete model for the daily returns of the S&P 500 index that can provide a statistically sound forecast for its volatility over the next month. We will treat this as a formal project, following a rigorous, multi-step process.

The Professional Quant's Workflow

We will follow a structured, four-phase approach:

Data Preparation and Exploration: Acquire the data, calculate returns, and identify the key "stylized facts" we need to model.
Mean Model Specification (ARIMA): Use the Box-Jenkins methodology to find the best-fitting ARMA model for the conditional mean of the returns.
Volatility Model Specification (GARCH): Analyze the residuals from our mean model to identify volatility clustering, and then fit an appropriate GARCH model to the conditional variance.
Forecasting and Interpretation: Use the combined model to generate a multi-step-ahead volatility forecast and interpret its practical meaning.

Part 2: Phase 1 - Data Preparation and Exploration

Our first step is to get the data for the S&P 500 index (^GSPC) and transform it into a series of daily returns. We will use log returns, as they are time-additive and are the standard in academic and professional practice.

Visualizing the Stylized Facts

After calculating the returns, we plot them. We are looking for two key features:

Stationarity: The returns should fluctuate around a constant mean (very close to zero). They should not exhibit a clear trend. A formal ADF test will be used to confirm this.
Volatility Clustering: We should see clear periods of high turmoil (large price swings) and periods of relative calm (small price swings). This is the visual evidence that a GARCH model is necessary.

Part 3: Phase 2 - Mean Model Specification (ARMA)

Although financial returns are very difficult to predict, they often exhibit small but statistically significant amounts of autocorrelation. We must model this "predictable" part first so that the residuals we pass to our GARCH model are as clean as possible.

The Box-Jenkins Process for Returns

We apply the Box-Jenkins methodology to our stationary returns series:

Identify (p,q): Plot the ACF and PACF of the daily returns. Financial returns often have a "spiky" look. We are looking for any significant spikes at the first few lags. Often, both the ACF and PACF will show some small significant spikes, suggesting a mixed ARMA model is appropriate.
Estimate & Select: Based on the ACF/PACF plots, we will fit a few candidate ARMA(p,q) models (e.g., ARMA(1,1), ARMA(2,1), ARMA(1,2)). We will then use the AIC or BIC to select the single most parsimonious model that best fits the data. For financial returns, a simple ARMA(1,1) is often sufficient.
Diagnose: We examine the residuals of our chosen ARMA model. At this stage, we want to confirm that the ACF of the residuals shows no significant spikes. This confirms we have successfully modeled the conditional mean.

Part 4: Phase 3 - Volatility Model Specification (GARCH)

Now we turn our attention to the residuals from our ARMA model. While their mean should be unpredictable, their variance is not. This is where we test for and model the ARCH effects.

The GARCH Modeling Steps

Test for ARCH Effects: Before fitting a GARCH model, we must prove it's necessary. We take the residuals from our ARMA model and square them. We then plot the ACF of these squared residuals. A significant autocorrelation in the squared residuals is strong evidence of volatility clustering. We will confirm this with a formal Engle's LM test.
Specify the GARCH Model: Based on decades of financial research, the GARCH(1,1) model is an incredibly robust and effective choice for most financial time series. We will specify a combined model: an ARMA(p,q) for the mean and a GARCH(1,1) for the variance.
$\text{Mean Equation: } R_t = c + \phi_1 R_{t-1} + \theta_1 \epsilon_{t-1} + \epsilon_t$
$\text{Variance Equation: } \sigma_t^2 = \alpha_0 + \alpha_1 \epsilon_{t-1}^2 + \beta_1 \sigma_{t-1}^2$
Estimate the Combined Model: We fit the ARMA(p,q)-GARCH(1,1) model to the returns data. The software estimates all parameters simultaneously using Maximum Likelihood Estimation.
Diagnose the GARCH Model: We examine the **standardized residuals** from the full model ( $\hat{\epsilon}_t / \hat{\sigma}_t$ ). If the model is well-specified, these standardized residuals should be clean white noise. We check the ACF of the squared standardized residuals—all significant autocorrelation should now be gone.

Part 5: Phase 4 - Forecasting and Interpretation

Generating the Volatility Forecast

import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from statsmodels.stats.diagnostic import het_arch
from arch import arch_model

# --- Phase 1: Data Prep & Exploration ---
sp500 = yf.download('^GSPC', start='2000-01-01', end='2023-12-31')
returns = 100 * np.log(sp500['Adj Close']).diff().dropna()

# Confirm stationarity
adf_test = adfuller(returns)
print(f"ADF p-value: {adf_test[1]:.4f}") # Expect a very low p-value

# --- Phase 2: Mean Model (ARMA) ---
# Visual inspection of ACF/PACF
fig, ax = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(returns, ax=ax[0], lags=20, title='ACF of S&P 500 Returns')
plot_pacf(returns, ax=ax[1], lags=20, title='PACF of S&P 500 Returns')
plt.tight_layout()
plt.show()
# Let's assume plots suggest an ARMA(1,1) is a reasonable starting point.

# --- Phase 3: Volatility Model (GARCH) ---
# Specify a combined ARMA(1,1)-GARCH(1,1) model.
# In the 'arch' library, this is done by setting mean='ARX' and lags=1
# and then adding the MA component via the model specification itself.
# A simpler way often used is to fit a GARCH on the returns directly
# if the ARMA effects are very small, which we'll do here for clarity.
# The mean model is am.Constant, am.ARX, etc.
# vol='Garch' is the key part. p=1, q=1.
model_spec = arch_model(returns, mean='Constant', vol='Garch', p=1, q=1, dist='Normal')

# Fit the model
model_fit = model_spec.fit(update_freq=5)
print(model_fit.summary())

# Check standardized residuals
std_resid = model_fit.resid / model_fit.conditional_volatility
std_resid_sq = std_resid**2
plot_acf(std_resid_sq, lags=20, title='ACF of Squared Standardized Residuals')
plt.show()
# This ACF plot should have no significant spikes.

# --- Phase 4: Forecasting ---
# Forecast volatility for the next 30 trading days
forecast_horizon = 30
forecast = model_fit.forecast(horizon=forecast_horizon)

# The output is variance, so take the square root for volatility
# The forecast object is a bit complex, we need to extract the variance forecast
# It's in a dataframe under the 'h.1', 'h.2', ... columns
future_variance = forecast.variance.iloc[-1]
future_volatility = np.sqrt(future_variance)

# Annualize the forecast by multiplying by sqrt(252)
annualized_forecast = future_volatility * np.sqrt(252)

# --- Visualize the Forecast ---
plt.figure(figsize=(12, 6))
plt.plot(returns.index[-100:], model_fit.conditional_volatility[-100:], label='Fitted Volatility')
plt.plot(future_volatility.index, future_volatility, label='Forecasted Volatility', color='red')
plt.title('GARCH(1,1) Volatility Forecast for S&P 500')
plt.ylabel('Daily Volatility (%)')
plt.legend()
plt.show()

print("\n--- Volatility Forecast ---")
print("The forecasted annualized volatility for the next 30 days is:")
print(f"{annualized_forecast.iloc[-1]:.2f}%")

Part 6: Practical Interpretation and Use Cases

What does our forecast, say an annualized volatility of 18%, actually mean for a practitioner?

For a Risk Manager: It is a direct input into their Value-at-Risk (VaR) models. A higher forecasted volatility means a wider potential distribution of returns, which translates to a larger VaR and may require the firm to reduce its overall risk exposure.
For an Options Trader: Volatility is the single most important input into an options pricing model like Black-Scholes. This GARCH forecast provides a data-driven, objective estimate of the "implied volatility" that should be used to price options expiring in the next month. If the market's implied volatility is significantly higher than this forecast, the trader might conclude that options are overpriced and look to sell volatility.
For a Portfolio Manager: The forecast is used in portfolio optimization. The classic mean-variance optimization requires an estimate of the future covariance matrix of all assets. The GARCH forecast for the market index is a key component in building this matrix and determining optimal asset allocation.

Conclusion of Module 5 and The Path to Multivariate Analysis

You have successfully completed the capstone project for univariate time series analysis. You have taken a raw financial data series and applied a complete, professional workflow to build a sophisticated two-equation model for its mean and variance, culminating in a practical and interpretable forecast.

You now possess the complete ARIMA-GARCH toolkit, which is the foundation upon which much of modern quantitative finance is built. You have moved beyond simple regression to understand the complex dynamics of memory, shocks, and time-varying risk.

The logical next step in our journey is to expand our perspective. Markets are not univariate; they are vast, interconnected systems. In the next module, we will begin our exploration of **Advanced Quant Modeling**, starting with the tools designed to analyze multiple time series simultaneously.

Up Next: Let's Start Module 6: Random Walks and the Efficient Market Hypothesis