Table of Contents
- Components of time series forecasting
- Additive Model:
- Multiplicative Model:
- Moving Averages
- ARIMA Models
- Model Evaluation Metrics: MAE and MSE
- Challenges in Time Series Forecasting
- Time Series Forecasting of NASDAQ Stock Prices Using Python
- Part 1: Loading and Visualizing NASDAQ's Daily Closing Prices
- Installing and Importing Libraries
- Loading NASDAQ Stock Data
- Inspecting the Data
- Plotting Daily Closing Prices
- Part 2: Data Prep and Decomposing Time Series into Components
- Selecting the Original Columns
- Handling Missing Values
- Converting Index to Datetime
- Decomposing the Time Series
- Visualizing the Components
- Part 3: Moving Averages for Stock Data Analysis
- Libraries and Utilities
- Hyperparameter Tuning for SMA
- Visualizing the Best SMA
- Exponential Weighted Moving Average (EWMA)
- Hyperparameter Tuning for WMA
- Visualizing the Best WMA
- Part 4: Time Series Analysis - Stationarity, ACF, and Differencing
- Plotting Autocorrelation Function (ACF)
- Testing for Stationarity
- Differencing the Series
- Testing Stationarity of the Differenced Series
- Part 5: ARIMA Modeling and Diagnostic Test
- Autocorrelation Function (ACF):
- Interpretation of ACF:
- Partial Autocorrelation Function (PACF):
- Differences Between ACF and PACF:
- Autocorrelation Function of Residuals:
- Autocorrelation Function of Squared Residuals:
- Ljung-Box Test:
- Kolmogorov-Smirnov Test:
- Goldfeld-Quandt Test:
- Conclusion
Components of time series forecasting
a. Trend:
The trend component represents the consistent, long-term upward or downward movement in a time series. In the financial context, a trend can be seen in the stock market where an upward trend in a company's share price might reflect steady growth, while a downward trend could indicate underlying problems.
b. Seasonality:
Seasonality refers to the regular fluctuations that occur at consistent intervals, such as daily, weekly, monthly, or even annually. For example, in finance, retail companies might see a recurring increase in sales during the holiday season every year, which then dips after the holidays. Analyzing these patterns can be vital for business planning.
c. Cyclical Patterns:
Cyclical patterns describe long-term oscillations or waves that are not of a fixed and regular period. These often align with wider economic cycles and may last for several years. The cyclical nature of the housing market, with its periods of boom and bust, often corresponds with broader economic expansion and recession, affecting prices accordingly.
d. Irregular (or Random) Component:
The irregular component encompasses the unpredictability or noise in a time series that is neither systematic nor predictable. Financially, this might manifest as sudden market shocks caused by unexpected events such as geopolitical changes or natural disasters. For example, an unforeseen alteration in trade policy might lead to erratic fluctuations in currency exchange rates.
Time series data can often be described using either a multiplicative model or an additive model. These models provide different ways of combining the primary components of time series data: trend, seasonality, cyclical patterns, and the irregular component.
Additive Model:
In an additive model, the different components are simply added together. The model assumes that the effects of individual factors are differentiated and linear. The general form is:
Here, Yt is the observed data at time t.
Example: If a company's quarterly sales show a steady increase (trend) plus a consistent pattern of seasonal fluctuations, and there is no multiplicative growth or exponential trend, the additive model may be appropriate.
Multiplicative Model:
In a multiplicative model, the components are multiplied together. This model assumes that the effects of individual factors interact in a multiplicative way, meaning that the impact of one component can change the impact of another. The general form is:
Example: If a company's sales are growing exponentially (e.g., doubling every year) and also have a seasonal component that grows proportionally with the trend, a multiplicative model may be suitable.
The choice between additive and multiplicative models depends on the nature of the time series data. Additive model best suited for time series data where the magnitude of the seasonal fluctuations or the variation around the trend-cycle does not vary with the level of the time series. Multiplicative model suitable for data where the magnitude of the seasonal pattern or the variation around the trend-cycle appears to be proportional to the level of the time series.
Moving Averages
Simple Moving Average (SMA)
The SMA calculates the average of the last n observations. By smoothing out noise and short-term fluctuations, the SMA can provide a clearer view of the underlying trend.
Pros: Easy to understand and implement.
Cons: Equal weighting may over smooth, losing valuable information.
Weighted Moving Average (WMA)
The WMA assigns different weights to different observations, typically giving more weight to recent observations.
Pros: More responsive to changes.
Cons: Choosing appropriate weights can be challenging.
Exponential Smoothing (ES)
ES applies exponential weights to past observations. The closer the observation is to the present, the higher its weight.
Parameter: Smoothing factor α (0 < α < 1) controls the decay of weights.
Pros: Effective for time series without trend or seasonality.
Cons: Cannot handle trends or seasonality.
ARIMA Models
AR (Autoregressive) Component
The AR component models the relationship between an observation and its previous observations.
Pros: Captures direct dependencies in the time series.
I (Integrated) Component
The I component represents the differencing needed to make the series stationary.
Action: Differencing the series one or more times until stationarity is achieved.
Pros: Helps in stabilizing the mean of the time series.
MA (Moving Average) Component
The MA component models the relationship between an observation and the residual errors from a moving average model applied to lagged observations.
Pros: Captures error patterns and noise structure.
ARIMA Model
ARIMA models combine AR, I, and MA components to capture a wide range of time series patterns.
Notation: ARIMA(p, d, q), where p is the AR order, d is the differencing order, and q is the MA order.
Pros: Versatility in modeling various time series structures.
Cons: May require extensive tuning and validation.
Model Evaluation Metrics: MAE and MSE
In the realm of time series forecasting, the metrics chosen to evaluate the accuracy and precision of models are vital. Among the plethora of available metrics, Mean Absolute Error (MAE) and Mean Squared Error (MSE) stand out for their widespread application and interpretability.
Mean Absolute Error (MAE) calculates the average of the absolute differences between the predicted values and the actual observed values. The formula for MAE is given by:
MAE is in the same unit as the data, offering direct interpretation. It gives equal weight to all errors, regardless of their size, and is commonly used when all errors are considered equally significant.
On the other hand, Mean Squared Error (MSE) computes the average of the squared differences between the predicted values and the actual observed values. The formula for MSE is expressed as:
MSE is sensitive to larger errors by emphasizing them through squaring the differences. The unit of MSE is the square of the data's unit, so it's often used in conjunction with its square root (Root Mean Square Error) to make it directly comparable to the data.
Both MAE and MSE are valuable depending on the specific needs of the forecasting task. While MAE's strength lies in its simplicity and equal treatment of all errors, MSE's focus on penalizing larger discrepancies makes it suitable when significant deviations from actual values have substantial consequences. These metrics are not mutually exclusive and can be used together, providing a multifaceted view of a model's predictive performance, aligning with the goals and constraints of the forecasting task at hand. By using these metrics judiciously, forecasters can gain a robust understanding of how well a model is performing and where improvements may be made.
Challenges in Time Series Forecasting
Time series forecasting is a complex and nuanced field, essential across various industries, from finance to healthcare. The aim is to predict future values of a sequence of observations recorded at regular time intervals. However, several challenges can hinder the effectiveness and precision of time series forecasting models:
Non-Stationarity
A stationary time series is one whose statistical properties, such as mean and variance, remain constant over time. Most of the statistical methods assume stationarity, but in practice, many time series data are non-stationary. Non-stationarity can arise due to trends, seasonality, or structural breaks in the series. Dealing with non-stationarity often requires differencing the series or transforming it to stabilize the mean and variance. Ignoring non-stationarity can lead to spurious relationships and inaccurate forecasts.
Missing Values
In real-world scenarios, it's not uncommon for time series data to have missing values. Missing values can occur randomly or systematically, and they present a significant challenge in forecasting. Handling missing values requires careful consideration of the nature and pattern of the missingness. Simple methods like mean imputation might not be sufficient, as they can distort the time dependencies in the series. More advanced techniques, such as interpolation or state-space models, might be needed to preserve the time series structure.
Outliers
Outliers are observations that deviate significantly from the other values in the data. In time series forecasting, outliers can be due to genuine extreme values or errors in data recording. They can dramatically affect the estimation of the model parameters and lead to misleading forecasts. Outlier detection and treatment are vital in time series analysis. Sophisticated methods, such as robust statistical techniques, can minimize the influence of outliers, allowing the model to capture the true underlying patterns in the data.
Multicollinearity
Multicollinearity refers to the high correlation among predictor variables in a model. In time series forecasting, multicollinearity can arise when using lagged values of the time series or other correlated time series as predictors. The presence of multicollinearity can make it difficult to identify the individual effect of each predictor on the response variable. It can lead to unstable estimates of model parameters and reduce the interpretability of the model. Addressing multicollinearity might involve variable selection techniques, regularized regression methods, or dimensionality reduction techniques like Principal Component Analysis (PCA).
Time Series Forecasting of NASDAQ Stock Prices Using Python
The NASDAQ Composite Index, representing over 3,000 companies listed on the NASDAQ stock exchange, serves as a significant indicator of the technology and internet sectors. This diverse blend of companies from various industries makes the NASDAQ Composite an intriguing subject for time series forecasting. By tracking the movements of the NASDAQ Composite over time, investors and analysts can gain valuable insights into broader market trends, sector performances, and economic indicators. Leveraging Python's extensive data analysis and modeling capabilities, we can apply an array of time series forecasting techniques to predict the index's future movements. This predictive analysis can guide investment strategies, portfolio management, and economic forecasting, empowering stakeholders to navigate the complex financial landscape with data-driven insights.
Part 1: Loading and Visualizing NASDAQ's Daily Closing Prices
In this first part, we'll walk through the process of loading and visualizing the daily closing prices for the NASDAQ index. This guide is tailored to provide beginners with clear, step-by-step instructions, so let's dive in.
1.1 Installing and Importing Libraries
Before working with the data, we must install and import the required libraries:
Execute !pip install openbb to install the openbb package, a specialized library for handling stock data.
!pip install openbb
Importing Libraries: Import the necessary modules with:
from openbb_terminal.sdk import openbb
import matplotlib.pyplot as plt
1.2 Loading NASDAQ Stock Data
We'll fetch the NASDAQ daily stock data using openbb's load function:
df_daily = openbb.stocks.load(symbol='ndaq')
1.3 Inspecting the Data
Quickly inspecting the data ensures that it's loaded correctly:
First Five Rows: df_daily.head() provides a snapshot of the beginning of the dataset.
Last Five Rows: df_daily.tail() offers a view of the end of the dataset.
df_daily.head() df_daily.tail()
1.4 Plotting Daily Closing Prices
Next, we'll create a visualization of the daily closing prices:
df_daily['Close'].plot(figsize=(15,6), title='NASDAQ Daily Closing Prices')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.show()
This code performs several tasks:
Plotting the 'Close' Column: Generates a line plot of the 'Close' column, with specific figure size and title.
Setting the Axis Labels: Adds descriptive labels to the x and y axes.
Displaying the Plot: Renders the plot, showcasing NASDAQ's daily closing prices.
Part 2: Data Prep and Decomposing Time Series into Components
Building on Part 1 of our series, where we loaded and visualized NASDAQ's daily closing prices, Part 2 will guide you through the data prep and decomposing the time series into its constituent components. Let's explore the details:
2.1 Selecting the Original Columns
The dataset may contain additional columns that we don't need. By defining the original columns of interest:
original_columns = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'Dividends', 'Stock Splits']
and selecting only those from the DataFrame:
df_daily = df_daily[original_columns]
we ensure that we're working with the necessary data.
2.2 Handling Missing Values
Analyzing the data requires clean, complete information. We'll look for missing values with:
missing_values = df_daily.isnull().sum()
print('Missing values:', missing_values)
This part of the code identifies and prints the count of missing values for each column, enabling further handling if necessary. This output shows no missing value in the data set.
2.3 Converting Index to Datetime
Time series analysis requires that the index be in datetime format. We'll achieve this with:
df_daily.index = pd.to_datetime(df_daily.index)
2.4 Decomposing the Time Series
Time series data can be broken down into observed, trend, seasonal, and residual components. The code assumes a prior decomposition into seasonal_result, and we extract each component as:
observed = seasonal_result.observed
trend = seasonal_result.trend
seasonal = seasonal_result.seasonal
residual = seasonal_result.resid
2.5 Visualizing the Components
To gain insights into these components, we'll create subplots and visualize each of them:
fig, axes = plt.subplots(4, 1, figsize=(15, 12))
fig.subplots_adjust(hspace=0.4)
axes[0].plot(observed)
axes[0].set_ylabel('Observed')
axes[0].set_title('Observed Component')
axes[1].plot(trend)
axes[1].set_ylabel('Trend')
axes[1].set_title('Trend Component')
axes[2].plot(seasonal)
axes[2].set_ylabel('Seasonal')
axes[2].set_title('Seasonal Component')
axes[3].plot(residual)
axes[3].set_ylabel('Residual')
axes[3].set_title('Residual Component')
plt.show()
These plots allow for a detailed examination of the time series' underlying structure. This output explains there is no seasonality and stationarity in data.
Part 3: Moving Averages for Stock Data Analysis
In this third segment of our series, we delve into one of the foundational techniques of time series analysis for stock prices - moving averages. We'll explore both Simple Moving Averages (SMA) and Weighted Moving Averages (WMA) using exponential weights, alongside hyperparameter tuning to select the best window sizes.
3.1 Libraries and Utilities
We have our crucial libraries and functions imported:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
import pandas as pd
3.2 Hyperparameter Tuning for SMA
Simple Moving Averages (SMA) calculates the mean of a stock's prices over a certain number of days. The idea is to smooth out price fluctuations and highlight long-term trends. We iterate over different window sizes to find the best SMA for our data:
windows = [5, 10, 15, 20, 25, 30]
mse_sma_list, mae_sma_list = [], []
for window in windows:
df_daily[f'SMA_{window}'] = df_daily['Close'].rolling(window=window).mean()
mse_sma = mean_squared_error(df_daily['Close'][window-1:], df_daily[f'SMA_{window}'][window-1:])
mae_sma = mean_absolute_error(df_daily['Close'][window-1:], df_daily[f'SMA_{window}'][window-1:])
mse_sma_list.append(mse_sma)
mae_sma_list.append(mae_sma)
print("SMA Hyperparameter Tuning Results:")
for win, mse, mae in zip(windows, mse_sma_list, mae_sma_list):
print(f"Window Size: {win} -> MSE: {mse}, MAE: {mae}")
After running the above code, we get results indicating the Mean Squared Error (MSE) and Mean Absolute Error (MAE) for each window size. This output shows that window size 5 has the best MSE.
3.3 Visualizing the Best SMA
Using the results from the above step, we can plot the actual closing prices alongside the 5-day SMA:
plt.figure(figsize=(15,6))
plt.plot(df_daily['Close'], label='Actual Close Price', color='yellow')
plt.plot(df_daily['SMA_5'], label='5-day SMA', color='red')
plt.title('NASDAQ Daily Closing Prices with 5-day SMA')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.legend()
plt.show()
3.4 Exponential Weighted Moving Average (EWMA)
EWMA provides more weight to recent prices, which can make it more reactive to price changes than the SMA:
def ewma(weights):
def calc(x):
return (weights / weights.sum() * x).sum()
return calc
The above function calculates the EWMA for the given exponential weights.
3.5 Hyperparameter Tuning for WMA
Similar to SMA, we conduct hyperparameter tuning for WMA, iterating through our window sizes and using exponential weights:
mse_wma_list, mae_wma_list = [], []
for window in windows:
weights = np.exp(np.linspace(-1, 0, window))
df_daily[f'EWMA_{window}'] = df_daily['Close'].rolling(window=window).apply(ewma(weights))
mse_wma = mean_squared_error(df_daily['Close'][window-1:], df_daily[f'EWMA_{window}'][window-1:])
mae_wma = mean_absolute_error(df_daily['Close'][window-1:], df_daily[f'EWMA_{window}'][window-1:])
mse_wma_list.append(mse_wma)
mae_wma_list.append(mae_wma)
print("WMA (Exponential Weights) Hyperparameter Tuning Results:")
for win, mse, mae in zip(windows, mse_wma_list, mae_wma_list):
print(f"Window Size: {win} -> MSE: {mse}, MAE: {mae}")
After running the above code, we get results indicating the Mean Squared Error (MSE) and Mean Absolute Error (MAE) for each window size. This output shows that window size 5 has the best MSE now we will plot actual and forecasted with window size 5.
3.6 Visualizing the Best WMA
After determining the best WMA with the lowest error, we can plot it against the actual stock prices:
plt.figure(figsize=(15,6))
plt.plot(df_daily['Close'], label='Actual Close Price', color='yellow')
plt.plot(df_daily['EWMA_5'], label='5-day WMA (Exponential Weights)', color='red')
plt.title('NASDAQ Daily Closing Prices with 5-day WMA')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.legend()
plt.show()
Part 4: Time Series Analysis - Stationarity, ACF, and Differencing
This fourth segment of the series dives into time series analysis, focusing on the concepts of stationarity, autocorrelation function (ACF), and differencing. We're dealing with the daily closing prices of NASDAQ stocks, and we will use various statistical tools and visualizations to understand and transform the data.
4.1 Plotting Autocorrelation Function (ACF)
ACF helps us understand the linear relationship of a time series with its lagged values. We'll plot the ACF for both the original series and its first difference:
from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt
# Assuming df_daily['Close'] is your time series data
series = df_daily['Close']
plt.figure(figsize=(12,6))
plot_acf(series, lags=40, title='Correlogram at Level')
plt.show()
# Compute 1st difference of the series
first_difference = series.diff().dropna()
plt.figure(figsize=(12,6))
plot_acf(first_difference, lags=40, title='Correlogram at 1st Difference')
plt.show()
The first plot is the ACF of the original series, while the second plot is the ACF of the first difference of the series. The differencing reduces the autocorrelation, making the data more suitable for models like ARIMA. Plots indicated that series at level is not stationary but series at 1st difference is stationary.
4.2 Testing for Stationarity
A common assumption in many time series models is that the data is stationary. The Augmented Dickey-Fuller (ADF) test is employed to test the null hypothesis that the time series has a unit root, meaning it is non-stationary:
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
print('Results of Dickey-Fuller Test:')
dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)' % key] = value
print(dfoutput)
test_stationarity(df_daily['Close'])
In the given example, the p-value is greater than 0.05, which means we fail to reject the null hypothesis, and the series is not stationary.
4.3 Differencing the Series
Differencing is a transformation applied to make a time series stationary. It's done by subtracting the previous observation from the current observation:
df_daily['Close_diff'] = df_daily['Close'].diff()
df_daily = df_daily.dropna()
# Plotting the differenced series
df_daily['Close_diff'].plot(figsize=(15,6), title='NASDAQ Daily Closing Prices (Differenced)')
plt.xlabel('Date')
plt.ylabel('Differenced Close Price')
This plot shows the differenced series, which helps to remove the trend and make the series more stationary.
4.4 Testing Stationarity of the Differenced Series
We test the differenced series for stationarity using the ADF test again:
from statsmodels.tsa.stattools import adfuller
result = adfuller(df_daily['Close_diff'])
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
The p-value is now 0, indicating that the differenced series is stationary, confirming the success of our transformation.
Part 5: ARIMA Modeling and Diagnostic Test
5.1: Finding the Best ARIMA Hyperparameters
The ARIMA model has three hyperparameters (p, d, q) that need to be fine-tuned for the best performance. To find the best parameters, we can use a grid search over different values of these hyperparameters. The code snippet below fits an ARIMA model for various values of p, d, and q, and finds the combination with the lowest mean squared error (MSE):
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
def evaluate_arima_model(df, arima_order):
split_idx = int(len(df) * 0.8)
train, test = df[0:split_idx], df[split_idx:]
history = [x for x in train]
predictions = []
for t in range(len(test)):
model = ARIMA(history, order=arima_order)
model_fit = model.fit()
yhat = model_fit.forecast()[0]
predictions.append(yhat)
history.append(test[t])
mse = mean_squared_error(test, predictions)
return mse
p_values = range(0, 3)
d_values = range(0, 3)
q_values = range(0, 3)
best_mse = float('inf')
best_order = None
for p in p_values:
for d in d_values:
for q in q_values:
order = (p, d, q)
try:
mse = evaluate_arima_model(df_daily['Close'], order)
if mse < best_mse:
best_mse = mse
best_order = order
print(f'ARIMA Order {order} - MSE: {mse}')
except:
continue
print(f'Best ARIMA Order: {best_order} - MSE: {best_mse}')
The best ARIMA order found is (2, 1, 2) with an MSE of 0.8671641605620587.
5.2: Forecasting with the Best ARIMA Model
Once we have the best hyperparameters, we can forecast the test data using the ARIMA(2,1,2) model and plot the actual vs. predicted closing prices:
from statsmodels.tsa.arima.model import ARIMA
import matplotlib.pyplot as plt
split_idx = int(len(df_daily['Close']) * 0.8)
train, test = df_daily['Close'][0:split_idx], df_daily['Close'][split_idx:]
history = [x for x in train]
predictions = []
for t in range(len(test)):
model = ARIMA(history, order=(2, 1, 2))
model_fit = model.fit()
yhat = model_fit.forecast()[0]
predictions.append(yhat)
history.append(test.iloc[t])
predictions_df = pd.Series(predictions, index=test.index)
plt.figure(figsize=(12,6))
plt.plot(test, label='Actual')
plt.plot(predictions_df, label='Predicted', linestyle='dashed')
plt.title('ARIMA(2, 1, 2) Forecast vs Actual')
plt.xlabel('Time')
plt.ylabel('Close Price')
plt.legend()
plt.show()
This graph show actual test data set and forecast line for ARIMA (2,1,2)
5.3: Diagnosing the Model
Diagnosing the model helps us ensure that the assumptions underlying the ARIMA model hold. We use various techniques:
Autocorrelation Function (ACF):
The ACF measures the linear relationship between an observation at time t and the observations at previous times. It provides the correlations between the series and its lags, i.e., the correlations between Yt and Yt−k for k=1,2,….
Interpretation of ACF:
Random Pattern: If the ACF shows that all the autocorrelations for lag k>0 are close to zero, then the series is likely to be white noise.
Gradual Decline: If the ACF shows a gradual decline, it may suggest a trend in the data or an Autoregressive (AR) model might be suitable.
Sharp Drop-off: If there is a sharp drop after lag k, it may suggest a Moving Average (MA) process of order k.
Seasonality: Regular spikes at specific intervals may indicate seasonality.
Partial Autocorrelation Function (PACF):
While ACF considers correlations with all previous lags, PACF measures the correlation between observations at two points in time while removing the effects of other time lags. In other words, PACF tells you the correlation between Yt and Yt−k that is not accounted for by lags 1,2,…,k−1.
Interpretation of PACF:
Sharp Drop-off: If the PACF has a sharp drop after lag k, leaving others close to zero, it suggests an Autoregressive model of order k.
Gradual Decline: A gradual decline may suggest a Moving Average process.
Differences Between ACF and PACF:
ACF: Considers the correlation between two points in time, including the indirect effects of the intervening data points.
PACF: Considers the direct relationship between two points in time, excluding the influence of other data points.
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig = plt.figure(figsize=(12, 8))
ax1 = fig.add_subplot(211)
plot_acf(residuals, lags=range(1, 40), ax=ax1)
ax1.set_title('Autocorrelation Function')
ax2 = fig.add_subplot(212)
plot_pacf(residuals, lags=range(1, 40), ax=ax2)
ax2.set_title('Partial Autocorrelation Function')
plt.tight_layout()
plt.show()
As all the spikes are around zero it means there is no autocorrelation and partial autocorrelation in the model. This is one of the diagnostic test for time series data that there should not be any AC or PAC.
Autocorrelation Function of Residuals:
Identifying Model Mis-Specification: If the residuals from a model exhibit significant autocorrelation, it indicates that the model has not captured all of the systematic information in the data. This might suggest that the model is mis-specified and needs further tuning.
Randomness Check: If most of the spikes in the ACF of residuals are around zero, it suggests that the residuals are behaving like white noise, meaning that they are random and uncorrelated. This is generally a good sign as it shows that the model has captured the underlying pattern in the data.
Detecting Seasonality: Non-zero spikes at regular intervals might indicate seasonality that has not been captured by the model.
Autocorrelation Function of Squared Residuals:
Analyzing the autocorrelation of squared residuals helps in detecting nonlinearity and heteroscedasticity (i.e., changing variance over time).
Identifying Nonlinearity: If the ACF of squared residuals shows significant spikes, it might indicate nonlinear patterns in the data that the linear model has failed to capture.
Detecting Heteroscedasticity: Significant autocorrelation in squared residuals may also reveal the presence of heteroscedasticity. Heteroscedasticity refers to the situation where the variance of the residuals changes over time. This is a violation of one of the assumptions of linear time series models, and if detected, it would suggest that a model accommodating changing variance (such as GARCH) might be more appropriate.
Ljung-Box Test:
The Ljung-Box test is a statistical test that is often used to check whether the residuals in a time series model exhibit autocorrelations it is objective approach as compare to graph it gives clear answer to autocorrelation. It's a way to ensure that the model is capturing the underlying patterns in the data, and that the residuals are behaving like white noise, which is what you would expect if the model is fitting well.
from statsmodels.stats.diagnostic import acorr_ljungbox
lb_test = acorr_ljungbox(residuals, lags=[10])
print("Ljung-Box test:", lb_test)
Here's an interpretation of the result:
lb_stat: The test statistic value is 1.307434. The test statistic is computed based on the sum of squared autocorrelations of the residuals. The closer this value is to zero, the more evidence there is of no autocorrelation.
lb_pvalue: The p-value is 0.99942. A p-value close to 1 indicates that the null hypothesis of no autocorrelations among the residuals cannot be rejected at conventional significance levels (such as 0.05). In other words, the p-value is providing strong evidence that the residuals are not autocorrelated.
Kolmogorov-Smirnov Test:
The Kolmogorov-Smirnov (K-S) test is a non-parametric statistical test that is used to compare a sample distribution with a reference probability distribution, usually the standard normal distribution. It can also be used to compare two empirical distributions.
from scipy.stats import kstest
ks_statistic, p_value = kstest(residuals, 'norm')
print("KS Statistic:", ks_statistic)
print("P-value:", p_value)
Here's an interpretation of the result:
KS Statistic: The K-S statistic value is 0.0910091860270345. This value represents the maximum difference between the empirical cumulative distribution function (CDF) of the sample and the CDF of the reference distribution. In the context of testing normality, a larger value of the K-S statistic would typically indicate a greater divergence from the standard normal distribution.
P-value: The p-value is 1.051782801518041×10−5. This extremely low p-value is indicative of a rejection of the null hypothesis at common significance levels (such as 0.01, 0.05, or 0.10). The null hypothesis for the K-S test is that the sample is drawn from the standard normal distribution.
Goldfeld-Quandt Test:
The Goldfeld-Quandt test is used to check the homoscedasticity of residuals in a regression model, i.e., whether the variance of the errors is constant across levels of the explanatory variables.
from statsmodels.stats.diagnostic import het_goldfeldquandt
gq_test = het_goldfeldquandt(residuals, fit_model.model.endog)
print("Goldfeld-Quandt test:", gq_test)
Here's an interpretation of the result:
Test Statistics: 0.19378200468162868. This value is used to assess the evidence against the null hypothesis of homoscedasticity (constant variance).
P-value: 0.9999999999999999. A very high p-value indicates that there is not enough statistical evidence to reject the null hypothesis that the variance of the residuals is constant.
Conclusion
In evaluating the ARIMA(2,1,2) model, a series of diagnostic tests were conducted to ensure the robustness and validity of the model. The Ljung-Box test confirmed that there was no evidence of autocorrelation in the residuals, indicating that the model has captured the underlying temporal structure well. The Kolmogorov-Smirnov test, indicated a deviation from normality in the residuals, the Goldfeld-Quandt test confirmed homoscedasticity in the residuals, highlighting that the variance of the errors is constant across the series. Together, these diagnostics paint a mostly favorable picture of the ARIMA(2,1,2) model but also hint at possible areas for further refinement, such as exploring transformations to achieve normality. Overall, the model seems well-specified for the data at hand, providing a useful tool for forecasting and insights into the underlying dynamics of the time series. For a deeper understanding of advanced and cutting-edge state of the art machine learning algorithms for forecasting, you may refer to this article on stock price forecasting using machine learning. Predicting stock prices is a complex task that often requires a multifaceted approach. Additionally, you might find value in exploring sentiment analysis techniques specifically tailored for stock prediction, as detailed in this guide on performing sentiment analysis in Python. The intricate landscape of financial forecasting has been profoundly transformed by state-of-the-art machine learning algorithms and sentiment analysis techniques. These innovations have not only added depth and accuracy to predictions but also paved the way for more agile and informed decision-making in the ever-fluctuating financial markets. As we continue to embrace these technological advancements, we unlock new potentials and possibilities, equipping ourselves to navigate the complexities of the financial world with greater confidence and precision. The resources and insights shared in this blog serve as valuable starting points for those keen to explore and harness the power of modern forecasting tools. The future of finance is here, and it's rich with opportunities for those ready to engage with it.
The full working code for the examples above can be found in the PyFi GitHub Repo.
Written by Numan Yaqoob, PHD candidate