Pearson’s correlation coefficient, often denoted as ‘r’, is a statistical measure that quantifies the strength and direction of the linear relationship between two continuous variables. In finance, these variables are typically the price series of two different assets (e.g., the closing prices of Stock A and Stock B, or a stock and a market index) over a specified N-period window.
The value of Pearson’s ‘r’ ranges from -1 to +1:
It’s crucial to remember that correlation measures linear association. Two variables could have a strong non-linear relationship but still have a correlation coefficient close to zero. Also, correlation does not imply causation.
Usage: Understanding the correlation between different assets is fundamental in finance for several reasons:
To calculate a rolling correlation, two input price arrays of the same length for the calculation window are required.
TA-Lib Function: The Technical Analysis Library (TA-Lib) provides a function to calculate the rolling Pearson’s correlation coefficient:
talib.CORREL(prices1, prices2, timeperiod=N)
prices1
: The first array or series of prices.prices2
: The second array or series of prices. These
must be of the same length as prices1
.timeperiod
: The lookback period over which to calculate
the correlation.Code Example (Calculation & Plot with yfinance Data):
The following Python code demonstrates how to fetch data for two
assets using yfinance
, align their price series, calculate
their rolling correlation, and then plot one asset’s price along with
the correlation coefficient.
import yfinance as yf
import talib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# --- 1. Data Fetching and Alignment ---
= "AAPL" # Example: Apple Inc.
asset1_symbol = "MSFT" # Example: Microsoft Corp. (or use a market index like "SPY")
asset2_symbol
= "2023-01-01"
data_start_date = "2024-05-01" # Current date for yfinance download
data_end_date
try:
# Complying with user preference for yfinance download
= yf.download(asset1_symbol, start=data_start_date, end=data_end_date, auto_adjust=False, progress=False)
data_asset1 = yf.download(asset2_symbol, start=data_start_date, end=data_end_date, auto_adjust=False, progress=False)
data_asset2
if data_asset1.empty or data_asset2.empty:
raise ValueError("Data download failed for one or both assets.")
# Complying with user preference for droplevel
if isinstance(data_asset1.columns, pd.MultiIndex) and data_asset1.columns.nlevels > 1:
= data_asset1.columns.droplevel(level=1)
data_asset1.columns if isinstance(data_asset2.columns, pd.MultiIndex) and data_asset2.columns.nlevels > 1:
= data_asset2.columns.droplevel(level=1)
data_asset2.columns
# Align data: Use 'Close' prices and join on date index, then drop NaNs from merged data
= data_asset1[['Close']].rename(columns={'Close': f'Close_{asset1_symbol}'})
close_prices_asset1 = data_asset2[['Close']].rename(columns={'Close': f'Close_{asset2_symbol}'})
close_prices_asset2
= close_prices_asset1.join(close_prices_asset2, how='inner').dropna()
aligned_data
if aligned_data.empty or len(aligned_data) < 2: # Need at least 2 points for correlation
raise ValueError("Not enough overlapping data after alignment.")
= aligned_data[f'Close_{asset1_symbol}']
prices1 = aligned_data[f'Close_{asset2_symbol}']
prices2 = aligned_data.index
aligned_date_index
except Exception as e:
print(f"Error in data fetching or alignment: {e}")
# Exit or use dummy data for plotting structure if desired for testing
= pd.DataFrame()
aligned_data
if not aligned_data.empty:
# --- 2. Pearson's Correlation Coefficient (CORREL) Calculation ---
= 30 # Example: 30-day rolling correlation
time_period_correl
if len(prices1) >= time_period_correl:
= talib.CORREL(
correl_values
prices1,
prices2,=time_period_correl
timeperiod
)
= f"CORREL({time_period_correl})"
indicator_name print(f"\n--- {indicator_name} - Pearson's Correlation ({asset1_symbol} vs {asset2_symbol}) ---")
# TA-Lib's CORREL returns NaNs for the first (timeperiod-1) values
= correl_values[~np.isnan(correl_values)]
valid_correl_values if len(valid_correl_values) >= 5:
print(f"Output {indicator_name} (last 5 valid): {valid_correl_values[-5:].round(3)}")
elif len(valid_correl_values) > 0:
print(f"Output {indicator_name} (all valid): {valid_correl_values.round(3)}")
else:
print(f"Output {indicator_name}: No valid values calculated (all NaNs).")
# --- 3. Plotting ---
# The correl_values array will have NaNs at the beginning. Matplotlib will plot available data.
# The aligned_date_index corresponds to the full length of prices1, prices2, and correl_values.
= plt.subplots(2, 1, figsize=(14, 10), sharex=True,
fig, axes ={'height_ratios': [2, 1]}) # Price chart taller
gridspec_kw
# Plot Price of Asset 1 (for context)
0].plot(aligned_date_index, prices1, label=f'{asset1_symbol} Close Price', color='blue')
axes[0].set_title(f'{asset1_symbol} Price and Rolling Correlation with {asset2_symbol}')
axes[0].set_ylabel(f'{asset1_symbol} Price')
axes[0].legend(loc='upper left')
axes[0].grid(True, linestyle=':', alpha=0.6)
axes[
# Plot Correlation Coefficient
1].plot(aligned_date_index, correl_values, label=indicator_name, color='purple')
axes[1].axhline(1.0, color='red', linestyle='--', linewidth=0.8, label='Perfect Positive (+1)')
axes[1].axhline(0, color='gray', linestyle=':', linewidth=0.8, label='No Correlation (0)')
axes[1].axhline(-1.0, color='green', linestyle='--', linewidth=0.8, label='Perfect Negative (-1)')
axes[1].set_ylim(-1.1, 1.1) # Correlation ranges from -1 to 1
axes[1].set_ylabel('Correlation Coefficient')
axes[1].set_xlabel('Date')
axes[1].legend(loc='lower left')
axes[1].grid(True, linestyle=':', alpha=0.6)
axes[
plt.tight_layout()
plt.show()
else:
print(f"\nSkipping CORREL plot: Insufficient aligned data (need >= {time_period_correl} points).")
if not aligned_data.empty:
print(f"Available aligned data points: {len(prices1)}.")
else:
print(f"\nSkipping CORREL plot: Data preparation failed.")
Explanation of the Code:
yfinance
,
talib
, numpy
, matplotlib.pyplot
,
and pandas
.asset1_symbol
,
asset2_symbol
) is downloaded using
yf.download()
. User preferences
auto_adjust=False
and droplevel
are
applied.pd.DataFrame.join(..., how='inner')
. This ensures that only
dates where both assets have price data are kept. dropna()
is called on the merged data to remove any remaining NaNs. This step is
vital for a correct correlation calculation.time_period_correl
(e.g., 30) is set for the rolling
window.if len(prices1) >= time_period_correl:
ensures enough aligned data points.talib.CORREL(prices1, prices2, timeperiod=time_period_correl)
calculates the rolling correlation. The output will have
NaN
values for the first timeperiod - 1
entries.plt.subplots()
.prices1
) to provide
price context.plt.tight_layout()
adjusts subplot spacing.plt.show()
displays the chart.By calculating and visualizing the rolling correlation, traders and investors can gain valuable insights into how different assets interact, aiding in diversification strategies, pairs trading idea generation, and overall risk management.