Diagnosing regression with interactive QQ plots

QQ plots, despite their importance, are not well-supported in Python, as of the moment of writing this note. The package statsmodels offers plots of rather poor quality and non-interactive, and even Plotly doesn’t come with its own recipe. In this note, I am sharing how to build an interactive QQ plot and apply it to improve linear regression.

I will use the state-of-the art Vega-Lite language dedicated to interactive visualization, available in Python through the package Altair API.

Let’s use the famous “Prestige” dataset​1​, and look how prestige is explained by income, education, and job type:

\( prestige \sim income + education:type \)

The model residuals do not follow the normal distribution (as rejected by the Jarque-Bera test). It turns out that the normality of residuals is broken by an anomaly visible on QQ plots. The anomaly in the top-right corner comes from the minister job (hover to inspect!), which comes with high prestige and little money 😉

After removing the political anomaly, the model behaves properly 😉

OLS Regression Results
Dep. Variable: prestige R-squared: 0.922
Model: OLS Adj. R-squared: 0.914
Method: Least Squares F-statistic: 115.8
Date: Sat, 06 Apr 2024 Prob (F-statistic): 4.31e-21
Time: 18:03:24 Log-Likelihood: -157.21
No. Observations: 44 AIC: 324.4
Df Residuals: 39 BIC: 333.3
Df Model: 4
Covariance Type: nonrobust
Omnibus: 3.775 Durbin-Watson: 1.960
Prob(Omnibus): 0.151 Jarque-Bera (JB): 2.734
Skew: 0.583 Prob(JB): 0.255
Kurtosis: 3.365 Cond. No. 261.

Diagnosing and fixing regression by identifying anomalies with QQ-is a common technique. With interactive QQ plots, it becomes more powerful and visually appealing. Below, I attach the full code in Python:

import numpy as np
from scipy import stats
import pandas as pd
import altair as alt

import statsmodels.api as sm
from statsmodels.formula.api import ols

# fit a model 

data = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
model = ols("prestige ~ income + education:C(type)", data=data).fit() 

# plot residuals against normal quantiles

empirical_quantiles = model.resid.sort_values()
dist = stats.norm

# for ploting positions, sample extremes are 1/(n+1) and 1-1/(n+1) percentiles
nobs = len(empirical_quantiles)
pos = np.arange(1.,nobs+1) /(nobs + 1) 
theoretical_quantiles = dist.ppf(pos)

df = pd.DataFrame(data={'res':empirical_quantiles.values,dist.name:theoretical_quantiles,'label':empirical_quantiles.index})

qqplot = alt.Chart(df).mark_point().encode(
    y='res',
    x=dist.name,
    tooltip='label'
).interactive()

# the line is fit robustly based on 1st and 3rd quartile
q1,q3 = stats.mstats.mquantiles(empirical_quantiles, [0.25,0.75])
tq1,tq3 = dist.ppf([0.25,0.75])
linex = theoretical_quantiles
liney = (q3-q1) / (tq3-tq1) * (linex-tq1) + q1

qqline = alt.Chart(pd.DataFrame({'res':liney,dist.name:linex})).mark_line().encode(
    y='res',
    x=dist.name,
)

# combine two plots
qqplot + qqline

# fit a fixed model 

data = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
mask = ~data.index.isin(["minister"])
model = ols("prestige ~ income + education:C(type)", data=data[mask]).fit() 
model.summary()

References

  1. 1.
    Duncan OD. A Socioeconomic Index for All Occupations. Occupations and Social Status. 1961;(7).

Published by mskorski

Scientist, Consultant, Learning Enthusiast

Leave a comment

Your email address will not be published. Required fields are marked *