Diagnosing regression with interactive QQ plots

QQ plots, despite their importance, are not well-supported in Python, as of the moment of writing this note. The package statsmodels offers plots of rather poor quality and non-interactive, and even Plotly doesn’t come with its own recipe. In this note, I am sharing how to build an interactive QQ plot and apply it to improve linear regression.

I will use the state-of-the art Vega-Lite language dedicated to interactive visualization, available in Python through the package Altair API.

Let’s use the famous “Prestige” dataset¹, and look how prestige is explained by income, education, and job type:

\( prestige \sim income + education:type \)

The model residuals do not follow the normal distribution (as rejected by the Jarque-Bera test). It turns out that the normality of residuals is broken by an anomaly visible on QQ plots. The anomaly in the top-right corner comes from the minister job (hover to inspect!), which comes with high prestige and little money 😉

After removing the political anomaly, the model behaves properly 😉

OLS Regression Results
Dep. Variable:	prestige	R-squared:	0.922
Model:	OLS	Adj. R-squared:	0.914
Method:	Least Squares	F-statistic:	115.8
Date:	Sat, 06 Apr 2024	Prob (F-statistic):	4.31e-21
Time:	18:03:24	Log-Likelihood:	-157.21
No. Observations:	44	AIC:	324.4
Df Residuals:	39	BIC:	333.3
Df Model:	4
Covariance Type:	nonrobust

Omnibus:	3.775	Durbin-Watson:	1.960
Prob(Omnibus):	0.151	Jarque-Bera (JB):	2.734
Skew:	0.583	Prob(JB):	0.255
Kurtosis:	3.365	Cond. No.	261.

Diagnosing and fixing regression by identifying anomalies with QQ-is a common technique. With interactive QQ plots, it becomes more powerful and visually appealing. Below, I attach the full code in Python:

import numpy as np
from scipy import stats
import pandas as pd
import altair as alt

import statsmodels.api as sm
from statsmodels.formula.api import ols

# fit a model 

data = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
model = ols("prestige ~ income + education:C(type)", data=data).fit() 

# plot residuals against normal quantiles

empirical_quantiles = model.resid.sort_values()
dist = stats.norm

# for ploting positions, sample extremes are 1/(n+1) and 1-1/(n+1) percentiles
nobs = len(empirical_quantiles)
pos = np.arange(1.,nobs+1) /(nobs + 1) 
theoretical_quantiles = dist.ppf(pos)

df = pd.DataFrame(data={'res':empirical_quantiles.values,dist.name:theoretical_quantiles,'label':empirical_quantiles.index})

qqplot = alt.Chart(df).mark_point().encode(
    y='res',
    x=dist.name,
    tooltip='label'
).interactive()

# the line is fit robustly based on 1st and 3rd quartile
q1,q3 = stats.mstats.mquantiles(empirical_quantiles, [0.25,0.75])
tq1,tq3 = dist.ppf([0.25,0.75])
linex = theoretical_quantiles
liney = (q3-q1) / (tq3-tq1) * (linex-tq1) + q1

qqline = alt.Chart(pd.DataFrame({'res':liney,dist.name:linex})).mark_line().encode(
    y='res',
    x=dist.name,
)

# combine two plots
qqplot + qqline

# fit a fixed model 

data = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
mask = ~data.index.isin(["minister"])
model = ols("prestige ~ income + education:C(type)", data=data[mask]).fit() 
model.summary()

References

1.
Duncan OD. A Socioeconomic Index for All Occupations. Occupations and Social Status. 1961;(7).

Post Views: 1,135

Published by mskorski

Scientist, Consultant, Learning Enthusiast View more posts

Diagnosing regression with interactive QQ plots

References

Published by mskorski

Leave a comment

Cancel reply