QQ plots, despite their importance, are not well-supported in Python, as of the moment of writing this note. The package statsmodels offers plots of rather poor quality and non-interactive, and even Plotly doesn’t come with its own recipe. In this note, I am sharing how to build an interactive QQ plot and apply it to improve linear regression.
I will use the state-of-the art Vega-Lite language dedicated to interactive visualization, available in Python through the package Altair API.
Let’s use the famous “Prestige” dataset1, and look how prestige is explained by income, education, and job type:
\( prestige \sim income + education:type \)The model residuals do not follow the normal distribution (as rejected by the Jarque-Bera test). It turns out that the normality of residuals is broken by an anomaly visible on QQ plots. The anomaly in the top-right corner comes from the minister job (hover to inspect!), which comes with high prestige and little money 😉
After removing the political anomaly, the model behaves properly 😉
| Dep. Variable: | prestige | R-squared: | 0.922 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.914 |
| Method: | Least Squares | F-statistic: | 115.8 |
| Date: | Sat, 06 Apr 2024 | Prob (F-statistic): | 4.31e-21 |
| Time: | 18:03:24 | Log-Likelihood: | -157.21 |
| No. Observations: | 44 | AIC: | 324.4 |
| Df Residuals: | 39 | BIC: | 333.3 |
| Df Model: | 4 | ||
| Covariance Type: | nonrobust |
| Omnibus: | 3.775 | Durbin-Watson: | 1.960 |
|---|---|---|---|
| Prob(Omnibus): | 0.151 | Jarque-Bera (JB): | 2.734 |
| Skew: | 0.583 | Prob(JB): | 0.255 |
| Kurtosis: | 3.365 | Cond. No. | 261. |
Diagnosing and fixing regression by identifying anomalies with QQ-is a common technique. With interactive QQ plots, it becomes more powerful and visually appealing. Below, I attach the full code in Python:
import numpy as np
from scipy import stats
import pandas as pd
import altair as alt
import statsmodels.api as sm
from statsmodels.formula.api import ols
# fit a model
data = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
model = ols("prestige ~ income + education:C(type)", data=data).fit()
# plot residuals against normal quantiles
empirical_quantiles = model.resid.sort_values()
dist = stats.norm
# for ploting positions, sample extremes are 1/(n+1) and 1-1/(n+1) percentiles
nobs = len(empirical_quantiles)
pos = np.arange(1.,nobs+1) /(nobs + 1)
theoretical_quantiles = dist.ppf(pos)
df = pd.DataFrame(data={'res':empirical_quantiles.values,dist.name:theoretical_quantiles,'label':empirical_quantiles.index})
qqplot = alt.Chart(df).mark_point().encode(
y='res',
x=dist.name,
tooltip='label'
).interactive()
# the line is fit robustly based on 1st and 3rd quartile
q1,q3 = stats.mstats.mquantiles(empirical_quantiles, [0.25,0.75])
tq1,tq3 = dist.ppf([0.25,0.75])
linex = theoretical_quantiles
liney = (q3-q1) / (tq3-tq1) * (linex-tq1) + q1
qqline = alt.Chart(pd.DataFrame({'res':liney,dist.name:linex})).mark_line().encode(
y='res',
x=dist.name,
)
# combine two plots
qqplot + qqline
# fit a fixed model
data = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
mask = ~data.index.isin(["minister"])
model = ols("prestige ~ income + education:C(type)", data=data[mask]).fit()
model.summary()References
- 1.Duncan OD. A Socioeconomic Index for All Occupations. Occupations and Social Status. 1961;(7).
