QQ plots, despite their importance, are not well-supported in Python, as of the moment of writing this note. The package statsmodels offers plots of rather poor quality and non-interactive, and even Plotly doesn’t come with its own recipe. In this note, I am sharing how to **build an interactive QQ plot and apply it to improve linear regression**.

I will use the state-of-the art **Vega-Lite language dedicated to interactive visualization**, available in Python through the package Altair API.

Let’s use the famous “Prestige” dataset^{1}, and look **how prestige is explained by income, education, and job type**:

The model residuals do not follow the normal distribution (as rejected by the Jarque-Bera test). It turns out that **the normality of residuals is broken by an anomaly visible on QQ plots**. The anomaly in the top-right corner comes from the minister job (hover to inspect!), which comes with high prestige and little money 😉

After removing the political anomaly, the model behaves properly 😉

Dep. Variable: | prestige | R-squared: | 0.922 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.914 |

Method: | Least Squares | F-statistic: | 115.8 |

Date: | Sat, 06 Apr 2024 | Prob (F-statistic): | 4.31e-21 |

Time: | 18:03:24 | Log-Likelihood: | -157.21 |

No. Observations: | 44 | AIC: | 324.4 |

Df Residuals: | 39 | BIC: | 333.3 |

Df Model: | 4 | ||

Covariance Type: | nonrobust |

Omnibus: | 3.775 | Durbin-Watson: | 1.960 |
---|---|---|---|

Prob(Omnibus): | 0.151 | Jarque-Bera (JB): | 2.734 |

Skew: | 0.583 | Prob(JB): | 0.255 |

Kurtosis: | 3.365 | Cond. No. | 261. |

Diagnosing and fixing regression by identifying anomalies with QQ-is a common technique. With interactive QQ plots, it becomes more powerful and visually appealing. Below, I attach the full code in Python:

```
import numpy as np
from scipy import stats
import pandas as pd
import altair as alt
import statsmodels.api as sm
from statsmodels.formula.api import ols
# fit a model
data = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
model = ols("prestige ~ income + education:C(type)", data=data).fit()
# plot residuals against normal quantiles
empirical_quantiles = model.resid.sort_values()
dist = stats.norm
# for ploting positions, sample extremes are 1/(n+1) and 1-1/(n+1) percentiles
nobs = len(empirical_quantiles)
pos = np.arange(1.,nobs+1) /(nobs + 1)
theoretical_quantiles = dist.ppf(pos)
df = pd.DataFrame(data={'res':empirical_quantiles.values,dist.name:theoretical_quantiles,'label':empirical_quantiles.index})
qqplot = alt.Chart(df).mark_point().encode(
y='res',
x=dist.name,
tooltip='label'
).interactive()
# the line is fit robustly based on 1st and 3rd quartile
q1,q3 = stats.mstats.mquantiles(empirical_quantiles, [0.25,0.75])
tq1,tq3 = dist.ppf([0.25,0.75])
linex = theoretical_quantiles
liney = (q3-q1) / (tq3-tq1) * (linex-tq1) + q1
qqline = alt.Chart(pd.DataFrame({'res':liney,dist.name:linex})).mark_line().encode(
y='res',
x=dist.name,
)
# combine two plots
qqplot + qqline
# fit a fixed model
data = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
mask = ~data.index.isin(["minister"])
model = ols("prestige ~ income + education:C(type)", data=data[mask]).fit()
model.summary()
```

### References

- 1.Duncan OD. A Socioeconomic Index for All Occupations.
*Occupations and Social Status*. 1961;(7).