For a good introduction into the state-of-the-art Fourier techniques for boolean functions, see . Recall that every \(f : \{0,1\}^n \to \mathbb{R}\) has the (Fourier) expansion as a combination of parity functions:

$$

f(x) = \sum_{S \subseteq [n]} \widehat{f}(S) \chi_S(x),

$$ with coefficients given by

$$

\widehat{f}(S) = 2^{-n}\sum_x f(x) \chi_S(x).

$$

Then comes the Parseval theorem, on isometric properties of the Fourier transform:

\begin{equation}

2^{-n} \sum_{x\in{0,1}^n} f(x)^2 = \sum_{S \subset [1\ldots n]} \widehat{f}(S)^2.

\end{equation}

Fourier analysis is often applied to probability distributions \(\mu\) on bit strings. In this case, the Parseval theorem can be restated nicely using the notion of bit bias:

$$

\left\| \mu – 2^{-n}\right\|_2 = \sqrt{2^{n} \sum_{\emptyset\not=S \subset [1\ldots n]} \widehat{\mu}(S)^2}

=\sqrt{2^{-n} \sum_{\emptyset\not=S \subset [1\ldots n]} \operatorname{bias}\left( \oplus_{i\in S} X_i \right)^2 },

$$ and in this modern form is known as (not surprisingly) the XOR lemma . Recall that the bit bias is \(\mathrm{bias}(X) = \mathbf{P}\{X=1\}-\mathbf{P}\{X=0\} \).

A linear code is a subspace of \(\mathbb{F}_2^n\), generated by rows of a matrix \(G: \{0,1\}^{m\times n}\). We apply the code matrix to a source \(X_1,\ldots,X_n\) obtaining

$$

\begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} = G \cdot \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{bmatrix}.

$$ To generate useful output, the code needs more structure, for instance relatively large Hamming weights. This means that every subset of rows in the matrix should sum up (in \(\mathbb{F}_2\), that is modulo 2) to a not-too-small number, as this will ensure small biases in the XOR lemma.

Say that \(X_i\) are IID with bias \(\delta\). Then we can express* ***the collision entropy** of the output in terms of code Hamming weights, using Fourier analysis. Indeed, applying the XOR lemma:

$$

\|\mathbf{P}_Y\|_2^2- 2^{-m} = \|\mathbf{P}_Y – 2^{-m}\|_2^2= 2^{-m}\sum_{\mathbf{0}\not=u \in \operatorname{rowspan}(G) } \operatorname{bias}( \oplus u_i X_i)^2 = 2^{-m} \sum_{\mathbf{0}\not= u} \mathbf{W}(k) \delta^{2k } ,

$$ where \(\mathbf{W}(k)\) is the number of code words with Hamming weight \(k\) (recall that the second norm can be restated as the collision entropy).

Using standard norm inequalities, we obtain the following useful bounds

\begin{align}

\begin{aligned}

\|\mathbf{P}_Y\|_{\infty} – 2^{-m} & \leq \left( 2^{-m} \sum_{\mathbf{0}\not= u} \mathbf{W}(k) \delta^{2k }\right)^{1/2} \\

\|\mathbf{P}_Y – 2^{-m}\|_{1} & \leq \left( \sum_{\mathbf{0}\not= u} \mathbf{W}(k) \delta^{2k } \right)^{1/2}.

\end{aligned}

\end{align}

for **min-entropy and smooth min-entropy** of the outout \(Y\) respectively.

When the code has distance \(d\), e.g. \(\mathbf{W}(k)\geq d\), then

$$

\begin{aligned}

\max_{y} \mathbf{P}\{Y=y\} & \leq 2^{-m} + \delta^{d} \\

\|\mathbf{P}_Y – 2^{-m}\|_{1} & \leq 2^{m/2}\delta^d

\end{aligned}

$$

The first bound is due to . The second bound is stronger than Corollary 3 in . The general formula for the second norm appears to be somewhat novel. It can be used to improve upon the results of *from high unpredictability to near uniformity*, as high min-entropy is not sufficient for all cryptographic applications.

While the derived formula is non-oblivious, computing the code distribution is a hard problem. It would be interesting to propose some approximation schemes for this purpose.

The Poisson Binomial distribution \(Y\sim \mathrm{PB}(p) \) is defined for \(p=(p_1,\ldots, p_n)\) as

$$

Y = \sum_{i=1}^n X_i,\quad X_i \sim \mathrm{Bern}(p_i).

$$

We are interested in the concentration properties of \(Y\) and its dependency on \( p=(p_1,\ldots,p_n)\). To (partly) answer this question, show that for any convex and non-decreasing function \(f\), the \(f\)-expectation of the PB distribution is Schur-concave with respect to the parameters:

$$

\mathbf{E} f(\mathrm{PB}(p) ) \leq \mathbf{E} f(\mathrm{PB}(p’) ),\quad p’ \prec p,

$$

where \(X_i\) are independent and \(p \prec p’\) denotes vector majorization. In particular, this means that

$$

\mathbf{E} f(\mathrm{PB}(p) ) \leq \mathbf{E}f(\mathrm{Binom}(\bar{p}),\quad \bar{p} = \frac{1}{n}\sum_{i=1}^{n} p_i,

$$

which means that the Binomial case is the “most dispersed” one. As for related work, see the recent survey , in particular the Schur-concavity of the tails. The result above is more general, as by focusing on functional expectations it yields various concentration bounds and is applicable to setups with limited dependence, such as \(k\)-wise independence. Indeed, by using \(f(x) = \max\{x,0\}^d\) and assuming that every \(d\) of \(X_i\) are independent, we can improve the previous concentration bounds . Numerically optimal results may be computed with recent formulas for binomial moments .

To prove the claim, we will check the Schur-Ostrowski criterion, namely that

$$

\Delta_{i,j}\triangleq \left(p_i – p_j\right) \left(\frac{\partial \mathbf{E} f(\mathrm{PB}(p) ) }{ \partial p_i} – \frac{\partial \mathbf{E} f(\mathrm{PB}(p) ) }{ \partial p_j}\right) \leq 0.

$$

Due to the independence of the summands

$$\mathbf{E}f(Y) = \sum_{y}\mathbf{P}\{Y-X_i-X_j= y\} \mathbf{E}_{X_i,X_j}[f(X_i + X_j + y),$$

it suffices to test for any pair of variables, while treating the sum of the remaining variables \(Y-X_i-X_j\) as fixed. Since the convexity and monotonicity of \(f\) are invariant to translations, we can just work with \(\mathbf{E} f(X_i+ X_j)\). Without loss of generality, we can test this for \(i=1, j=2\), as the variables are identically distributed. Now, let’s implement this in Python!

```
import sympy as sm
from sympy.stats import E, Bernoulli, Rademacher
from IPython.display import display
# define objective
p1,p2 = sm.symbols('p1 p2',positive=True)
X1 = Bernoulli('b_1',p1)
X2 = Bernoulli('b_2',p2)
f = sm.Function('f')
S = f(X1+X2)
# evaluate Schur-Ostrowski
schur_diff = (p1-p2) * sm.diff(E(S),p1)-sm.diff(E(S), p2)
schur_diff = schur_diff.expand().collect([f(2),f(1),f(0)])
schur_diff
```

We obtained

$$

\Delta_{i,j}=\left(p_{1} p_{2} – p_{1} – p_{2}^{2}\right) f{\left(2 \right)} + \left(- 2 p_{1} p_{2} + 3 p_{1} + 2 p_{2}^{2} – p_{2} – 1\right) f{\left(1 \right)} + \left(p_{1} p_{2} – 2 p_{1} – p_{2}^{2} + p_{2} + 1\right) f{\left(0 \right)}.

$$

To simplify this expression and made use of the remaining assumptions, let us introduce \(\delta_{convex} = \frac{f(2)+f(0)}{2}-f(1)\) and \(\delta_{monotone} = f(1)-f(0)\), both non-negative. Further calculations in SymPy

```
delta_cvx, delta_mnt = sm.symbols('delta_convex delta_monotone')
schur_diff = schur_diff.subs({f(2):2*f(1)-f(0)+delta_cvx})
schur_diff = schur_diff.subs({f(1):f(0)+delta_mnt})
schur_diff.expand().collect([delta_cvx,delta_mnt]).collect(p1)
```

show that

$$

\Delta_{i,j} = \delta_{convex} \left(p_{1} \left(p_{2} – 1\right) – p_{2}^{2}\right) + \delta_{monotone} \left(p_{1} – p_{2} – 1\right),

$$

which is clearly non-positive as \(0 \leq p_1,p_2 \leq 1\).

Skorski, M. (2024) ‘Handy formulas for binomial moments’, *Modern Stochastics: Theory and Applications*, pp. 1–15. Available at: https://doi.org/10.15559/24-VMSTA260.

Tang, W. and Tang, F. (2023) ‘The Poisson Binomial Distribution— Old & New’, *Statistical Science*, 38(1). Available at: https://doi.org/10.1214/22-STS852.

Skorski, M. (2022) ‘Tight Chernoff-Like Bounds Under Limited Independence’, *LIPIcs, Volume 245, APPROX/RANDOM 2022*, 245, p. 15:1-15:14. Available at: https://doi.org/10.4230/LIPICS.APPROX/RANDOM.2022.15.

Consider an ordered iid sample \(x^{(1)} < x^{(2)} < \ldots < x^{(n)}\) from a distribution \(X\), and let \(q\) be the \(p\)-th quantile, that is \(\mathbf{P}\{X\leqslant q\}=p\). Using our samples we want to build a possibly short confidence interval with satisfying coverage, that is to solve the optimization problem

\begin{align}

\begin{aligned}

\mathrm{minimize} && r-\ell \\

&& \mathbf{P}\{ x^{(\ell)} \leqslant q \leqslant x^{(r)} \} \geqslant 1-\alpha

\end{aligned}

\end{align}

We start with establishing explicit formulas for the coverage probability. First,

$$

\mathbf{P}\{ x^{(k)} \leqslant q \} = \mathbf{P}\{ \mathsf{Binom}(n,p) \geqslant k\},

$$

because \(x^{(k)} \leqslant q\) means that at least \(k\) samples are smaller or equal to \(q\). Accordingly,

$$

\mathbf{P}\{ x^{(k)} < q \} = \mathbf{P}\{ \mathsf{Binom}(n,p^{-}) \geqslant k\},

$$

with \(p^{-} = \mathbf{P}\{X < q\} \). Therefore, we obtain

$$

\mathbf{P}\{ x^{(\ell)} \leqslant q \leqslant x^{(r)} \} = \mathbf{P}\{ \mathsf{Binom}(n,p) \geqslant \ell\} – \mathbf{P}\{ \mathsf{Binom}(n,p^{-}) \geqslant r\},

$$

which reduces to \(\mathbf{P}\{ x^{(\ell)} \leqslant q \leqslant x^{(r)} \} = \mathbf{P}\{ \ell\leqslant \mathsf{Binom}(n,p) \leqslant r-1\} \) if \(q\) is a discontinuity point (this has been known for strictly increasing continuous CDFs ^{1}). Since

$$

1-\mathbf{P}\{ x^{(\ell)} \leqslant q \leqslant x^{(r)} \} = \mathbf{P}\{ \mathsf{Binom}(n,p) \leqslant \ell-1\} + \mathbf{P}\{ \mathsf{Binom}(n,p^{-}) \geqslant r\},

$$

the smallest interval with coverage at least \(1-\alpha\) is determined by the optimization task

\begin{align}

\begin{aligned}

\min_{\ell\leqslant r}&& r-\ell \\

&& \mathbf{P}\{ \mathsf{Binom}(n,p) \leqslant \ell-1\} + \mathbf{P}\{ \mathsf{Binom}(n,p^{-}) \geqslant r\} \geqslant \alpha

\end{aligned} && (*)

\end{align}

which can be solved under the necessary condition that

\begin{align}

(1-p)^{n} + (p^{-})^n \leqslant \alpha.

\end{align}

Introducing \(r :=n -r + 1\) we obtain the more convenient/symmetric formulation

\begin{align}

\begin{aligned}

\max_{\ell+r\leqslant n+1}&& \ell + r \\

&& \mathbf{P}\{ \mathsf{Binom}(n,p) \leqslant \ell-1\} + \mathbf{P}\{ \mathsf{Binom}(n,1-p^{-}) \leqslant r-1\} \leqslant \alpha

\end{aligned}

\end{align}

and use this formulation to develop an algorithm in Python

```
from scipy import stats
def quantile_ci(x,p,alpha,p_minus=None,):
# solve related binomial covering problem
n = len(x)
if not p_minus:
p_minus = p
ks = np.arange(n+1)
probs_l = stats.binom(n,p).pmf(ks)
probs_r = stats.binom(n,1-p_minus).pmf(ks)
i_l = 0
i_r = 0
total = probs_l[i_l] + probs_r[i_r]
while (i_l + i_r < n) and (total <= alpha):
p_l = probs_l[i_l+1]
p_r = probs_r[i_r+1]
if p_l <= p_r:
total += p_l
if total <= alpha:
i_l += 1
else:
total += p_r
if total < alpha:
i_r += 1
# convert to original indices
i_l = i_l+1
i_r = (n+1)-(i_r+1)
return (i_l,i_r)
```

As an example, consider times to failure in hours for a set of pentode radio valves ^{2}:

```
import numpy as np
x = [46.9,
47.2,
49.1,
56.5,
56.8,
59.2,
59.9,
63.2,
63.3,
63.4,
63.7,
64.1,
67.1,
67.7,
73.3,
78.5]
p = 0.75
alpha = 0.1
i_l, i_r = quantile_ci(x,p,alpha)
np.sort(x)[[i_l-1,i_r-1]] # [63.4, 78.5]
```

- 1.Zieliński R, Zieliński W. Best exact nonparametric confidence intervals for quantiles.
*Statistics*. Published online February 2005:67-71. doi:10.1080/02331880412331329854 - 2.Conover WJ.
*Practical Nonparametric Statistics*. John Wiley & Sons; 1999.

Continue reading "Demographic Data with Zero-Truncated Poisson"

]]>The zero-truncated Poisson distribution \(Y\) is defined as

\begin{align}

\mathbf{P}\{Y=x\} = \mathbf{P}\{ X = x | X > 0\}, \quad X\sim \mathrm{Poiss}(\lambda) .

\end{align}

Conditioning which removes zero counts makes it a perfect fit for counting data that starts from 1. For example, consider the following historical UK data from a demographic research study ^{2}

The source data can be read into Python and visualized as follows

```
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
N = 98
counts = np.array([5.7,14.2,16.5,15.8,14.7,11.8,8.0,5.4,3.1,1.9,1.1,0.7,1.1]) / 100 * N
bins = pd.IntervalIndex.from_breaks(list(np.arange(1,13+1)) + [np.inf],closed="left",name="household size")
counts = pd.Series(data=counts,name="count",index=bins)
fig,ax = plt.subplots(figsize=(12,4))
sns.barplot(counts.reset_index(),x="household size",y="count",ax=ax)
ax.tick_params(axis='x', labelrotation=45)
```

Now, we hypothetize the truncated Python distribution with \(\lambda = 5\), and test the fit quality by adjusting the chi-square method. This chi-squared technique boils down to binning a reference distribution into finitely many categories, consistently with the empirical data, and is well-explained in R packages^{3}. We obtain a high p-value which confirms a good fit:

```
from scipy import stats
pmf = pd.Series({i:stats.poisson(5).pmf(i) for i in range(0,100)})
pmf = pmf.groupby(pd.cut(pmf.index, counts.index)).sum()
pmf = pmf / pmf.sum()
counts_expected = pmf * N
stats.chisquare(counts,counts_expected, ddof=1)
# statistic=12.157245195139515, pvalue=0.35194047158664454
```

- 1.Jennings V, Lloyd-Smith B, Ironmonger D. Household size and the poisson distribution.
*Journal of Population Research*. Published online May 1999:65-84. doi:10.1007/bf03029455 - 2.Laslett P. Size and structure of the household in England over three centuries.
*Population Studies*. Published online July 1969:199-223. doi:10.1080/00324728.1969.10405278 - 3.Millard SP.
*EnvStats*. Springer New York; 2013. doi:10.1007/978-1-4614-8456-1

I will use the state-of-the art **Vega-Lite language dedicated to interactive visualization**, available in Python through the package Altair API.

Let’s use the famous “Prestige” dataset^{1}, and look **how prestige is explained by income, education, and job type**:

The model residuals do not follow the normal distribution (as rejected by the Jarque-Bera test). It turns out that **the normality of residuals is broken by an anomaly visible on QQ plots**. The anomaly in the top-right corner comes from the minister job (hover to inspect!), which comes with high prestige and little money

After removing the political anomaly, the model behaves properly

Dep. Variable: | prestige | R-squared: | 0.922 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.914 |

Method: | Least Squares | F-statistic: | 115.8 |

Date: | Sat, 06 Apr 2024 | Prob (F-statistic): | 4.31e-21 |

Time: | 18:03:24 | Log-Likelihood: | -157.21 |

No. Observations: | 44 | AIC: | 324.4 |

Df Residuals: | 39 | BIC: | 333.3 |

Df Model: | 4 | ||

Covariance Type: | nonrobust |

Omnibus: | 3.775 | Durbin-Watson: | 1.960 |
---|---|---|---|

Prob(Omnibus): | 0.151 | Jarque-Bera (JB): | 2.734 |

Skew: | 0.583 | Prob(JB): | 0.255 |

Kurtosis: | 3.365 | Cond. No. | 261. |

Diagnosing and fixing regression by identifying anomalies with QQ-is a common technique. With interactive QQ plots, it becomes more powerful and visually appealing. Below, I attach the full code in Python:

```
import numpy as np
from scipy import stats
import pandas as pd
import altair as alt
import statsmodels.api as sm
from statsmodels.formula.api import ols
# fit a model
data = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
model = ols("prestige ~ income + education:C(type)", data=data).fit()
# plot residuals against normal quantiles
empirical_quantiles = model.resid.sort_values()
dist = stats.norm
# for ploting positions, sample extremes are 1/(n+1) and 1-1/(n+1) percentiles
nobs = len(empirical_quantiles)
pos = np.arange(1.,nobs+1) /(nobs + 1)
theoretical_quantiles = dist.ppf(pos)
df = pd.DataFrame(data={'res':empirical_quantiles.values,dist.name:theoretical_quantiles,'label':empirical_quantiles.index})
qqplot = alt.Chart(df).mark_point().encode(
y='res',
x=dist.name,
tooltip='label'
).interactive()
# the line is fit robustly based on 1st and 3rd quartile
q1,q3 = stats.mstats.mquantiles(empirical_quantiles, [0.25,0.75])
tq1,tq3 = dist.ppf([0.25,0.75])
linex = theoretical_quantiles
liney = (q3-q1) / (tq3-tq1) * (linex-tq1) + q1
qqline = alt.Chart(pd.DataFrame({'res':liney,dist.name:linex})).mark_line().encode(
y='res',
x=dist.name,
)
# combine two plots
qqplot + qqline
# fit a fixed model
data = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
mask = ~data.index.isin(["minister"])
model = ols("prestige ~ income + education:C(type)", data=data[mask]).fit()
model.summary()
```

- 1.Duncan OD. A Socioeconomic Index for All Occupations.
*Occupations and Social Status*. 1961;(7).

More precisely, by looking into the classical heart disease dataset ^{3} we will find that **the cholesterol level is log-normally distributed**, as suggested in the medical literature^{4}, and using this we **will prove that cholesterol level is associated with heart disease**.

Here is how the data looks like:

We see that under the log-transform statistical tests no longer reject normality, and moreover that the difference between groups becomes significant!

Transform | KS Test p-val | T Test p-val | |
---|---|---|---|

Healthy | Sick | ||

none | 0.003235 | 0.990000 | 0.05 |

log | 0.246064 | 0.453206 | 0.03 |

For completeness, here is the code snippet to reproduce:

```
#%pip install openml
import matplotlib.pyplot as plt
import seaborn as sns
import openml
import numpy as np
from statsmodels.stats import diagnostic, weightstats
import pandas as pd
from IPython.display import display
## get the data
dataset = openml.datasets.get_dataset("Heart-Disease-Prediction")
X, y, _, _ = dataset.get_data(dataset_format="dataframe")
X["Heart_Disease"] = X["Heart_Disease"].apply({"Absence":False,"Presence":True}.get)
sns.histplot(data=X,x="Cholesterol",hue="Heart_Disease" )
x0 = X['Cholesterol'][X["Heart_Disease"]==False]
x1 = X['Cholesterol'][X["Heart_Disease"]==True]
## investigate normality
outs = []
for fn_name, fn in zip(["none","log"],[lambda x:x, lambda x:np.log(x)]):
p_norm0 = diagnostic.kstest_normal(fn(x0), dist='norm', pvalmethod='table')[1]
p_norm1 = diagnostic.kstest_normal(fn(x1), dist='norm', pvalmethod='table')[1]
p_ttest = weightstats.ttest_ind( fn(x0), fn(x1), usevar='unequal')[1].round(2)
outs.append( (fn_name, p_norm0, p_norm1, p_ttest ) )
outs = pd.DataFrame(data=outs,columns=[("Transform",),('KS Test p-val',"Healthy"),('KS Test p-val',"Sick"), ('T Test',) ])
outs.columns = pd.MultiIndex.from_tuples( outs.columns )
display(outs)
```

- 1.Daimon T. Box–Cox Transformation.
*International Encyclopedia of Statistical Science*. Published online 2011:176-178. doi:10.1007/978-3-642-04898-2_152 - 2.Box GEP, Cox DR. An Analysis of Transformations.
*Journal of the Royal Statistical Society: Series B (Methodological)*. Published online July 1964:211-243. doi:10.1111/j.2517-6161.1964.tb00553.x - 3.Andras Janosi WS. Heart Disease. Published online 1989. doi:10.24432/C52P4X
- 4.Tharu B, Tsokos C. A Statistical Study of Serum Cholesterol Level by Gender and Race.
*J Res Health Sci*. 2017;17(3):e00386. https://www.ncbi.nlm.nih.gov/pubmed/28878106

Below are the code details:

```
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
likelihod_cols = ["Rare","Unlikely","Moderate","Likely","Almost certain"]
impact_cols = ["Insignificant","Minor","Significant","Major","Severe"]
score_vals = np.arange(len(impact_cols)).reshape(-1,1) + np.arange(len(likelihod_cols)).reshape(1,-1)
cmap = mpl.colormaps["RdYlGn"].reversed().resampled(9)
norm = mpl.colors.BoundaryNorm(range(10),10)
fig, ax = plt.subplots(figsize=(8,8))
ax.imshow(score_vals,cmap=cmap,norm=norm)
annotations = {
(2,3): ["competition"],
(3,2): ["too ambitiuous"],
(3,3): ["hardware expertise"],
(2,2): ["time crunch"],
(2,1): ["stretched resources"],
}
annots_style = {'fontweight':'bold'}
for (i,j),text in annotations.items():
text = '\n'.join( text )
_ = ax.text(i, j, text, ha="center", va="center", color="b", **annots_style)
ax.set_xlabel('Impact',fontweight='bold')
ax.set_xticks(range(len(impact_cols)), labels=impact_cols)
ax.set_ylabel('Likelihood',fontweight='bold')
ax.set_yticks(range(len(likelihod_cols)), labels=likelihod_cols)
ax.invert_yaxis()
plt.tight_layout()
plt.show()
fig.savefig("risk_heatmap.svg")
```

- 1.Peace C. The risk matrix: uncertain results?
*Policy and Practice in Health and Safety*. Published online July 3, 2017:131-144. doi:10.1080/14773996.2017.1348571

$$f(x;\mu,\lambda)=\sqrt{\frac{\lambda}{2 \pi x^{3}}} e^{- \frac{\lambda \left(- \mu + x\right)^{2}}{2 \mu^{2} x}}.$$

This distribution arises as passage times of random walks ^{1}. For a survey of its basic properties, see ^{2}. In this note we explore in detail its *approximate normality*, which holds when \( \lambda/\mu \to \infty\).

Consider the random variable \(X\sim \mathsf{IG}(\mu,\lambda)\) and its standardised version $$Y = \frac{X-\mathbf{E}[X]}{\sqrt{\mathbf{Var}[X]}}.$$

Below we establish normal approximation in various metrics.

Since the characteristic function of \(X\) is \(\phi_X(t) = e^{\frac{\lambda \left(1 – \sqrt{1 – \frac{2 i \mu^{2} t}{\lambda}}\right)}{\mu}}\), and \(\mathbf{E}[X] = \mu, \mathbf{Var}[X]=\mu^3/\lambda\), the characteristic function of the standardized distribution equals

$$ \phi_Y(t) = e^{k^{2} \cdot \left(1 – \sqrt{1 – \frac{2 i t}{k}}\right)} e^{- i k t}, $$

where \( k \triangleq \sqrt{\lambda/\mu}\). The limit under \(k\to\infty\) equals \(e^{-\frac{t^2}{2}}\), the characteristic function of the standard normal distribution, which proves **convergence in distribution**:

$$Y \overset{d}{\rightarrow} \mathsf{N}(0,1),\quad \lambda / \mu \to \infty$$

We can also look at the pdf of the normalized distribution, with \(k\) as before,

$$ f_Y(y) = \frac{\sqrt{2} k^{\frac{3}{2}} e^{- \frac{k y^{2}}{2 \left(k + y\right)}}}{2 \sqrt{\pi} \left(k + y\right)^{\frac{3}{2}}},\quad y>-k$$

that can be calculated with SymPy

```
import sympy as sm
x, y, t, lmbd, mu, k = sm.symbols('x y t lambda mu k', positive=True)
pdf = sm.Mul( sm.sqrt(lmbd/(2*sm.pi*x**3), evaluate=False), sm.exp(-lmbd*(x-mu)**2/(2*mu**2*x)), evaluate=False)
mean = mu
var = mu**3/lmbd
norm_pdf = sm.sqrt(var) * pdf.subs({x:y*sm.sqrt(var)+mu})
norm_pdf.subs({lmbd:k**2*mu}).simplify()
```

and compare it to that of the standard normal distribution. Namely, we have

$$ f_Y(y) \rightarrow \frac{1}{\sqrt{2\pi}} e^{-\frac{y^2}{2}}, \quad k=\lambda / \mu \to \infty, $$

that is the **convergence of probability densities** to the standard normal distribution. By Scheffe’s Theorem ^{3} this proves the **convergence in total variation** (stronger than convergence in distribution)

$$Y\overset{TV}{\rightarrow} \mathsf{N}(0,1), \quad \lambda/\mu \to \infty.$$

We now attempt to derive concrete numerical estimates. The ratio of the characteristic function \(\phi_k\) of \(Y\) and \(\phi_{\infty}\) of the standard normal equals

$$\phi_k(t)/\phi_{\infty}(t) = e^{ k^2 h(t /k ) }, \quad h(x) \triangleq 1-\sqrt{1-2 i x}- ix + \frac{x^2}{2}.$$

Observe that \(h(x)\) expands into a convergent power series around \(x=0\) when \(|x|<\frac{1}{2}\). Thus, under the condition that \(|t| \leq \sqrt{k}\) and \(k\geq 4\), we obtain that \(|k^2 h(t /k )|\leq O(1)\), and by the bound \(|e^{z}-1| = \Theta(z)\) for \(|z|\leq O(1)\) that

$$ \left| e^{ k^2 h(t /k )} -1 \right| = \Theta(1)\cdot \frac{t^3}{k} $$

This gives the following integral estimate

$$ \int_{|t|\leq \sqrt{k}} \left| \frac{\phi_k(t)-\phi_{\infty}(t)}{t} \right| \mbox{d} t = \Theta(1/k). $$

We know handle the regime \(\sqrt{k} < |t| \leq k/2\). Observe that \(\Re h(x) =\Theta(1) x^4\) by the power series expansion. It follows that \(|e^{k^2 h(t /k )}| = e^{\Theta(1)}\) and therefore

$$\int_{\sqrt{k} < |t| \leq k/2} \left| \frac{\phi_k(t)-\phi_{\infty}(t)}{t} \right| \mbox{d} t \leq O(1) \int_{\sqrt{k} < |t| \leq k/2} \frac{\phi_{\infty}(t)}{t} \mbox{d}t \leq o(1/k). $$

The integral Berry-Essen inequality ^{4} implies now the **convergence rate in Kolmogorov–Smirnov distance**

$$d_{KS}(Y,\mathsf{N}(0,1))=O(1/k) = O(\sqrt{\mu/\lambda}).$$

Observe that the ratio of the densities equals

$$\frac{f_Y(y)}{f_{\infty}(y)} = \frac{k^{\frac{3}{2}} e^{\frac{y^{3}}{2 \left(k + y\right)}}}{\left(k + y\right)^{\frac{3}{2}}},\quad y>-k.$$

By power series expansions we have, uniformly in \(k\)

$$\frac{f_Y(y)}{f_{\infty}(y)} = 1-\Theta(y/k),\quad |y|\leq k^{1/3},$$

and because the expression increases in \(y\) and decreases in \(k\) when \(|y| > k^{1/3}, k>8\) (we have \(\frac{\partial}{\partial y} \frac{f_Y(y)}{f_{\infty}(y)} = \frac{k^{\frac{3}{2}} \left(- 3 k + y^{2} \cdot \left(3 k + 2 y\right) – 3 y\right) e^{\frac{y^{3}}{2 \left(k + y\right)}}}{2 \left(k + y\right)^{\frac{7}{2}}}\) and \(\frac{\partial}{\partial k} \frac{f_Y(y)}{f_{\infty}(y)} = – \frac{\sqrt{k} y \left(k y^{2} – 3 k – 3 y\right) e^{\frac{y^{3}}{2 k + 2 y}}}{2 \left(k + y\right)^{\frac{7}{2}}}\)) also

$$ \frac{f_Y(y)}{f_{\infty}(y)} > 1,\quad y \geq k^{1/3},k\geq 8$$ and

$$ \frac{f_Y(y)}{f_{\infty}(y)} < 1,\quad y \leq -k^{1/3},k\geq 8.$$ Therefore, we obtain

$$ \int \left| f_Y(y)-f_{\infty}(y) \right| \mbox{d}y \leq O(1/k)\int_{|y|<k^{1/3}} |y| f_{\infty}(y)\mbox{d} y + \int_{t<-k^{1/3}} 2 f_{\infty}(y) = O(1/k). $$

Which proves that the convergence rate in total variation is \(O(1/k)\):

$$d_{TV}(Y;\mathsf{N}(0,1)) \leq O(1/k) = O(\sqrt{\mu/\lambda})$$ (in fact, here we obtain the matching lower bound too).

Numerical results match the theoretical analysis, as shown below:

For completeness, here is the script:

```
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import invgauss, norm
fig, axs = plt.subplots(1,2, figsize=(12,4))
# Part 1: evaluate total variation
mus = np.linspace(0.5,0.00001,100)
out = []
for mu in mus:
# construct Inverse Gaussian
Z = invgauss(mu=mu)
mean = Z.mean()
std = Z.std()
k = (mean/std)**2
# standardise and compare with normal
Z_std = invgauss(mu=mu,loc=-mean/std,scale=1/std)
Norm = norm(0,1)
xs = np.linspace(-10,10,1000)
metric = np.abs(Z_std.pdf(xs) - Norm.pdf(xs)).mean()*20
result = (k,metric)
out.append(result)
ax = axs[1]
ax.plot(*zip(*out))
ax.set_xscale('log')
ax.set_xlabel('$\lambda / \mu$')
ax.set_yscale('log')
ax.set_ylabel('total-variation to normal')
# Part 2: plot pdf
ax = axs[0]
for mu in [0.2,0.1,0.05,0.01,0.0001]:
# construct Inverse Gaussian
Z = invgauss(mu=mu)
mean = Z.mean()
std = Z.std()
k = (mean/std)**2
# standardise and plot
Z_std = invgauss(mu=mu,loc=-mean/std,scale=1/std)
xs = np.linspace(-5,5,10000)
ax.plot(xs,Z_std.pdf(xs), label=f'$\lambda/\mu$={k:.2f}')
ax.legend(title="standardised distribution")
ax.set_xlabel('$x$')
ax.set_ylabel('pdf')
plt.tight_layout()
plt.show()
```

- 1.Hartzell FZ. Sequential Analysis.
*Agronomy Journal*. Published online July 1947:640-641. doi:10.2134/agronj1947.00021962003900070011x - 2.Folks JL, Chhikara RS. The Inverse Gaussian Distribution and its Statistical Application—A Review.
*Journal of the Royal Statistical Society: Series B (Methodological)*. Published online July 1978:263-275. doi:10.1111/j.2517-6161.1978.tb01039.x - 3.Scheffe H. A Useful Convergence Theorem for Probability Distributions.
*Ann Math Statist*. Published online September 1947:434-438. doi:10.1214/aoms/1177730390 - 4.Bobkov SG. Proximity of probability distributions in terms of Fourier-Stieltjes transforms.
*Russ Math Surv*. Published online December 31, 2016:1021-1079. doi:10.1070/rm9749

The task was to containerize a card game web application, written by students as a group project. The application was created with React.js and used to be run in a development mode (which supports tracing, debugging and so on) via `npm start`

. The page can be then optimized for deployment via `npm build`

. The job of a Docker container is to serve this optimized version with a simple http server.

Of course, the easiest way was to both page build and deployment within the Create React App framework. But the optimal way is to **stay minimalistic** and leverage multi-stage Docker builds with task-tailored sub-images: **one to build and another to serve**. The image size indeed **gets reduced from 500Mb to about 20MB!** To build and serve we used, respectively, a Node.js image and a Static Web Server image. The build product of the first step gets copied to the second image which provides a lightweight server, leaving unessential stuff (cached files, development tools ) behind. The implementation is shown below:

```
# docker build -t brydz:latest .
# docker run -it -p 3000:3000 brydz:latest sh
# stage 1: install dependencies and build the app
FROM node:18-alpine AS pre_build
WORKDIR /brydz
COPY client/package.json ./client/package.json
COPY client/package-lock.json ./client/package-lock.json
RUN npm install --prefix client
COPY client/ ./client/
ENV PUBLIC_URL=“.”
RUN npm run build --prefix client
# stage 2: move to a clean prod environment
FROM joseluisq/static-web-server:2-alpine AS final
WORKDIR /brydz
COPY --from=pre_build /brydz/client/build ./build
EXPOSE 3000
CMD ["static-web-server", "--port=3000", "--root=build", "--log-level=trace"]
```

The following code snippet demonstrates an efficient implementation for two-player zero-sum finite games in PyTorch, and tests on the game called **Rock-Paper-Scissors with Crash**. In this variant, the scissors crashes against rock, and the loss is doubled. This shifts the equilibrium choices in favour of “paper”, at the exact proportion of 25%/50%/25%. Remarkably, regrets can be used to estimate the approximation accuracy, which approaches the perfect solution under mild conditions ^{3}. This is also demonstrated in the snippet.

```
import torch
## Game: Rock-Paper-Scissors with Crash, the equilibrium achieved with 25%/50%/25%
N_ACTIONS = 3 # 0: rock, 1: paper, 2: scissors
PAYOFF = torch.tensor([[0,-1,2], [1,0,-1], [-2,1,0]], dtype=torch.float).cuda() # gain of row player / loss of col player
## Utils
def getStrategy(cfr):
"""Return a strategy corresponding to observed regrets
Args:
cfr (array): counter-factual regret, shape (N_ACTIONS,)
Returns:
weights (array): strategy, shape (N_ACTIONS,)
"""
weights = torch.clip(cfr,0,torch.inf)
weights = weights/torch.sum(weights)
N_ACTIONS = weights.shape[0]
weights = torch.nan_to_num(weights,nan=1/N_ACTIONS)
return weights
#@torch.jit.script
@torch.compile(mode='max-autotune')
def getEquilibrium(PAYOFF, N_ITER:int = 500, stochastic_advantage:bool=False):
# auxiliary variables
N_ACTIONS = PAYOFF.shape[0]
cumCFR1 = torch.zeros(N_ACTIONS,).cuda()
cumCFR2 = torch.zeros(N_ACTIONS,).cuda()
cumStrategy1 = torch.zeros(N_ACTIONS,).cuda()
cumStrategy2 = torch.zeros(N_ACTIONS,).cuda()
strategy1 = torch.ones(N_ACTIONS,).cuda()/N_ACTIONS
strategy2 = torch.ones(N_ACTIONS,).cuda()/N_ACTIONS
# training loop
for _ in range(N_ITER):
# sample actions and observe regrets
if stochastic_advantage:
# a) stochastic variant, often implemented in tutorials
action1 = torch.multinomial(strategy1,num_samples=1).squeeze()
action2 = torch.multinomial(strategy2,num_samples=1).squeeze()
cfr1 = PAYOFF[:,action2]-PAYOFF[action1,action2]
cfr2 = - (PAYOFF[action1,:]-PAYOFF[action1,action2])
else:
# b) averaged variant
PAYOFF_avg = strategy1.view(1,-1).mm(PAYOFF).mm(strategy2.view(-1,1))
cfr1 = (PAYOFF.mm(strategy2.view(-1,1))-PAYOFF_avg).squeeze()
cfr2 = (strategy1.view(1,-1).mm(PAYOFF)-PAYOFF_avg).squeeze() * (-1)
# update strategies proportionally to regrets
strategy1 = getStrategy(cumCFR1)
strategy2 = getStrategy(cumCFR2)
# track cumulated regrets and strategies
cumCFR1 += cfr1
cumCFR2 += cfr2
cumStrategy1 += strategy1
cumStrategy2 += strategy2
# averaged strategies converge to Nash Equilibrium
avgStrategy1 = cumStrategy1/cumStrategy1.sum()
avgStrategy2 = cumStrategy2/cumStrategy2.sum()
# estimate approximation error (upper bound)
eps = 2*torch.max(cumCFR1.max(),cumCFR2.max())/N_ITER
torch.cuda.synchronize()
return (avgStrategy1,avgStrategy2,eps)
getEquilibrium(PAYOFF) # eps < 0.03
```

Regrets are expensive to compute in large games, but can be approximated using modern machine-learning techniques. This approach has recently found many applications, including solvers for Poker and even larger card games ^{4,5} .

- 1.Hart S, Mas-Colell A. A Simple Adaptive Procedure Leading to Correlated Equilibrium.
*Econometrica*. Published online September 2000:1127-1150. doi:10.1111/1468-0262.00153 - 2.Neller TW, Lanctot M. An Introduction to Counterfactual Regret Minimization. Teaching Materials. Published 2013. Accessed 2023. https://www.ma.imperial.ac.uk/~dturaev/neller-lanctot.pdf
- 3.Waugh K. Abstraction in Large Extensive Games.
*University of Alberta Libraries*. Published online 2009. doi:10.7939/R3CM74 - 4.Brown N, Lerer A, Gross S, Sandholm T. Deep Counterfactual Regret Minimization. In: ; 2019.
- 5.Adams D. The Feasibility of Deep Counterfactual Regret Minimisation for Trading Card Games.
*AI 2022: Advances in Artificial Intelligence*. Published online 2022:145-160. doi:10.1007/978-3-031-22695-3_11