On Subgaussian Concentration of Missing Mass

The problem of missing mass goes back to the cryptographic work of Good and Turing during WWII, but has been also studied in the context of linguistic, ecology, and by probability theoreticians. The missing mass is defined as the weight of elements not observed in a sample, due to pure chance:

True and empirical frequency based on 100 samples, for Poiss(10).
The value of 12 is missing, and the corresponding bin is empty.

Such an event and the total mass of all unseen elements are exponentially small in the sample size.

Quite curiously, proving it rigorously has been quite hard. The first proof appeared at COLT’00, but has since then been reworked many times, in attempts to simplify and numerically improve. The arguments were based on a thermodynamic framework, logarithmic Sobolev inequalities and information theory. Formally, for some constant \(K>0\) we want

$$\Pr[\pm(M-\boldsymbol{E}M)>\epsilon]\leq \mathrm{e}^{-K\cdot n \epsilon^2}$$

where \(n\) is the sample size and \(M\) is the missing mass for the i.i.d. sample of size \(n\).

The problem of proving it “standard” concentration inequalities has been open so far. In reponse to this challenge, in my recent note I have proved it with Bernstein’s inequality, century old. So, no complicated approaches and refinements were necessary!

Published by mskorski

Scientist, Consultant, Learning Enthusiast

Leave a comment

Your email address will not be published. Required fields are marked *