On Importance of Log-Normality in Hypothesis Testing

Normality of data is important in statistics, so it is not a surprise that transforming data to look closer to a normal distribution can be beneficial in many ways. There are ways of finding the right transform, such as the popular Box-Cox method ​1,2​. In this post, I will share an elegant example on how logarithmic transforms can improve normality and consequently make statistical testing significant, effectively avoiding missing an outcome (false-negative)!

More precisely, by looking into the classical heart disease dataset ​3​ we will find that the cholesterol level is log-normally distributed, as suggested in the medical literature​4​, and using this we will prove that cholesterol level is associated with heart disease.

Here is how the data looks like:

We see that under the log-transform statistical tests no longer reject normality, and moreover that the difference between groups becomes significant!

TransformKS Test p-valT Test p-val
HealthySick
none0.0032350.9900000.05
log0.2460640.4532060.03
Normality and mean-difference tests for both original and transformed data.

For completeness, here is the code snippet to reproduce:

#%pip install openml
import matplotlib.pyplot as plt
import seaborn as sns
import openml
import numpy as np
from statsmodels.stats import diagnostic, weightstats
import pandas as pd
from IPython.display import display

## get the data

dataset = openml.datasets.get_dataset("Heart-Disease-Prediction")
X, y, _, _ = dataset.get_data(dataset_format="dataframe")
X["Heart_Disease"] = X["Heart_Disease"].apply({"Absence":False,"Presence":True}.get)

sns.histplot(data=X,x="Cholesterol",hue="Heart_Disease" )

x0 = X['Cholesterol'][X["Heart_Disease"]==False]
x1 = X['Cholesterol'][X["Heart_Disease"]==True]

## investigate normality
outs = []
for fn_name, fn in zip(["none","log"],[lambda x:x, lambda x:np.log(x)]):
  p_norm0 = diagnostic.kstest_normal(fn(x0), dist='norm', pvalmethod='table')[1]
  p_norm1 = diagnostic.kstest_normal(fn(x1), dist='norm', pvalmethod='table')[1]
  p_ttest = weightstats.ttest_ind( fn(x0), fn(x1), usevar='unequal')[1].round(2)
  outs.append( (fn_name, p_norm0, p_norm1, p_ttest ) )

outs = pd.DataFrame(data=outs,columns=[("Transform",),('KS Test p-val',"Healthy"),('KS Test p-val',"Sick"), ('T Test',) ])
outs.columns = pd.MultiIndex.from_tuples( outs.columns )
display(outs)

References

  1. 1.
    Daimon T. Box–Cox Transformation. International Encyclopedia of Statistical Science. Published online 2011:176-178. doi:10.1007/978-3-642-04898-2_152
  2. 2.
    Box GEP, Cox DR. An Analysis of Transformations. Journal of the Royal Statistical Society: Series B (Methodological). Published online July 1964:211-243. doi:10.1111/j.2517-6161.1964.tb00553.x
  3. 3.
    Andras Janosi WS. Heart Disease. Published online 1989. doi:10.24432/C52P4X
  4. 4.
    Tharu B, Tsokos C. A Statistical Study of Serum Cholesterol Level by Gender and Race. J Res Health Sci. 2017;17(3):e00386. https://www.ncbi.nlm.nih.gov/pubmed/28878106

Published by mskorski

Scientist, Consultant, Learning Enthusiast

Leave a comment

Your email address will not be published. Required fields are marked *