Normality of data is important in statistics, so it is not a surprise that transforming data to look closer to a normal distribution can be beneficial in many ways. There are ways of finding the right transform, such as the popular Box-Cox method 1,2. In this post, I will share an elegant example on how logarithmic transforms can improve normality and consequently make statistical testing significant, effectively avoiding missing an outcome (false-negative)!
More precisely, by looking into the classical heart disease dataset 3 we will find that the cholesterol level is log-normally distributed, as suggested in the medical literature4, and using this we will prove that cholesterol level is associated with heart disease.
Here is how the data looks like:
We see that under the log-transform statistical tests no longer reject normality, and moreover that the difference between groups becomes significant!
Transform | KS Test p-val | T Test p-val | |
---|---|---|---|
Healthy | Sick | ||
none | 0.003235 | 0.990000 | 0.05 |
log | 0.246064 | 0.453206 | 0.03 |
For completeness, here is the code snippet to reproduce:
#%pip install openml
import matplotlib.pyplot as plt
import seaborn as sns
import openml
import numpy as np
from statsmodels.stats import diagnostic, weightstats
import pandas as pd
from IPython.display import display
## get the data
dataset = openml.datasets.get_dataset("Heart-Disease-Prediction")
X, y, _, _ = dataset.get_data(dataset_format="dataframe")
X["Heart_Disease"] = X["Heart_Disease"].apply({"Absence":False,"Presence":True}.get)
sns.histplot(data=X,x="Cholesterol",hue="Heart_Disease" )
x0 = X['Cholesterol'][X["Heart_Disease"]==False]
x1 = X['Cholesterol'][X["Heart_Disease"]==True]
## investigate normality
outs = []
for fn_name, fn in zip(["none","log"],[lambda x:x, lambda x:np.log(x)]):
p_norm0 = diagnostic.kstest_normal(fn(x0), dist='norm', pvalmethod='table')[1]
p_norm1 = diagnostic.kstest_normal(fn(x1), dist='norm', pvalmethod='table')[1]
p_ttest = weightstats.ttest_ind( fn(x0), fn(x1), usevar='unequal')[1].round(2)
outs.append( (fn_name, p_norm0, p_norm1, p_ttest ) )
outs = pd.DataFrame(data=outs,columns=[("Transform",),('KS Test p-val',"Healthy"),('KS Test p-val',"Sick"), ('T Test',) ])
outs.columns = pd.MultiIndex.from_tuples( outs.columns )
display(outs)
References
- 1.Daimon T. Box–Cox Transformation. International Encyclopedia of Statistical Science. Published online 2011:176-178. doi:10.1007/978-3-642-04898-2_152
- 2.Box GEP, Cox DR. An Analysis of Transformations. Journal of the Royal Statistical Society: Series B (Methodological). Published online July 1964:211-243. doi:10.1111/j.2517-6161.1964.tb00553.x
- 3.Andras Janosi WS. Heart Disease. Published online 1989. doi:10.24432/C52P4X
- 4.Tharu B, Tsokos C. A Statistical Study of Serum Cholesterol Level by Gender and Race. J Res Health Sci. 2017;17(3):e00386. https://www.ncbi.nlm.nih.gov/pubmed/28878106