Normality of data is important in statistics, so it is not a surprise that transforming data to look closer to a normal distribution can be beneficial in many ways. There are ways of finding the right transform, such as the popular Box-Cox method ^{1,2}. In this post, I will share an elegant example on how** logarithmic transforms can improve normality** and consequently **make statistical testing significant**, effectively avoiding missing an outcome (false-negative)!

More precisely, by looking into the classical heart disease dataset ^{3} we will find that **the cholesterol level is log-normally distributed**, as suggested in the medical literature^{4}, and using this we **will prove that cholesterol level is associated with heart disease**.

Here is how the data looks like:

We see that under the log-transform statistical tests no longer reject normality, and moreover that the difference between groups becomes significant!

Transform | KS Test p-val | T Test p-val | |
---|---|---|---|

Healthy | Sick | ||

none | 0.003235 | 0.990000 | 0.05 |

log | 0.246064 | 0.453206 | 0.03 |

For completeness, here is the code snippet to reproduce:

```
#%pip install openml
import matplotlib.pyplot as plt
import seaborn as sns
import openml
import numpy as np
from statsmodels.stats import diagnostic, weightstats
import pandas as pd
from IPython.display import display
## get the data
dataset = openml.datasets.get_dataset("Heart-Disease-Prediction")
X, y, _, _ = dataset.get_data(dataset_format="dataframe")
X["Heart_Disease"] = X["Heart_Disease"].apply({"Absence":False,"Presence":True}.get)
sns.histplot(data=X,x="Cholesterol",hue="Heart_Disease" )
x0 = X['Cholesterol'][X["Heart_Disease"]==False]
x1 = X['Cholesterol'][X["Heart_Disease"]==True]
## investigate normality
outs = []
for fn_name, fn in zip(["none","log"],[lambda x:x, lambda x:np.log(x)]):
p_norm0 = diagnostic.kstest_normal(fn(x0), dist='norm', pvalmethod='table')[1]
p_norm1 = diagnostic.kstest_normal(fn(x1), dist='norm', pvalmethod='table')[1]
p_ttest = weightstats.ttest_ind( fn(x0), fn(x1), usevar='unequal')[1].round(2)
outs.append( (fn_name, p_norm0, p_norm1, p_ttest ) )
outs = pd.DataFrame(data=outs,columns=[("Transform",),('KS Test p-val',"Healthy"),('KS Test p-val',"Sick"), ('T Test',) ])
outs.columns = pd.MultiIndex.from_tuples( outs.columns )
display(outs)
```

### References

- 1.Daimon T. Box–Cox Transformation.
*International Encyclopedia of Statistical Science*. Published online 2011:176-178. doi:10.1007/978-3-642-04898-2_152 - 2.Box GEP, Cox DR. An Analysis of Transformations.
*Journal of the Royal Statistical Society: Series B (Methodological)*. Published online July 1964:211-243. doi:10.1111/j.2517-6161.1964.tb00553.x - 3.Andras Janosi WS. Heart Disease. Published online 1989. doi:10.24432/C52P4X
- 4.Tharu B, Tsokos C. A Statistical Study of Serum Cholesterol Level by Gender and Race.
*J Res Health Sci*. 2017;17(3):e00386. https://www.ncbi.nlm.nih.gov/pubmed/28878106