Addition of statistical testing if the data follows normal distribution or not

Statistical testing is used to check whether a given dataset follows a normal distribution or not. Normal distribution is a common pattern observed in many natural phenomena, where data is distributed symmetrically around the mean, and most of the data falls within one standard deviation from the mean.

There are several reasons why it is important to check whether data follows a normal distribution or not. For instance:

Some statistical methods, such as t-tests, ANOVA, and linear regression, assume that the data is normally distributed. Using these methods with non-normal data can lead to incorrect results.

Normality testing can help identify outliers and extreme values in the data, which may need to be removed or handled in a different way.

Normality testing can also help identify if the data is skewed, which can affect the interpretation of statistical results.

Normality testing is an important assumption of many machine learning algorithms, such as linear discriminant analysis and Gaussian Naive Bayes.

There are several statistical tests that can be used to check whether the data follows a normal distribution or not, such as the Shapiro-Wilk test, the Anderson-Darling test, and the Kolmogorov-Smirnov test. These tests can help determine whether the data is normally distributed, or whether it follows a different distribution, such as a skewed or bimodal distribution.

Proposed feature

When analyzing a continuous variable in a dataset, it is often useful to perform a normality test to determine whether the distribution of the data follows a normal distribution or not. This can be important because many statistical tests and models assume a normal distribution of the data, and violating this assumption can lead to incorrect results or interpretations.

A normality test can produce several outputs that can help in understanding the distribution of the data. Some of the common outputs of a normality test include:

Stat Value: The statistic calculated by the normality test, which is used to determine whether the data follows a normal distribution or not. Common normality tests include the Shapiro-Wilk test, Anderson-Darling test, and Kolmogorov-Smirnov test, among others.
Graph of distribution: A plot of the data distribution, often as a histogram or density plot, can be helpful for visualizing the shape of the distribution and identifying any deviations from normality.
Transformation results: If the data is not normally distributed, basic transformations such as logarithmic or square root transformations can be applied to the data to try and make it more normal. The normality test can be run again on the transformed data to see if the transformation was successful in achieving normality.

In summary, performing a normality test on a continuous variable can provide valuable insights into the distribution of the data and help guide any subsequent analysis or modeling steps. The outputs of the normality test, such as the statistic, distribution plot, and transformation results, can all be used to better understand the properties of the data and make more informed decisions.

Alternatives considered

No response

Additional context

In addition to normality testing, there may be other useful properties to consider when analyzing a continuous feature in a dataset. Some examples of additional context that could be relevant include:

Skewness and kurtosis: These are measures of the shape of the data distribution that can provide additional information beyond normality testing. Skewness measures the degree of asymmetry in the distribution, while kurtosis measures the "peakedness" of the distribution.
Outliers: Outliers are data points that are significantly different from the rest of the data and can have a strong influence on statistical analysis. Identifying and handling outliers appropriately can improve the accuracy and robustness of statistical models.
Scaling: In some cases, it may be necessary to scale the feature to a specific range or standardize it to have a mean of 0 and a standard deviation of 1. This can be important for some machine learning algorithms that are sensitive to differences in scale between features.
Correlations (Already Available): Examining correlations between the continuous feature and other variables in the dataset can provide insights into potential relationships and dependencies between variables. But this can update like if the data is normal do Pearson's correlation and If not do Spearman's correlation

Overall, understanding the additional context for a continuous feature beyond normality testing can help guide data exploration and modelling decisions and lead to more accurate and meaningful results.

fabclmnt commented 1 year ago

Hi @danishbansal808 ,

thank you for the detailed request :) This is great.

We will integrate some of the suggestions. Nevertheless, as per the transformations goes, we won't be including it in the development. The objective of the tool is to profile the data without changing the original input, to apply a normal transformation would change the design and project purpose.

prayas7102 commented 1 month ago

Hi @danishbansal808 ,

thank you for the detailed request :) This is great.

We will integrate some of the suggestions. Nevertheless, as per the transformations goes, we won't be including it in the development. The objective of the tool is to profile the data without changing the original input, to apply a normal transformation would change the design and project purpose.

Hi @fabclmnt @lpeti69,

Is this issue still open? As a regular user of pandas-profiling, I find the feature suggested by @danishbansal808 highly valuable. In addition to the tests already highlighted, we could also incorporate a QQ plot and the .skew() method to assess the skewness of the distribution, providing more comprehensive insights.

Would it be possible to assign this issue to me? I'd be happy to contribute. Thanks!

ydataai / ydata-profiling

Feature Request: Addition of statistical testing if the data follows normal distribution or not #1291

Missing functionality

Addition of statistical testing if the data follows normal distribution or not

Proposed feature

Alternatives considered

Additional context