ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.55k stars 1.69k forks source link

Feature Request: Addition of statistical testing if the data follows normal distribution or not #1291

Open danishbansal808 opened 1 year ago

danishbansal808 commented 1 year ago

Missing functionality

Addition of statistical testing if the data follows normal distribution or not

Statistical testing is used to check whether a given dataset follows a normal distribution or not. Normal distribution is a common pattern observed in many natural phenomena, where data is distributed symmetrically around the mean, and most of the data falls within one standard deviation from the mean.

There are several reasons why it is important to check whether data follows a normal distribution or not. For instance:

Some statistical methods, such as t-tests, ANOVA, and linear regression, assume that the data is normally distributed. Using these methods with non-normal data can lead to incorrect results.

Normality testing can help identify outliers and extreme values in the data, which may need to be removed or handled in a different way.

Normality testing can also help identify if the data is skewed, which can affect the interpretation of statistical results.

Normality testing is an important assumption of many machine learning algorithms, such as linear discriminant analysis and Gaussian Naive Bayes.

There are several statistical tests that can be used to check whether the data follows a normal distribution or not, such as the Shapiro-Wilk test, the Anderson-Darling test, and the Kolmogorov-Smirnov test. These tests can help determine whether the data is normally distributed, or whether it follows a different distribution, such as a skewed or bimodal distribution.

Proposed feature

When analyzing a continuous variable in a dataset, it is often useful to perform a normality test to determine whether the distribution of the data follows a normal distribution or not. This can be important because many statistical tests and models assume a normal distribution of the data, and violating this assumption can lead to incorrect results or interpretations.

A normality test can produce several outputs that can help in understanding the distribution of the data. Some of the common outputs of a normality test include:

In summary, performing a normality test on a continuous variable can provide valuable insights into the distribution of the data and help guide any subsequent analysis or modeling steps. The outputs of the normality test, such as the statistic, distribution plot, and transformation results, can all be used to better understand the properties of the data and make more informed decisions.

Alternatives considered

No response

Additional context

In addition to normality testing, there may be other useful properties to consider when analyzing a continuous feature in a dataset. Some examples of additional context that could be relevant include:

Overall, understanding the additional context for a continuous feature beyond normality testing can help guide data exploration and modelling decisions and lead to more accurate and meaningful results.

fabclmnt commented 1 year ago

Hi @danishbansal808 ,

thank you for the detailed request :) This is great.

We will integrate some of the suggestions. Nevertheless, as per the transformations goes, we won't be including it in the development. The objective of the tool is to profile the data without changing the original input, to apply a normal transformation would change the design and project purpose.

prayas7102 commented 1 month ago

Hi @danishbansal808 ,

thank you for the detailed request :) This is great.

We will integrate some of the suggestions. Nevertheless, as per the transformations goes, we won't be including it in the development. The objective of the tool is to profile the data without changing the original input, to apply a normal transformation would change the design and project purpose.

Hi @fabclmnt @lpeti69,

Is this issue still open? As a regular user of pandas-profiling, I find the feature suggested by @danishbansal808 highly valuable. In addition to the tests already highlighted, we could also incorporate a QQ plot and the .skew() method to assess the skewness of the distribution, providing more comprehensive insights.

Would it be possible to assign this issue to me? I'd be happy to contribute. Thanks!