ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.51k stars 1.68k forks source link

Pandas profiling becoming too slow : un-usable #743

Open arita37 opened 3 years ago

arita37 commented 3 years ago

With 10k rows and 30 columns, it takes more than 2mins to generate a report...

Pandas profiking becomes more and more slow...

Can you run benchmark tests ?

sbrugman commented 3 years ago

Could you provide a dataset to reproduce? Did you test against prior versions of this package?

arita37 commented 3 years ago

Dataset is simple :

40k rows 37 columns of float.

You can try add sklearn random dataset as your regression tests.

On Apr 2, 2021, at 18:41, Simon Brugman @.***> wrote:

 Could you provide a dataset to reproduce? Did you test against prior versions of this package?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

sbrugman commented 3 years ago

What settings are you using? There is always a trade-off between the performance and which statistics are generated. For instance, on a similar dataset as you mention with the minimal=True setting, the full report, including HTML rendering takes 2.5 seconds.

arita37 commented 3 years ago

Ok, How much time for minimal= False ???

My suggestions :

1) Run speed benchmark when releasing With nrows= 10,000 and ncolumns=50 (ie a small dataset)

2) Find tricks / optional to remove heavy compute —> Mostly pairwise

Mostly : ratio =Nb of unqiue values in (5%, 95%) / len(df) if ratio is low : might skip

On Apr 2, 2021, at 19:50, Simon Brugman @.***> wrote:

 What settings are you using? There is always a trade-off between the performance and which statistics are generated. For instance, on a similar dataset as you mention with the minimal=True setting, the full report, including HTML rendering takes 2.5 seconds.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

arita37 commented 3 years ago

Suggest to use

pyinstrument to benchmark

You’ll see which functionalities to de-activate by default.

On Apr 2, 2021, at 23:53, No Ke @.***> wrote:

 Ok, How much time for minimal= False ???

My suggestions :

1) Run speed benchmark when releasing With nrows= 10,000 and ncolumns=50 (ie a small dataset)

2) Find tricks / optional to remove heavy compute —> Mostly pairwise

Mostly : ratio =Nb of unqiue values in (5%, 95%) / len(df) if ratio is low : might skip

On Apr 2, 2021, at 19:50, Simon Brugman @.***> wrote:

 What settings are you using? There is always a trade-off between the performance and which statistics are generated. For instance, on a similar dataset as you mention with the minimal=True setting, the full report, including HTML rendering takes 2.5 seconds.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

sbrugman commented 3 years ago

@arita37 Sounds good. Would you be interested in contributing a PR and work out the sketched solution?

arita37 commented 3 years ago

Sorry, am too busy... Check my profile and count the number of commits done....

It takes 5 lines of code

Random dataset numpy random: 10000 rows x 50cols run pandas profiling with py instrument identifiy bottlenack

Put optional all compute > 40sec.

Make the total compute < 60 sec.

the more bloated you put pandas profiling the less usable it becomes....

Dont spend time on useless jupyter widgets, al.

Make sure the code is running fast —> people will always use it because it would be faster than using jupyter....

HTML report replaces jupyter itself

On Apr 3, 2021, at 1:16, Simon Brugman @.***> wrote:

 @arita37 Sounds good. Would you be interested in contributing a PR and work out the sketched solution?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

dpnem commented 3 years ago

I ran a 49 million record table with 86 variables. Yeah, it took 30 minutes to run, but I'm good with that because the data insights I got from the report are an incredible help.

I turned off correlations, interactions, and closed all the missing diagrams.

akshayreddykotha commented 3 years ago

Are correlations, interactions, and missing diagrams the most expensive in terms of computation? Could you throw in additional insights about what are the tasks which are computationally expensive during report generation?

enesMesut commented 3 years ago

@dpnem Would you share harware specs of computer that you run profiling on ?

zcfrank1st commented 1 year ago

too slow...