Open arita37 opened 3 years ago
Could you provide a dataset to reproduce? Did you test against prior versions of this package?
Dataset is simple :
40k rows 37 columns of float.
You can try add sklearn random dataset as your regression tests.
On Apr 2, 2021, at 18:41, Simon Brugman @.***> wrote:
Could you provide a dataset to reproduce? Did you test against prior versions of this package?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
What settings are you using? There is always a trade-off between the performance and which statistics are generated. For instance, on a similar dataset as you mention with the minimal=True
setting, the full report, including HTML rendering takes 2.5 seconds.
Ok, How much time for minimal= False ???
My suggestions :
1) Run speed benchmark when releasing With nrows= 10,000 and ncolumns=50 (ie a small dataset)
2) Find tricks / optional to remove heavy compute —> Mostly pairwise
Mostly : ratio =Nb of unqiue values in (5%, 95%) / len(df) if ratio is low : might skip
On Apr 2, 2021, at 19:50, Simon Brugman @.***> wrote:
What settings are you using? There is always a trade-off between the performance and which statistics are generated. For instance, on a similar dataset as you mention with the minimal=True setting, the full report, including HTML rendering takes 2.5 seconds.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Suggest to use
pyinstrument to benchmark
You’ll see which functionalities to de-activate by default.
On Apr 2, 2021, at 23:53, No Ke @.***> wrote:
Ok, How much time for minimal= False ???
My suggestions :
1) Run speed benchmark when releasing With nrows= 10,000 and ncolumns=50 (ie a small dataset)
2) Find tricks / optional to remove heavy compute —> Mostly pairwise
Mostly : ratio =Nb of unqiue values in (5%, 95%) / len(df) if ratio is low : might skip
On Apr 2, 2021, at 19:50, Simon Brugman @.***> wrote:
What settings are you using? There is always a trade-off between the performance and which statistics are generated. For instance, on a similar dataset as you mention with the minimal=True setting, the full report, including HTML rendering takes 2.5 seconds.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
@arita37 Sounds good. Would you be interested in contributing a PR and work out the sketched solution?
Sorry, am too busy... Check my profile and count the number of commits done....
It takes 5 lines of code
Random dataset numpy random: 10000 rows x 50cols run pandas profiling with py instrument identifiy bottlenack
Put optional all compute > 40sec.
Make the total compute < 60 sec.
the more bloated you put pandas profiling the less usable it becomes....
Dont spend time on useless jupyter widgets, al.
Make sure the code is running fast —> people will always use it because it would be faster than using jupyter....
HTML report replaces jupyter itself
On Apr 3, 2021, at 1:16, Simon Brugman @.***> wrote:
@arita37 Sounds good. Would you be interested in contributing a PR and work out the sketched solution?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I ran a 49 million record table with 86 variables. Yeah, it took 30 minutes to run, but I'm good with that because the data insights I got from the report are an incredible help.
I turned off correlations, interactions, and closed all the missing diagrams.
Are correlations, interactions, and missing diagrams the most expensive in terms of computation? Could you throw in additional insights about what are the tasks which are computationally expensive during report generation?
@dpnem Would you share harware specs of computer that you run profiling on ?
too slow...
With 10k rows and 30 columns, it takes more than 2mins to generate a report...
Pandas profiking becomes more and more slow...
Can you run benchmark tests ?