Open eromoe opened 1 year ago
Hi @eromoe,
can you please provide more details on your environment? Python version, ydata-profiling ,etc?
Based on the error you have provided it does seem to be a different issues from #1308 .
Python 3.9.16 pandas 1.5.3 numpy 1.23.5 ydata-profiling develop(2023-05-18)
Hi @eromoe, may I suggest that you use a version that is not under development? Development version do not ensure that everything is 100% functional. We are currently working in a new release with update of major packages which can impact the experience of the package.
Based on the details provided, everything points to lack of memory in your machine.
Have you tried one of the following strategies:
If you can share the size of your data that would be appreciated as well.
@fabclmnt
So I come up this issue, that pr doesn't fix the memory leak.
Have you tried one of the following strategies:
- Convert your float variables to float 32 or 16 depending on the precision that you require? Similar for integers
- If you have timestamps double check the precision that you require
I would take a try next week.
The issue seems to be fixed after that PR you've linked - we have validated.
Hence why I'm asking as well the size of our data.
I tried to convert data to np.float16
, but still same error .
There is the testing dataset: https://mega.nz/file/xZ1xyIBL#0BM_WghcbQTJO6E1N4wpEeoESwxTr696UBlc85SnmlA
@eromoe please confirm what is your system computational power.
And please provide the code you are using to compute the profiling.
No complicate, just load the file and use ProfileReport
r = ProfileReport(df, title='fina_price')
r.to_file('fina_price.html')
You see the error message, it require over 1T memory , no matter of system computational power. I have 12 core amd cpu and 64G ram.
Hi,
It is not a memory leak. It fails on requested 1T memory and the most likely explanation is that you have categorical features with extremely high cardinality. Chi Square will compute a contingency table with dimension n x n where n is the number of different categories.
Try to remove the first column which is a unique index and brings no value. With this column, it will request for the Chi Square a matrux of size 200k x 200k which indeed might require around 1T memory.
Sorry, I wanted to make sure you tested the file I sent? There is no categorical features in my dataset.
The first column is index of original dataframe, you can drop it, and no column have high cardinality ( I mean all columns are numerical, so there is no cardinality ) .
I have used datatile
to reconfirm this:
Current Behaviour
I saw this pr had been merged https://github.com/ydataai/ydata-profiling/pull/1308
But I still got memory error:
Expected Behaviour
no error
Data Description
not avaliable
Code that reproduces the bug
No response
pandas-profiling version
latest develop
Dependencies
OS
win10
Checklist