Open jkleint opened 1 year ago
Hi @jkleint,
Can you share a bit more context please? Based on our understanding, you want to compare 2 datasets with the same variable names, but with different data types identified. Is that correct?
@jkleint I was able to resolve this by updating Conda, updating the Pillow library, and then manually deleting the DroidSansMono.ttf file in this folder: C:\ProgramData\Anaconda3\Lib\site-packages\wordcloud\
Hi @jkleint,
Can you share a bit more context please? Based on our understanding, you want to compare 2 datasets with the same variable names, but with different data types identified. Is that correct?
Yes.
It's not that I want to, but dirty data happens, and a good tool should do something reasonable besides crash. The base level would be reporting that they have different types and not trying to compare. Ideal would be trying to coerce compatible types and do the comparison (or report when that's not possible).
Thanks!
@jkleint I've asked to understand better the use case that you trying to run / solve.
If this is a feature request, then it will be considered as such, if this is something that I can help you with with a workaround, more than happy to provide you one.
Regarding your request, yes dirty data might be expected indeed and we had that into consideration. Nevertheless, there is a very valid reason why this is not supported yet, metrics of 2 different data types are not comparable and for that reason the comparison in the end would not make sense.
In order to overcome the error you can always define the schema of the data prior running the report. This allows you to avoid the errors that are prompted.
I'm building a generic "compare datasets" tool for many data scientists to use. I want to say "point this tool at your data to see what's different." I want it to be super simple, one line of code, no knowledge required to use. I do not have control over their data; I don't even see it. Often it's very dirty; sometimes that means columns with the same names and different types. If the tool crashes in this case, it's not helpful, and people aren't going to use it.
I know you build your software to high standards, and you'd agree that ideally your software would not outright crash in any case, but handle errors gracefully. I've shared what looks to me like a bug; I say this because report1.compare(report2)
works, and it seems there is some basic logic to handle differing types; but report2.compare(report1)
crashes. It seems like the order of comparison shouldn't matter, and at the very least shouldn't crash in one case and work in the other. I feel the graceful fix is to recognize when columns have incomparable types, note that, and proceed with the comparison. Then just say in the report that the types were X and Y and so couldn't be compared. I hope you agree that's reasonable. Thanks!
I second this. Recently upgraded to ydata-profiling from pandas-profiling and I can't seem to run reports on old data because of this.
It would be nice if it failed gracefully since I think many people use this library in the early stages of data analysis when you have very little idea what your data looks like. Defining a schema isn't a trivial task (in my case I can have hundreds of columns).
At the minimum it'd be nice if it threw a more helpful error.
Current Behaviour
When comparing simple datasets where one has a column with type datetime, and the other corresponding column has type string,
compare()
crashes. Interestingly, calling compare on the string data report works, but calling compare on the datetime data crashes (i.e.,datatime_data_report.compare(string_data_report)
crashes, the other way does not.).Code to reproduce below, here is the output.
Expected Behaviour
Should show a comparison and not crash.
Data Description
See below.
Code that reproduces the bug
pandas-profiling version
v4.3.1
Dependencies
OS
MacOS 13.4
Checklist