ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.54k stars 1.69k forks source link

Comparing datetime and str columns crashes with TypeError #1378

Open jkleint opened 1 year ago

jkleint commented 1 year ago

Current Behaviour

When comparing simple datasets where one has a column with type datetime, and the other corresponding column has type string, compare() crashes. Interestingly, calling compare on the string data report works, but calling compare on the datetime data crashes (i.e., datatime_data_report.compare(string_data_report) crashes, the other way does not.).

Code to reproduce below, here is the output.


/Users/jk/progs/profile-test/venv/bin/python /Users/jk/progs/profile-test/ydata_bugs.py 
Python 3.11.4 (main, Jun 29 2023, 21:37:20) [Clang 12.0.0 (clang-1200.0.32.29)]
Pandas 2.0.3
ydata_profiling v4.3.1 

Summarize dataset: 100%|██████████| 9/9 [00:00<00:00, 42.25it/s, Completed]
Summarize dataset: 100%|██████████| 9/9 [00:00<00:00, 67.19it/s, Completed]
-----
Summarize dataset: 100%|██████████| 9/9 [00:00<00:00, 67.49it/s, Completed]
Summarize dataset:   0%|          | 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/summary_algorithms.py", line 68, in inner
    return fn(config, series, summary)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/summary_algorithms.py", line 85, in inner
    return fn(config, series, summary)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/describe_date_pandas.py", line 34, in pandas_describe_date_1d
    "min": pd.Timestamp.to_pydatetime(series.min()),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: descriptor 'to_pydatetime' for 'pandas._libs.tslibs.timestamps._Timestamp' objects doesn't apply to a 'str' object

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/summary_pandas.py", line 57, in pandas_describe_1d
    return summarizer.summarize(config, series, dtype=vtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/summarizer.py", line 42, in summarize
    _, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)})
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/handler.py", line 62, in handle
    return op(*args)
           ^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/handler.py", line 21, in func2
    return f(*res)
           ^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/handler.py", line 21, in func2
    return f(*res)
           ^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/handler.py", line 21, in func2
    return f(*res)
           ^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/handler.py", line 17, in func2
    res = g(*x)
          ^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 317, in __call__
    raise DispatchError(f"Function {func.__code__}") from ex
multimethod.DispatchError: Function <code object inner at 0x134307960, file "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/summary_algorithms.py", line 62>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/summary_pandas.py", line 99, in pandas_get_series_descriptions
    for i, (column, description) in enumerate(
  File "/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 873, in next
    raise value
  File "/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/summary_pandas.py", line 79, in multiprocess_1d
    return column, describe_1d(config, series, summarizer, typeset)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 317, in __call__
    raise DispatchError(f"Function {func.__code__}") from ex
multimethod.DispatchError: Function <code object pandas_describe_1d at 0x7fa1d47960d0, file "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/summary_pandas.py", line 19>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jk/progs/profile-test/ydata_bugs.py", line 26, in <module>
    compare_ydata(df2, df1)
  File "/Users/jk/progs/profile-test/ydata_bugs.py", line 16, in compare_ydata
    report1.compare(report2)
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/profile_report.py", line 543, in compare
    return compare([self, other], config if config is not None else self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/compare_reports.py", line 339, in compare
    labels, descriptions = _compare_profile_report_preprocess(reports, _config)  # type: ignore
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/compare_reports.py", line 146, in _compare_profile_report_preprocess
    descriptions = [report.get_description() for report in reports]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/compare_reports.py", line 146, in <listcomp>
    descriptions = [report.get_description() for report in reports]
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/profile_report.py", line 320, in get_description
    return self.description_set
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/profile_report.py", line 251, in description_set
    self._description_set = describe_df(
                            ^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/describe.py", line 72, in describe
    series_description = get_series_descriptions(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 317, in __call__
    raise DispatchError(f"Function {func.__code__}") from ex
multimethod.DispatchError: Function <code object pandas_get_series_descriptions at 0x7fa1d47977d0, file "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/summary_pandas.py", line 60>

Process finished with exit code 1

Expected Behaviour

Should show a comparison and not crash.

Data Description

See below.

Code that reproduces the bug

import sys

import pandas as pd
import ydata_profiling
from ydata_profiling import ProfileReport

print('Python', sys.version)
print('Pandas', pd.__version__)
print('ydata_profiling', ydata_profiling.__version__, '\n')

def compare_ydata(df1, df2):
    report1 = ProfileReport(df1, title='df1', interactions=None, correlations=None)
    report2 = ProfileReport(df2, title='df2', interactions=None, correlations=None)
    report1.compare(report2)

df1 = pd.DataFrame({'a': ['2023-01-17 05:12:40', '2023-01-17 05:02:38']})
df2 = pd.DataFrame({'a': pd.to_datetime(['2023-06-08 07:02:00', '2023-01-05 08:07:00'])})

# Works
compare_ydata(df1, df2)
print('-----')
# Crashes
compare_ydata(df2, df1)

pandas-profiling version

v4.3.1

Dependencies

pandas==2.0.3

OS

MacOS 13.4

Checklist

fabclmnt commented 1 year ago

Hi @jkleint,

Can you share a bit more context please? Based on our understanding, you want to compare 2 datasets with the same variable names, but with different data types identified. Is that correct?

ccmilne commented 1 year ago

@jkleint I was able to resolve this by updating Conda, updating the Pillow library, and then manually deleting the DroidSansMono.ttf file in this folder: C:\ProgramData\Anaconda3\Lib\site-packages\wordcloud\

jkleint commented 1 year ago

Hi @jkleint,

Can you share a bit more context please? Based on our understanding, you want to compare 2 datasets with the same variable names, but with different data types identified. Is that correct?

Yes.
It's not that I want to, but dirty data happens, and a good tool should do something reasonable besides crash. The base level would be reporting that they have different types and not trying to compare. Ideal would be trying to coerce compatible types and do the comparison (or report when that's not possible).

Thanks!

fabclmnt commented 1 year ago

@jkleint I've asked to understand better the use case that you trying to run / solve.

If this is a feature request, then it will be considered as such, if this is something that I can help you with with a workaround, more than happy to provide you one.

Regarding your request, yes dirty data might be expected indeed and we had that into consideration. Nevertheless, there is a very valid reason why this is not supported yet, metrics of 2 different data types are not comparable and for that reason the comparison in the end would not make sense.

In order to overcome the error you can always define the schema of the data prior running the report. This allows you to avoid the errors that are prompted.

jkleint commented 1 year ago

I'm building a generic "compare datasets" tool for many data scientists to use. I want to say "point this tool at your data to see what's different." I want it to be super simple, one line of code, no knowledge required to use. I do not have control over their data; I don't even see it. Often it's very dirty; sometimes that means columns with the same names and different types. If the tool crashes in this case, it's not helpful, and people aren't going to use it.

I know you build your software to high standards, and you'd agree that ideally your software would not outright crash in any case, but handle errors gracefully. I've shared what looks to me like a bug; I say this because report1.compare(report2) works, and it seems there is some basic logic to handle differing types; but report2.compare(report1) crashes. It seems like the order of comparison shouldn't matter, and at the very least shouldn't crash in one case and work in the other. I feel the graceful fix is to recognize when columns have incomparable types, note that, and proceed with the comparison. Then just say in the report that the types were X and Y and so couldn't be compared. I hope you agree that's reasonable. Thanks!

Filimoa commented 11 months ago

I second this. Recently upgraded to ydata-profiling from pandas-profiling and I can't seem to run reports on old data because of this.

It would be nice if it failed gracefully since I think many people use this library in the early stages of data analysis when you have very little idea what your data looks like. Defining a schema isn't a trivial task (in my case I can have hundreds of columns).

At the minimum it'd be nice if it threw a more helpful error.