xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.readthedocs.io
Apache License 2.0
1.11k stars 67 forks source link

BUG: read_parquet is unable to read ZipExtFile #623

Open BlackArbsCEO opened 1 year ago

BlackArbsCEO commented 1 year ago

Describe the bug

When trying to read a parquet file from a zipped folder using zipFile it failed whereas pandas has no issues.

To Reproduce

To help us to reproduce this bug, please provide information below:

  1. Your Python version
    Python version       : 3.10.10
    IPython version      : 8.13.2
  2. The version of Xorbits you use xorbits : 0.4.4
  3. Versions of crucial packages, such as numpy, scipy and pandas
    numpy     : 1.23.5
    pandas    : 1.5.2
  4. Full stack of the error.
    Traceback (most recent call last):
    File "C:\Users\kngka\Anaconda3\envs\algodev\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
    File "<ipython-input-2-66890dd066a4>", line 1, in <module>
    runfile('D:\\PERSONAL\\CODE_PROJECTS\\blackarbs_algo_strategy_dev-master\\scripts\\data_exploration.py', wdir='D:\\PERSONAL\\CODE_PROJECTS\\blackarbs_algo_strategy_dev-master\\scripts\\)
    File "C:\Program Files\JetBrains\PyCharm Community Edition 2020.1.2\plugins\python-ce\helpers\pydev\_pydev_bundle\pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
    File "C:\Program Files\JetBrains\PyCharm Community Edition 2020.1.2\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
    File "D:\PERSONAL\CODE_PROJECTS\blackarbs_algo_strategy_dev-master\scripts\data_exploration.py", line 124, in <module>
    df = get_symbol_dataframe_from_zip(zip_file_path, symbol)
    File "D:\PERSONAL\CODE_PROJECTS\blackarbs_algo_strategy_dev-master\scripts\data_exploration.py", line 94, in get_symbol_dataframe_from_zip
    out = read_parquet_files_from_zip(zip_file, filenames, symbol)
    File "C:\Users\kngka\Anaconda3\envs\algodev\lib\site-packages\blk_utils\utils.py", line 37, in wrap_func
    result = func(*args, **kwargs)
    File "D:\PERSONAL\CODE_PROJECTS\blackarbs_algo_strategy_dev-master\scripts\data_exploration.py", line 76, in read_parquet_files_from_zip
    df = pd.read_parquet(parquetfile)
    File "C:\Users\kngka\Anaconda3\envs\algodev\lib\site-packages\xorbits\core\adapter.py", line 472, in wrapped
    return from_mars(c(*to_mars(args), **to_mars(kwargs)))
    File "C:\Users\kngka\Anaconda3\envs\algodev\lib\site-packages\xorbits\_mars\dataframe\datasource\read_parquet.py", line 752, in read_parquet
    fs = get_fs(single_path, storage_options)
    File "C:\Users\kngka\Anaconda3\envs\algodev\lib\site-packages\xorbits\_mars\lib\filesystem\core.py", line 53, in get_fs
    scheme = get_scheme(path)
    File "C:\Users\kngka\Anaconda3\envs\algodev\lib\site-packages\xorbits\_mars\lib\filesystem\core.py", line 40, in get_scheme
    if os.path.exists(path) or glob_.glob(path):
    File "C:\Users\kngka\Anaconda3\envs\algodev\lib\genericpath.py", line 19, in exists
    os.stat(path)
    TypeError: stat: path should be string, bytes, os.PathLike or integer, not ZipExtFile
  5. Minimized code to reproduce the error.
    with zipfile.ZipFile(zip_file) as zip:
        for parquet_file in parquet_files:
            with zip.open(parquet_file, "r") as parquetfile:
                df = pd.read_parquet(parquetfile)

    Expected behavior

    A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

aresnow1 commented 1 year ago

Thanks for your report, we've reproduced this issue, and will fix it ASAP.