Use magic bytes to identify file formats

scottyhq commented 3 weeks ago

[x] Closes #142
[x] Tests added
[x] Changes are documented in docs/releases.rst
[ ] New functions/methods are listed in api.rst

rabernat commented 3 weeks ago

I would suggest considering just not trusting file extensions at all and only using the magics to identify file types.

scottyhq commented 3 weeks ago

I took a stab at just using magic bytes in the last commit. Note that I changed the 'netcdf4' mapping to 'hdf5' since that is the format behind the scenes based on magic_bytes. Tests seem to pass locally, I haven't added new tests, but did try the follow 'test urls' that are publicly available without authentication:

import fsspec
# https://en.wikipedia.org/wiki/List_of_file_signatures
examples = {
    'grib':'https://github.com/pydata/xarray-data/raw/master/era5-2mt-2019-03-uk.grib',
    'netcdf3':'https://github.com/pydata/xarray-data/raw/master/air_temperature.nc', 
    'netcdf4':'https://github.com/pydata/xarray-data/raw/master/ROMS_example.nc',
    'hdf4':'https://github.com/corteva/rioxarray/raw/master/test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf',
     # https://nisar.jpl.nasa.gov/data/sample-data/
    'hdf5':'https://nisar.asf.earthdatacloud.nasa.gov/NISAR-SAMPLE-DATA/Soil_Moisture/ALOS-2/NISAR_L3_PR_SME2_001_008_D_070_4000_QPNA_A_20190829T180759_20190829T180809_P01101_M_P_J_001.h5',
    'tif':'https://github.com/corteva/rioxarray/raw/master/test/test_data/input/cog.tif',
     # https://github.com/astropy/astropy/blob/4d034aa7e27e31cb0241cc01bbe76eab47406a91/astropy/io/fits/tests/test_fsspec.py#L73
    'fits':'https://mast.stsci.edu/api/v0.1/Download/file/?uri=mast:HST/product/ibxl50020_jif.fits',
    'jpg': 'https://github.com/rasterio/rasterio/raw/main/tests/data/389225main_sw_1965_1024.jpg',

}

for file_type,url in examples.items():
    with fsspec.open(url) as f:
        magic_bytes = f.read(8)
        print(file_type, magic_bytes)

    if magic_bytes.startswith(b"CDF"):
        print('netCDF3!')
    elif magic_bytes.startswith(b"\x89HDF"):
        print('HDF5 / netCDF4!')
    elif magic_bytes.startswith(b"\x0e\x03\x13\x01"):
        print('HDF4!')
    elif magic_bytes.startswith(b'GRIB'):
        print('GRIB!')
    elif magic_bytes.startswith(b'II*'):
        print('TIFF!')
    elif magic_bytes.startswith(b'SIMPLE'):
        print('FITS!')
    else:
        raise NotImplementedError(f"Unrecognised file based on header bytes: {magic_bytes}")

TomNicholas commented 2 weeks ago

This is neat, thanks @scottyhq !

I agree the magics are much more robust.

That list of examples is cool - is there any point adding tests of opening those files to the test suite?

scottyhq commented 2 weeks ago

That list of examples is cool - is there any point adding tests of opening those files to the test suite?

I think it's useful. I could add a @network marker as Xarray does? Should have some time to do that later this week https://github.com/pydata/xarray/blob/be8e17e4dc5da67d7cbb09db87d80c1bbc71a64e/conftest.py#L10

TomNicholas commented 2 weeks ago

I could add a @network marker as Xarray does?

Yes nice. This test of reading from s3 could also use that marker.

scottyhq commented 1 week ago

Worked on this a bit more @TomNicholas and I think it's ready to go. Added tests for all formats and linked to example files for all kerchunk-supported formats... But it seems like some additional work is needed to fully support TIFF, FITS, and HDF5.

TomNicholas commented 1 week ago

Thanks so much @scottyhq !

zarr-developers / VirtualiZarr

Use magic bytes to identify file formats #143