Data files on results page are "dirty" zip files

michaelaye commented 5 years ago

Hi! I'm trying to establish a remote-loadable data catalog using the served result files on planetfour.org/results but am facing the issue that the zip files have more than one file in them, with all superfluous files being macOS additions. For example for "https://data.zooniverse.org/planet_four/P4_catalog_v1.1_L1C_cut_0.5_fan.csv.zip" I am getting:

ValueError: ('Multiple files found in compressed zip file %s', "['P4_catalog_v1.1_L1C_cut_0.5_fan.csv', '__MACOSX/', '__MACOSX/._P4_catalog_v1.1_L1C_cut_0.5_fan.csv']")

when using pandas.read_csv() which is able to read remotely stored CSV files. Could the served files be cleaned up to have only exactly one file please? If it's easier for you, this also could be .gz files instead of .zip.

Thanks!

michaelaye commented 5 years ago

Or I can create them for you and you put them up? Maybe it was even me creating these in the first place, I don't remember.

michaelaye commented 5 years ago

Here are new clean gzip files: [EDIT: Had to replace with gzip for remote load-ability]

(Fans): https://www.dropbox.com/s/o70a1x8xzrxfyvg/P4_catalog_v1.1_L1C_cut_0.5_fan.csv.gz?dl=1
(Blotches): https://www.dropbox.com/s/hx0qr1ut5lk2fva/P4_catalog_v1.1_L1C_cut_0.5_blotch.csv.gz?dl=1
(HiRISE observation catalog) https://www.dropbox.com/s/k0bfd11mnwqf9h8/P4_catalog_v1.1_metadata.csv.gz?dl=1
Tile catalog https://www.dropbox.com/s/rupw3rz0pw4bzm8/P4_catalog_v1.1_tile_coords_final.csv.gz?dl=1
Raw data https://www.dropbox.com/s/zczeko8najbn627/P4_catalog_v1.0_raw_classifications.hdf.gz?dl=1

The last item is a folder structure anyway so wouldn't work as a single file read in any case.

Thanks!

camallen commented 5 years ago

I've re-uploaded those zip files with just the src data files, no cruft.

Using the link above in the initial report i see the following

$ zipinfo Downloads/P4_catalog_v1.1_L1C_cut_0.5_fanv.zip 
Archive:  Downloads/P4_catalog_v1.1_L1C_cut_0.5_fan.csv.zip
Zip file size: 9936053 bytes, number of entries: 1
-rw-r--r--  3.0 unx 33212637 tx defX 19-Jun-04 10:47 P4_catalog_v1.1_L1C_cut_0.5_fan.csv
1 file, 33212637 bytes uncompressed, 9935833 bytes compressed:  70.1%

FWIW you can use some simple python zip file filtering can help avoid the issue for other archives you may not have control over

import requests, zipfile
from urllib.request import urlopen
from io import BytesIO
import pandas as pd

zip_file_url = 'https://data.zooniverse.org/planet_four/P4_catalog_v1.1_L1C_cut_0.5_fan.csv.zip'
remote_zip_file = urlopen(zip_file_url)
zipinmemory = BytesIO(remote_zip_file.read())
zip_file = zipfile.ZipFile(zipinmemory)

# the zipfile namelist can be filtered for smarter file loading
# In this case, only load the first file from the zip archive 
data = pd.read_csv(zip_file.open(zip_file.namelist()[0]))

print(data.head())

michaelaye commented 5 years ago

Actually, simply using the command line zip instead of Finder.app's right-click compress does the job as well. Alas, you seem to have missed my edit above. I rather prefer the files to be gzip instead of zip files, because the Dask autoscheduler library for analysis pipelines does not support zip format and Dask is used as the default remote file reader for the intake library, which is what I'm currently using to create a documented and versioned data set. If you could just take above linked gzips which are command-line created and clean and load them up onto the site, that would be much appreciated!

camallen commented 5 years ago

Alas, you seem to have missed my edit above. I rather prefer the files to be gzip instead of zip files

I kept the zip files as they are more portable across operating systems, not just one use case listed here. Also the existing data page links didn't require changing.

because the Dask autoscheduler library for analysis pipelines does not support zip format and Dask is used as the default remote file reader for the intake library, which is what I'm currently using to create a documented and versioned data set.

I'm not convinced that we need to host specific files for your specific scheduling platform. What if you change from Dask to another scheduling system and then request that i change from .gz files back to .zip or another format? I'd prefer to provide the general solution and let specific use cases be solved another way.

Instead, why not use your prepared dropbox gz links in your dask specific code? That way you have complete control over the input files and don't rely on an external party to host / fix them for you.

zooniverse / planet-four

Data files on results page are "dirty" zip files #185