Header line of data dump corrupted

michaelaye commented 10 years ago

Last Sunday's header line is corrupted in the data file. It looks like this:

$ head -n 1 2014-06-01_planet_four_classifications.csv
2014-06-01_planet_four_classifications.csv0000644000000000000003112071610612342625137017203 0ustar  rootroot"classification_id","created_at","image_id","image_name","image_url","user_name","marking","x_tile","y_tile","acquisition_date","local_mars_time","x","y","image_x","image_y","radius_1","radius_2","distance","angle","spread"

while before it was looking cleanly like this:

$ head -n 1 2014-02-02_planet_four_classifications.csv 
"classification_id","created_at","image_id","image_name","image_url","user_name","marking","x_tile","y_tile","acquisition_date","local_mars_time","x","y","image_x","image_y","radius_1","radius_2","distance","angle","spread"

michaelaye commented 10 years ago

Additionally, there is a lot of binary stuff in the last line that is parsed as an empty line by standard CSV parsers.

chrissnyder commented 10 years ago

Looking into this.

michaelaye commented 10 years ago

FYI: I confirmed the exact same things for yesterdays's dump file. So it's not a one-off hickup.

chrissnyder commented 10 years ago

I think this has to do with a problem we ran into last week during data processing.

If this problem persists this week, I'll look into it again.

michaelaye commented 10 years ago

But you saw that I wrote above, it's both in the 2nd and 9th of June data dump? So the problem must have existed more than a week?

michaelaye commented 10 years ago

Same story:

2014-06-15_planet_four_classifications.csv0000644000000000000003122202146112347321372017203 0ustar  rootroot"classification_id","created_at","image_id","image_name","image_url","user_name","marking","x_tile","y_tile","acquisition_date","local_mars_time","x","y","image_x","image_y","radius_1","radius_2","distance","angle","spread"

mschwamb commented 10 years ago

So weirdly not having this issue. So I took a deeper look. When I start uncompressing the .gz I get 2014-06-15_planet_four_classifications.csv which has that weirdness that Michael reported. But my Archive Utility says it is still going and exapnding 2014-06-15_planet_four_classifications.csv . Interestingly after expanding it which takes some time another file appears '2014-06-15_planet_four_classifications.csv 2' without that problem and 2014-06-15_planet_four_classifications.csv disappears . Is this just weird uncompressing issues/behavior between the Archive Utility versus gzip because the file is so big? Michael what are using to unzip it?

I've put my uncompressed version of 2014-06-15_planet_four_classifications.csv 2 on dropbox. It's a ~3.3 GB file https://dl.dropboxusercontent.com/u/56971802/2014-06-15_planet_four_classifications%202.csv

mschwamb commented 10 years ago

Also give it like 3 hours from now for the file to upload to Dropbox before trying to grab the file.

michaelaye commented 10 years ago

I'm using gunzip on linux and I did not look at anything before gunzip was finished.

And thanks, but you didn't need to do the Dropbox upload. I coded work-arounds that replaces the header line with what I know it should be and I am scanning the whole database for any empty lines, which gets rid of the empty line at the end (Of course I could just remove the last line, but that's too hackish for my taste). These work-arounds don't remove the requirement for solving this, though, IMHO. Because one day we might add other columns in the database dump, and the reduction routines should just be able to read the header lines from the first row.

mschwamb commented 10 years ago

the file might need to be uncompressed again with gunzip once it's a .csv. You might see if that works. Mac disk utility seems to work. Anyway I'll let the dropbox upload continue in case you want to compare and see if there are any other differences or the development team wants to look at it.

michaelaye commented 10 years ago

That's it:

$ file 2014-06-15_planet_four_classifications.csv
2014-06-15_planet_four_classifications.csv: POSIX tar archive (GNU)

So the solution is to provide the files with the correct and standard extension .tar.gz. Once I used tar zxvf and not only gunzip, all is fine.

michaelaye commented 10 years ago

This also solved the last empty line, by the way.

adammcmaster commented 10 years ago

I think this is due to something getting switched when I was updating the backup scripts recently. Today’s backup is already in progress, but I’ll fix it for tomorrow’s.

adammcmaster commented 10 years ago

Oh, wait, ignore me :) I thought this was referring to the raw database dumps. It’s not related to what I was working on.

chrissnyder commented 10 years ago

Got me then. @parrish ?

camallen commented 10 years ago

Seems to be linked to here. I think the file extension should be '.tar.gz' as the compress method uses tar.

parrish commented 10 years ago

I can't really understand why you're ending up with a mangled header. The format is tar/gzip, which hasn't changed. Archive utility always appends a "2" to the filename for some random reason, but still produces a valid file. tar xzvf also produces valid output too. I'll try changing the extension to be more explicit.

michaelaye commented 10 years ago

It only was a 'mangled' header when expecting text. I was looking at the tar itself, which in this case for one file is basically the text file with the tar header and tail attached. The 'corruption' was the tar header. All this because I only used gunzip to unpack, because the extension did not indicate that the format was tar/gzip but only gzip.

zooniverse / planet-four

Header line of data dump corrupted #96