weecology / retriever

Quickly download, clean up, and install public datasets into a database management system
http://data-retriever.org
Other
306 stars 134 forks source link

Downloading fails for files with no Content-Disposition #1659

Open henrykironde opened 2 years ago

henrykironde commented 2 years ago

Example packages:
1: Package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/usda_agriculture_plants_database.py Sample url: https://plants.sc.egov.usda.gov/csvdownload?plantLst=plantCompleteList

2: package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/aquatic_animal_excretion.py url: https://esajournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fecy.1792&file=ecy1792-sup-0001-DataS1.zip

ethanwhite commented 2 years ago

The second one is fixed by spoofing the user agent with a browser, i.e., it's Wiley (the publisher) trying to block automated downloads. I did it using wget to test but we should be able to do the same thing in Python.

As you mentioned earlier the first one is a mess. Not only is it rendering into html, but the data itself isn't in the html it's being rendered by javascript, so I think you'd basically have to cut and paste the text out of the browser. I don't have any good thoughts on this one other than to email the data providers and ask them to provide a better option. We might be able to scrape it out somehow, but I don't think it's worth it for one dataset.