wfondrie / ppx

A Python interface to proteomics data repositories
https://ppx.readthedocs.io
MIT License
29 stars 4 forks source link

File List from MassIVE Accelerated API #8

Closed mwang87 closed 3 years ago

mwang87 commented 3 years ago

Hi Will,

Awesome work. I was just chatting with Wout and he let me know that you're using the FTP server for massive to get all the files from massive. If there are a lot of files it can be incredibly slow to traverse that. I ran into this exact problem for so many projects! So I created a dataset files cache (that also does other things to precompute). If you're interested, it might be a better way to get all the files for a massive dataset:

https://gnps-datasetcache.ucsd.edu/datasette/database/filename

There are web apis for all of it since behind the scene its a sqlite database with datasette on top of it.

Let me know what you think!

Best,

Ming

wfondrie commented 3 years ago

Thank you Ming - this would be great! A better way to find the files on MassIVE would be an excellent change. I have a couple of questions:

mwang87 commented 3 years ago

Hey Will,

Yes, we keep these up to date every 24 hours and additionally every two weeks I download every single open format mass spec file and generate a summary for it (i.e. MS1, MS2 counts, and some metadata).

As for API, I'm fixing a bug in usability I noticed, but here is the web API endpoint that I personally use a lot:

https://gnps-datasetcache.ucsd.edu/datasette/database/filename.csv?_stream=on&_sort=filepath&dataset__exact={DATASETACCESSION}&_size=max

I gotta work to document it better but its a pretty standard datasette API

wfondrie commented 3 years ago

Awesome - this sounds perfect then. I'll work on getting ppx switched over. Thanks!

mwang87 commented 3 years ago

Well one thing I've been doing is using it as a cache, so I think keep your current implementation and try using this as a first line of getting the data. If it errors out or no files are returned, then fallback on the ftp.