wdecoster / NanoPlot

Plotting scripts for long read sequencing data
http://nanoplot.bioinf.be
MIT License
413 stars 47 forks source link

Support streaming from cloud storage #270

Open SHuang-Broad opened 2 years ago

SHuang-Broad commented 2 years ago

Hi,

we are routinely using Nanoplot in our cloud-native pipelines and would love to see Nanoplot support streaming from cloud strages.

Based on a quick glimpse of the code, it looks like that would require at least one dependency, i.e. pysam to support that. Are there any other "patches" necessary to support the streaming?

Thanks, Steve

wdecoster commented 2 years ago

Hi Steve,

Interesting suggestion! I have to admit I don't immediately know on how to adapt the code for this. Since you ask for pysam you are mainly interested in bam/cram files as input? Which you would then specify using an URL?

Cheers, Wouter

SHuang-Broad commented 2 years ago

Our current pipeline uses Google Cloud Storage (gs://...), but I could see users benefit from support for all major cloud service providers, e.g. AWS, Azure.

If Nanoplot only access the BAM through pysam, then probably that's the dependency that needs to support streaming. And the change will be minimal.

This is definitely an optimization, so it's not an urgent need.

SHuang-Broad commented 2 years ago

Regarding supporting gs://... path, I think the following link might be useful. https://github.com/pysam-developers/pysam/issues/592

wdecoster commented 2 years ago

Do you have such a (public?) gs://... path for me to test things on? All our data is processed locally.

SHuang-Broad commented 2 years ago

we don't have any public data to share (definitely because downloading data from cloud storage incurs costs on the owner of the data unless something like requester pay is specified, so this could easily be abused by malicious actors).

I think these from DeepVariant team themselves might work, but may require you to set up a google cloud account: https://console.cloud.google.com/storage/browser/deepvariant/pacbio-case-study-testdata?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false

I'm sorry if this is too much trouble. Thanks for getting on top of this! Steve