wwood / singlem

Novelty-inclusive microbial community profiling of shotgun metagenomes
http://wwood.github.io/singlem/
GNU General Public License v3.0
137 stars 17 forks source link

SingleM stuck running on specific SRAs #147

Open AnneliektH opened 1 year ago

AnneliektH commented 1 year ago

Hi there,

I've been running SingleM on a set of SRA files. For most of them this is fast and takes < 10 minutes. Some SRAs, it seemingly runs forever and gets stuck. This is one example of such a file: ERR2205747. Using top, I find that the system does use CPU so I think it is doing something? I run SingleM using the following command (only looking for a specific protein)

singlem pipe --sra-files sra/ERR2205747 --otu-table ERR2205747.csv \ --singlem-packages path/to/payloaddirectory/S3.40.ribosomal_protein_L11_rplK.spkg \ --no-assign-taxonomy --threads 8

wwood commented 1 year ago

Hi,

Thanks for the report. I'm wondering whether there is something fishy about that sample as stored in the SRA. When you say the system uses CPU, is it singlem or kingfisher that is using the CPU?

A likely workaround for this is to download the data from ENA rather than SRA (you could use kingfisher directly for this) - the fastq format may help.

That specific sample is missing from the sandpiper database too, even though it seems like it should be (published 2017, metagenomic), but I don't have logs about what went awry still.

Let me know how you go. Thanks.