wejlab / MetaScope

An R-based approach for preprocessing and aligning 16S, metagenomic, and metatranscriptomic data (PathoScope version 3.0)
GNU General Public License v3.0
16 stars 7 forks source link

metascope_id running into ID length issues #22

Closed susheelbhanu closed 1 year ago

susheelbhanu commented 1 year ago

Hey @aubreyodom,

I'm running the metascope_id step on some samples and running in the below issue, w.r.t. the length of the requested IDs.

==> ../metascope_mock_SILVA_results/logs/taxonomy/BT24_taxonomy.log <==
Warning message:
In dir.create(out_dir) :
  '/hdd0/susbus/nf_core/data/hebe_16S/metascope_mock_SILVA_results/BT24' already exists
Reading .bam file: /hdd0/susbus/nf_core/data/hebe_16S/metascope_mock_SILVA_results/BT24/BT24.bam
        Found 215437 reads aligned to 145158 NCBI accessions
Obtaining taxonomy and genome names
Accession list broken into 1452 chunks
Error : Request-URI Too Long (HTTP 414)
Attempt #2 Chunk #1
Error : Request-URI Too Long (HTTP 414)

Do you have a workaround for this, especially when one is working with a complex microbiome? Sorry if I missed something in the documentation.

Thanks!

aubreyodom commented 1 year ago

Hi @susheelbhanu ,

I was aware that this was an issue (according to a former developer) but I couldn't find an example with which to replicate the error. Would you be willing to send me the BAM file or .csv.gz file that metascope_id is accessing so that I can try to find a workaround? Happy to send you a Google Shared Drive link so storage space isn't a roadblock.

Thanks, Aubrey

susheelbhanu commented 1 year ago

Thanks @aubreyodom for willing to tackle this. I've uploaded two of the bam files to zenodo here: https://doi.org/10.5281/zenodo.8327966

Caution: each file is at least 6 GB., so let me know if you can't access the files.

P.S. I used the same raw reads to test the full refseq database and the SILVA138 indices that you had published in the PathoScope2.0 paper. I had no such issues with metascope_id using refseq, but it's the SILVA138 that is causing this ID length issue. Not sure if this information is relevant, but a FYI just in case.

Thank you!

aubreyodom commented 1 year ago

@susheelbhanu The problem makes more sense now that you mention using the SILVA138 files. The SILVA analysis in our paper was ran a few years ago on PathoScope by another researcher, so I haven't tried to run it on MetaScope. metascope_id is currently formatted to grab NCBI accession numbers for RefSeq sequence names, so it's not going to work out of the box with SILVA. I'll look into it but it may not be a quick fix for that reason.

susheelbhanu commented 1 year ago

Thanks @aubreyodom. I suppose that makes sense. In which case, and maybe related to the other issue which is open - how does one use *metascope_id* with a different database?

Alternatively, is there a custom function to pull the names from the "different" database? I suppose these are all feature requests for now. Either way, thank you for looking into this.

susheelbhanu commented 1 year ago

This might help too, @aubreyodom : https://github.com/pirovc/multitax

aubreyodom commented 1 year ago

Ok, sorry for the delay! Since this is a SILVA issue I'm going to close this issue and leave the other one open since it is more relevant. Here's the update I posted in the other issue @susheelbhanu:

Just an update for folks wanting to use another reference database - we are actively working on this issue and should have an update for metascope_id in the coming months (if not sooner). I'm particularly interested in trying out Greengenes2 and Silva myself. Stay tuned.