0 SNPs reported for longer reads (undocumented maximum read length?)

bsmith89 commented 1 year ago

In a pinch, I decided to apply GT-Pro for an "off-label" use, where I fed it whole genome sequence instead of reads.

This worked surprisingly well, but I had two issues I had to work around:

First, I needed to convert a FASTA formatted genome sequence into FASTQ. Not a problem, but I did want to confirm that GT-Pro isn't sensitive to quality scores (since I had to provide a "dummy" score for each base). Quality is ignored, right?

Updating GT-Pro to accept FASTA formatted sequence might be nice for some applications.

Second, and much more importantly, there seems to be a maximum read length, after which GT-Pro starts reporting no SNPs at all. For instance, if I write a ~3 Mbp reference genome as a single read in a FASTQ file, it reports 0 SNPs. I got around this by converting the full-length sequence into tiles (e.g. of 200 bp), at which point it started reporting the thousands of SNP sites correctly. (Note: to make sure I wasn't losing kmers at "read" boundaries, I overlapped my tiles by 32 bp.)

This isn't too surprising to me (since GT-Pro is intended for short-reads), but I didn't notice any mention of a length limit in the documentation. (Admittedly, I didn't look all that hard.) I also figure that some users may want to apply GT-Pro to long reads, and I expect this to give secretly incorrect results.

I played around with the tile lengths and I start getting 0-SNPs somewhere around 500 bases.

Am I doing something silly?

nick-youngblut commented 1 year ago

@bsmith89 do you know if @zjshi is still maintaining GT-Pro (and CallM)?

The last commit was in Aug 22, 2022, and @zjshi does not seem to be responding to reported issues, including this one from January. I'm asking you @bsmith89, since you are in the Pollard lab, which collaborated on the GT-Pro paper.

If GT-Pro is not longer maintained, then how should one go about using StrainFacts? I'm guessing that MIDAS(2) does not scale nearly as well as GT-Pro, and the StrainFacts documentation currently recommends using GT-Pro:

GT-Pro is the preferred metagenotyper for StrainFacts.

zjshi commented 1 year ago

Hi Byron, my apology again for my late response. It turns out all alerts of this repo have been unfortunately redirected to a group account in Biohub instead of my email account. I can now receive alerts after I was dropped from the group account. For your question, GT-Pro indeed does not support reads that are longer than 500bp, which was hard coded at line 460 in /src/gt_pro.cpp. It may be changed plus recompiled to enable processing of longer reads. However, you might might observe performance drops due to such a change.

zjshi commented 1 year ago

Hi Nick, I've transitioned out of the lab since last year, but I will continue to address most pressing issues related to GT-Pro. Since I have a full time job competing for my time, I hope you will understand it could take longer than usual to respond. I may most likely make my responses during the weekends. Request for new features could be challenging to meet, but will be considered case by case.

nick-youngblut commented 1 year ago

Hi Nick, I've transitioned out of the lab since last year, but I will continue to address most pressing issues related to GT-Pro.

@zjshi I know how that is. Sadly, one either has to devote their free time to maintaining code from past projects (and finishing publishing manuscript from their last job) while working a new full-time job... or just abandon the project. I've run into that most recently for ResMiCo, with many nights & weekends spent on the project.

I completely understand not having time to keep working on old projects (I'm behind with some github issues now), but I just wanted to know whether or not I would have to fork your repo and fix any bugs if I ran into them (or just questions about software usage). I'm glad to hear that you are still able to devote a bit of time to gt-pro. I'd be happy to help, if possible.

bsmith89 commented 1 year ago

GT-Pro indeed does not support reads that are longer than 500bp, which was hard coded at line 460 in /src/gt_pro.cpp. It may be changed plus recompiled to enable processing of longer reads. However, you might might observe performance drops due to such a change.

@zjshi, thanks! Good to know the exact cut-off. I can submit a PR to update the documentation. I might play with increasing that limit if it seems important. While I'm not super familiar with C++, I imagine a compile-time flag could be added to set that constant...? Might try doing that myself at some point.

Very thankful that you're still maintaining GT-Pro!

zjshi commented 1 year ago

@nick-youngblut Thanks for you understanding and sharing your experience! Again please feel free to let me know any further questions or issues you will run into in the future :)

zjshi commented 1 year ago

@bsmith89 Compile-time flag might do the job for a specific max read length, but I feel it might be annoying to compile it every time for each individual dataset. One way is to compile the program with supplying a large constant that is likely always longer than any single long read, e.g. something like 5M. The second way is to chop a input long read into multiple short reads that are shorter than 500bp. To make sure every nucleotide base will be processed, every two adjacent short reads should have a small overlap (l=30). Another way is to make the constant an variable in the program and the add a flag that can be used by the user to supply read length cap. This idea came up when we wrote the program, however, we eventually did not go with it due to a performance preference. Yes, I believe I should be able to find a good and sustainable way to maintain GT-Pro and other repos for the long run.

bsmith89 commented 1 year ago

These are all good ideas! In the meantime, I'd suggest that GT-Pro docs should give an informative warning when reads are too long and this should be clearly stated in the documentation. :)

zjshi / gt-pro

0 SNPs reported for longer reads (undocumented maximum read length?) #49