feat: improve lib type relationship inference for SRA samples

balajtimate commented 5 months ago

Is your feature request related to a problem? Please describe. With #157, the library type relationship is determined by aligning the reads separately and comparing the ratio of concordant pairs to the number of aligned reads. This works great for SRA samples where the seq_ids don't confirm to the Illumina seq_id formats. The problem is with samples where the seq_ids are in the appropriate format, but they don't contain any information regarding their mate, so HTSinfer treats them as single. In this case, it would be beneficial to double-check the relationship by alignment, trying to minimize the number of samples deemed "single", when two files were given as input.

Also, when the the relationship is deemed split_mates based on the ratio of concordant reads, the type of library for the samples still remain either single (in the case of above example) or null (for SRA samples where the seq_id is not in the correct format). In this case, I would say it's safe to assume that input 1 is the first_mate and input 2 is second_mate, but since it's not confirmed from the seq_ids, so additional categories could be added (first_mate_assumed, second_mate_assumed) as suggested, and the lib type updated after the alignment.

Describe the solution you'd like Check the relationship by forcing the alignment when both inputs are deemed "single"; update lib type for individual samples based on alignment results.

balajtimate commented 5 months ago

One more thing would be the optimization of the cutoff value for the ratio of concordant read pairs / aligned reads. I ran the testing with the 515 PE samples, filtering out the ones where no lib source was inferred (therefore no alignment should be done), 372 samples were left. Out of that, 255 had no seq_ids matching the Illumina format, so the reads were alignment to decide the relationship, the first panel shows the distribution of the ratio. The rest (117 samples) had matching seq_ids, but no mate information, so were deemed single. I reran the test by forcing the alignment of these samples, the second panel shows the result.

So most of the samples fall in the 0.85-1 range. Currently, the cutoff for the ratio for the samples to be deemed split_mates is 0.95. The mean for the first panel (without the matching seq_ids) is 0.894, the median 0.939. For the second panel, the mean is 0.887, and the median 0.936.

The more interesting thing is checking how the inferred library type relationships change.

So first I checked the relationship types on all (372) samples (where 255 samples were aligned and 117 weren't - the ones that weren't were immediately deemed single and the relationship not_mates). Then by aligning (adding) the 117 samples with correct format seq_ids but no mate info, and then by lowering the cutoff value to 0.9 and 0.85. At this point most PE samples are deemed split_mates, which I would say seems logical in the case of PE samples.

I know the overall sample size is not too big, but I think based on this we could lower the default value for this parameter to 0.85. There are still 65 not_mates samples (and null_relationship is when the number of aligned reads or concordant were 0).

uniqueg commented 5 months ago

Thanks a lot @balajtimate - nice analysis. I agree with lowering the cutoff accordingly.

And of course I agree with addressing the main issue as you suggest (as we had already discussed previously). Wasn't another conclusion that we can never confidently say whether a library is single? Because they'll always be indistinguishable from the first or second mate of an SRA library that retained the read IDs but cut off the mate info (and where the mate is missing)?

Perhaps the whole category of reporting type results individually for each sample doesn't make too much sense. If there's only one sample, then you can always go ahead and analyze that sample as if it were a single-ended library, even if it's just one of two mate files. And if two samples are given, all that really matters is the mate relationship. It made more sense in a world where the mate info was nicely given in the seq ID :)

Anyway, I'm not proposing to change all of this now. Let's stick to what you suggest. It will give us what we need - and we can still think about refactoring/simplyfing the type inference some other time (or not).

balajtimate commented 5 months ago

Wasn't another conclusion that we can never confidently say whether a library is single? Because they'll always be indistinguishable from the first or second mate of an SRA library that retained the read IDs but cut off the mate info (and where the mate is missing)?

Yes, that's correct. :confused: As a last resort, we could also check the file name, as PE samples are named _1.fastq and _2.fastq when downloaded through fastqdump, and if either is missing, we could assume it's PE but missing pair, and single otherwise.

zavolanlab / htsinfer

feat: improve lib type relationship inference for SRA samples #160