zavolanlab / htsinfer

Infer metadata for your downstream analysis straight from your RNA-seq data
Apache License 2.0
9 stars 22 forks source link

Analyze organism/source #108

Open uniqueg opened 1 year ago

uniqueg commented 1 year ago
  1. Compare predictions with reported source
  2. Quantify accuracy: correct, false, undecided

One goal is to minimize false predictions. Here we can try playing with --library-source-min-match-percentag and --library-source-min-frequency-ratio.

However, another goal is to also make sure that we get reliable predictions at least for the most common model organism (not useful that we never get a mouse because based on our rRNA gene transcripts we cannot sufficiently distinguish between mouse and shrew mouse). About this, there are basically two things we can do:

balajtimate commented 1 year ago

I'm currently running the 770 samples tests with different --library-source-min-match-percentage to try to optimize the param for the correctly identified organisms, just wanted to come back to this issue continuing from this comment: https://github.com/zavolanlab/htsinfer/issues/107#issuecomment-1521811412

To summarize, from our data of 77 samples from the lab, which consisted mostly of model organisms, only 5 had the organisms falsely inferred, 1 mouse (where the issue was that the ratio between 1st and 2nd inferred org was less than 2, with 30% as mmusculus and 17% as omeridionalis, which is weird, cause it's a butterfly, and for the other mouse samples, the second most frequent org is usually another mouse, like mspicilegus) and 4 E.coli samples, that I checked and it is not in our list of transcripts. I suppose some RP sequences could be added?

balajtimate commented 1 year ago

So unfortunately, as it turns out, there's no point lowering --library-source-min-match-percentage, as there's no difference between the results of the different parameter setting. 1-Barplot-predicted-organisms-2 (1)

However, out of the 226 Undecided samples (which is the 29% of the 770 samples), 77 had the organism correctly matched, except the frequency between 1st and 2nd most frequent source was <2 (the default), and 94 had the an incorrect organism inferred. We could try setting this to 1, as none of the samples had a frequency ratio lower than that (except for when none of the organisms were identified - 55). We would get an additional ~10% increase in correctly identified organisms, but at the same time, ~12% increase in false predictions. What do you think?

Here are the results from the rerun, with --library-source-min-frequency-ratio set to 1. 1-Barplot-predicted-organisms-2 (4) So there's 55 samples (7%) for which no organism could be determined. It's also worth noting that out of the 229 (29%) falsely identified samples, 52 had the correct organism in the 2nd most frequent place, the first one simply had more counts. I'm not sure what could be done in this case, given that most of the time, the 1nd and 2nd most frequent organisms are not even closely related (or the problem is with the samples themselves, e.g. incorrect metadata or contamination from other species). mined_test_data_result_libfreq_1_0806.csv

uniqueg commented 1 year ago

Interesting results. I think it's really not too bad for a start. A couple of points:

uniqueg commented 1 year ago

One other thing you could do: Create separate stats for (1) the 10 most common organisms in SRA or whatever number of organisms we need to account for more than 95% of RNA-Seq samples in SRA, out of the organisms we support (for these we wanna be particularly good), (2) the rest of the organisms, and (3) all organisms (what you have now).

uniqueg commented 1 year ago

And one more question: Are all of the annotated organisms for the 770 samples supported, in theory, by HTSinfer?

balajtimate commented 1 year ago
  • I'm not quite sure what you mean when you say that none of the samples had a frequency ratio above 1. It's a ratio of the counts of the most frequent and the second most frequent orgs, so by definition it must be >= 1, or NaN (if there are no counts for any org or all but one org). This, setting that parameter to 1 basically deactivates this check.

Yeah, sorry, I got it mixed up with the min-match-percentage 😅 It makes sense that we wouldn't want this lowered.

And one more question: Are all of the annotated organisms for the 770 samples supported, in theory, by HTSinfer?

Yes, I checked these, I think they were specifically chosen because they're in the transcript list.

So for those samples where we have false positives, we should map these samples against the complete annotations of the top 2 or 3 results. That should give us the real result.

One other thing you could do: Create separate stats for (1) the 10 most common organisms in SRA or whatever number of organisms we need to account for more than 95% of RNA-Seq samples in SRA, out of the organisms we support (for these we wanna be particularly good), (2) the rest of the organisms, and (3) all organisms (what you have now).

It's a good idea, I'll do the mapping and create the stats. Tbh I think it would be best to focus on the most common organisms (like you mentioned); currently, there's 440 orgs supported by HTSinfer, which is far too much, and there's a large number of closely related ones. I'll look up the most common ones in SRA and remove the others from the list, and check if that lowers the number of false positives.

uniqueg commented 1 year ago

Yes, but please don't remove orgs too drastically. Once you have a list of orgs by sample count in SRA, we can take the top 100. If that's still not enough, we can further reduce. But being able to support many organisms is still a cool feature, so I wouldn't want to go down to 20 or less, if we can avoid it.

balajtimate commented 1 year ago

Well, it seems reducing the number of organisms is probably not the way to go 😓 Out of the 3.4 million illumina RNA-seq samples in SRA, sorted according to organism frequency, I selected the ones we already have in our list and took the 100 most common ones; all in all, this amounted to ~92% of the SRA samples. This actually lowered the undecided rate, but increased the false positivity rate: 1-Barplot-predicted-organisms-2 (7) Even just looking at the 10 most common organisms (hsapiens, mmusculus, athaliana, dmelanogaster, drerio, rnorvegicus, zmays, mmulatta, scerevisiae, osativa) in both cases (just selected the samples where the metadata org was from these - 208 samples), it seems the 100-filtered transcript prediction is only slightly better than the one with all organism transcripts, but the false positivity also increased: org_test_results

Also, do you know that in the case of PE samples, in what way does the predicted organism from the second read get taken into account? I found quite a lot of samples where the organism could've been correctly predicted from R2, with a large ratio between first and second most frequent organism for that read, but since R1 had a ratio <2, the end result is undecided. It's defined in get_library_source.py and what I gather is that both reads get quantified with kallisto, but the result source (org) gets determined only for the first read? Examples (1_tpm_1 is first hit, and 1_tpm_2 is the second hit for R1, and similarly for R2, 2_tpm_1, 2_tpm_2 etc):

sample org 1_tpm_1 1_org_1 1_tpm_2 1_org_2 1_percent 2_tpm_1 2_org_1 2_tpm_2 2_org_2 2_percent match_org
SRR17499120 bmori 54.893 lchalumnae 35.848 omeridionalis 1.531 64.649 bmori 7.038 omeridionalis 9.186 Undecided
SRR17499123 bmori 55.264 lchalumnae 36.041 omeridionalis 1.533 64.964 bmori 6.969 omeridionalis 9.322 Undecided
SRR22952772 btaurus 38.895 btaurus 20.333 bbbison 1.913 55.846 btaurus 15.559 bgrunniens 3.589 Undecided
SRR22952770 btaurus 36.444 btaurus 22.088 bbbison 1.65 51.714 btaurus 16.228 bgrunniens 3.187 Undecided
uniqueg commented 1 year ago

Oh, that's a bit unexpected!

I mean, checking against all the 770 orgs, okay, we are then "forcing" a lot of false positives, because the true ones aren't in the list. But for the second part, where you only consider the most common orgs, I would have expected a better result: more reduction of undecided, and less false positives. Hard to imagine why we have more, although ... I guess we are still pushing reads that don't map well to the target organism to any of fewer other choices, and those could then maybe mess with the numbers, which are probably rather low, in general.

Two things I can think of:

As for the PE samples: That's weird indeed. What I would actually expect how the mapping should be done for PE samples is that both reads are mapped together, resulting in a single file of alignments - rather than mapping both libraries separately and then deciding separately and somehow combining or concatenating. This should actually be much more stringent, because the best RP gene compatible with both mates would be picked, and in cases where, say, there is two reasonably good options for one mate, it is a lot less likely that for the other mate the wrong organism of the two would also be in the top choices.

Anyway, thanks for the good work!

uniqueg commented 6 months ago

It is important that we also get an idea of (1) why some of our organism annotations fail and (2) whether perhaps there are also mistakes in the SRA metadata.

@balajtimate: To do that, please map all libraries for which the organism was falsely annotated and at least a few dozen of the libraries for which no organism was annotated against the top 3 organisms and check the mapping rates.

balajtimate commented 5 months ago

So here are some final notes on this issue: For one thing, the metadata quality of the 780 samples SRA mined dataset is... not great. I know I really should have done this in the beginning, but I just crossed check the lib sources for each sample in our dataset with the metadata from SRA, and there are some differences. Out of 780, 135 were considered mismatches for lib source. Some examples from our original testing data:

  1. 15 samples, like SRR23906030 were marked as bvulgaris, as in Beta vulgaris, because they had "beta" in the study description, when in fact they are C. elegans samples. HTSinfer here correctly identified the lib source for all 15.
  2. All 30 ppaniscus and ptroglodytes samples were in fact Homo sapiens, I don't know where that information came from.
  3. Others like SRR17509963, were described as mmmarmota when in fact it was a meta-transcriptome analysis.
  4. And there were samples like SRR23906946, which we had in our list because the Strategy was RNA-seq, but in fact they were scRNA-seq samples, so I took those out.

So I cleaned up the mined dataset (corrected the lib source so that it matched the SRA metadata and removed 45 problematic samples, like single cell ones and where the actual lib source organism is not found in our transcript db) and checked the lib sources again, if the result of HTSinfer matched with the metadata:

There are still 37 mismatches however, but I think the 5% false positives are not bad. There are also weird cases:

  1. SRX6058708, where the organism is indicated as Plasmodium chabaudi chabaudi, and HTSinfer determined Mus musculus, from the abstract however it's clear that they used mouse cells in the experiemental design.
  2. Then there is also SRR21614910, which has Mesocricetus auratus as organism in the metadata, but the title is simply "RNA-seq of mus", description as "transcriptomic data in mouse" and the inferred lib source is Mus musculus. I guess one could call a hamster a mouse, but there could be a clearer description in the SRA entry.

I would say in these cases HTSinfer actually provided a more precise information about the lib source than the metadata from SRA.

Lastly, I mapped the rest of the libraries with mismatches against the metadata organism and the inferred organism, and to no surprise, the samples had a significantly higher mapping rate to the metadata organism than the inferred ones. This was also true for the couple of Undecided lib source that I mapped. In most cases, the decision came down to the ratio of mapped reads from the first and second most common lib source, which has to be at least 2. Lowering it to 1.5 also wouldn't make sense, as we would get an almost equal number of matches and mismatches. It's also worth noting that the most common incorrectly inferred organism is Ficedula albicollis and Oryza meridionalis. I have no idea what could be the meaning of that.

So while there is still room for improvement, I think with the testing data that we had this is probably the best result. I will summarize the findings from here and write it in #56.

uniqueg commented 5 months ago

This is awesome work, thanks a lot @balajtimate! And fantastic results :heart_eyes:

The only thing I would still like to see though is the full mappings of at least the 37 mismatches (ideally of all samples) against the top 3 inferred organisms/sources and comparison with what SRA reports. It is really a powerful statement for the paper (and the usefulness of HTSinfer) if we are able to quantify (to a limited extent) how often samples are misannotated or heavily contaminated.