Improve library source inference

rohank63 commented 3 years ago

Consistently update set of reference transcripts for library source inference and include more genes for better accuracy
Include organisms/sources from all clades in Ensembl: Bacteria, Plants, Fungi, Metazoa etc.

uniqueg commented 2 years ago

Handle with or after #72.

uniqueg commented 6 months ago

@balajtimate: In this issue, please create a short list of all the strategies we dsicussed to improve the library source inference

balajtimate commented 5 months ago

As both the library type and the orientation inference relies on the inferred library source, it's extremely important to improve the inference. The key points from #108 and other discussions:

Add more genes from the current organisms other than ribosomal protein genes. This should include genes that are highly conserved intra-species, but show enough variability inter-species to be used for identification. One approach would be the use of DNA barcoding genes, like cytochrome c oxidase I (COI), cytochrome b (CYTB), histone 3 (H3) for mammals, matK and rbcL for plants. One source for this could be the BOLD database.
This should focus on the most common organisms in SRA: hsapiens, mmusculus, athaliana, drerio, rnorvegicus, zmays, mmulatta, scerevisiae, osativa, btaurus, sscrofa, celegans, ggallus
Currently, HTSinfer doesn't support bacteria, but the next most common organism is ecoli, so add the RP genes from Ensembl Bacteria

uniqueg commented 5 months ago

Thanks! To clarify: What exactly do you mean by "This should focus" in 2. What is "This" and how to make "this" focus on just the listed organisms?

balajtimate commented 5 months ago

I meant adding more genes (other than the RP genes) should focus on the 15 most common organisms, to have greater precision in the lib source inference of those organisms (at least).

uniqueg commented 5 months ago

Thanks. Any concrete ideas how such a strategy could look like? I mean, how to find genes that are broadly conserved while at the same time maximizing the difference between the most common orgs? I don't really see how to start with such an exercise. Or were you suggesting to not care about the conservation beyond the most common organisms at all? And then maybe have a 2-stage process - look first at the broadly conserved (current) genes and then, based on the results for that, pick another subset of genes for better resolution?

zavolanlab / htsinfer

Improve library source inference #56