zstephens / exogene

A workflow for identifying viral integrations in both short and long read data
GNU General Public License v3.0
7 stars 2 forks source link

non-human species #1

Open xiaoyunguo opened 3 years ago

xiaoyunguo commented 3 years ago

Is there a way to adapt the pipeline for nonhuman organisms?

zstephens commented 3 years ago

An interesting idea. Looking over the source I see a number of areas where there are hardcoded assumptions that the input data is human, but not too many. The biggest offender is the final combine_reports.py script, which takes all the reads spanning viral breakpoints and boils them down into a consolidated list of sites. At a glance, I see:

i) a HUMAN_CHR whitelist ii) hardcoded TELOMERE_HG38 and CENTROMERE_HG38 coordinates used for filtering iii) a get_nearest_transcript() function which is hardcoded to reference hg38 gene annotations

In the main Exogene-SR.sh script, there are a few human-specific steps:

i) # discard reads which align very well to transcriptome reference ii) # discard reads which align very well to decoy reference

So overall I think the workflow could be forked to support nonhuman reference sequences, it would involve replacing a few hardcoded variables with contig lists read in (most likely) from a .fa.fai file. It would also likely involve removing some of the filters described above, so I would expect the pipeline to yield more false positives as compared to running it on human.

PavitaKae commented 2 months ago

An interesting idea. Looking over the source I see a number of areas where there are hardcoded assumptions that the input data is human, but not too many. The biggest offender is the final combine_reports.py script, which takes all the reads spanning viral breakpoints and boils them down into a consolidated list of sites. At a glance, I see:

i) a HUMAN_CHR whitelist ii) hardcoded TELOMERE_HG38 and CENTROMERE_HG38 coordinates used for filtering iii) a get_nearest_transcript() function which is hardcoded to reference hg38 gene annotations

In the main Exogene-SR.sh script, there are a few human-specific steps:

i) # discard reads which align very well to transcriptome reference ii) # discard reads which align very well to decoy reference

So overall I think the workflow could be forked to support nonhuman reference sequences, it would involve replacing a few hardcoded variables with contig lists read in (most likely) from a .fa.fai file. It would also likely involve removing some of the filters described above, so I would expect the pipeline to yield more false positives as compared to running it on human.

From your response, i can run with non-human species already, but i need function nearest gene to filter some position. How to change annotation from Human to another species, what file or line in code that i must be change?

Thank you

zstephens commented 1 month ago

Greetings! Since I'm already in the process of cleaning up combine_reports.py, I'll try to make the gene bed file an input argument so that you can provide your own annotations.