Open xiaoyunguo opened 3 years ago
An interesting idea. Looking over the source I see a number of areas where there are hardcoded assumptions that the input data is human, but not too many. The biggest offender is the final combine_reports.py
script, which takes all the reads spanning viral breakpoints and boils them down into a consolidated list of sites. At a glance, I see:
i) a HUMAN_CHR
whitelist
ii) hardcoded TELOMERE_HG38
and CENTROMERE_HG38
coordinates used for filtering
iii) a get_nearest_transcript()
function which is hardcoded to reference hg38 gene annotations
In the main Exogene-SR.sh
script, there are a few human-specific steps:
i) # discard reads which align very well to transcriptome reference
ii) # discard reads which align very well to decoy reference
So overall I think the workflow could be forked to support nonhuman reference sequences, it would involve replacing a few hardcoded variables with contig lists read in (most likely) from a .fa.fai file. It would also likely involve removing some of the filters described above, so I would expect the pipeline to yield more false positives as compared to running it on human.
An interesting idea. Looking over the source I see a number of areas where there are hardcoded assumptions that the input data is human, but not too many. The biggest offender is the final
combine_reports.py
script, which takes all the reads spanning viral breakpoints and boils them down into a consolidated list of sites. At a glance, I see:i) a
HUMAN_CHR
whitelist ii) hardcodedTELOMERE_HG38
andCENTROMERE_HG38
coordinates used for filtering iii) aget_nearest_transcript()
function which is hardcoded to reference hg38 gene annotationsIn the main
Exogene-SR.sh
script, there are a few human-specific steps:i)
# discard reads which align very well to transcriptome reference
ii)# discard reads which align very well to decoy reference
So overall I think the workflow could be forked to support nonhuman reference sequences, it would involve replacing a few hardcoded variables with contig lists read in (most likely) from a .fa.fai file. It would also likely involve removing some of the filters described above, so I would expect the pipeline to yield more false positives as compared to running it on human.
From your response, i can run with non-human species already, but i need function nearest gene to filter some position. How to change annotation from Human to another species, what file or line in code that i must be change?
Thank you
Greetings! Since I'm already in the process of cleaning up combine_reports.py, I'll try to make the gene bed file an input argument so that you can provide your own annotations.
Is there a way to adapt the pipeline for nonhuman organisms?