zstephens / telogator

A method for measuring chromosome-specific telomere length from long reads
GNU General Public License v3.0
20 stars 1 forks source link

help me understand filtering double anchored telomeres #5

Closed santiago-es closed 1 year ago

santiago-es commented 2 years ago

Hello again,

Just a conceptual question this time: what is the purpose of filtering out double anchored telomeres, and what are they exactly? My intuition from the paper is that these are telomeres that anchored to subtelomere sequences on two different subtelomeres? But I'm not sure.

zstephens commented 2 years ago

Greetings, the double-anchored telomeres that get filtered out are reads with alignments: subtelomere --> telomere --> subtelomere. The idea was to remove reads spanning interstitial telomere repeat regions that originate from elsewhere in the genome but were identified as having telomere-like regions based on their sequence composition. This is something we've been revisiting recently, as reads spanning telomere fusions will be expected to have this form as well, so for certain use cases it may be desirable to keep them (or set them aside for further analysis).

santiago-es commented 1 year ago

Thanks for this. Identifying potential fusions was precisely my interest. Is there a way to stop double anchor filtering entirely? I tried commenting out the double anchor block but then lost all the previously measured telomeres in the output.

Also, could you clarify the last section of the output of merge_jobs.py?

I see a lot of the following:

Chr1_p Chr1p 13

etc and then in the measured telomere I might not see any from Chr1 so I was wondering if this was the double anchored telos it was printing out? But then would it be double anchored to the same chromosome? that seems like odd behavior.

Thanks again!

zstephens commented 1 year ago

Greetings! I've been looking into this recently and it turns out it's a bit more complicated than simply disabling a downstream filter of some sort. Due to the way telomere regions are selected a majority of double-anchored reads aren't making it past the initial "do we have any telomere at all?" filters. So the bad news is that I don't have a straightforward suggestion for how to recover those reads right now, but the good news is that telomere fusions are very much on our radar and we plan to rework some parts of how telomere regions are identified such that in an upcoming version we'll be able to identify such reads and either output them directly or use them to call fusion events, as part of a new feature.

As for the numbers that get printed in merge_jobs, admittedly that script is a little scrappy and has a few things that get printed to console that were just for debugging purposes. e.g. I believe those printouts were simply the number of reads at each subtelomere/telomere boundary (post filtering, I believe).

My ultimate goal for the next version is to combine telogator.py and merge_jobs.py into a single script that can be run in one command line (i.e. using multiprocessing instead of making the user chunk the jobs themselves) and overall to make it a bit more polished.

santiago-es commented 1 year ago

Thanks for the response! I look forward to the update.