-m leads to 0 reads mapped in output

zd1 / telseq

A software for calculating telomere length

GNU General Public License v3.0

66 stars 26 forks source link

-m leads to 0 reads mapped in output #25

Closed kgaonkar6 closed 5 years ago

kgaonkar6 commented 6 years ago

Hello,

I ran my tumor and normal bam files through telseq in per-read-group and with -m mode to compare the outputs. And it looks like I have a issue in the -m mode and read-group based analysis

1) In the -m mode I get only 0 or 1 reads in the Mapped column which seems odd since I do see reads mapped in the read-group mode analysis.

2) In the per read-group mode I get UNKNOWN in LENGTH_ESTIMATE columns. From previous issues it looks like it's because for that read-group I didn't get any reads for TEL5 and higher. Does that seem odd since there should be random reads aligned with TEL repeats in interstitial telomeric sequences that show up here?

Please find the corresponding output files with this issue:

Mode-read-group_telseq.txt Mode-merged_telseq.txt

zd1 commented 6 years ago

Hi there,

Thanks for your interest in telseq and sending through some example data.

This looks like a bug. I would recommend running telseq without the "-m". This way all the results are written out and merging can be done afterwards.
Yes, when no telomeric reads in library the estimate would be unknown. It does look odd to me that there is no reads with 5 telomeric repeats in libraries that have 10s of millions reads. It'd be worth checking if any preprocessing applied to those libraries. The variation in duplication rate look comparable within each sample, with a few exceptions. I guess you could take a weighted average of the length estimates using only those read groups that have a similar duplication rate.

duplication_rate

Zhihao

kgaonkar6 commented 6 years ago

Thank you for the detailed reply.

I was wondering if the duplication issue with the few read groups is unique to our samples or is this generally observed in analyzing for telomere length estimation. And can you also elaborate how would you define comparable duplication rate here?

Thanks, Krutika

zd1 commented 6 years ago

Sure. Those libraries with 0 duplication rate and >25% duplicate rate look like outliers to me. They could be fine for other analysis but are problematic for the telseq approach. I think you could take libraries with a duplication rate within the interquantile range across all libraries, and compute a weighted average just using those.

Zhihao