zd1 / telseq

A software for calculating telomere length
GNU General Public License v3.0
67 stars 26 forks source link

LENGTH_ESTIMATE units #8

Closed JenniferShelton closed 6 years ago

JenniferShelton commented 8 years ago

What units (bp or kb) is the length reported in?

Thanks!

zd1 commented 8 years ago

Sure, it's kb.

JenniferShelton commented 8 years ago

Thanks,

I'm wondering how you calculate the estimate? From the paper I read...

t_k is the abundance of telomeric reads at threshold k

k=7 and I see 4861 in the output. That plus all higher values is 4861 + 4849 + 4424 + 4113 + 3780 + 3485 + 2903 + 2521 + 2192 + 1850 = 34978

c is a constant for genome length divided by number of telomere ends 46 (23 × 2)

... I don't follow this (did you mean read length?). If not this would be 3095693981 / 46 = 67297695.23913044

we define s as a fraction of all reads with gas chromatography (GC) composition between 48 and 52%

For GC content (as opposed the AT content) I see GC4 48-50 = 32419846 so... 32419846 / 702469493 = 0.04615125115476011

which would mean... l = 34978 * 67297695.23913044 * 0.04615125115476011 = 108637220026.74385

If I assume I should read length rather than genome length then l = 34978 * 2.1739130434782608 * 0.04615125115476011 = 3509.3010062852154

This is closer to your reported 4.0732 kb but I am not sure how or why it varies.

From the following output how did you come to the value 4.0732? Thanks!:

Total Mapped Duplicates LENGTH_ESTIMATE TEL0 TEL1 TEL2 TEL3 TEL4 TEL5 TEL6 TEL7 TEL8 TEL9 TEL10 TEL11 TEL12 TEL13 TEL1 TEL15 TEL16 GC0 GC1 GC2 GC3 GC4 GC5 GC6 GC7 GC8 GC9 702469493 695800772 114647356 4.0732 667408896 34305564 660169 32653 7809 6650 5342 4861 4849 4424 4113 3780 3485 2903 2521 2192 1850 53121065 45319338 39182395 35076778 32419846 29692931 25113409 19025081 13607107 9597819

fpbarthel commented 8 years ago

Try this:

image

c = a constant that represents average chromosome size in kbp, always 3.3e8 bp (GC-adjusted genome size) x 46 chromosomes x 1000 bp (to get TL in Kbp) nk = number of reads at k=7 (34978) s = GC adjusted coverage (32419846+29692931)

332,720,800 / 46000*(34978/(32419846+29692931)) = 4,07

mengshux commented 6 years ago

"c is a constant for genome length divided by number of telomere ends 46 (23 × 2)" Except the number of telomere ends is 23 x 2 x 2 = 96, since each chromosome has 2 telomeres. The "per telomere" length prediction should be 1/2 of this: 2.045 Kb

zd1 commented 6 years ago

Hi there, thanks for your interest in telseq. Please note that the genome length, which is approximately 3G and was used for estimating the normalisation constant, corresponds to the haploid genome. telseq approach can't distinguish two telmere chromsomes at a telomere end. So I think this number should be 46.

Best, Zhihao

mengshux commented 6 years ago

Thanks Zhihao. I think the confusion for me came from fpbarthel's post: "c = a constant that represents average chromosome size in kbp, always 3.3e8 bp (GC-adjusted genome size) x 46 chromosomes x 1000 bp (to get TL in Kbp)" Here "46 chromosomes" should be "46 chromosome ENDS"