xiaoming-liu / stairway-plot-v2

The stairway plot is a method for inferring detailed population demographic history using the site frequency spectrum (SFS) from DNA sequence data.
Other
31 stars 4 forks source link

Clarification on L input parameter #3

Open hrivera28 opened 3 years ago

hrivera28 commented 3 years ago

I'm rather confused as to what L (number of observed sites is supposed to be). In your readme you define it as the sum of the SFS (if unfolded). However in your sample blueprint file your L is listed as 10,000,000 which is not the sum of your SFS (24,198).

I happened across this as I was trying to troubleshoot my own data, where my graphs all looked empty and my final summary just showed a constant (and very high ~250K) value of Ne. If I edit your blueprint file to designate L as the sum of your SFS then I get the same result as with my data.

So my question is what is L supposed to be? Would I need to calculate the total number of sequenced basepairs for a particular dataset?

Thank you for your help!

THccaa commented 3 years ago

In the old manual L is described as: L: length of sequence or more specifically the total number of observed nucleic sites (after filtering), including polymorphic and monomorphic. The number of polymorphic sites will be further separated by mutation size and described in SFS. For example, if you sequenced 100,000 loci each with 100 bp for 50 diploid samples and after filtering low-quality and missing data, 80,000 loci were retained each with 90 bp sequenced for 40 out of the 50 samples, then L=80000*90=7200000 and nseq=2*40=80.

That explains the L: 10000000 in the example blueprint, but it is different than the sum of the SFS as described in the current manual. The question is, which one is correct?

sallycylau commented 3 years ago

Hi! Not sure if this is useful as this thread is a few months ago. But I emailed Prof Liu last year about this. I was analysing RADseq data, and for me, the L should be the (number of loci kept after SNP filtering) x (length of locus).

So I would say L is the length of the genome explored in the dataset (monomorphic + polymorphic sites) after filtering.

hrivera28 commented 3 years ago

Hi Sally,

Thanks for following up. Just to clarify, say I did 150 bp single end reads and ended up with 12,000 SNPs after filtering (and each 150 bp read had only 1 SNP). My L would be 12,000 * 150 = 1,800,000?

Thanks! Hanny

On Sun, Sep 12, 2021 at 5:01 AM Sally Lau @.***> wrote:

Hi! Not sure if this is useful as this thread is a few months ago. But I emailed Prof Liu last year about this. I was analysing RADseq data, and for me, the L should be the (number of loci kept after SNP filtering) x (length of locus).

So I would say L is the length of the genome explored in the dataset (monomorphic + polymorphic sites).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xiaoming-liu/stairway-plot-v2/issues/3#issuecomment-917595160, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYGRJWRXMGNZW7TDWD5QQTUBRT5DANCNFSM42SB3DWQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Hanny E. Rivera, Ph.D. https://www.linkedin.com/in/hannyrivera

sallycylau commented 3 years ago

Hi Hanny

If I understood correctly, you have 12,000 loci with 1 snp per locus? And each locus is 150bp long? If this is right then yes I think L would be 12,000 * 150 = 1,800,000.

Just in case if this info is also helpful to you, as I don't know whether you are thinning out your data for SFS... in my original email to Prof Liu my dataset only included 1 snp per locus for my SFS. He recommended I should keep all snps per locus when building the SFS.

Hope this is helpful.

Sally

hrivera28 commented 3 years ago

Hi Sally,

This is useful, thank you!

Best, Hany

On Sun, Sep 12, 2021 at 7:36 PM Sally Lau @.***> wrote:

Hi Hanny

If I understand correctly, you have 12,000 loci with 1 snp per locus? And each locus is 150bp long? If this is right then yes I think L would be 12,000 * 150 = 1,800,000.

Just in case if this info is also helpful to you, as I don't know whether you are thinning out your data for SFS... in my original email to Prof Liu my dataset only included 1 snp per locus for my SFS. He recommended I should keep all snps per locus when building the SFS.

Hope this is helpful.

Sally

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xiaoming-liu/stairway-plot-v2/issues/3#issuecomment-917733214, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYGRJQ4L2ZPAQIHFSVPFULUBU2QDANCNFSM42SB3DWQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Hanny E. Rivera, Ph.D. https://www.linkedin.com/in/hannyrivera

zhangzb554 commented 2 years ago

Hi Hanny, Did you solve the question about L? I think L should be the number of nucleotide site in VCF number of individuals ploidy. I have no idea about this. Could you give me some advice? Thanks a lot!