zstephens / neat-genreads

NEAT read simulation tools
Other
96 stars 27 forks source link

parameter '-to' not working as expected #64

Open mrpk123 opened 5 years ago

mrpk123 commented 5 years ago

Hello,

I am trying to use parameter 'to' in order to limit number of variants in the off-target region as per mentioned in your doc. However, my several trials with different data sets have indicated that use of '-to' parameter yields more off-target variants than those obtained without '-to'. For example, see following results that i got for a single chromosome simulation.

Reference: chr17.fa Read length (-R): 75 Paired end fragment size and deviation (-Pe): 200 20 Coverage (-C) : 100 Number of chromosomes in target bed file:1 Number of intervals in target bed file: 24

image

As per the description on the github page, -to = 0 should result 0 variants out of the targeted region or very few if generated at all. From the 3rd row, it is clear that when -to =0, number of variants generated has actually increased !!!! if we subtract in-target variants from the total number of variants (last column in the table), we see that with -to=0 the number of off-target variants is larger than the number of off-target variants obtained without ‘to’ parameter . Which is contradictory to what the github page says.

I saw a similar issue reported here in September 2017. Though it says, the issue is fixed; my results indicate otherwise. Am i missing something? Or the issue is still to be fixed? I am using the latest repository.

Regards

zstephens commented 5 years ago

Greetings!

To confirm, the variants you're observing are those in the golden.vcf produced by the simulation, correct? The reads (and golden.bam, if you're outputting one) are still properly restricted to the targeted regions as expected?

NEAT proceeds along the entire reference sequence window-by-window, and if it finds a window that includes a targeted region as specified in the input BED it will introduce variants into that window and then sample reads. The read sampling is heavily biased to occur primarily in coordinates from the BED file, but the randomly generated variants are not restricted in this fashion. Would you be able to confirm that the variants outside the targeted regions are occurring nearby the targeted regions?

If this is in fact the case, I'm open to adjusting the behavior such that the output VCF only contains variants within the target regions (perhaps configurable via another option).

Thanks!

mrpk123 commented 5 years ago

Yes, i am checking variants from the golden.vcf. Out of 100 off-target variants, 5 are near target regions (around 10 to 60 bases away from either start or stop). For those variants, few reads (3-7) are seen. For the remaining 95 variants, no reads are observed in the pile up. So we can say that the reads are properly restricted to the targeted regions.