zstephens / neat-genreads

NEAT read simulation tools
Other
92 stars 27 forks source link

Bug in golden.bam for InDels #56

Closed ju-mu closed 5 years ago

ju-mu commented 5 years ago

Hi,

thank you very much for providing this fantastic tool. One of the non crucial features of neat is the direct output of the golden.bam which is great for visual inspection and might be great for direct variant caller evaluation. Unfortunately I noticed that the golden.bam just works for SNVs, not for InDels. Is this a known issue?

I created an IGV snapshot of the golden.bam (upper panel) and the fastq output after mapping by bwa mem (lower panel) for an 8Bp deletion (there is an additional SNV 2Bp before the deletion):

igv_snapshot_neat_deletion

... and an 8Bp insertion: igv_snapshot_neat_insertion

The mapped fastq is in line with the golden.vcf, however the golden.bam seems completely off, introducing many mismatches which would generate FP in an aligner evaluation. In addition, the location of the actual InDel seems to be partially in the middle and partially at the end of the variant, rather than the start.

I assume there must be an error during the CIGAR calculation, however I couldn't spot anything obvious in your code. It would be fantastic if you could have a look.

Many thanks!

zstephens commented 5 years ago

Greetings, thanks for pointing this out!

I am able to replicate this bug by using an input vcf with insertions and deletions in close proximity. I have some suspicions about what may be causing the problem and will follow up here as I poke at it.

zstephens commented 5 years ago

Allright, I believe I have fixed this! Feel free to reopen this issue if you still have this problem, or if the changes I made seemed to have broken anything else.

Here's the test case I was using: 5bp ins + 10bp del + 7bp ins + 4bp del

Screen Shot 2019-03-12 at 10 41 58 PM

ju-mu commented 5 years ago

Hi,

Unfortunately me again ;-) I have used the following settings using the old and new version: python genReads.py -r test.fa -M 0.002 -R 101 -c 200 -o testdir/testprefix --vcf -t test.bed -e mytest_errorModel.p -m models/MutModel_CLLE-ES_ICGC.p --pe-model mytest_fraglen.p --gc-model mytest_gcBiasModel.p --bam -v mytest_test.vcf --rng 999 --job 1 1

I am only looking at a small region with 2x8Bp Ins and 1x8Bp Del. The upper panel is the NEAT version before your changes, the middle panel after and the lower panel after bwa mem mapping.

The first Insertion looks fine: igv_snapshot_new_old_ins2

The second Insertion is off by one base pair and introduces one FP SNV at the beginning of the Ins: igv_snapshot_new_old_ins

The deletion is also off by one base pair, but also introduces quite a few mismatches down stream: igv_snapshot_new_old_del

The mismatches are caused by all reads that have their left hand side starting point within the deleted region. There are no reads that have their right hand side end point within the deleted region (coincidence?):

igv_snapshot_new_old_del_z

Many thanks!

zstephens commented 5 years ago

I just pushed another attempt at a fix for this, here's hoping!

ju-mu commented 5 years ago

Yes, this seems to have fixed it. At least all three regions I tested on 13th of March appear as expected with the new version. Thanks!!

zstephens commented 5 years ago

Excellent! The cigar string code was needlessly convoluted, so this was a good excuse to revisit and simplify it. Thanks again for reporting this!

ju-mu commented 5 years ago

Hi,

unfortunately it seems as if your fix has broken something. I have run a simulation as described above, but this time for a whole WES bed file on all major chromosomes and split into 30 processes:

python genReads.py -r test.fa -M 0.002 -R 101 -c 200 -o testdir/testprefix --vcf -t test.bed -e mytest_errorModel.p -m models/MutModel_CLLE-ES_ICGC.p --pe-model mytest_fraglen.p --gc-model mytest_gcBiasModel.p --bam -v mytest_test.vcf --rng 999 --job 1 30

Three processes exited prematurely with the following error messages: ... 82% 83% Index error when attempting to find cigar string. 22318 22318 (22226, 22318) 0 193 182 ... 5% 6% Index error when attempting to find cigar string. 22309 22309 (22172, 22309) 1 238 226 ... 22% 23% Index error when attempting to find cigar string. 22315 22315 (22223, 22315) 0 193 182

I then ran the exact same job again, but this time using the old version before your recent fix and it completed successfully.

Could you have a look?

Many thanks!

zstephens commented 5 years ago

I'm not 100% sure where, but there's an off-by-one error somewhere. It might be related to the particular fragment length distribution. I pushed a rather hacky quick fix that might solve it.

ju-mu commented 5 years ago

Yes, your fix seems to have solved it. At least the repeat of this particular simulation was successfull using the same seed value and InDels are looking fine. Thanks!