zstephens / neat-genreads

NEAT read simulation tools
Other
92 stars 27 forks source link

VCF problem: IndexError: string index out of range_ #68

Closed gonzalpk closed 4 years ago

gonzalpk commented 4 years ago

Hello, I am trying to create normal-tumor paired DNAseq samples. My approach is to set the rng for both normal and tumor samples to the same number to establish germline mutations and then use randomly sampled mutations from a COSMIC VCF file for the tumor sample. I am targeting genomic regions and have used bedtools to extract COSMIC mutations from those regions. When running the script the normal sample runs beautifully but the tumor sample fails with the italicized error message below. Im assuming the issue is with my VCF file but I am not sure what needs to be fixed in the VCF file. Any thoughts on the error would be greatly appreciated. Also the script is pasted below. Thank you, Patrick

_reading input VCF... Warning: Found variants without a GT field, assuming heterozygous... Traceback (most recent call last): File "/projects/gonzalpk/neat-genreads/genReads.py", line 743, in main() File "/projects/gonzalpk/neat-genreads/genReads.py", line 277, in main (sampNames, inputVariants) = parseVCF(INPUTVCF,ploidy=PLOIDS) File "/projects/gonzalpk/neat-genreads/py/vcfFunc.py", line 176, in parseVCF while len(varsOut[r][i][1]) > 1 and all([n[-1] == varsOut[r][i][1][-1] for n in varsOut[r][i][2]]): IndexError: string index out of range

python /projects/gonzalpk/neat-genreads/genReads.py \ -r /scratch/summit/gonzalpk/ensembl/UCSC/hg38.fa \ -R 150 \ -E 0.01 \ --bam \ -c 500 \ -v 0.vcf \ --vcf \ --rng 0 \ --gz \ --pe 300 30 \ -t targeted_panel_locations.bed \ -to 0 \ -o 0_target_simulated_data_tumor &

python /projects/gonzalpk/neat-genreads/genReads.py \ -r /scratch/summit/gonzalpk/ensembl/UCSC/hg38.fa \ -R 150 \ -E 0.01 \ --bam \ -c 500 \ --vcf \ --rng 0 \ --gz \ --pe 300 30 \ -t targeted_panel_locations.bed \ -to 0 \ -o 0_target_simulated_data_normal &

zstephens commented 4 years ago

Greetings, are you able to share the VCF file, or a subset of it?

The error is occurring in a part of the VCF parser that removes redundant bases from the REF and ALT alleles. E.g. code that turns (ACAA --> AGAA) into (AC --> AG). My first guess would be these fields of the input VCF might be non-standard or in some format that I didn't anticipate.

gonzalpk commented 4 years ago

Absolutely, Thanks for getting back to me so quickly. The VCF file is attached.

Patrick Gonzales Link lab, Department of Integrative Physiology University of Colorado, Boulder


From: zstephens notifications@github.com Sent: Thursday, February 20, 2020 4:49 PM To: zstephens/neat-genreads neat-genreads@noreply.github.com Cc: Patrick Kenneth Gonzales patrick.gonzales@colorado.edu; Author author@noreply.github.com Subject: Re: [zstephens/neat-genreads] VCF problem: IndexError: string index out of range_ (#68)

Greetings, are you able to share the VCF file, or a subset of it?

The error is occurring in a part of the VCF parser that removes redundant bases from the REF and ALT alleles. E.g. code that turns (ACAA --> AGAA) into (AC --> AG). My first guess would be these fields of the input VCF might be non-standard or in some format that I didn't anticipate.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/zstephens/neat-genreads/issues/68?email_source=notifications&email_token=AB3TIHN5SCHRMLJ5ES6HSPDRD4JKLA5CNFSM4KYYPTLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMQ7G4Y#issuecomment-589427571, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB3TIHOUNCU2UW7SQ7F2FHLRD4JKLANCNFSM4KYYPTLA.

zstephens commented 4 years ago

I'm unsure if the attachment sent correctly (I can't see it in either github or the email response), feel free to send it directly to me at zstephe2@illinois.edu.

Thanks!

zstephens commented 4 years ago

Greetings! I pushed an update to the repository that should fix this. It was indeed a bug in input variant simplification code.

gonzalpk commented 4 years ago

Excellent! The program works beautifully now. Thank you.