zstephens / neat-genreads

NEAT read simulation tools
Other
92 stars 27 forks source link

question about sequencing error rate #69

Closed kkamii closed 4 years ago

kkamii commented 4 years ago

hello :) I want to make simulation data having high sequencing quality

but, when I used simulation data for input of the bbduk program (Q30 trimming) almost 50% bases removed from the original simulation fastq file

This is my code..

python genReads.py -r hg19.fa -R 151 -o output --bam --vcf -c 800 -t target.bed --rng 100 --gz -M 0 -v variant.vcf --pe-model fraglen.p -to 0 -E 0 -p 1

how can I change my code to make high sequencing quality data??

sorry for my bad english..

zstephens commented 4 years ago

Greeetings!

The "-E 0" input option should indeed cause the reads to be high quality (no sequencing errors), but the quality score strings are still sampled from the underlying model. You have a couple options:

1) Since the reads have no errors, is running the trimming tool still necessary?

2) Download the latest version of the repository (I just pushed an update), and use the following options: "-E 0 -e models/errorModel_pacbio_toy.p"

Option 2 will use the a uniform sequencing error model, where every base in a read has the same quality score. In conjunction with -E 0 you should get reads with the maximum quality score (Q41 by default).

kkamii commented 4 years ago

wow.. thank you for your quick answer! I chose the first option. I didn't think of it :)