zstephens / neat-genreads

NEAT read simulation tools
Other
92 stars 27 forks source link

Dealing with generic bases (N's) in the reference genome #77

Closed reinator closed 3 years ago

reinator commented 3 years ago

Hi! First of all, great tool!

Second, I noticed that when I use a large reference genome that has many N's (such some plant genomes), the coverage is too low from what I have specified with -c. Here are some statistics:

For large genomes with maximum 200 N's, the coverage was close to what I have specified. For large genomes with maximum 7Mb of N's, the coverage that was supposed to be 30X, was only 1.4X.

Is there some way that I could overcome this problem by using NEAT? My interest in using NEAT is due to the ploidy functionality, which will help me to get more close to a "plant genome reality"

Thanks!

reinator commented 3 years ago

My reference genome has 633Mb and I specified a long read simulation with 30X and read length = 15Kb Turns out I splitted my reference genome into contigs with "seqtk cutN" and by doing so, I noticed that there was only a few contigs that were larger than 15Kb (total of 62Mb).

I guess that's why I obtained a smaller coverage than expected. I will close this issue.