zstephens / neat-genreads

NEAT read simulation tools
Other
95 stars 27 forks source link

GRCH38 problems #86

Open popicka opened 3 years ago

popicka commented 3 years ago

Hi,

We have been using Neat to successfully generate reads with hg19 reference. However, when using GRCh38 with alternative sequences, plus decoys and HLA, we stumble upon different problems.

  1. In order to properly use computeGC.py we have removed the strings after the first space from the original fasta as suggested here. ComputeGC was able to generate the model after this.
  2. We used neat-genreads This is the command line that was used: python genReads.py --gc-model gc_model.p -t target.bed --pe-model fraglen.p -r grch38.fasta -o wes_normal_150x -v variants.vcf -R 101 -c 150 -M 0.1 --rng 10 --vcf --bam --job 8 8

and got this error /usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py:2909: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount)

Do you maybe know what could have caused this error? When running neat-genreads should we use the original fasta or the fasta file used for computeGC? Are there any known bugs related to GRCH38 reference?

Thank you, Ana

zstephens commented 3 years ago

Greetings, was there more to that error? I don't see a line number to where in genReads.py it was going wrong.

I wonder if you would be able to share your target bed file? I think the code might expect it to very organized (sorted, no overlapping regions) and things can go wrong if it's not.

popicka commented 3 years ago

Hi, sorry for the delayed response. Here are all of the BED files that we used. With all of these BED files we get the same error. S07604514_Regions_v6_hg38_Targets.txt S31285117_Regions_v7_hg38_Targets.txt S07604514_Regions_v6_hg38_Targets_removed_additional_info.txt S04380110_Regions_v5_hg38_Targets.txt S04380110_Regions_v5_hg38_Targets_removed_additional_info.txt S04380110_Regions_v5_hg38_Targets_removed_additional_info.sorted.txt

Interestingly, we also have a problem with GRCH38 WGS simulations, where NEAT breaks without reporting an error. Thank you!

zstephens commented 3 years ago

I was able to successfully simulate small test datasets using those bed files (minus the header lines) and a subset of chr1. So something must be going wrong later down the line. Do you have a log file of where the simulation was failing? E.g. which chromosome and position did it make it to before dying?