zstephens / neat-genreads

NEAT read simulation tools
Other
92 stars 27 forks source link

gen_reads.py produces empty output files #92

Closed joys8998 closed 3 years ago

joys8998 commented 3 years ago

Hi there, I just downloaded the latest release (compatible with python 3) and after the installation of all requirements, I tried to run gen_reads.py. But even if I don't get any error (only a warning regarding the bed file Warning: Reference contains sequences not found in targeted regions BED file. but I saw it's quite normal, and I know reference matches bed file because I used them in the germline pipeline) the output files are still empty after 7-8 hours)

This is my env

python==3.8.8 (I tried also with python 3.8.5 and 3.6 but it doesn't work) numpy==1.19.5 biopython==1.78 matplotlib==3.3.4 matplotlib_venn==0.11.6 pandas==1.2.1 pysam==0.16.0.1

This is my command using hg38 as ref

python gen_reads.py -r ../../pipelines/genomes/hg38_analysisSet/hg38.analysisSet.fa -R 101 -o ../../pipelines/data/simulated_data/test_1 --vcf --pe 300 30 -tr ../../pipelines/genomes/geneAnnotations/hg38.exome.pad20nts.ncbiRefSeq.bed -c 50

This is my log

Using default sequencing error model. Using default gc-bias model. Using artificial fragment length distribution. mean=300, std=30 found index ../../pipelines/genomes/hg38_analysisSet/hg38.analysisSet.fa.fai Warning: Reference contains sequences not found in targeted regions BED file. reading chr1...

I'm running it on Ubuntu 20.04

While when I use hg19 I get this error [Traceback (most recent call last): File "gen_reads.py", line 902, in <module> main() File "gen_reads.py", line 549, in main print(f'PROCESSING WINDOW: {(start, end), [buffer_added]}, ' UnboundLocalError: local variable 'buffer_added' referenced before assignment

thanks :)

joshfactorial commented 3 years ago

Hello. First off, thank you for using NEAT. We have a new repo now, and I invite you there, since that's where active development will take place: https://github.com/ncsa/NEAT. That said, NEAT takes a very long time to run, proportional to the input. That's one thing we'll be working on in the new repository. I would not recommend running it on the full hg38 without a workstation where you can let it run for several days. What you might try instead is breaking up your reference by chromosome and then you can combine the resulting fastqs and vcfs into one at the end.

That is an interesting bug when using hg19. I had not thought to test that, but I will make a bug report on the repo above regarding that buffer_added variable and check into that.

joshfactorial commented 3 years ago

I think the reason the output files are empty is that NEAT opens the files, then writes to them while processing. Opening the file creates it on you OS, but since NEAT is still processing after your timeframe, it still hasn't written anything out yet. However, I will double check your command input and confirm that this is what's happening.