zstephens / neat-genreads

NEAT read simulation tools
Other
94 stars 27 forks source link

Threads exit when given VCF #14

Open muppetjones opened 7 years ago

muppetjones commented 7 years ago

For some reason, when I generate reads using a VCF, most of the threads exit without a sound a few minutes after starting the thread. This seems to happen before the reads are assigned parts of the genome. It is not always the same thread, nor is it always the same number of threads (8-10 still remain)

Command:

REGIONS="/data/regions.bed"
REFERENCE="/refs/ensembl/ensembl_build_75.fa"
VARIANTS="/scripts/cosmic_fix.vcf"
OUT=/data
VAF=15

TUMOR_PRCNT=$(( ${VAF} * 2 ))
G_COV=$(echo "scale=2; (100 - ${TUMOR_PRCNT}) * 4" | bc -l | xargs printf "%.0f")
S_COV=$(echo "scale=2; ${TUMOR_PRCNT} * 4" | bc -l | xargs printf "%.0f")

G_PREFIX=${OUT_DIR}/tumor_g
S_PREFIX=${OUT_DIR}/tumor_s

  for i in $(seq 1 ${MAX_JOB})
  do
    set -x
    python /neat-genreads/genReads.py \
      -r "${REFERENCE}" \
      -R 126 \
      -o "${S_PREFIX}" \
      --pe 300 30 \
      -t "${REGIONS}" \
      -c ${S_COV} \
      -M 0 \
      -v "${VARIANTS}" \
      --vcf --bam \
      --job $i ${MAX_JOB} 2>&1 | tee ${S_PREFIX}_${i}.log &
      set +x
  done
  wait

  for i in $(seq 1 ${MAX_JOB})
  do
    set -x
    python /neat-genreads/genReads.py \
      -r "${REFERENCE}" \
      -R 126 \
      -o ${G_PREFIX} \
      --pe 300 30 \
      -t "${REGIONS}" \
      -c ${G_COV} \
      -M 0 \
      --vcf --bam \
      --job $i ${MAX_JOB} 2>&1 | tee ${G_PREFIX}_${i}.log &
    set +x
  done
  wait

Output (before fail)

Using default sequencing error model.
Warning: Read length of error model (101) does not match -R value (126), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index /refs/ensembl/ensembl_build_75.fa.fai
--------------------------------
reading input VCF...

Edit: This may be due to memory requirements. the parseVCF contains the unclosed open mentioned in a different issue; however, this still occurs when fixed. The VCF file I'm using is 800MB.

muppetjones commented 7 years ago

I've run this a few times, and the only way to guarantee no threads exit seems to be with MAX_JOB=8 _when running with an input VCF. I'm 90% certain it's a memory issue as the threads start to drop when the memory maxes out, but for some reason, each job quits quietly (vs. running w/o a VCF and too many jobs, which quits with MemoryError).

However, splitting the job into more parts and only running 8 at a time still works, but you only get the regions for those 8 jobs. I'm assuming the region split is constant (otherwise it wouldn't work in the first place), so you could potentially split as much as you like and just run 8 at a time; unfortunately, this negates any benefits from running regions in parallel.

I don't know if this was an issue before, and I just missed it, or if this is something new.

Joseph-Vineland commented 2 years ago

I had the same problem, when trying to call gen-reads.py in parallel. It looks like everything is working and then the jobs abort themselves without a message part way through. It is easy to miss becasue you think everything ran to completeion without error.

COMMAND:

find L*.vcf | sed 's/.vcf$//' | parallel 'python3.6 neat-genreads/gen_reads.py -M 0 -R 150 -o {} -v {}.vcf -c 1 --vcf -r ref.fa'