zstephens / neat-genreads

NEAT read simulation tools
Other
92 stars 27 forks source link

No output in Golden VCF #62

Closed senzhaocode closed 5 years ago

senzhaocode commented 5 years ago

Hei Zstephens,

I am using neat-genreads to create some synthetic WGS dataset. Here, I apply the parameter -v to generate user-defined mutations via a vcf file, see my command below: _python genReads.py -r /usit/abel/u1/senz/CHM-eval.kit/human_g1k_v37_decoy.fasta -R 150 -c 41 -M 0 -v /usit/abel/u1/senz/neat-genreads/tmp.vcf -o /usit/abel/u1/senz/CHM1_CMH13/ERR1341793simulate --bam --vcf --no-fastq --pe 354 73 --job 1 20

Very strange! No output in the Gold_VCF file, only BAM file. There is no error report in log file. I attach my input VCF file, not sure the format is OK? Can you help me to figure out.

_##fileformat=VCFv4.2

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

FORMAT=

FILTER=

FILTER=

FILTER=

FILTER=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT syndip

1 10927 . A G 30 GAP13 . GT 1/. 1 10931 . C T 30 GAP13 . GT 1/. 1 579576 . A AG 30 HET1;GAP13 . GT ./. 1 579732 . CA C 30 HET1;GAP13 . GT ./. 1 579846 . GA G 30 HET1;GAP13 . GT ./. 1 580069 . GA G 30 HET1;GAP13 . GT ./. 1 1254057 . CA C 30 . . GT 0/1 1 1254136 . G A 30 . . GT 1/1 1 1254436 . A G 30 . . GT 1/1 1 1254443 . G A 30 . . GT 1/1 1 1254841 . C G 30 . . GT 1/1 1 1255028 . G A 30 . . GT 1/0 1 1255287 . C T 30 . . GT 1/1 1 1255403 . TG T 30 . . GT 1/1 1 1256608 . G A 30 . . GT 0/1 1 1257593 . A G 30 . . GT 1/1 1 1258246 . A C 30 . . GT 1/1 1 1259429 . TG T 30 . . GT 0/1 1 1259529 . G A 30 . . GT 1/1 1 1259537 . G A 30 . . GT 1/1 1 1259546 . G A 30 . . GT 1/1 1 1259551 . A G 30 . . GT 1/1 1 1259552 . G GGAGCTGGGGGGCTGGGGGGCTGAGGGGCTCGGGGGCTGGGGGGCTGCTGGGCTGAGAGGCTGGGAGACTGGA 30 . . GT 1/1 1 1259559 . G T 30 . . GT 1/1 1 1259561 . A G 30 . . GT 1/1 1 1259569 . A G 30 . . GT 1/1 1 1260168 . C A 30 . . GT 1/1 1 1260733 . A C 30 . . GT 1/1 1 1261282 . AC A 30 . . GT 0/1 1 1261684 . TC T 30 . . GT 0/1 1 1261824 . G C 30 . . GT 1/1_

Thank you

zstephens commented 5 years ago

Greetings,

I ran your command line and successfully got the input variants in the output vcf:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   1254057 .   CA  C   .   PASS    WP=0/1
1   1254136 .   G   A   .   PASS    WP=1/1
1   1254436 .   A   G   .   PASS    WP=1/1
1   1254443 .   G   A   .   PASS    WP=1/1
1   1254841 .   C   G   .   PASS    WP=1/1
1   1255028 .   G   A   .   PASS    WP=1/0
1   1255287 .   C   T   .   PASS    WP=1/1
1   1255403 .   TG  T   .   PASS    WP=1/1
1   1256608 .   G   A   .   PASS    WP=0/1
1   1257593 .   A   G   .   PASS    WP=1/1
1   1258246 .   A   C   .   PASS    WP=1/1
1   1259429 .   TG  T   .   PASS    WP=0/1
1   1259529 .   G   A   .   PASS    WP=1/1
1   1259537 .   G   A   .   PASS    WP=1/1
1   1259546 .   G   A   .   PASS    WP=1/1
1   1259551 .   A   G   .   PASS    WP=1/1
1   1259552 .   G   GGAGCTGGGGGGCTGGGGGGCTGAGGGGCTCGGGGGCTGGGGGGCTGCTGGGCTGAGAGGCTGGGAGACTGGA   .   PASS    WP=1/1
1   1260168 .   C   A   .   PASS    WP=1/1
1   1260733 .   A   C   .   PASS    WP=1/1
1   1261282 .   AC  A   .   PASS    WP=0/1
1   1261684 .   TC  T   .   PASS    WP=0/1
1   1261824 .   G   C   .   PASS    WP=1/1

The first 6 got skipped, because genReads doesn't like dots in the GT field. (I'm open to changing this behavior such that '.' gets treated at '0', though).

edit: I ran it without the --job parameter. I'll run it again to make sure that works too.

senzhaocode commented 5 years ago

Hi Zstephens:

Thank you for the reply. I figured out why there was no output in Gold_VCF file, because chromosome names in reference fasta file and these in -v vcf files do not match each other. Now I used the input VCF file (with around 67000 variant records) for generating simulation reads, unfortunately there is an error report which causes the running process crashed:

**-------------------------------- reading input VCF... found 67298 valid variants in input vcf.

It seems something related to SequenceContainer.py. I attached the reference fasta and input vcf file. Can you could try and see whether to recover the same problem. http://folk.uio.no/senz/chr22.fasta http://folk.uio.no/senz/chr22.vcf The command used for simulating reads: python genReads.py -r chr22.fasta -R 150 -c 10 -M 0 -v chr22.vcf -o ERR1341793_chr22 --bam --vcf --pe 354 73

senzhaocode commented 5 years ago

Hi Zstephens:

I found the reason why genReads.py was crashed during reads simulation. Because the user-defined VCF input file was assembled from PicBio Technology. It contains some VERY long indel variants (>150 bp), which does not make sense for generating illumina paired-end reads. After I removed these long indel variants from original vcf file, the simulation process went well. Zstephens, I suggest that you add some codes in the script vcfFunc.py and make control of potential presence of long indel variants. That will avoid some filtering work done by users.

Again, neat-genreads is very nice simulator even if it has such a small shortage.