zyndagj / BSMAPz

Updated and optimized fork of BSMAP
Other
22 stars 6 forks source link

Segmentation Fault when Loading Reference #27

Closed madeluis closed 4 years ago

madeluis commented 4 years ago

Hello,

I am trying to run BSMAPz with some watermelon data. The reference genome that I am using can be found here:

ftp://cucurbitgenomics.org/pub/cucurbit/genome/watermelon/WCG/v2/

My strong preference is to use the chromosome FASTA file as my reference (WCG_genome_v2.fa, from the link above). However, using this file leads to a segmentation error. If I use the scaffold reference (WCG_scaffold_v2.fa) though, my command runs just fine. My first guess was that this was caused by a memory issue (i.e., in the first case, the program tries to load the entire chromosome at once and it cannot allocate that much memory, which is not a problem with the scaffolds due to their smaller size). However, I find this hard to believe since am using an instance with 768GB of memory, and the problem persists even if I run it with one core.

Any insights into this would be greatly appreciated (command and error below).

Thank you very much! Angels

Command: bsmapz -a ./data/SRR6328781_1.fastq -b ./data/SRR6328781_2.fastq -d ./data/WCG_genome_v2.fa -o SRR6328781.bam -p 8 -A AGATCGGAAGAGC -w 100 -r 0 -q 10

Error: loading reference file: ./data/WCG_genome_v2.fa (format: FASTA) Segmentation fault (core dumped)

zhangaicen commented 4 years ago

Hi,I am meeting with the same problem, have you resolve it?

zyndagj commented 4 years ago

Hello,

I was able to reproduce your problem

$ bsmapz -a SRR6328781_1.fastq -b SRR6328781_2.fastq -d WCG_genome_v2.fa -o SRR6328781.bam -p 64 -A AGATCGGAAGAGC -w 100 -r 0 -q 10
[bsmapz] @Wed Aug  5 15:11:59 2020      loading reference file: WCG_genome_v2.fa        (format: FASTA)
Segmentation fault

and solved it by splitting reference sequence into lines of 70 characters with FASTA-formatter.

$ fasta_formatter -i WCG_genome_v2.fa -o WCG_genome_v2_70w.fa  -w 70
$ ./bsmapz -a SRR6328781_1.fastq -b SRR6328781_2.fastq -d WCG_genome_v2_70w.fa -o SRR6328781.bam -p 64 -A AGATCGGAAGAGC -w 100 -r 0 -q 10
[bsmapz] @Wed Aug  5 15:14:40 2020      loading reference file: WCG_genome_v2_70w.fa    (format: FASTA)
[bsmapz] @Wed Aug  5 15:14:57 2020      12 reference seqs loaded, total size 404611775 bp. 17 secs passed
[bsmapz] @Wed Aug  5 15:15:11 2020      create seed table. 31 secs passed
[bsmapz] @Wed Aug  5 15:15:11 2020      Pair-end alignment(64 threads),
        Input read file #1: SRR6328781_1.fastq  (format: FASTQ)
        Input read file #2: SRR6328781_2.fastq  (format: FASTQ)
        Output file: SRR6328781.bam      (format: SAM, automatically convert to BAM)

You can use another width besides 70, BSMAPz just can't handle the whole chromosome on a single line.

zhangaicen commented 4 years ago

Hello,

I was able to reproduce your problem

$ bsmapz -a SRR6328781_1.fastq -b SRR6328781_2.fastq -d WCG_genome_v2.fa -o SRR6328781.bam -p 64 -A AGATCGGAAGAGC -w 100 -r 0 -q 10
[bsmapz] @Wed Aug  5 15:11:59 2020      loading reference file: WCG_genome_v2.fa        (format: FASTA)
Segmentation fault

and solved it by splitting reference sequence into lines of 70 characters with FASTA-formatter.

$ fasta_formatter -i WCG_genome_v2.fa -o WCG_genome_v2_70w.fa  -w 70
$ ./bsmapz -a SRR6328781_1.fastq -b SRR6328781_2.fastq -d WCG_genome_v2_70w.fa -o SRR6328781.bam -p 64 -A AGATCGGAAGAGC -w 100 -r 0 -q 10
[bsmapz] @Wed Aug  5 15:14:40 2020      loading reference file: WCG_genome_v2_70w.fa    (format: FASTA)
[bsmapz] @Wed Aug  5 15:14:57 2020      12 reference seqs loaded, total size 404611775 bp. 17 secs passed
[bsmapz] @Wed Aug  5 15:15:11 2020      create seed table. 31 secs passed
[bsmapz] @Wed Aug  5 15:15:11 2020      Pair-end alignment(64 threads),
        Input read file #1: SRR6328781_1.fastq  (format: FASTQ)
        Input read file #2: SRR6328781_2.fastq  (format: FASTQ)
        Output file: SRR6328781.bam      (format: SAM, automatically convert to BAM)

You can use another width besides 70, BSMAPz just can't handle the whole chromosome on a single line.

Good, I've got it. Thank a lot!