wtsi-hpag / Scaff10X

Pipeline for scaffolding and breaking a genome assembly using 10x genomics linked-reads
MIT License
20 stars 3 forks source link

No scaffolding being done on the Scaff10xV4 #8

Open harish0201 opened 5 years ago

harish0201 commented 5 years ago

Hi,

I have been trying out scaffolding the pacbio contigs using scaff10x, using the latest version the version 4 uploaded last month.

I can see that the align* series files are empty, but the tool doesn't throw an error, nor does it do scaffolding. Please see the attached screenshot. scaff10x

It seems during the update the tool names in the scaff-bin folder have been changed, but calling them in the main script seems to be done using the older names, creating the error. (I observed this for bwa).

This is the command I have been using:

~/Tools/Scaff10X/src/scaff10x -nodes 48 -longread 1 -gap 100 -matrix 2000 -reads 10 -score 20 -edge 50000 -link 8 -block 50000 final_MP_scaff.fa genome-BC2_1.fastq.gz genome-BC2_2.fastq.gz out.scaffolds.fa &

I'm also a bit unsure if this is because the genome is fragmented? Please see the stats below:

file final_MP_scaff.fa num_seqs 20896 sum_len 896879817 min_len 2002 avg_len 42921.1 max_len 1276294 Q1 14596 Q2 21642 Q3 37521 sum_gap 1782762 N50 86627

Edit: On closer look, it seems that for some reason the bwa indices are not being created.

EdHarry commented 5 years ago

I've just pushed an update with should have better error handling. Could you try the latest version out.

BTW, unless you want to keep the barcode processed reads, it will be more efficient to list your 10x fq files in an input.fofn text file and pass that in with -data input.fofn.

harish0201 commented 5 years ago

Thank you for the update!

I did try having the list in the fofn and then pass it on. But I then broke down all the steps to see where the errors occurred!

I have manually generated the bwa indices and I'm currently doing the steps manually to further trouble shoot where the issues are.

I'll keep you updated.

mwhj commented 5 years ago

I have the same problem! I've been pulling my hair out thinking I've been doing something wrong. It would be really brilliant to get this fixed.

harish0201 commented 5 years ago

Alright, I have finally a 10X dataset that I have to use to scaffold. These are the errors on the latest commit:

[main] CMD: /home/harish/Tools/Scaff10X/src/scaff-bin/bwa mem -t 30 tarseq.fastq /data/analysis/dr.n.v.singh/genome/10X/scaff/genome-BC2_1.fastq.gz /data/analysis/dr.n.v.singh/genome/10X/scaff/genome-BC2_2.fastq.gz [main] Real time: 43975.365 sec; CPU: 1319572.223 sec sh: -c: line 0: syntax error near unexpected token(' sh: -c: line 0: mv genome.fasta /data/analysis/plant/genome/10X/scaff/(null)' Error running command: mv genome.fasta /data/analysis/plant/genome/10X/scaff/(null) Input target assembly file2: /data/analysis/plant/genome/10X/scaff/polished.fa Input read1 file: /data/analysis/plant/genome/10X/scaff/genome-BC2_1.fastq.gz Input read2 file: /data/analysis/plant/genome/10X/scaff/genome-BC2_2.fastq.gz /home/harish/Tools/Scaff10X/src/scaff-bin/bwa mem -t 30 tarseq.fastq /data/analysis/plant/genome/10X/scaff/genome-BC2_1.fastq.gz /data/analysis/plant/genome/10X/scaff/genome-BC2_2.fastq.gz | egrep tarseq_ | awk '($2<100)&&($5>=0){print $1,$2,$3,$4,$5}' | egrep -v '^@' > align.dat

It still generates a genome.fasta which seems to be after scaffolding, but there's very less reduction, which is kinda curious. I'll have to check with the older version.

Reg: Retaining the debar-coded reads, I plan to use them for polishing as well :) Thanks for the heads-up