wtsi-hpag / Scaff10X

Pipeline for scaffolding and breaking a genome assembly using 10x genomics linked-reads
MIT License
20 stars 3 forks source link

Floating point exception in break10x #15

Open claumer opened 4 years ago

claumer commented 4 years ago

Hi there,

I'm running break10x on my cluster as follows:

/nfs/research1/marioni/claumer/Scaff10X/src/break10x -nodes 40 -gap 100 -reads 5 -score 20 -cover 50 -ratio 15 /nfs/research1/marioni/claumer/DalyG_HiFi/Scaff10x/DalyG_HiFi_q10_redbean_all.ctg.fa /nfs/research1/marioni/claumer/DalyG_HiFi/fwdrev_all/Scaff10x/DalyG_10x_R1.fastq /nfs/research1/marioni/claumer/DalyG_HiFi/fwdrev_all/Scaff10x/DalyG_10x_R2.fastq DalyG_wtdbg2_break10x DalyG_wtdbg2_break10x_breakpoints

I've installed using a conda environment built to use gcc 7.1.0.

The program finishes, gives a "successful complete" to LSF, but when I check stderr I see these last couple of lines:

[main] Version: 0.7.17-r1198-dirty [main] CMD: /nfs/research1/marioni/claumer/Scaff10X/src/scaff-bin/bwa mem -t 40 tarseq.fastq /nfs/research1/marioni/claumer/DalyG_HiFi/fwdrev_all/Scaff10x/DalyG_10x_R1.fastq /nfs/research1/marioni/claumer/DalyG_HiFi/fwdrev_all/Scaff10x/DalyG_10x_R2.fastq [main] Real time: 15635.842 sec; CPU: 211509.263 sec sh: line 1: 23754 Floating point exception/nfs/research1/marioni/claumer/Scaff10X/src/scaff-bin/scaff_barcode-cover -score 20 -cover 50 -ratio 15 align.length-sort break.dat cover.dat > break.out

And it looks like the breakpoints file is completely empty and the assembly that's output is the same as the input. So I assume this means break10x has failed. Can you advise on the cause and remedy?

Regards, Chris L

zning-sanger commented 4 years ago

Hi Chris,

Thanks for reporting errors on break10x. Please go to the working directory and send me the information of "ls -lrt"?

Best regards,

Zemin

claumer commented 4 years ago

Hi Zemin,

The tmp directory has:

-rw-r--r-- 1 claumer marioni 475390678 Feb 16 12:55 tarseq.fastq -rw-r--r-- 1 claumer marioni 16530 Feb 16 12:55 tarseq.tag -rw-r--r-- 1 claumer marioni 237691388 Feb 16 12:58 tarseq.fastq.bwt -rw-r--r-- 1 claumer marioni 59422826 Feb 16 12:58 tarseq.fastq.pac -rw-r--r-- 1 claumer marioni 16 Feb 16 12:58 tarseq.fastq.amb -rw-r--r-- 1 claumer marioni 19684 Feb 16 12:58 tarseq.fastq.ann -rw-r--r-- 1 claumer marioni 118845704 Feb 16 12:59 tarseq.fastq.sa -rw-r--r-- 1 claumer marioni 5717262408 Feb 16 18:41 align.dat -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 align.sort4 -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 target-bcl-5.dat -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 align.length-5 -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 target-bcl-5.freq -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 target-bcn-5.dat -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 target-bcn-5.freq -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 break.out -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 align.length-sort -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 break.dat -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 cover.dat -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 tarseq.agp -rw-r--r-- 1 claumer marioni 60 Feb 16 18:45 break.out2 -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 break-all.dat -rw-r--r-- 1 claumer marioni 0 Feb 16 18:45 break-all.clean -rw-r--r-- 1 claumer marioni 475395688 Feb 16 18:45 break.fastq -rw-r--r-- 1 claumer marioni 54 Feb 16 18:45 try.out

One thought: I started with reads that had been already run through 10X's longranger basic program (which trims reads, error corrects barcodes, etc). Could it be that this broke break10x? Is it expecting raw reads?

zning-sanger commented 4 years ago

Thanks for the information.

Looks that there are files which are missing and also there are some files which shouldn't be there. Yes, you can use a bam file from longranger. Try this

~zn1/src/Scaff10X/src/break10x -nodes 2 -gap 100 -reads 5 -score 20 -cover 50 -ratio 15 -bam /lustre/scratch117/sciops/team117/hpag/zn1/project/bird/hummingbird/QC/refdata-cns_p_ctg/fasta/possorted_bam.bam genome.fa genome-break10x.fasta > try.out

Note: the bam file needs a full path.

Let me know how it goes. Cheers.

Zemin

claumer commented 4 years ago

Hi Zemin,

To clarify, I meant the unaligned fastqs that you get after running the "longranger basic" command. But I did actually try on the raw reads as well after I had this thought, and got the same issue as I originally wrote about above.

I'm trying to use Scaff10x on the EBI cluster, not the Sanger farm, so I had to scp the files you mentioned over. But I did this, and ran the command exactly as you suggested, and it looks like it finished without any major errors (although the genome-break10x.fasta is the same as the input).

try.out contains:

Input target assembly file: /nfs/research1/marioni/claumer/Scaff10X_debug/genome.fa /nfs/research1/marioni/claumer/Scaff10X/src/scaff-bin/samtools view /nfs/research1/marioni/claumer/Scaff10X_debug/possorted_bam.bam | awk '($4!=0)&&($2<100)&&($5>=0){print $1,$12,$2,$3,$4,$5}' | egrep -v '^@' > align0.dat /nfs/research1/marioni/claumer/Scaff10X/src/scaff-bin/scaff_bwa-barcode tarseq.tag align0.dat align.dat > try.out /nfs/research1/marioni/claumer/Scaff10X/src/scaff-bin/scaff_bwa -edge 1800000000 tarseq.tag align.dat align2.dat > try.out

And stderr contains:

sh: -c: line 0: syntax error near unexpected token (' sh: -c: line 0:mv break-all.name /nfs/research1/marioni/claumer/Scaff10X_debug/(null)'

Is this expected behavior? Would be keen to hear how you think we should progress from here in solving this issue.

-Chris L

zning-sanger commented 4 years ago

Hi Chris,

Sorry I have been away for two days and didn't reply your email in time.

With the run using possorted_bam.bam, could you check the sizes of the files in the temporary: align.sort4 align.length-5 align.length-sort

If the sizes for the above files are not zero, the size of break.dat is zero, this means there is no breakpoint found in the assembly.

My pipeline break10x has an issue with zero breakpoints and does not report results correctly. I will try to fix this when I have time. At the moment, if you do not see the file scaffolds-break.name and the input fasta and output fasta files are the same, this means no breakpoints.

Sorry again for the late reply.

Zemin

zning-sanger commented 4 years ago

If you only run "longranger basic", you won't get very much from the pipeline. I would suggest you do "longranger align" or use the fastq files.

claumer commented 4 years ago

Hi Zemin,

Sorry also on my part, I've been doing some intensive labwork the last couple days and let this fall to the wayside momentarily.

So the tmp directory contains:

(base) [claumer@noah-login-01 tmp_rununik_64971]$ ls -lt total 39047312 -rw-r--r-- 1 claumer marioni 56 Feb 17 19:16 try.out -rw-r--r-- 1 claumer marioni 0 Feb 17 19:15 break-all.name -rw-r--r-- 1 claumer marioni 2098433176 Feb 17 19:15 break.fastq -rw-r--r-- 1 claumer marioni 0 Feb 17 19:15 break-all.clean -rw-r--r-- 1 claumer marioni 62 Feb 17 19:15 break.out2 -rw-r--r-- 1 claumer marioni 0 Feb 17 19:15 break-all.dat -rw-r--r-- 1 claumer marioni 0 Feb 17 19:15 tarseq.agp -rw-r--r-- 1 claumer marioni 32 Feb 17 19:14 break.out -rw-r--r-- 1 claumer marioni 32170 Feb 17 19:14 cover.dat -rw-r--r-- 1 claumer marioni 0 Feb 17 19:12 break.dat -rw-r--r-- 1 claumer marioni 577798602 Feb 17 19:12 align.length-sort -rw-r--r-- 1 claumer marioni 9365 Feb 17 19:08 target-bcn-5.freq -rw-r--r-- 1 claumer marioni 152952247 Feb 17 19:08 target-bcn-5.dat -rw-r--r-- 1 claumer marioni 180294 Feb 17 19:07 target-bcl-5.freq -rw-r--r-- 1 claumer marioni 178996980 Feb 17 19:07 target-bcl-5.dat -rw-r--r-- 1 claumer marioni 577798602 Feb 17 19:06 align.length-5 -rw-r--r-- 1 claumer marioni 6793333019 Feb 17 19:04 align.sort4 -rw-r--r-- 1 claumer marioni 12966931205 Feb 17 18:38 align.dat -rw-r--r-- 1 claumer marioni 14382615454 Feb 17 18:34 align0.dat -rw-r--r-- 1 claumer marioni 2098422846 Feb 17 17:58 tarseq.fastq -rw-r--r-- 1 claumer marioni 39842 Feb 17 17:58 tarseq.tag

Which suggests from your comment that break10x worked properly on this dataset but did not find any linked-read supported breakpoints.

That implies that this install works, but for some reason my dataset is breaking break10x, right? I'm not sure quite what to do next - maybe I can share my reads and assemblies with you and you can examine the problem for yourself? I do know that it is a decent linked read dataset, with deep coverage and good numbers of links per molecule, although the molecule length is a bit low. I have tried it on both raw fastqs and the trimmed, barcode-in-read-name fastq format output by "longranger basic", and it has failed with both.

Thanks, Chris L

zning-sanger commented 4 years ago

Hi Chris,

Happy to have a look at your data if you could copy your files read fastqs and assembly to a Sanger location.

Best regards,

Zemin

claumer commented 4 years ago

OK, you can grab the files in question at:

/lustre/scratch117/cellgen/team220/cl16/DalyG_GyrF_10X

The fastqs starting with DalyG are the raw linked reads for the species in this assembly (ignore the GyrF files). The two assemblies I have been trying to break and then scaffold with Scaff10x are called primary_polished.fasta (a FALCON assembly) and DalyG_HiFi_q10_redbean_all.ctg.fa (a wtdbg2 assembly).

Regards, Chris L

zning-sanger commented 4 years ago

Hi Chris,

I have run scaff10x/break10x on your dataset and I have put all the results at

/lustre/scratch117/sciops/team117/hpag/zn1/project/DalyG

A few points:

  1. The accuracy of wtdbg2 assembly is notably higher than that produced by Falcon No breakpoints were found and also at basepair level, the BWA CPU time for the falcon assembly is more than twice for the wtdbg2 assembly. I would suggest that you use the wtdbg2 one.

  2. The genome size estimate by genoemscope is 257Mb and yours is not far away. The good news is that the hetz rate is low for this genome, see DalyG-genomescope.png.

  3. Your 10X reads are not in good quality, I mean very short barcode length, see barcode_length-wtdbg.png, barcode_length-falcon.png, wtdbg2-scaff.fasta.cov, falcon-scaff.fasta.cov. If I were you, I will add HiC reads for a chromosome scale assembly.

  4. Which assembly to use There are two wtdbg2 assemblies: wtdbg2-crossgenome-scaff.fasta - combined with the falcon assembly wtdbg2-scaff.fasta - just scaff10x on the wtdbg2 assembly.

  5. I also suggest that you run purge_dups to get rid of some duplicated contigs - a step is now adapted for most people:

https://github.com/dfguan/purge_dups

Let me know if you need further information.

Zemin

claumer commented 4 years ago

Hi Zemin,

That's really interesting, and I thank you very much for your work on these data. Yes, I'm aware of the length issue with the linked reads - it appears that, because the HMW DNA was stored at 4 C in low concentration for several weeks while a batch was being assembled (the library prep was not done by me, unfortunately), the DNA had a chance to degrade a bit. Still, definitely some utility in these linked reads.

I'll follow up separately more directly with you by email about these particular steps - and in a sense my issue is solved, Scaff10x works on my data at least on the Sanger cluster - but it would be good to know why the installation I made at EBI is not working on these data. What would you recommend trying next - is it possible that I need to use exactly the same GCC version you are, rather than gcc 7.1.0? I'd love to be able to reproduce what you've done on a different cluster.

Regards, Chris L