wtsi-hpag / Scaff10X

Pipeline for scaffolding and breaking a genome assembly using 10x genomics linked-reads
MIT License
20 stars 3 forks source link

run fails at scaff_matrix #13

Closed dcopetti closed 5 years ago

dcopetti commented 5 years ago

Hello, I was able to run the alignment step, then the run died at the scaff_matrix step. This is the stdout:

[...]
[M::mem_pestat] skip orientation RF
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 569628 reads in 2053.558 CPU sec, 114.390 real sec
[main] Version: 0.7.17-r1198-dirty
[main] CMD: /data/dario/bin/Scaff10X/src/scaff-bin/bwa mem -p -t 18 tarseq.fastq -
[main] Real time: 321077.765 sec; CPU: 5706537.235 sec
sh: line 1: 14303 Segmentation fault      (core dumped) /data/dario/bin/Scaff10X/src/scaff-bin/scaff_matrix -file 1 -matrix 2000 -link 10 -uplink 50 -longread 0 barcodes.clean tarseq.tag contig.dat > scaff.out
Error running command: /data/dario/bin/Scaff10X/src/scaff-bin/scaff_matrix -file 1 -matrix 2000 -link 10 -uplink 50 -longread 0 barcodes.clean tarseq.tag contig.dat > scaff.out

and this is the content of the tmp folder:

total 59G
-rw-rw----+ 1 copettid mpb  64M Jul 18 17:16 tarseq.tag
-rw-rw----+ 1 copettid mpb 9.5G Jul 18 17:16 tarseq.fastq
-rw-rw----+ 1 copettid mpb 4.8G Jul 18 18:12 tarseq.fastq.bwt
-rw-rw----+ 1 copettid mpb 1.2G Jul 18 18:14 tarseq.fastq.pac
-rw-rw----+ 1 copettid mpb  76M Jul 18 18:14 tarseq.fastq.ann
-rw-rw----+ 1 copettid mpb 2.4M Jul 18 18:14 tarseq.fastq.amb
-rw-rw----+ 1 copettid mpb 2.4G Jul 18 18:41 tarseq.fastq.sa
-rw-rw----+ 1 copettid mpb  31G Jul 22 11:52 align.dat
-rw-rw----+ 1 copettid mpb 3.3G Jul 22 11:55 align2.dat
-rw-rw----+ 1 copettid mpb 3.3G Jul 22 11:57 align.sort
-rw-rw----+ 1 copettid mpb 3.0G Jul 22 11:57 align.sort2
-rw-rw----+ 1 copettid mpb  73M Jul 22 11:58 barcodes.clust
-rw-rw----+ 1 copettid mpb   32 Jul 22 11:58 try.out
-rw-rw----+ 1 copettid mpb  54M Jul 22 11:58 barcodes.clean
-rw-rw----+ 1 copettid mpb  649 Jul 22 11:58 scaff.out

is the script looking for the contig.dat file and can't find it maybe? If we find the issue, is it possible to restart from after the alignment step? It took a long time and I think we have the right files. Thanks, Dario

zning-sanger commented 5 years ago

Hi,

Can you send me a few lines (10) of align.dat?

Thanks,

Zemin

dcopetti commented 5 years ago

Hi, Here they are:

ST-E00273:278:HGCTHALXX:7:1101:5518:1520_NAGACCCGTCATCCCT 83 tarseq_5065 55396 60
ST-E00273:278:HGCTHALXX:7:1101:5923:1520_NTGGGTAAGGCCAGAT 99 tarseq_1249 255777 58
ST-E00273:278:HGCTHALXX:7:1101:5944:1520_NAGGTGCAGATTCACC 99 tarseq_9573 47884 16
ST-E00273:278:HGCTHALXX:7:1101:6614:1520_NCGTGTGCAGGGCTTC 83 tarseq_6694 28673 60
ST-E00273:278:HGCTHALXX:7:1101:6898:1520_NGTAAGCGTTCGATAC 99 tarseq_2865 237819 22
ST-E00273:278:HGCTHALXX:7:1101:7080:1520_NAGATCCGTGAATATG 83 tarseq_618 503027 60
ST-E00273:278:HGCTHALXX:7:1101:7101:1520_NAGATCCGTGAATATG 83 tarseq_618 503027 59
ST-E00273:278:HGCTHALXX:7:1101:7182:1520_NTAAGTGAGAGAGATG 83 tarseq_128454 1227 34
ST-E00273:278:HGCTHALXX:7:1101:7344:1520_NTAATGCCAGAGGCAT 83 tarseq_289234 311 1
ST-E00273:278:HGCTHALXX:7:1101:8055:1520_NGCCAAGAGACGCAAC 83 tarseq_14550 8793 60
ST-E00273:278:HGCTHALXX:7:1101:8176:1520_NCAATGAGTCTGTCCT 99 tarseq_656765 266 5
ST-E00273:278:HGCTHALXX:7:1101:8603:1520_NTCCATTGTCTCGGCA 83 tarseq_1114 42231 60
ST-E00273:278:HGCTHALXX:7:1101:9049:1520_NCGCAACGTAGCAAAT 99 tarseq_2370 17744 59
ST-E00273:278:HGCTHALXX:7:1101:9211:1520_NGCAACGAGACCTGTT 99 tarseq_1649 338444 40
ST-E00273:278:HGCTHALXX:7:1101:9678:1520_NCAACAAAGGACGTAC 99 tarseq_10062 70002 48
ST-E00273:278:HGCTHALXX:7:1101:10368:1520_NTCACACTCGGATCCG 99 tarseq_40764 623 45
ST-E00273:278:HGCTHALXX:7:1101:10835:1520_NTTGCTACACTTGTTT 99 tarseq_5559 178790 60
ST-E00273:278:HGCTHALXX:7:1101:10937:1520_NGGGTGTGTGCGTGCT 83 tarseq_9790 115193 60
zning-sanger commented 5 years ago

The file looks fine to me.

How much RAM do you have in your computer?

dcopetti commented 5 years ago

126 GB: too little? I could run it on a larger machine, if you can estimate how much memory I will need it would be great. My assembly is about 5 Gb, the raw data is this big: -rw-r--r--+ 1 copettid mpb 37G Jun 22 2018 4606-KCB-0001_S6_L006_R2_001.fastq.gz -rw-r--r--+ 1 copettid mpb 41G Jun 22 2018 4606-KCB-0001_S6_L007_R2_001.fastq.gz -rw-r--r--+ 1 copettid mpb 33G Jun 22 2018 4606-KCB-0001_S6_L006_R1_001.fastq.gz -rw-r--r--+ 1 copettid mpb 36G Jun 22 2018 4606-KCB-0001_S6_L007_R1_001.fastq.gz

zning-sanger commented 5 years ago

For a 3Gb genome with 30X coverage, it needs about 50-60Gb RAM. Given your genome size and read coverage, so this could be due to memory. We have done some 4Gb genomes, but my colleagues run the pipeline. I didn't know how much RAM it was used. I would suggest that you move to a larger computer with at least 512GB. I would be grateful if you record the memory usage and let me know. I might need to spend some time to reduce memory usage. Thanks, Zemin

On Monday, 22 July 2019, 15:10:06 BST, dcopetti <notifications@github.com> wrote:  

126 GB: too little? I could run it on a larger machine, if you can estimate how much memory I will need it would be great. My assembly is about 5 Gb, the raw data is this big: -rw-r--r--+ 1 copettid mpb 37G Jun 22 2018 4606-KCB-0001_S6_L006_R2_001.fastq.gz -rw-r--r--+ 1 copettid mpb 41G Jun 22 2018 4606-KCB-0001_S6_L007_R2_001.fastq.gz -rw-r--r--+ 1 copettid mpb 33G Jun 22 2018 4606-KCB-0001_S6_L006_R1_001.fastq.gz -rw-r--r--+ 1 copettid mpb 36G Jun 22 2018 4606-KCB-0001_S6_L007_R1_001.fastq.gz

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

zning-sanger commented 5 years ago

For a 3Gb genome with 30X coverage, it needs about 50-60Gb RAM. Given your genome size and read coverage, so this could be due to memory. We have done some 4Gb genomes, but my colleagues run the pipeline. I didn't know how much RAM it was used. I would suggest that you move to a larger computer with at least 512GB. I would be grateful if you record the memory usage and let me know.

I might need to spend some time to reduce memory usage.

zning-sanger commented 5 years ago

Number of contigs is a major factor in determining memory usage as I use matrix to track barcodes. If the assembly is not from long reads, it would have problems.

dcopetti commented 5 years ago

The number of contigs may be the issue then: I have a N50 of 200 kb, but many many small contigs. OK, I give up on this tool then. Thanks for the support!