xiaochuanle / MECAT

MECAT: an ultra-fast mapping, error correction and de novo assembly tool for single-molecule sequencing reads
107 stars 26 forks source link

We have released a new version MECAT2. Please go and download that new version. This version will not be updated any more.

Contents

Introdction

MECAT is an ultra-fast Mapping, Error Correction and de novo Assembly Tools for single molecula sequencing (SMRT) reads. MECAT employs novel alignment and error correction algorithms that are much more efficient than the state of art of aligners and error correction tools. MECAT can be used for effectively de novo assemblying large genomes. For example, on a 32-thread computer with 2.0 GHz CPU , MECAT takes 9.5 days to assemble a human genome based on 54x SMRT data, which is 40 times faster than the current PBcR-Mhap pipeline. We also use MECAT to assemble a diploid human genome based on 102x SMRT data only in 25 days. The latter assembly leads a great improvement of quality to the previous genome assembled from the 54x haploid SMRT data. MECAT performance were compared with PBcR-Mhap pipeline, FALCON and Canu(v1.3) in five real datasets. The quality of assembled contigs produced by MECAT is the same or better than that of the PBcR-Mhap pipeline and FALCON. Here are some comparisons on the 32-thread computer with 2.0 GHz CPU and 512 GB RAM memory:

Genome Pipeline Thread number Total running time (h) NG50 of genome
E.coli FALCON 16 1.21 4,635,129
PBcR-MHAP 16 1.29 4,652,272
Canu 16 0.71 4,648,002
MECAT 16 0.24 4,649,626
Yeast FALCON 16 2.16 587,169
PBcR-MHAP 16 4.2 818,229
Canu 16 5.11 739,902
MECAT 16 0.91 929,350
A.thaliana FALCON 16 223.84 7,583,032
PBcR-MHAP 16 188.7 9,610,192
Canu 16 118.57 8,315,338
MECAT 16 10.73 12600961
D.melanogaster FALCON 16 140.72 15,664,372
PBcR-MHAP 16 101.22 13,627,256
Canu 16 69.34 14,179,324
MECAT 16 9.58 18,111,159
Human(54X) PBcR-MHAH(f) 32 5750 1,857,788
PBcR-MHAH(s) 32 13000 4,320,471
MECAT 32 230.54 4,878,957

MECAT consists of four modules:

MEAP is written in C, C++, and perl. It is open source and distributed under the GPLv3 license.

Installation

The current directory is /public/users/chenying/smrt_asm.

Quick Start

Using MECAT to assemble a genome involves 4 steps. Here we take assemblying the genome of Ecoli as an example, to go through each step in order. Options of each command will be given in next section.

Assemblying Pacbio Data

We download the reads ecoli_filtered.fastq.gz from the MHAP website. By decompressing it we obtain ecoli_filtered.fastq.


mecat2pw -j 0 -d ecoli_filtered.fastq -o ecoli_filtered.fastq.pm.can -w wrk_dir -t 16

mecat2cns -i 0 -t 16 ecoli_filtered.fastq.pm.can ecoli_filtered.fastq corrected_ecoli_filtered

extract_sequences corrected_ecoli_filtered.fasta corrected_ecoli_25x.fasta 4800000 25

mecat2canu -trim-assemble -p ecoli -d ecoli genomeSize=4800000 ErrorRate=0.02 maxMemory=40 maxThreads=16 useGrid=0 Overlapper=mecat2asmpw -pacbio-corrected corrected_ecoli_25x.fasta

Assemblying Nanopore Data

Download MAP006-PCR-1_2D_pass.fasta.


mecat2pw -j 0 -d MAP006-PCR-1_2D_pass.fasta -o candidatex.txt -w wrk_dir -t 16 -x 1

mecat2cns -i 0 -t 16 -x 1 candidates.txt MAP006-PCR-1_2D_pass.fasta corrected_ecoli.fasta

extract_sequences corrected_ecoli.fasta corrected_ecoli_25x.fasta 4800000 25

mecat2canu -trim-assemble -p ecoli -d ecoli genomeSize=4800000 ErrorRate=0.06 maxMemory=40 maxThreads=16 useGrid=0 Overlapper=mecat2asmpw -nanopore-corrected corrected_ecoli_25x.fasta

After step 4, the assembled genome is given in file ecoli/ecoli.contigs.fasta. Details of the contigs can be found in file ecoli/ecoli.layout.tigInfo.

Input Format

MECAT is capable of processing FASTA, FASTQ, and H5 format files. However, the H5 files must first be transfered to FASTA format by running DEXTRACTOR/dextract before running MECAT. For example:

find pathto/raw_reads -name "*.bax.h5" -exec readlink -f {} \; > reads.fofn
while read line; do   dextract -v $line >> reads.fasta ; done <  reads.fofn

the extracted result should be the reads.fasta file for mecat's input file.

Program Descriptions

We describe in detail each module of MECAT, including their options and output formats.

mecat2pw

options

The command for running mecat2pw is


mecat2pw -j [task] -d [fasta/fastq] -w [working folder] -t [# of threads] -o [output] -n [# of candidates] -a [overlap size] -k [# of kmers] -g [0/1] -x [0/1]

The options are:

output format

If the job is detecting overlapping candidates, the results are output in can format, each result of which occupies one line and 9 fields:


[A ID] [B ID] [A strand] [B strand] [A gapped start] [B gapped start] [voting score] [A length] [B length]

mecat2pw outputs overlapping results in M4 format, of which one result is given in one line. The fileds of M4 format is given in order below:


[A ID] [B ID] [% identity] [voting score] [A strand] [A start] [A end] [A length] [B strand] [B start] [B end] [B length]

If the -g option is set to 1, two more fields indicating the extension starting points are given:


[A ID] [B ID] [% identity] [voting score] [A strand] [A start] [A end] [A length] [B strand] [B start] [B end] [B length] [A ext start] [B ext start]

In the strand field, 0 stands for the forward strand and 1 stands for the reverse strand. All the positions are zero-based and are based on the forward strand, whatever which strand the sequence is mapped. Here are some examples:


44 500 83.6617 30 0 349 8616 24525 0 1 10081 21813

353 500 83.2585 28 0 10273 18410 22390 1 0 10025 21813

271 500 80.4192 13 0 14308 19585 22770 1 4547 10281 21813

327 501 89.8652 117 0 10002 22529 22529 1 9403 21810 21811

328 501 90.8777 93 0 0 10945 22521 1 0 10902 21811

In the examples above, read 500 overlaps with reads 44, 353, 271, 327 and 328.

memory consumption

Before overlapping is conducted, the reads will be split into several chunks. Each chunk is about 2GB in size so that the overlapping can be run on a 8GB RAM computer.

mecat2ref

options

mecat2ref is used for mapping SMRT reads to the reference genomes. The command is


mecat2ref -d [reads] -r [reference] -w [folder] -t [# of threads] -o [output] -b [# of results] -m [output format] -x [0/1]

The meanings of each option are as follows:

output format

mecat2ref outputs results in one of the three formats: the ref format, the M4 format, and the SAM format.

For the ref format, each result occupies three lines in the form:


[read name] [ref name] [ref strand] [voting score] [read start] [read end] [read length] [ref start] [ref end]

mapped read subsequence

mapped reference subsequence

The strands of the reads are always forward. In the [ref strand] field, F indicates forward strand while R indicates reverse strand. All the positions are zero-based and relative to the forward strand. Here is an example:


1   gi|556503834|ref|NC_000913.3|   F   10  2   58  1988134 1988197

AAT-AGCGCCTGCCAGGCG-TCTTTT--CCGGCCATTGT-CGCAG--CACTGTAACGCGTAAAA

AATTAGCGCCTGCCAGGCGGTCTTTTTTCCGGCCATTGTTCGCAGGG-ACTGTAACGCGTAAAA

In this example, read 1 is mapped to the reference gi|556503834|ref|NC_000913.3|.

memory consumption

mecat2cns

mecat2cns is an adaptive error correction tool for high-noise single-molecula sequencing reads. It is as accurate as pbdagcon and as fast as FalconSense. Inputs to mecat2cns can be either can format or M4 format. The command for running mecat2cns is


mecat2cns [options] overlaps-file reads output

The options are

If x is 0, then the default values for the other options are:

-i 1 -t 1 -p 100000 -r 0.9 -a 2000 -c 6 -l 5000

If x is 1, then the default values for the other options are:

-i 1 -t 1 -p 100000 -r 0.4 -a 400 -c 6 -l 2000

If the inputs are M4 format, the overlap results in [overlaps-file] must contain the gapped extension start point, which means the option -g in mecat2pw must be set to 1, otherwise mecat2cns will fail to run. Also note that the memory requirement of mecat2cns is about 1/4 of the total size of the reads. For example, if the reads are of total size 1GB, then mecat2cns will occupy about 250MB memory.

output format

The corrected sequences are given in FASTA format. The header of each corrected sequence consists of three components seperated by underlines:


>A_B_C_D

where

by effective position we mean the position in the original sequence that is covered by at least c (the argument to the option -c) reads.

extract_sequences

extract_sequences was applied into extract 25X 0r 40X longest sequences from the corrected data. The command is


extract_sequences [the input fasta file from mecat2cns] [the output filename] [genome size]  [coverage]

mecat2canu

mecat2canu is a modified and more efficient version of the Canu pipeline. mecat2canu accelerates canu by replacing its overlapper mhap by mecat2asmpw, which is a customized version of mecat2pw. The options of mecat2canu are the same as canu except its overlapper is replaced by mecat2asmpw. The command for assemblying corrected Pacbio reads is


mecat2canu -d [working-folder] -p [file-prefix] -trim-assemble errorRate=[fraction error] \

    -overlapper=mecat2asmpw genomeSize=[genome size] \

    maxMemory=[host memory size] maxThreads=[# of CPU threads] usedGrid=0 \

    -pacbio-corrected reads-name

The command for assemblying corrected Nanopore reads is


mecat2canu -d [working-folder] -p [file-prefix] -trim-assemble errorRate=[fraction error] \

    -overlapper=mecat2asmpw genomeSize=[genome size] \

    maxMemory=[host memory size] maxThreads=[# of CPU threads] usedGrid=0 \

    -nanopore-corrected reads-name

After assembling, the results are given in the folder working-folder. The assembled genome is given in the file working-folder/file-prefix.contigs.fasta and the details of the contigs are given in the file working-folder/file-prefix.layout.tigInfo.

Citation

Chuan-Le Xiao, Ying Chen, Shang-Qian Xie, Kai-Ning Chen, Yan Wang, Yue Han, Feng Luo, Zhi Xie. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nature Methods, 2017, 14: 1072-1074

Contact

Update Information

Updates in MECAT V1.3 (2017.12.18):

Updates in MECAT V1.2 (2017.5.22):

MECAT v1.1 replaced the old MECAT,some debug were resolved and some new fuctions were added: