yangao07 / TideHunter

TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
MIT License
27 stars 4 forks source link
long-reads multiple-sequence-alignment partial-order-alignment seed-and-chain tandem-repeats

TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain

Latest Release Github All Releases BioConda Install Published in Bioinformatics GitHub Issues Build Status License

Updates (v1.5.5)

Getting started

Download the latest release:

wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5.tar.gz
tar -zxvf TideHunter-v1.5.5.tar.gz && cd TideHunter-v1.5.5

Make from source and run with test data:

make; ./bin/TideHunter ./test_data/test_50x4.fa > cons.fa

Or, install via conda and run with test data:

conda install -c bioconda tidehunter
TideHunter ./test_data/test_50x4.fa > cons.fa

Table of Contents

Introduction

TideHunter is an efficient and sensitive tandem repeat detection and consensus calling tool which is designed for tandemly repeated long-read sequence (INC-seq, R2C2, NanoAmpli-Seq).

It works with Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing data at error rates up to 20% and does not have any limitation of the maximal repeat pattern size.

Installation

Installing TideHunter via conda

On Linux/Unix and Mac OS, TideHunter can be installed via

conda install -c bioconda tidehunter

Building TideHunter from source files

You can also build TideHunter from source files. Make sure you have gcc (>=6.4.0) and zlib installed before compiling. It is recommended to download the latest release of TideHunter from the release page.

wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5.tar.gz
tar -zxvf TideHunter-v1.5.5.tar.gz
cd TideHunter-v1.5.5; make

Or, you can use git clone command to download the source code. Don't forget to include the --recursive to download the codes of abPOA. This gives you the latest version of TideHunter, which might be still under development.

git clone --recursive https://github.com/yangao07/TideHunter.git
cd TideHunter; make

Pre-built binary executable file for Linux/Unix

If you meet any compiling issue, please try the pre-built binary file:

wget https://github.com/yangao07/TideHunter/releases/download/v1.5.5/TideHunter-v1.5.5_x64-linux.tar.gz
tar -zxvf TideHunter-v1.5.5_x64-linux.tar.gz

Getting started with toy example in test_data

TideHunter ./test_data/test_1000x10.fa > cons.fa

Usage

To generate consensus sequences in FASTA format

TideHunter ./test_data/test_1000x10.fa > cons.fa

To generate consensus sequences in tabular format

TideHunter -f 2 ./test_data/test_1000x10.fa > cons.out

To generate consensus sequences in FASTQ format

TideHunter -f 3 ./test_data/test_1000x10.fa > cons.fq

To generate full-length consensus sequences

TideHunter -5 ./test_data/5prime.fa -3 ./test_data/3prime.fa ./test_data/full_length.fa > cons_full.fa

To generate unit sequences in FASTA format

TideHunter -u ./test_data/test_1000x10.fa > unit.fa

To generate unit sequences in tabular format

TideHunter -u -f 2 ./test_data/test_1000x10.fa > unit.out

Commands and options

Usage:   TideHunter [options] in.fa/fq > cons.fa

Options:
  Seeding:
    -k --kmer-length INT    k-mer length (no larger than 16) [8]
    -w --window-size INT    window size, set as >1 to enable minimizer seeding [1]
    -H --HPC-kmer           use homopolymer-compressed k-mer [False]
  Tandem repeat criteria:
    -c --min-copy    INT    minimum copy number of tandem repeat (>=2) [2]
    -e --max-diverg  INT    maximum allowed divergence rate between two consecutive repeats [0.25]
    -p --min-period  INT    minimum period size of tandem repeat (>=2) [30]
    -P --max-period  INT    maximum period size of tandem repeat (<=4294967295) [10K]
  Scoring parameters for partial order alignment:
    -M --match    INT       match score [2]
    -X --mismatch INT       mismatch penalty [4]
    -O --gap-open INT(,INT) gap opening penalty (O1,O2) [4,24]
    -E --gap-ext  INT(,INT) gap extension penalty (E1,E2) [2,1]
                            TideHunter provides three gap penalty modes, cost of a g-long gap:
                            - convex (default): min{O1+g*E1, O2+g*E2}
                            - affine (set O2 as 0): O1+g*E1
                            - linear (set O1 as 0): g*E1
  Adapter sequence:
    -5 --five-prime  STR    5' adapter sequence (sense strand) [NULL]
    -3 --three-prime STR    3' adapter sequence (anti-sense strand) [NULL]
    -a --ada-mat-rat FLT    minimum match ratio of adapter sequence [0.80]
  Output:
    -o --output      STR    output file [stdout]
    -m --min-len     INT    only output consensus sequence with min. length of [30]
    -r --min-cov  FLOAT|INT only output consensus sequence with at least R supporting units for all bases: [0.00]
                            if r is fraction: R = r * total copy number
                            if r is integer: R = r
    -u --unit-seq           only output unit sequences of each tandem repeat, no consensus sequence [False]
    -l --longest            only output consensus sequence of tandem repeat that covers the longest read sequence [False]
    -F --full-len           only output full-length consensus sequence. [False]
                            full-length: consensus sequence contains both 5' and 3' adapter sequence
                            *Note* only effective when -5 and -3 are provided.
    -s --single-copy        output additional single-copy full-length consensus sequence. [False]
                            *Note* only effective when -F is set and -5 and -3 are provided.
    -f --out-fmt     INT    output format [1]
                            - 1: FASTA
                            - 2: Tabular
                            - 3: FASTQ
                            - 4: Tabular with quality score
                              for [3] and [4], qualiy score of each base represents the ratio of the consensus coverage to the # total copies.
  Computing resource:
    -t --thread      INT    number of threads to use [4]

  General options:
    -h --help               print this help usage information
    -v --version            show version number

Input

TideHunter works with FASTA, FASTQ, gzip'd FASTA(.fa.gz) and gzip'd FASTQ(.fq.gz) formats.

Adapter sequence

Additional adapter sequence files can be provided to TideHunter with -5 and -3 options.

TideHunter uses adapter information to search for the full-length sequence from the generated consensus.

Once two adapters are found, TideHunter trims and reorients the consensus sequence.

Output

TideHunter can output consensus sequence in FASTA format by default, it can also provide output in tabular format.

Tabular format

For tabular format, 9 columns will be generated for each consensus sequence:

No. Column name Explanation
1 readName the original read name
2 repN N is the ID number of the tandem repeat, within each read, starts from 0
3 copyNum copy number of the tandem repeat
4 readLen length of the original long read
5 start start coordinate of the tandem repeat, 1-based
6 end end coordinate of the tandem repeat, 1-based
7 consLen length of the consensus sequence
8 aveMatch average percent of matches between each unit sequence and the consensus sequence (# matched bases / unit length)
9 fullLen 0: not a full-length sequence, 1: sense strand full-length, 2: anti-sense strand full-length
10 subPos start coordinates of all the tandem repeat unit sequence, followed by the end coordinate of the last tandem repeat unit sequence, separated by ,, all coordinates are 1-based, see examples below
11 consSeq consensus sequence

For example, here are the output for a non-full-length consensus sequence generated from test_data/test_50x4.fa and the adiagram that illustrates all the coordiantes in the output:

test_50x4 rep0  4.0 300 51  250 50  100.0 0 59,109,159,208  CGATCGATCGGCATGCATGCATGCTAGTCGATGCATCGGGATCAGCTAGT

In this example, TideHunter identifies three consecutive tandem repeat units, [59,108], [109,158], [159,208], from the raw read which is 300 bp long. A consensus sequence with 50 bp is generated from the three repeat units. TideHunter further extends the tandem repeat boundary to [51, 250] by aligning the consensus sequence back to the raw read on both sides of the three repeat units.

Another example of the output for a full-length consensus sequence generated from test_data/full_length.fa:

8f2f7766-4b8e-4c0d-9e2b-caf0e5527b19  rep0  8.8  5231  31  5215  203 95.7  1 207,798,1386,1976,2563,3155,3746,4333,4930  ACTAATAAGATCAACAGAATCAGAGTAGATAGTTCCTTGATCGGAACCAAAGGACCCCGTGCCTCAATCTCTATCCTGATGTCATGGGAGTCCTAGCAAAGCTATAGACTCAAGCAAGGCTTGGGGTCCTTTATGGAACCCAAGGATGACTCAGCAATAAAATATTTTGGTTTTGGTTTATAAAAAAAAAAAAAAAAAAAAAA

In this example, the consLen (i.e., 203) is the length of the full-length consensus sequence excluding the 5' and 3' adapter sequences and the subPos (i.e., 207,798,1386,1976,2563,3155,3746,4333,4930) contains the coordinate information of the identified tandem repeat units.

FASTA format

For FASTA output format, the read name and the comment provide detailed information of the detected tandem repeat, i.e., the above columns 1 \~ 10. The sequence is the consensus sequence.

The read name and comment of each consensus sequence have the following format:

>readName_repN_copyNum readLen_start_end_consLen_aveMatch_fullLen_subPos

FASTQ format

For FASTQ output format, the read name and comment are the same as described in FASTA format. TideHunter calculated a customized Phred score as the base quality score of each consensus base:

Here, is the Sigmoid-smoothed consensus calling error rate for each base:

is the Sigmoid function:

is the coverage of the consensus base and is the number of total copies. For example, if one base of the consensus sequence has 4 supporting copies and the total copy number is 5, is 4 and is 5.

The Phred quality score was then shifted by 33 and converted to characters based on the ASCII value. The quality scores range from 0 to 60 and the corresponding ASCII values range from 33 to 93.

Unit sequences

TideHunter can output the unit sequences without performing the consensus calling step when option -u/--unit-seq is enabled. Then, only the following information will be output for the tabular format:

No. Column name Explanation
1 readName the original read name
2 repN N is the ID number of the tandem repeat, within each read, starts from 0
3 subX X is the ID number of the unit sequence, starts from 0
4 unitSeq unit sequence

And for the FASTA format:

>readName_repN_subX
unitSeq X
>readName_repN_subY
unitSeq Y

Contact

Yan Gao gaoy1@chop.edu

Yi Xing XINGYI@chop.edu

github issues