zstephens / telogator2

A method for measuring allele-specific TL and characterizing telomere variant repeat (TVR) sequences from long reads.
MIT License
11 stars 1 forks source link
longreads nanopore pacbio telomere telomeres

Telogator2

A method for measuring allele-specific TL and characterizing telomere variant repeat (TVR) sequences from long reads.

If this software has been useful for your work, please cite us at:

Stephens, Z., & Kocher, J. P. (2024). Characterization of telomere variant repeats using long reads enables allele-specific telomere length estimation. BMC bioinformatics, 25(1), 194.

https://link.springer.com/article/10.1186/s12859-024-05807-5

Dependencies:

Telogator2 dependencies can be easily installed via conda:

# create conda environment
conda env create -f conda_env_telogator2.yaml

# activate environment
conda activate telogator2

Running Telogator2:

python telogator2.py -i input.fq \ 
                     -o results/ \ 
                     --minimap2 /path/to/minimap2

-i accepts fa, fa.gz, fq, fq.gz, or bam (multiple can be provided, e.g. -i reads1.fa reads2.fa). For Revio reads sequenced with SMRTLink13 and onward, we advise including both the "hifi" BAM and "fail" BAM as input to Telogator2.

An aligner executable must be specified, via either --minimap2, --winnowmap, or --pbmm2.

Recommended settings

Sequencing platforms have different sequencing error types, as such we recommend running Telogator2 with different options based on which platform was used:

PacBio Revio HiFi (30x) - -r hifi -n 4
PacBio Sequel II (10x) - -r hifi -n 3
Nanopore R10 (30x) - -r ont -n 4

For Nanopore reads generated using telomere enrichment methods, such as those described by Karimian et al., we recommend using -r ont -n 5 -tt 0.100 --collapse-hom 1000.

Telogator2 may be unable to analyze older Nanopore data, as reads basecalled with Guppy have prohibitively high sequencing error rates in telomere regions.

Test data

Telomere reads for HG002 can be found in the test_data/ directory.

HiFi reads (~70x): hg002-telreads_pacbio.fa.gz
ONT reads  (~25x): hg002-telreads_ont.fa.gz

These are full-sized datasets and may take awhile to run. A smaller input dataset (e.g. for just checking that Telogator2 successfully runs) is also provided: test_data/test.fa.gz.

Output files

The primary output files are:

The main results are in tlens_by_allele.tsv, which has the following columns:

Telogator reference

The reference sequence used for telomere anchoring currently contains the first and last 500kb of sequences from the following T2T assemblies:

More will be added as they become available.