zstephens / telogator2

A method for measuring allele-specific TL and characterizing telomere variant repeat (TVR) sequences from long reads.
MIT License
11 stars 1 forks source link

Alignment error #3

Closed 23Niklas closed 2 months ago

23Niklas commented 2 months ago

Hi, I am trying the telogator2 tutorial with the test data and am encountering an error in the alignment step. The error I am encountering is Error: alignment command returned an error: 1 and results from the error in the subtel_aln.log file >|> 20240618 15:03:19.626 -|- WARN -|- CheckPositionalArgs -|- 0x7eff948f3780|| -|- Input is FASTA. Output BAM file cannot be used for polishing with GenomicConsensus! >|> 20240618 15:03:19.626 -|- FATAL -|- CheckPositionalArgs -|- 0x7eff948f3780|| -|- pbmm2 align ERROR: Could not determine read input type(s). Please do not mix data types, such as BAM+FASTQ. File of files may only contain BAMs or datasets.

The command I'm running is `#!/bin/bash

SBATCH -J telogator2

SBATCH -c 8

SBATCH -t 12:00:00

SBATCH -p short

SBATCH --mem=10G

SBATCH -o telogator2_%j.out

python telogator2.py \ -i test_data/hg002-telreads_pacbio.fa.gz \ -o results/ \ -p 8 \ --muscle /home/miniconda3/envs/telogator2/bin/muscle \ --pbmm2 /home/miniconda3/envs/telogator2/bin/pbmm2 \ -r hifi -tt 0.400 -ts 0.250 -n 4`

And I can also reproduce the same pbmm2 align ERROR form above using pbmm2 align resources/telogator-ref.fa.gz results/temp/unanchored_subtels.fa.gz results/temp/subtel_aln.bam --preset HiFi --sort

I tried version 1.14.99 and 1.13.1 of pbmm2 and below is my full telogator2 env (I generated it from the respective yaml file). Thanks in advance for your help!

# Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge alsa-lib 1.2.12 h4ab18f5_0 conda-forge archspec 0.2.3 pyhd8ed1ab_0 conda-forge attr 2.5.1 h166bdaf_1 conda-forge biopython 1.81 py39hd1e30aa_1 conda-forge boltons 24.0.0 pyhd8ed1ab_0 conda-forge brotli 1.1.0 hd590300_1 conda-forge brotli-bin 1.1.0 hd590300_1 conda-forge brotli-python 1.1.0 py39h3d6467e_1 conda-forge bzip2 1.0.8 hd590300_5 conda-forge c-ares 1.28.1 hd590300_0 conda-forge ca-certificates 2024.6.2 hbcca054_0 conda-forge cairo 1.18.0 h3faef2a_0 conda-forge certifi 2024.6.2 pyhd8ed1ab_0 conda-forge cffi 1.16.0 py39h7a31438_0 conda-forge charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge colorama 0.4.6 pyhd8ed1ab_0 conda-forge conda 24.5.0 py39hf3d152e_0 conda-forge conda-libmamba-solver 24.1.0 pyhd8ed1ab_0 conda-forge conda-package-handling 2.3.0 pyh7900ff3_0 conda-forge conda-package-streaming 0.10.0 pyhd8ed1ab_0 conda-forge contourpy 1.2.1 py39h7633fee_0 conda-forge cycler 0.12.1 pyhd8ed1ab_0 conda-forge dbus 1.13.6 h5008d03_3 conda-forge distro 1.9.0 pyhd8ed1ab_0 conda-forge expat 2.6.2 h59595ed_0 conda-forge fmt 10.2.1 h00ab1b0_0 conda-forge font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge font-ttf-inconsolata 3.000 h77eed37_0 conda-forge font-ttf-source-code-pro 2.038 h77eed37_0 conda-forge font-ttf-ubuntu 0.83 h77eed37_2 conda-forge fontconfig 2.14.2 h14ed4e7_0 conda-forge fonts-conda-ecosystem 1 0 conda-forge fonts-conda-forge 1 0 conda-forge fonttools 4.53.0 py39hd3abc70_0 conda-forge freetype 2.12.1 h267a509_2 conda-forge frozendict 2.4.4 py39hd3abc70_0 conda-forge gettext 0.22.5 h59595ed_2 conda-forge gettext-tools 0.22.5 h59595ed_2 conda-forge glib 2.80.2 hf974151_0 conda-forge glib-tools 2.80.2 hb6ce0ca_0 conda-forge graphite2 1.3.13 h59595ed_1003 conda-forge gst-plugins-base 1.24.4 h9ad1361_0 conda-forge gstreamer 1.24.4 haf2f30d_0 conda-forge harfbuzz 8.5.0 hfac3d4d_0 conda-forge icu 73.2 h59595ed_0 conda-forge idna 3.7 pyhd8ed1ab_0 conda-forge importlib-resources 6.4.0 pyhd8ed1ab_0 conda-forge importlib_resources 6.4.0 pyhd8ed1ab_0 conda-forge jsonpatch 1.33 pyhd8ed1ab_0 conda-forge jsonpointer 2.4 py39hf3d152e_3 conda-forge k8 0.2.5 hdcf5f25_4 bioconda keyutils 1.6.1 h166bdaf_0 conda-forge kiwisolver 1.4.5 py39h7633fee_1 conda-forge krb5 1.21.2 h659d440_0 conda-forge lame 3.100 h166bdaf_1003 conda-forge lcms2 2.16 hb7c19ff_0 conda-forge ld_impl_linux-64 2.40 hf3520f5_3 conda-forge lerc 4.0.0 h27087fc_0 conda-forge libarchive 3.7.4 hfca40fe_0 conda-forge libasprintf 0.22.5 h661eb56_2 conda-forge libasprintf-devel 0.22.5 h661eb56_2 conda-forge libblas 3.9.0 22_linux64_openblas conda-forge libbrotlicommon 1.1.0 hd590300_1 conda-forge libbrotlidec 1.1.0 hd590300_1 conda-forge libbrotlienc 1.1.0 hd590300_1 conda-forge libcap 2.69 h0f662aa_0 conda-forge libcblas 3.9.0 22_linux64_openblas conda-forge libclang-cpp15 15.0.7 default_h127d8a8_5 conda-forge libclang13 18.1.7 default_h087397f_0 conda-forge libcups 2.3.3 h4637d8d_4 conda-forge libcurl 8.8.0 hca28451_0 conda-forge libdeflate 1.20 hd590300_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 hd590300_2 conda-forge libevent 2.1.12 hf998b51_1 conda-forge libexpat 2.6.2 h59595ed_0 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libflac 1.4.3 h59595ed_0 conda-forge libgcc-ng 13.2.0 h77fa898_7 conda-forge libgcrypt 1.10.3 hd590300_0 conda-forge libgettextpo 0.22.5 h59595ed_2 conda-forge libgettextpo-devel 0.22.5 h59595ed_2 conda-forge libgfortran-ng 13.2.0 h69a702a_7 conda-forge libgfortran5 13.2.0 hca663fb_7 conda-forge libglib 2.80.2 hf974151_0 conda-forge libgomp 13.2.0 h77fa898_7 conda-forge libgpg-error 1.49 h4f305b6_0 conda-forge libiconv 1.17 hd590300_2 conda-forge libjpeg-turbo 3.0.0 hd590300_1 conda-forge liblapack 3.9.0 22_linux64_openblas conda-forge libllvm15 15.0.7 hb3ce162_4 conda-forge libllvm18 18.1.7 hb77312f_0 conda-forge libmamba 1.5.8 had39da4_0 conda-forge libmambapy 1.5.8 py39h10defb6_0 conda-forge libnghttp2 1.58.0 h47da74e_1 conda-forge libnsl 2.0.1 hd590300_0 conda-forge libogg 1.3.4 h7f98852_1 conda-forge libopenblas 0.3.27 pthreads_h413a1c8_0 conda-forge libopus 1.3.1 h7f98852_1 conda-forge libpng 1.6.43 h2797004_0 conda-forge libpq 16.3 ha72fbe1_0 conda-forge libsndfile 1.2.2 hc60ed4a_1 conda-forge libsolv 0.7.29 ha6fb4c9_0 conda-forge libsqlite 3.46.0 hde9e2c9_0 conda-forge libssh2 1.11.0 h0841786_0 conda-forge libstdcxx-ng 13.2.0 hc0a3c3a_7 conda-forge libsystemd0 255 h3516f8a_1 conda-forge libtiff 4.6.0 h1dd3fc0_3 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libvorbis 1.3.7 h9c3ff4c_0 conda-forge libwebp-base 1.4.0 hd590300_0 conda-forge libxcb 1.15 h0b41bf4_0 conda-forge libxcrypt 4.4.36 hd590300_1 conda-forge libxkbcommon 1.7.0 h662e7e4_0 conda-forge libxml2 2.12.7 hc051c1a_1 conda-forge libzlib 1.2.13 h4ab18f5_6 conda-forge lz4-c 1.9.4 hcb278e6_0 conda-forge lzo 2.10 hd590300_1001 conda-forge mamba 1.5.8 py39hc5d2bb1_0 conda-forge matplotlib 3.8.4 py39hf3d152e_2 conda-forge matplotlib-base 3.8.4 py39h10d1fc8_2 conda-forge menuinst 2.1.1 py39hf3d152e_0 conda-forge minimap2 2.28 he4a0461_1 bioconda mpg123 1.32.6 h59595ed_0 conda-forge munkres 1.0.7 py_1 bioconda muscle 3.8.1551 h7d875b9_6 bioconda mysql-common 8.3.0 hf1915f5_4 conda-forge mysql-libs 8.3.0 hca2cd23_4 conda-forge ncurses 6.5 h59595ed_0 conda-forge nspr 4.35 h27087fc_0 conda-forge nss 3.100 hca3bf56_0 conda-forge numpy 1.26.4 py39h474f0d3_0 conda-forge openjpeg 2.5.2 h488ebb8_0 conda-forge openssl 3.3.1 h4ab18f5_0 conda-forge packaging 24.1 pyhd8ed1ab_0 conda-forge pbmm2 1.14.99 h9ee0642_0 bioconda pcre2 10.43 hcad00b1_0 conda-forge pillow 10.3.0 py39h90c7501_0 conda-forge pip 24.0 pyhd8ed1ab_0 conda-forge pixman 0.43.2 h59595ed_0 conda-forge platformdirs 4.2.2 pyhd8ed1ab_0 conda-forge pluggy 1.5.0 pyhd8ed1ab_0 conda-forge ply 3.11 pyhd8ed1ab_2 conda-forge pthread-stubs 0.4 h36c2ea0_1001 conda-forge pulseaudio-client 17.0 hb77b528_0 conda-forge pybind11-abi 4 hd8ed1ab_3 conda-forge pycosat 0.6.6 py39hd1e30aa_0 conda-forge pycparser 2.22 pyhd8ed1ab_0 conda-forge pyparsing 3.1.2 pyhd8ed1ab_0 conda-forge pyqt 5.15.9 py39h52134e7_5 conda-forge pyqt5-sip 12.12.2 py39h3d6467e_5 conda-forge pysam 0.22.1 py39h61809e1_1 bioconda pysocks 1.7.1 pyha2e5f31_6 conda-forge python 3.9.19 h0755675_0_cpython conda-forge python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge python_abi 3.9 4_cp39 conda-forge pywavelets 1.4.1 py39h44dd56e_1 conda-forge qt-main 5.15.8 hc9dc06e_21 conda-forge readline 8.2 h8228510_1 conda-forge regex 2024.5.15 py39hd3abc70_0 conda-forge reproc 14.2.4.post0 hd590300_1 conda-forge reproc-cpp 14.2.4.post0 h59595ed_1 conda-forge requests 2.32.3 pyhd8ed1ab_0 conda-forge ruamel.yaml 0.18.6 py39hd1e30aa_0 conda-forge ruamel.yaml.clib 0.2.8 py39hd1e30aa_0 conda-forge scipy 1.13.1 py39haf93ffa_0 conda-forge setuptools 70.0.0 pyhd8ed1ab_0 conda-forge sip 6.7.12 py39h3d6467e_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge tk 8.6.13 noxft_h4845f30_101 conda-forge toml 0.10.2 pyhd8ed1ab_0 conda-forge tomli 2.0.1 pyhd8ed1ab_0 conda-forge tornado 6.4.1 py39hd3abc70_0 conda-forge tqdm 4.66.4 pyhd8ed1ab_0 conda-forge tzdata 2024a h0c530f3_0 conda-forge unicodedata2 15.1.0 py39hd1e30aa_0 conda-forge urllib3 2.2.1 pyhd8ed1ab_0 conda-forge wheel 0.43.0 pyhd8ed1ab_1 conda-forge xcb-util 0.4.0 hd590300_1 conda-forge xcb-util-image 0.4.0 h8ee46fc_1 conda-forge xcb-util-keysyms 0.4.0 h8ee46fc_1 conda-forge xcb-util-renderutil 0.3.9 hd590300_1 conda-forge xcb-util-wm 0.4.1 h8ee46fc_1 conda-forge xkeyboard-config 2.42 h4ab18f5_0 conda-forge xorg-kbproto 1.0.7 h7f98852_1002 conda-forge xorg-libice 1.1.1 hd590300_0 conda-forge xorg-libsm 1.2.4 h7391055_0 conda-forge xorg-libx11 1.8.9 h8ee46fc_0 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xorg-libxext 1.3.4 h0b41bf4_2 conda-forge xorg-libxrender 0.9.11 hd590300_0 conda-forge xorg-renderproto 0.11.1 h7f98852_1002 conda-forge xorg-xextproto 7.3.0 h0b41bf4_1003 conda-forge xorg-xf86vidmodeproto 2.3.1 h7f98852_1002 conda-forge xorg-xproto 7.0.31 h7f98852_1007 conda-forge xz 5.2.6 h166bdaf_0 conda-forge yaml-cpp 0.8.0 h59595ed_0 conda-forge zipp 3.19.2 pyhd8ed1ab_0 conda-forge zlib 1.2.13 h4ab18f5_6 conda-forge zstandard 0.19.0 py39hb9d737c_0 conda-forge zstd 1.5.6 ha6fb4c9_0 conda-forge

zstephens commented 2 months ago

Thanks for the detailed report!

It looks like pbmm2 gets upset if you give it a gzipped reference fasta. I made so that if pbmm2 is the chosen aligner that it now aligns to a decompressed copy of the reference instead. Could you pull the latest commit (2156e1185bae7218fa2683710eb684365afa8a42) and try running it again?

23Niklas commented 2 months ago

Thanks for your quick reply! Indeed just gunzipping the ref file resolved the alignment error. I will re-run telogator and let you know if it completes successfully this time. Speaking of, is it expected that telogator 2 takes ~8 hours for the test files, since they're quit small?

zstephens commented 2 months ago

The test files in the repository may look small because I removed everything aside from the telomere reads, but they're actually full-sized datasets (with the PacBio data actually being combined from 3 separate runs combined). I've thought it might be useful to add a smaller set that users could run just to make check that the software works, but yeah, that kind of runtime is generally expected.

I'm typically running Telogator2 in an HPC environment with 16 processes (-p 16, the default is 4), which for ~30x WGS data takes about 6hrs wall time and ~12GB of memory. There are a few options for speeding runs up that I used during development, such as -d for downsampling reads, --fast-aln for using a faster pairwise alignment algorithm, and --init-filt which can be used to immediately prune interstitial telomere reads earlier on in the workflow (--init-filt 500 200 is a fairly safe starting point that should reduce runtime by ~50%). Though these are all somewhat-experimental features that I haven't throughly tested or documented.

23Niklas commented 2 months ago

This time it worked flawlessly, thank you! One last question, I tried out my own reference file with -t (CHM13_v2) and encountered the error below aligning subtels to reference... (132 sec) getting anchors from alignment...Traceback (most recent call last): File "/n/data1/hms/dbmi/park/niklas/analysis/telomere_mosaicism/config/envs/telogator2/telogator2.py", line 1089, in <module> main() File "/n/data1/hms/dbmi/park/niklas/analysis/telomere_mosaicism/config/envs/telogator2/telogator2.py", line 948, in main my_chr = n.split('_')[1] IndexError: list index out of range Do I need to adjust other reference files in a certain way to work for telogator?

zstephens commented 2 months ago

Did you have a specific need for trying to anchor reads to contigs other than those in the default reference? In most cases you shouldn't need to include the -t input option, as the default telogator-ref.fa contains all the subtelomeres of T2T_v2, for example.

The specific reason for the error is that the code is expecting that contigs are named like:

>referencebuild_chr#arm

e.g. >t2t-chm13_chr1p

23Niklas commented 2 months ago

I see, that explains the error. Thanks a lot for your help!

I was curious to see if this would work with other T2T references as well.