Cayman (Carbohydrate active enzymes profiling of metagenomes) is a command-line profiling tool for profiling CAZyme abundances in metagenomic datasets. It takes as input (preferably) cleaned -- quality-filtered and host-filtered -- metagenomic shotgun reads and produces a matrix of CAZyme
Reads-Per-Kilobase-Million (RPKM) abundances for your sample. Cayman makes heavy use of the functional profiling library gqlib
.
bwa
The following python libraries need to be installed
pyhmmer (for protein set annotation)
You will need a bwa
installation. One way -- if you didn't install cayman
via bioconda or if you're not using a container -- would be to use conda env create -f environment.yml
using the provided environment.yml.
Cayman uses prevalence-filtered reference data sets from the Global Microbial Gene Catalog (GMGC). We annotated these datasets with our dedicated CAZyme annotation method (cf. Ducarmon & Karcher et al.). The filtered GMGC datasets and their CAZyme annotations can be downloaded from Zenodo.
[TABLE]
Prior to your first profiling run, you will have to build a bwa index from the respective GMGC reference dataset.
bwa index -p <index_name> [-b blocksize] /path/to/dataset
If you have enough memory available, setting -b
to a higher value than the default (10,000,000
), e.g. 100,000,000
, may speed up the index generation.
Cayman can most easily be installed via
bwa
installation)docker://ghcr.io/zellerlab/cayman:latest
)git clone https://github.com/zellerlab/cayman && cd cayman && pip install .
(also requires a bwa
installation)Cayman can be run from the command line as follows:
Attention: As of version 0.10.0, cayman profiling is invoked with cayman profile
instead of cayman
.
cayman profile \
<input_options> \
</path/to/db> \
</path/to/bwa_index> \
[--out_prefix <prefix>] \
[--min_identity <float>] \
[--min_seqlen <int>] \
[--cpus_for_alignment <int>]
<input_options>
Read files need to be in fastq format (best with fastq
or fq
file ending) and can be gzip compressed.
The <input_options>
parameters depend on the library layout of your samples:
--reads1 </path/to/reads1> --reads2 </path/to/reads2>
. Each read will be counted as 0.5
.--singles </path/to/reads>
. Each read will be counted as 1
.--orphans </path/to/orphans>
. Each read will be counted as 0.5
.Samples comprising multiple fastq files (e.g. from multiple lanes) can be provided as space-separated lists. In the case of paired-end reads, ensure that the order of the files matches (e.g. --reads1 sampleX_lane1_R1.fq sampleX_lane2_R1.fq --reads2 sampleX_lane1_R2.fq sampleX_lane2_R2.fq
)!
The choice of assigning an unpaired read set to be "true" single-end reads or orphan reads influences the read count distribution.
2 x 0.5 = 1
(as both reads of a pair are derived from the same sequenced nucleic acid fragment.)1 x 0.5 = 0.5
.1
.--annotation_db
is the path to a 4-column text file containing the reference domain annotation. (using the bed4 format: contig,start,end,domain-type). This contains all the CAZy domain annotations for all ORFs in our gene catalog.
--bwa_index
refers to the path of the gene catalog bwa index.
--out_prefix
is a string that will be prepended to the output files (default: "cayman"
). If you want to store the output in a specific folder, then provide a path such as "/path/to/folder/some_prefix"
. Without "some_prefix"
, the output files will be hidden as they start with a .
.
--min_identity
is the minimum sequence identity level (default: 0.97) for an alignment of your read to a CAZyme domain to be included.
--min_seqlen
is the minimum alignment length (actually aligned bases without soft/hard-clipping) to be included (default: 45[bp]).
--cpus_for_alignment
the number of cpus to use for alignment (default: 1).
--db_separator
allows you to specify your own separator/delimiter in case you want to use e.g. a csv-formatted database. The bed4 restrictions such as 0-based start and 1-based end coordinates still unless you use --db_coordinates hmmer
.
--db_coordinates
one of bed
(default) or hmmer
. This allows you to provide 1-based, closed interval coordinates (hmmer
) or 0,1-based, half-open interval coordinates (bed
) in your database file.
<out_prefix>.cazy.txt
contains the CAZy profile of the samplefeature uniq_raw uniq_rpkm combined_raw combined_rpkm
total_reads 2498819.00000 2498819.00000 2498819.00000 2498819.00000
filtered_reads 2241860.00000 2241860.00000 2241860.00000 2241860.00000
AA1 7.00000 16.09944 8.00000 18.91073
AA10 0.50000 3.32879 0.50000 3.32879
AA6 7.50000 30.03446 8.50000 33.29036
The first line is the header, followed by the counts of the total aligned reads and filtered reads.
The following lines contain the counts for each CAZy family present in the sample: family name (feature
), unique counts, unique counts rpkm-normalised, unique counts + ambiguous counts, unique counts + ambiguous counts rpkm-normalised.
<out_prefix>.gene_counts.txt
contains the gene profiles of the sample. The format is identical to the CAZy profiles, featuring are the detected genes from the respective gene catalogue.
<out_prefix>.aln_stats.txt
contains statistics on the alignments in the sample.
The default hmm_database
can be obtained from Zenodo
cayman annotate_proteome \
</path/to/cayman/hmm_database> \
</path/to/input/proteins> \
[ -o/--output_file </path/to/output_file>, default: cayman_annotation.csv ] \
[ -t/--threads <int> ] \
[ --cutoffs <path/to/cutoff_values>, default: </path/to/cayman/hmm_database/cutoffs.csv>]