This directory contains scripts related to the manuscript "Expanded catalogue of microbial genes and metagenome-assembled genomes from the pig gut microbiome".
Before running, you must ensure that all required softwares and databases are installed successfully.
Create two directories "bin" and "Database" in user home directory.
The installation method refer to the manual of each software. The name, version and availablity of the software are as follows:
Note: Make all needed command of software availabled in the "~/bin" directory or in system environment variables.The version is only the version used in the paper and does not have to be the same, and some softwares are included in other software, so you don't have to install it repeatedly. For example, bwa, bowtie2, Samtools and MEGAHIT are included in metawrap.
All databases are stored in the "~/Datebase" directory.
The name,description and availability of the database are as follows:
Database | Version/release date | Description | Availability |
---|---|---|---|
Pig (Sscrofa11.1) | Sscrofa11.1 | Pig reference genome | http://asia.ensembl.org/Sus_scrofa/Info/Index |
Uniprot TrEMBL | version 2020_03 | protein database | https://www.uniprot.org/downloads |
KEGG | 2019/12/20 | KEGG annotation | http://kobas.cbi.pku.edu.cn/kobas3/download/ |
dbCAN | HMMdb-V8 | CAZymes annotation | http://bcb.unl.edu/dbCAN2/download/ |
EggNOG | EggNOG5.0 | EggNOG annotation | http://eggnog5.embl.de/#/app/downloads |
CARD | v3.1.0 | Antibiotic Resistance genes annotation | https://github.com/arpcard/rgi#install-dependencies |
VFDB | Fri Sep 4 10:06:01 2020 | Virulence factors annotation | http://www.mgc.ac.cn/VFs/download.htm |
GTDB-tk | release89 | Taxonomic assignments of MAGs | https://gtdb.ecogenomic.org/downloads |
PhyloPhlAn | 2013 | phylogenetic analysis of MAGs | https://github.com/biobakery/phylophlan/wiki |
Note: The version are only the version used in the paper,most of database are constantly updated.
The scripts of metagenomic analysis are placed in "Pipeline" directory. There are two main modules in the pipeline, the construction of the gene catalog and metagenome-assembled genomes. The processes before assembly are same.
Metagemonic data pre-processing: read trimming and host (pig) read removal,generating high-quality sequence.
Metagenomic assembly: Assemble short reads into long contigs.
A total of four scripts in this modules, including gene prediction, taxonomy annotation, function annotation and abundance estimation.
This part contains steps of gene prediction, filtration of incomplete genes, integration of gene catalog and gene dereplications.
The protein sequence of genes were aligned to Uniprot TrEMBL, and the taxonomic classification were determined based on the last (lowest) common ancestor algorithms.
The KEGG Orthology and pathway, CAZymes family, EggNOG Orthology,antibiotic Resistance genes,and virulence factors annotation were performed by aligning the protein sequence to KEGG, dbCAN, EggNOG, CARD and VFDB databases.
Gene abundance were caculated by aligning clean reads of each sample to the gene catalog to obtained the counts of mapped reads, and normalized to read count fragments per kilobase million (FPKM). The abundance of function items were performed were calculated by adding the abundances of all its members falling within each category with R scripts.
The steps related to reconstruction of metagenome-assembled genomes (MAG) are included in this Part. Binning, refinement, reassembly, genome annotation and abundance estimation of MAGs were performed with the modules of metaWRAP pipeline. dRep software was used for dereplication of MAGs. And taxonomic classification and phylogenetic analysis were processed by GTDB-tk and PhyloPhlAn 3.0, respectively.
Some processing steps in the pipeline, statistical analysis and visualization were handled by scripting with R, Shell, Perl or Python languages. These scripts were placed in "Scripts" directory. All related input data for statistical analysis and visualization are in "Pre-processed_Files" directory.