yongzhiyang2012 / WGDdetector

21 stars 10 forks source link

WGDdetector

WGDdetector: a pipeline for whole genome duplication (WGD) detecting with the genome or transcriptome annotations. Our paper has been published at BMC Bioinformatics:

"Yang Y, Li Y, Chen Q, Sun Y, Lu Z: WGDdetector: a pipeline for detecting whole genome duplication events using the genome or transcriptome annotations. BMC Bioinformatics 2019, 20(1):75."

We have updated the pipeline and using some more fast software or method to reduce the time wasting. The new pipeline also including 4 steps:

01: gene family clustering.

02: large gene family spliting and Ks estimating

03: Ks hclustering

04: outputing

Install

Software requirements (all path of the needed software should be list in this file WGDdetector_path/config/software.config)

Pipline install

This pipline is a series perl scripts, you can just replace those scripts' header with the right perl path.

cd WGDdetector_path
perl setup.pl

Install test

cd example
sh 00.run.sh

Results

The dS dsitribution results of different software and datasets descripted in the WGDdetector parper (https://github.com/yongzhiyang2012/WGDdetector_paper_results).

Running

wgddetector (v1.00)

Usage: wgddetector --input_cds <cds_file> --input_pep <pep_file> --output_dir <output_dir> --tmp_dir <tmp_dir> --thread_num <thread_num> --cluster_engine <blastp|mmseqs2>

Options:
        --input_cds      input CDS in fasta format
        --input_pep      input protein in fasta format
        --output_dir     the output dir, which containing the main results
        --tmp_dir        the tmp dir
        --thread_num     the thread number when runnig this script
        --cluster_engine "blastp" or "mmseqs2". when setted this as mmseqs2, the protein aligning and clustering will be faster than blast
        --clean          "yes" or "no". If selected yes, the tmp_dir will removed after finish. Default: yes
        --run_cluster    "yes" or "no". If selected yes, the protein blast and mcl clustering will generate. If selected no, the clustering file "output_dir/01.cluster/all-protein-clustering.result" must be existed

  ** The sequence IDs within the CDS and protein files must be same!