Download code from https://github.com/xa6xa6/metaOthello/
There are two code folders:
Building indexing is very time-consuming (costs about 6 hours to build index for NCBI/refseq bacterial genome database). Therefore, we provide ready-built indexes (NCBI/refseq bacterial genome database) for users to download:
If you want to build an index with your own reference sequences, follow these steps.
Jellyfish is used to prepare k-mer files for each reference sequence. Download jellyfish from: http://www.cbcb.umd.edu/software/jellyfish/
Produce a k-mer count file for your reference seqeuences. Command:
jellyfish count \
–o <path_to_bacterial_rawKmerCountFile> \
-m <Kmer_length> \
-t <threads_num> \
-s <bf_size> \
-C <path_to_bacterial_referenceSeqFastaFile>
Dump k-mers to human-readable format. Command:
jellyfish dump \
–t –c \
–o <path_to_bacterial_readableKmerCountFile> \
<path_to_bacterial_rawKmerCountFile>
Generate taxonomy info file.
Put all readable k-mer count files into the same directory
<path_to_bacterial_reference_seq_Kmer_file_dir>
and rename them as 1.Kmer
, 2.Kmer
, ...
, m.Kmer
,
and generate a taxonomy info file like:
https://drive.google.com/open?id=0BxgO-FKbbXRIZlV3ZzBBdlFpMTQ
There are three columns for each taxonomic rank in the file:
the 1st column is a reissued id from 0
to m-1
,
where m
is the total taxon num in that taxonomic rank.
The 2nd column lists taxon ids, and the 3rd column lists taxon scientific names.
Each row represents a species and its associated taxonomy info.
Run make build
under the directory build
.
Build the index. Command:
./build \
<bacterial_reference_seq_associated_taxonomy_info_file(generated in Step1.3)> \
<path_to_bacterial_reference_seq_Kmer_file_dir> \
<shared kmer file suffixes> \
<Kmer_length> 6 \
<path_to_bacterial_index> \
<path_to_a_temp_dir_for_intermediate_files>
make classifier
in the classifier
directory. Perform taxonomic classification for each metagenomics sequencing reads. Command:
./classifier \
<path_to_bacterial_index> \
<path_to_output_results_dir> \
<Kmer_length> \
<threads_num> \
<fa_or_fq> \
<SE_or_PE> \
<bacterial_speciesId2taxoInfo_file> \
<NCBI_names_file> \
<readFile_singleEnd or readFile_end1> \
(<readFile_end2 if paired-end reads are provided>)
<bacterial_speciesId2taxoInfo_file>
can be downloaded from:
https://drive.google.com/open?id=0BxgO-FKbbXRIc3FkLVFvMlpVVGM
Each row represents a species and its associated taxon ids at each taxonomic rank:
species, genus, family, order, class, and phylum. Assign -1
if the taxon id is not available.
<NCBI_names_file>
can be downloaded from:
https://drive.google.com/open?id=0BxgO-FKbbXRIUFI2dHlBMXZhdTA
NOTE: We will keep the following files updated with the latest NCBI/refseq bacterial genome databases:
Also, we will release tools for generating all the above files (from NCBI/refseq bacterial genome databases) very soon.
Copyright (C) 2016-, University of Kentucky
Please refer to LICENSE.TXT for the detailed 'License'.