Open biofuture opened 6 years ago
Yes. The current index files we supplied are only for bacteria. You may build your own index if you want to include more species.
For PE reads, reads from the same pair are considered as a whole. So all the kmer features from both ends will be considered, but we do not consider their distance for now.
Thanks for the reply.
Will you supply the index for Fungi in the near future?
I have downloaded all the 2700 Fungi genomes from NCBI and want to construct the index for Fungi. However, I found that I not only need to construct the Kmer index file for all the fungi, but also need to generate the species2taxID files for classifying. I have two questions:
Dear Developer
I have build the index for fungi, however after a long time of running for the build process, NO OUTCOME Only a series of binary TMP files were generated and then the program stopped without any useful information supplied.
-rw-rw-r-- 1 lg209ws3 lg209ws3 18G Dec 28 11:25 FgitempdirbuildTMP49 -rw-rw-r-- 1 lg209ws3 lg209ws3 8.5G Dec 27 18:32 FgitempdirbuildTMP5 -rw-rw-r-- 1 lg209ws3 lg209ws3 5.8G Dec 28 11:41 FgitempdirbuildTMP50 -rw-rw-r-- 1 lg209ws3 lg209ws3 6.2G Dec 28 11:57 FgitempdirbuildTMP51 -rw-rw-r-- 1 lg209ws3 lg209ws3 5.9G Dec 28 12:14 FgitempdirbuildTMP52 -rw-rw-r-- 1 lg209ws3 lg209ws3 3.2G Dec 28 12:25 FgitempdirbuildTMP53 -rw-rw-r-- 1 lg209ws3 lg209ws3 3.2G Dec 28 12:35 FgitempdirbuildTMP54
The last two lines of the nohup.out file is this OpenFile to binary read/write Multivalue Kmers file FgitempdirbuildTMP54 145a560 Reading file for keys in group 0/fffff
I have written a script to parse the NCBI taxonomy dump files and could generate the species2taxID for your program and test it already. It's a pity that the building process is still not successful for all the fungi. It was successful while only use two genomes.
Any help from you with your software will be appreciated!
I share the Perl Script to generate the file here. Anybody can use it and distribute it without permission from me.
perl generate_speciesIDtax_for_metaOthello.pl
#!/usr/bin/perl -w
use strict;
#Author Xiao-Tao Jiang
#2017-12-27
##This script is to generate the species id to taxid file from NCBI rankedlineage.dmp file and a list of selected gennomes
die "perl $0 <selected_species_IDS> <rankedlineage.dmp> <output_speciesId2taxoInfo.txt>\n" unless (@ARGV == 3);
my %allspids;
die "$!\n" unless open(I, "$ARGV[0]");
while(<I>){
chomp;
$allspids{$_} =1;
}
close I;
die "$!\n" unless open(RANK, "$ARGV[1]");
my %arname; ##store all the ranks names in order to retrieve the ids then
my %sp2othername; ##store all the upper rank names in order to retrieve all the ids
while(<RANK>){
chomp;
my @m =split(/\t\|\t/, $_);
#print "$m[-1]\n";
pop @m;
if(exists $allspids{$m[0]}){
if($#m > 9){
print "$_\n";
}
##find the species ids
die "Not a species $m[0]\n" if($m[1] eq "");
for(my $i = 3; $i <= $#m; $i++ ){ ##from genus
if($m[$i] ne ""){
$sp2othername{$m[0]}{$m[$i]} = $i;
#print "$m[$i]\n";
unless (exists $arname{$m[$i]}){ $arname{$m[$i]} = 1;}
}else{
# for my $k(@m){
# print "$k\tK\n";
# }
# die "\n";
}
}
}
}
close RANK;
die "REOPEN $ARGV[1]\n" unless open(REO, "$ARGV[1]");
my %name2id; ##store all the taxon ids for each rank name
while(<REO>){
chomp;
my @m =split(/\s+\|\s+/, $_);
if(exists $arname{$m[1]}){
#print "$m[1]\n";
$name2id{$m[1]} = $m[0];
}
}
close REO;
die "$!\n" unless open(OT, ">$ARGV[2]");
for my $sid (keys %sp2othername){
# print OT "$sid";
my @tempids = ();
for my $upn(sort {$sp2othername{$sid}{$a} <=> $sp2othername{$sid}{$b}} keys %{$sp2othername{$sid}} ){
# print OT "\t$upn";
if(exists $name2id{$upn}){
$tempids[$sp2othername{$sid}{$upn}] = $name2id{$upn};
#print OT "\t$name2id{$upn}";
}
#print OT "\t|\t$name2id{$upn}";
}
for (my $i =3; $i <=$#tempids; $i++){
unless($tempids[$i]){
$tempids[$i] = -1;
}
}
shift @tempids; shift @tempids; shift @tempids;
my $o = join("\t", $sid, @tempids);
print OT "$o\n";
}##print out each one
Dear Developer
As metagenomics data contains all microorganisms and also contamination from host. What if I want to use your tool to classify all of them?
The document/index files you supplied seems to be only used for bacteria???
It seems that for pair-end reads it only report one reads, my question is that for the PE reads, do you consider all the kmer features in the paired reads and distance between the pair-end reads???