xa6xa6 / metaOthello

Other
16 stars 7 forks source link

What if I would like to classify reads from Fungi and human from metagenomic data? #11

Open biofuture opened 6 years ago

biofuture commented 6 years ago

Dear Developer

As metagenomics data contains all microorganisms and also contamination from host. What if I want to use your tool to classify all of them?

The document/index files you supplied seems to be only used for bacteria???

It seems that for pair-end reads it only report one reads, my question is that for the PE reads, do you consider all the kmer features in the paired reads and distance between the pair-end reads???

xa6xa6 commented 6 years ago

Yes. The current index files we supplied are only for bacteria. You may build your own index if you want to include more species.

For PE reads, reads from the same pair are considered as a whole. So all the kmer features from both ends will be considered, but we do not consider their distance for now.

biofuture commented 6 years ago

Thanks for the reply.

Will you supply the index for Fungi in the near future?

I have downloaded all the 2700 Fungi genomes from NCBI and want to construct the index for Fungi. However, I found that I not only need to construct the Kmer index file for all the fungi, but also need to generate the species2taxID files for classifying. I have two questions:

  1. Do I need to use all of the genomes or only need to select one strain from each species as there are many strains under one species and there is quite a large portion of redundancy in the genomes and the kmer features.
  2. Could you send me your script for constructing the bacteria bacterial_speciesId2taxoInfo.txt file from NCBI taxonomy dump files? I do not want to rewrite such program as you definitely had already done this kind of job. Please sent it to my email, biofuture.jiang@gmail.com Thank you for your help. The speed improvement in your software is really amazing. You guys did a good job. Hopefully, this tool can really help us to save more time.
biofuture commented 6 years ago

Dear Developer

I have build the index for fungi, however after a long time of running for the build process, NO OUTCOME Only a series of binary TMP files were generated and then the program stopped without any useful information supplied.

-rw-rw-r-- 1 lg209ws3 lg209ws3 18G Dec 28 11:25 FgitempdirbuildTMP49 -rw-rw-r-- 1 lg209ws3 lg209ws3 8.5G Dec 27 18:32 FgitempdirbuildTMP5 -rw-rw-r-- 1 lg209ws3 lg209ws3 5.8G Dec 28 11:41 FgitempdirbuildTMP50 -rw-rw-r-- 1 lg209ws3 lg209ws3 6.2G Dec 28 11:57 FgitempdirbuildTMP51 -rw-rw-r-- 1 lg209ws3 lg209ws3 5.9G Dec 28 12:14 FgitempdirbuildTMP52 -rw-rw-r-- 1 lg209ws3 lg209ws3 3.2G Dec 28 12:25 FgitempdirbuildTMP53 -rw-rw-r-- 1 lg209ws3 lg209ws3 3.2G Dec 28 12:35 FgitempdirbuildTMP54

The last two lines of the nohup.out file is this OpenFile to binary read/write Multivalue Kmers file FgitempdirbuildTMP54 145a560 Reading file for keys in group 0/fffff

biofuture commented 6 years ago

I have written a script to parse the NCBI taxonomy dump files and could generate the species2taxID for your program and test it already. It's a pity that the building process is still not successful for all the fungi. It was successful while only use two genomes.

Any help from you with your software will be appreciated!

I share the Perl Script to generate the file here. Anybody can use it and distribute it without permission from me. perl generate_speciesIDtax_for_metaOthello.pl

#!/usr/bin/perl -w
use strict;
#Author Xiao-Tao Jiang
#2017-12-27

##This script is to generate the species id to taxid file from NCBI rankedlineage.dmp file and a list of selected gennomes
die "perl $0 <selected_species_IDS> <rankedlineage.dmp> <output_speciesId2taxoInfo.txt>\n" unless (@ARGV == 3);

my %allspids;
die "$!\n" unless open(I, "$ARGV[0]");
while(<I>){
        chomp;
        $allspids{$_} =1;
}
close I;

die "$!\n" unless open(RANK, "$ARGV[1]");
my %arname; ##store all the ranks names in order to retrieve the ids then 
my %sp2othername; ##store all the upper rank names in order to retrieve all the ids 
while(<RANK>){
        chomp;
        my @m =split(/\t\|\t/, $_);
        #print "$m[-1]\n"; 
        pop @m;
        if(exists $allspids{$m[0]}){
                if($#m > 9){
                        print "$_\n";
                }
                ##find the species ids 
                die "Not a species $m[0]\n" if($m[1] eq "");
                for(my $i = 3; $i <= $#m; $i++ ){ ##from genus 
                        if($m[$i] ne ""){

                                $sp2othername{$m[0]}{$m[$i]} = $i;
                                #print "$m[$i]\n";
                                unless (exists $arname{$m[$i]}){ $arname{$m[$i]} = 1;}
                        }else{
                        #       for my $k(@m){
                        #               print "$k\tK\n";
                        #       }
                        #       die "\n";
                        }
                }
        }
}
close RANK;

die "REOPEN $ARGV[1]\n" unless open(REO, "$ARGV[1]");
my %name2id; ##store all the taxon ids for each rank name 
while(<REO>){
        chomp;
        my @m =split(/\s+\|\s+/, $_);

        if(exists $arname{$m[1]}){
                #print "$m[1]\n";
                $name2id{$m[1]} = $m[0];
        }
}
close REO;
die "$!\n" unless open(OT, ">$ARGV[2]");
for my $sid (keys %sp2othername){
#       print OT "$sid";
        my @tempids = ();
        for my $upn(sort {$sp2othername{$sid}{$a} <=> $sp2othername{$sid}{$b}} keys  %{$sp2othername{$sid}}  ){
        #       print OT "\t$upn";
                if(exists $name2id{$upn}){

                        $tempids[$sp2othername{$sid}{$upn}] = $name2id{$upn};

                        #print OT "\t$name2id{$upn}";
                }
                #print OT  "\t|\t$name2id{$upn}";
        }

        for (my $i =3; $i <=$#tempids; $i++){
                unless($tempids[$i]){
                        $tempids[$i] = -1;
                }
        }
        shift @tempids; shift @tempids; shift @tempids;
        my $o = join("\t", $sid, @tempids);

        print OT "$o\n";
}##print out each one