zheminzhou / EToKi

all methods related to Enterobase
https://enterobase.warwick.ac.uk
GNU General Public License v3.0
39 stars 18 forks source link

Need documentation on cgMLST #13

Open lskatz opened 2 years ago

lskatz commented 2 years ago

I am taking some notes on how I ran cgMLST, and I hope you can add documentation for it.

Create database: this took a very long time

# Downloaded the cgMLST scheme from enterobase FTP into Salmonella.cgMLSTv2.enterobase (undocumented)
\ls -f1 Salmonella.cgMLSTv2.enterobase/*.fasta | \
  grep -v cgMLST_v2_ref.fasta `# ignore already-established reference file` | \
  xargs seqtk seq -l 0 `# cat out all the fasta contents and two-line fasta format` | \
  perl -lane '
    # get the id with '>' and the seq on the next line since it is in a two-line fasta format
    $id=$F[0]; 
    $seq=<>; 
    chomp($seq); 
    # I don't think this will matter but just avoid any infinite loops by quitting if we see the same sequence
    my %seen; 
    if($seen{$id}++){print STDERR "Already seen $id. Done."; last;} 

    # Avoid deflines that might be problematic
    if($id =~ /[^_>0-9a-zA-Z]/){
      print STDERR "Skipping ".$id; 
      next;
    } 
    print "$id\n$seq";
  ' > enterobase.filtered.fasta
verylili commented 1 year ago

I also need. I downloaded the cgMLST scheme for E.coli. When I tried to create the database for 4 days, the machine-time is only 1.2 hour. I found that the machine time nearly no longer increased when it was close to 1.2 hour. So I had to stop the command for creating a database.