Need documentation on cgMLST

I am taking some notes on how I ran cgMLST, and I hope you can add documentation for it.

Create database: this took a very long time

# Downloaded the cgMLST scheme from enterobase FTP into Salmonella.cgMLSTv2.enterobase (undocumented)
\ls -f1 Salmonella.cgMLSTv2.enterobase/*.fasta | \
  grep -v cgMLST_v2_ref.fasta `# ignore already-established reference file` | \
  xargs seqtk seq -l 0 `# cat out all the fasta contents and two-line fasta format` | \
  perl -lane '
    # get the id with '>' and the seq on the next line since it is in a two-line fasta format
    $id=$F[0]; 
    $seq=<>; 
    chomp($seq); 
    # I don't think this will matter but just avoid any infinite loops by quitting if we see the same sequence
    my %seen; 
    if($seen{$id}++){print STDERR "Already seen $id. Done."; last;} 

    # Avoid deflines that might be problematic
    if($id =~ /[^_>0-9a-zA-Z]/){
      print STDERR "Skipping ".$id; 
      next;
    } 
    print "$id\n$seq";
  ' > enterobase.filtered.fasta

zheminzhou / EToKi

Need documentation on cgMLST #13