weizhongli / cdhit

Automatically exported from code.google.com/p/cdhit
GNU General Public License v2.0
641 stars 129 forks source link

Functionality missing for -sf option in cd-hit-est #116

Open tcr0fts opened 3 years ago

tcr0fts commented 3 years ago

I ran the following command for cd-hit-est and got the expected two output files, so it ran successfully

./cd-hit-est -i /home/tsc7044/cd-hit-v4.8.1-2019-0228/soil_ntc04_processed.fasta -o soil_ntc04_cdhit99 -c 0.99 -n 11 -g 1 -d 0 -T 8 -M 1600 -sc 1 -sf 1

Note -sc 1 and -sf 1, meaning both the cluster file and the representative sequences fasta file should be ordered in decreasing cluster size. This checks out for the clusters file (largest cluster to singletons) but the representative reads fasta file is not ordered and seems to be ordered by ascending read number instead (not correlated with cluster size). Re-running the same command with -sf 0 (turn off fasta sorting) gave the exact same output (for first several dozen lines at least).

Is lack of -sf functionality a known issue or is something else wrong on my end? I want to be able to reference the clusters file to find the top 'n' most common reads and then pull those from the representative reads fasta file. I have a work around but it looks like -sf should be able to make this much easier if it worked for me.

Thanks

genegolts commented 1 year ago

Experiencing the same problem in version 4.8.1. The fasta output is not sorted by cluster size, with or without -sf 1.

shaharbr commented 1 year ago

Experiencing the same problem