Note -sc 1 and -sf 1, meaning both the cluster file and the representative sequences fasta file should be ordered in decreasing cluster size. This checks out for the clusters file (largest cluster to singletons) but the representative reads fasta file is not ordered and seems to be ordered by ascending read number instead (not correlated with cluster size). Re-running the same command with -sf 0 (turn off fasta sorting) gave the exact same output (for first several dozen lines at least).
Is lack of -sf functionality a known issue or is something else wrong on my end? I want to be able to reference the clusters file to find the top 'n' most common reads and then pull those from the representative reads fasta file. I have a work around but it looks like -sf should be able to make this much easier if it worked for me.
I ran the following command for cd-hit-est and got the expected two output files, so it ran successfully
./cd-hit-est -i /home/tsc7044/cd-hit-v4.8.1-2019-0228/soil_ntc04_processed.fasta -o soil_ntc04_cdhit99 -c 0.99 -n 11 -g 1 -d 0 -T 8 -M 1600 -sc 1 -sf 1
Note -sc 1 and -sf 1, meaning both the cluster file and the representative sequences fasta file should be ordered in decreasing cluster size. This checks out for the clusters file (largest cluster to singletons) but the representative reads fasta file is not ordered and seems to be ordered by ascending read number instead (not correlated with cluster size). Re-running the same command with -sf 0 (turn off fasta sorting) gave the exact same output (for first several dozen lines at least).
Is lack of -sf functionality a known issue or is something else wrong on my end? I want to be able to reference the clusters file to find the top 'n' most common reads and then pull those from the representative reads fasta file. I have a work around but it looks like -sf should be able to make this much easier if it worked for me.
Thanks