weizhongli / cdhit

Automatically exported from code.google.com/p/cdhit
GNU General Public License v2.0
644 stars 129 forks source link

Preferred representative sequences #35

Open sejmodha opened 8 years ago

sejmodha commented 8 years ago

Hi There,

I was wondering if it is possible to define preferred representative sequences for clusters in cd-hit when there are identical sequences in a cluster.

dbolser-ebi commented 8 years ago

Not that I'm aware of... have you tried ordering the identical sequences differently?

On Tue, Jun 14, 2016 at 10:02 AM, sejmodha notifications@github.com wrote:

Hi There,

I was wondering if it is possible to define preferred representative sequences for clusters in cd-hit when there are identical sequences in a cluster.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/weizhongli/cdhit/issues/35, or mute the thread https://github.com/notifications/unsubscribe/AF0Af1eTqmSJzf1prjWAK7mX6G0AZnzsks5qLm4rgaJpZM4I1Gp6 .

ckeeling commented 7 years ago

Some clusters have more than one protein with equal longest length, but only one is chosen as the cluster representative. How does cd-hit choose THE representative when there are more than one of equal maximum length? Although of course the expression information is not given to cd-hit, but when one compares the transcript levels of the corresponding translations clustered, it sometimes is one of the ones NOT chosen that has the highest expression. I'm trying to figure out how to "correct/adjust" for this when choosing the best representative (for follow-up work etc.). (post cd-hit or prior to cd-hit if possible) Is the order of the input sequences relevant here as @dbolser-ebi eludes to? The manual suggests that the first step in the code is sorting based on protein length, so I would think not. I created a script to append the expression level behind each ID in the cluster file. Here is an example cluster to illustrate my point.

>Cluster 53
0   252aa, >Gene.116053::k24.R37863736::g.116053::m.116053... at 100.00%    0.230025465276951
1   250aa, >Gene.7533::k32.R26410032::g.7533::m.7533... at 100.00%  0.159230324881883
2   177aa, >Gene.105054::k40.J20255009::g.105054::m.105054... at 98.31% 0.595071256619217
3   322aa, >Gene.105738::k40.J20289056::g.105738::m.105738... at 99.07% 2.66349076959379
4   130aa, >Gene.204115::k64.R11492990::g.204115::m.204115... at 99.23% 3.61629985607615
5   1835aa, >Gene.220450::k64.S11351006::g.220450::m.220450... at 100.00%   0.188932242855423
6   1377aa, >Gene.265102::k80.J8197378::g.265102::m.265102... at 100.00%    0
7   2392aa, >Gene.233290::k80.R8177074::g.233290::m.233290... * 2.24921549465352
8   105aa, >Gene.238070::k80.R8269513::g.238070::m.238070... at 100.00% 6.34069835526098
9   2392aa, >Gene.153629::k96.R5793713::g.153629::m.153629... at 100.00%    6.96925035549896
10  2392aa, >Gene.162410::k96.R5914198::g.162410::m.162410... at 100.00%    9.44989456273482

There are 3 proteins with the same 2392aa, but the one chosen has the lowest transcript level, when I would like the last one to be selected instead. Thanks for any ideas/suggestions.