Open sejmodha opened 8 years ago
Not that I'm aware of... have you tried ordering the identical sequences differently?
On Tue, Jun 14, 2016 at 10:02 AM, sejmodha notifications@github.com wrote:
Hi There,
I was wondering if it is possible to define preferred representative sequences for clusters in cd-hit when there are identical sequences in a cluster.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/weizhongli/cdhit/issues/35, or mute the thread https://github.com/notifications/unsubscribe/AF0Af1eTqmSJzf1prjWAK7mX6G0AZnzsks5qLm4rgaJpZM4I1Gp6 .
Some clusters have more than one protein with equal longest length, but only one is chosen as the cluster representative. How does cd-hit choose THE representative when there are more than one of equal maximum length? Although of course the expression information is not given to cd-hit, but when one compares the transcript levels of the corresponding translations clustered, it sometimes is one of the ones NOT chosen that has the highest expression. I'm trying to figure out how to "correct/adjust" for this when choosing the best representative (for follow-up work etc.). (post cd-hit or prior to cd-hit if possible) Is the order of the input sequences relevant here as @dbolser-ebi eludes to? The manual suggests that the first step in the code is sorting based on protein length, so I would think not. I created a script to append the expression level behind each ID in the cluster file. Here is an example cluster to illustrate my point.
>Cluster 53
0 252aa, >Gene.116053::k24.R37863736::g.116053::m.116053... at 100.00% 0.230025465276951
1 250aa, >Gene.7533::k32.R26410032::g.7533::m.7533... at 100.00% 0.159230324881883
2 177aa, >Gene.105054::k40.J20255009::g.105054::m.105054... at 98.31% 0.595071256619217
3 322aa, >Gene.105738::k40.J20289056::g.105738::m.105738... at 99.07% 2.66349076959379
4 130aa, >Gene.204115::k64.R11492990::g.204115::m.204115... at 99.23% 3.61629985607615
5 1835aa, >Gene.220450::k64.S11351006::g.220450::m.220450... at 100.00% 0.188932242855423
6 1377aa, >Gene.265102::k80.J8197378::g.265102::m.265102... at 100.00% 0
7 2392aa, >Gene.233290::k80.R8177074::g.233290::m.233290... * 2.24921549465352
8 105aa, >Gene.238070::k80.R8269513::g.238070::m.238070... at 100.00% 6.34069835526098
9 2392aa, >Gene.153629::k96.R5793713::g.153629::m.153629... at 100.00% 6.96925035549896
10 2392aa, >Gene.162410::k96.R5914198::g.162410::m.162410... at 100.00% 9.44989456273482
There are 3 proteins with the same 2392aa, but the one chosen has the lowest transcript level, when I would like the last one to be selected instead. Thanks for any ideas/suggestions.
Hi There,
I was wondering if it is possible to define preferred representative sequences for clusters in cd-hit when there are identical sequences in a cluster.