xychang / RecursiveHierarchicalClustering

Use iterative feature pruning to identify hierarchical clusters.
http://sandlab.cs.ucsb.edu/clickstream/index.html
GNU General Public License v3.0
55 stars 19 forks source link

Verifying results #4

Open IG16 opened 5 years ago

IG16 commented 5 years ago

Hey, I was using your awesome clickstream algorithm engine when I noticed something interesting.

Here is what I did: I am trying to verify results of the algorithm, so the check I do is the following:

  1. After running algorithm, open result.json file.
  2. For all leaf nodes in result.json, find list of exclusions for example: ["t", [["l", [48, 167, 201, 283, 434, 468, 672, 883, 916, 970, 1015, 1271],

{"exclusionsScore": [1285.0, 336.0208333333333, 0.0, 0.0], "exclusions": ["S2319", "S674", "S3690", "S3361"]}],

  1. To verify results I do a lookup for all users in this cluster (for instance userId 48) against their respective input file (input file contains actual log of actions performed by users which is used as input to algorithm) to verify that they actually have done at least one of ["S2319", "S674", "S3690", "S3361"] sequences.

Here are the results: I found that I when do verify results - about 20% of users do not have any of the cluster sequences in the input file, meaning they did not perform any of the sequences of actions of the cluster they belong to.

Here is what I expected: Does this result make sense? Shouldn’t users perform at least 1 sequence that appears in cluster they belong to? Thank you very much

xychang commented 5 years ago

Hi,

Thank you for your question.

Yes, it can happen that some of the users do not contain the prominent features in a given cluster. This is because these users are drawn in because they are similar to existing members of the cluster on some other features.

To illustrate how this may happen, let's suppose the scenario below: maybe a cluster's prominent feature is spamming, but some of the spamming accounts may have other features in common (e.g. registration pattern). When a new account exhibits the same registration pattern but hasn't started spamming yet. It can still have drawn into the spamming cluster because of the similar registration patterns.

Please keep in mind that the prominent features are just the top-ranked features for each cluster and not the full set of features. We use the L-method to make sure that these features are significantly stronger than the rest of the features, but this by no means indicates that the other features have no effect on the cluster formation.

Hope this answers your questions.