Interpret clustering graphs

xurxodiz commented 12 years ago

Ok, so if we draw graphs for two clusters, for three, and for four, what can we learn from their correlations? Can we pick a cluster number based on that? What does EM deduce, if open to pick a number?

Continues from #59.

xurxodiz commented 12 years ago

Let's finish this thing already, I'm confusing myself like a Pokémon.

Open Weka, compare the log likelihood of FarthestFirst, HierarchicalCluster, sIB, XMeans, SimpleKMeans and EM for two, three, four (and five?) clusters. Make a nice table, a nice XY plot with those balloons, and go with the best option.

xurxodiz commented 12 years ago

Quite useful explanation on how experiments work: [http://www.cs.tufts.edu/~ablumer/weka/README.html#Experiment package](http://www.cs.tufts.edu/~ablumer/weka/README.html#Experiment package)

Another nice tutorial: http://www.cs.utexas.edu/~mooney/cs391L/hw2/Experiments.pdf

Also useful (in Spanish): https://addi.ehu.es/bitstream/10810/4627/1/tr10-1.pdf

xurxodiz commented 12 years ago

We can try with k = sqrt(n/2) [Kanti Mardia et al. (1979). Multivariate Analysis].

Things learned so far:

Weka reads from a Producer and outputs to a Listener. In our case, the listener is a file results.arff

Weka offers the following ResultProducers

DataBaseResultProducer: reads from a DB (not for us)
LearningRateResultProducer: overlays another producer and tracks its performance as we add new data (perhaps useful on a later stage, but not now)
AveragingResultProducer: overlays another producer to promediate the results of its runs.
CrossValidationResultProducer: splits into n folds, leaves one out for testing.
RandomSplitResultProducer: separates a % into training and testing.

Now, those producers rely on evaluators to assert how good is our thing:

ClassifierSplitEvaluator, CostSensitiveClassifierSplitEvaluator: nominal values
RegressionSplitEvaluator: for regression datasets
DensityBasedClustererSplitEvaluator: clusterer (density based)

But we can turn any clusterer into a density-based one with MakeDensityBasedClusterer.

As for the clusterers, we have:

EM
Others that need MakeDensityBasedClusterer:
- CLOPE (for transaction data, not for us)
- Cobweb (categorial hierarchy, not for us)
- DBSCAN, OPTICS (seem to have problems with some instances, check)
- FarthestFirst
- FilteredClusterer (but we need no previous filtering)
- HierarchicalClusterer (check if it's divisive, if agglomerative then it's useless)
- sIB
- SimpleKMeans
- XMeans

xurxodiz commented 12 years ago

DBSCAN and OPTICS are not correctly implemented in Weka; they can't clusterInstance (explained here: http://stackoverflow.com/questions/7453131/how-to-cluster-an-instance-with-wekas-dbscan).

xurxodiz commented 12 years ago

Okay I understand HierarchicalClusterer now: the number of clusters is the number of "root" nodes. So even if Weka only has agglomerative (not divisive) implemented, we can use it.

xurxodiz commented 12 years ago

RandomSplit is a much quicker method than CrossValidation, but in small datasets like ours, it can be deceitful and biased, so let's run with CrossValidation.

The path is now:

(AveragingResultProducer) > CrossValidationResultProducer > DensityBasedClustererSplitEvaluator

And then the evaluators will be

EM; and through MakeDensityBasedEvaluator: FarthestFirst, HierarchicalClusterer, sIB, SimpleKMeans, XMeans.

First we make a run of EM with nClusters set to -1 and XMeans with nClusters ranging from 2 to 10 (Kantia Mardia's rule would say 7-8, but just to be safe). I think they'll settle around 3 and 5 respectively.

Then we pick each of those clusters from above and run them in the fixed number of clusters 2, 3, 4, 5, and 6. And compare their log likelihoods and make decisions.

xurxodiz commented 12 years ago

EM returned an average of 4.40 clusters, and XMeans 2.20. Running the experiment with 2, 3, 4, 5 and 6 clusters.

xurxodiz commented 12 years ago

Results show that 4, 5 and 6 clusters with EM show the highest log likelihood.

xurxodiz commented 12 years ago

2D Graphs reveal:

Z-T: no correlation
Y-T: bell curve?
X-T: positive correlation
Y-Z: negative correlation
X-Z: no correlation
X-Y: positive correlation

3D graphs show... some weird things. Apparently everything increases when everything increases. Dunno.

xurxodiz commented 12 years ago

Correlation is not really important: this issue has helped us to identify the number of clusters (could be 4, 5, or 6, but we picked 4 for simplicity), and learn a lot about using Weka. The data is there and we'll have to roll with it.

Now, to identify the meaning of each cluster... well, that's why we've created #73.

xurxodiz / MarioLevels

Interpret clustering graphs #67