Closed xurxodiz closed 12 years ago
Let's finish this thing already, I'm confusing myself like a Pokémon.
Open Weka, compare the log likelihood of FarthestFirst, HierarchicalCluster, sIB, XMeans, SimpleKMeans and EM for two, three, four (and five?) clusters. Make a nice table, a nice XY plot with those balloons, and go with the best option.
Quite useful explanation on how experiments work: [http://www.cs.tufts.edu/~ablumer/weka/README.html#Experiment package](http://www.cs.tufts.edu/~ablumer/weka/README.html#Experiment package)
Another nice tutorial: http://www.cs.utexas.edu/~mooney/cs391L/hw2/Experiments.pdf
Also useful (in Spanish): https://addi.ehu.es/bitstream/10810/4627/1/tr10-1.pdf
We can try with k = sqrt(n/2) [Kanti Mardia et al. (1979). Multivariate Analysis].
Things learned so far:
Weka reads from a Producer and outputs to a Listener. In our case, the listener is a file results.arff
Weka offers the following ResultProducers
Now, those producers rely on evaluators to assert how good is our thing:
But we can turn any clusterer into a density-based one with MakeDensityBasedClusterer.
As for the clusterers, we have:
DBSCAN and OPTICS are not correctly implemented in Weka; they can't clusterInstance
(explained here: http://stackoverflow.com/questions/7453131/how-to-cluster-an-instance-with-wekas-dbscan).
Okay I understand HierarchicalClusterer now: the number of clusters is the number of "root" nodes. So even if Weka only has agglomerative (not divisive) implemented, we can use it.
RandomSplit is a much quicker method than CrossValidation, but in small datasets like ours, it can be deceitful and biased, so let's run with CrossValidation.
The path is now:
(AveragingResultProducer) > CrossValidationResultProducer > DensityBasedClustererSplitEvaluator
And then the evaluators will be
EM; and through MakeDensityBasedEvaluator: FarthestFirst, HierarchicalClusterer, sIB, SimpleKMeans, XMeans.
First we make a run of EM with nClusters set to -1 and XMeans with nClusters ranging from 2 to 10 (Kantia Mardia's rule would say 7-8, but just to be safe). I think they'll settle around 3 and 5 respectively.
Then we pick each of those clusters from above and run them in the fixed number of clusters 2, 3, 4, 5, and 6. And compare their log likelihoods and make decisions.
EM returned an average of 4.40 clusters, and XMeans 2.20. Running the experiment with 2, 3, 4, 5 and 6 clusters.
Results show that 4, 5 and 6 clusters with EM show the highest log likelihood.
2D Graphs reveal:
3D graphs show... some weird things. Apparently everything increases when everything increases. Dunno.
Correlation is not really important: this issue has helped us to identify the number of clusters (could be 4, 5, or 6, but we picked 4 for simplicity), and learn a lot about using Weka. The data is there and we'll have to roll with it.
Now, to identify the meaning of each cluster... well, that's why we've created #73.
Ok, so if we draw graphs for two clusters, for three, and for four, what can we learn from their correlations? Can we pick a cluster number based on that? What does EM deduce, if open to pick a number?
Continues from #59.