xurxodiz / MarioLevels

My entry for the Mario AI Championship 2012, implementing adaptive procedural level generation. Holds 1st place at the all-time points and percentage tally.
MIT License
2 stars 1 forks source link

Interpret clustering graphs #67

Closed xurxodiz closed 12 years ago

xurxodiz commented 12 years ago

Ok, so if we draw graphs for two clusters, for three, and for four, what can we learn from their correlations? Can we pick a cluster number based on that? What does EM deduce, if open to pick a number?

Continues from #59.

xurxodiz commented 12 years ago

Let's finish this thing already, I'm confusing myself like a Pokémon.

Open Weka, compare the log likelihood of FarthestFirst, HierarchicalCluster, sIB, XMeans, SimpleKMeans and EM for two, three, four (and five?) clusters. Make a nice table, a nice XY plot with those balloons, and go with the best option.

xurxodiz commented 12 years ago

Quite useful explanation on how experiments work: [http://www.cs.tufts.edu/~ablumer/weka/README.html#Experiment package](http://www.cs.tufts.edu/~ablumer/weka/README.html#Experiment package)

Another nice tutorial: http://www.cs.utexas.edu/~mooney/cs391L/hw2/Experiments.pdf

Also useful (in Spanish): https://addi.ehu.es/bitstream/10810/4627/1/tr10-1.pdf

xurxodiz commented 12 years ago

We can try with k = sqrt(n/2) [Kanti Mardia et al. (1979). Multivariate Analysis].

Things learned so far:

Weka reads from a Producer and outputs to a Listener. In our case, the listener is a file results.arff

Weka offers the following ResultProducers

Now, those producers rely on evaluators to assert how good is our thing:

But we can turn any clusterer into a density-based one with MakeDensityBasedClusterer.

As for the clusterers, we have:

xurxodiz commented 12 years ago

DBSCAN and OPTICS are not correctly implemented in Weka; they can't clusterInstance (explained here: http://stackoverflow.com/questions/7453131/how-to-cluster-an-instance-with-wekas-dbscan).

xurxodiz commented 12 years ago

Okay I understand HierarchicalClusterer now: the number of clusters is the number of "root" nodes. So even if Weka only has agglomerative (not divisive) implemented, we can use it.

xurxodiz commented 12 years ago

RandomSplit is a much quicker method than CrossValidation, but in small datasets like ours, it can be deceitful and biased, so let's run with CrossValidation.

The path is now:

(AveragingResultProducer) > CrossValidationResultProducer > DensityBasedClustererSplitEvaluator

And then the evaluators will be

EM; and through MakeDensityBasedEvaluator: FarthestFirst, HierarchicalClusterer, sIB, SimpleKMeans, XMeans.

First we make a run of EM with nClusters set to -1 and XMeans with nClusters ranging from 2 to 10 (Kantia Mardia's rule would say 7-8, but just to be safe). I think they'll settle around 3 and 5 respectively.

Then we pick each of those clusters from above and run them in the fixed number of clusters 2, 3, 4, 5, and 6. And compare their log likelihoods and make decisions.

xurxodiz commented 12 years ago

EM returned an average of 4.40 clusters, and XMeans 2.20. Running the experiment with 2, 3, 4, 5 and 6 clusters.

xurxodiz commented 12 years ago

Results show that 4, 5 and 6 clusters with EM show the highest log likelihood.

xurxodiz commented 12 years ago

2D Graphs reveal:

3D graphs show... some weird things. Apparently everything increases when everything increases. Dunno.

xurxodiz commented 12 years ago

Correlation is not really important: this issue has helped us to identify the number of clusters (could be 4, 5, or 6, but we picked 4 for simplicity), and learn a lot about using Weka. The data is there and we'll have to roll with it.

Now, to identify the meaning of each cluster... well, that's why we've created #73.