xdvom03 / klaus

Bayesian text classification of websites in a nested class system
Creative Commons Zero v1.0 Universal
2 stars 0 forks source link

The pair explainer contains too much information #109

Open xdvom03 opened 3 years ago

xdvom03 commented 3 years ago

Sites often have ~100 keywords. This may be too much anyway, putting focus on word count over strength (as one group will devolve into random words). Choosing words more carefully (#73, #74) is the way forward (For example: this should end up as food, not physics. Maybe we need word pairs here. In that case, we must recognize the word pair as a single token.).

However, it's very hard to make any educated guesses here because the interface is too debug-y. We don't usually need the word counts, just the score, and listing the words might still be too lengthy. Maybe a graph would be better, with individual words visible on mouseover or by toggle. This would show whether evidence is at the beginning or end of the distribution.

xdvom03 commented 3 years ago

Semi-related speculation: keyword diversity must be accounted for. Misclasses typically come from a bunch of closely related terms that aren't really important in a broader context (such as fiction here). Diversity might mean "belonging in multiple subcasses", since that's a decent proxy for the meaning of a word. This is of course predicate on a good class structure and enough subclasses. In particular, this will not work if the subclasses don't share many keywords.