In the original spam filter, belonging was a simple yes/no problem. But here, we often find keywords that indicate two or more possible classes. For example, consider this link: it contains keywords for STEM ("circuit", "electrical"), computer science ("user", "design"), and society ("justice", "born"). That is to be expected - topics are included as examples. Yet the decision for which should prevail is primarily due to sheer numbers - most good words wins. There is something to be said for context - the words for STEM and computer science are not explored much, just thrown in as examples. One way is to check for sequences of two words. Presumably this is moot until more data is acquired.
Perhaps one could measure correlation between words, check for whether the word's usual accompanying words are also included, and give the word lesser importance if they aren't. However, this risks amplifying random data.
Design dilemma:
In the original spam filter, belonging was a simple yes/no problem. But here, we often find keywords that indicate two or more possible classes. For example, consider this link: it contains keywords for STEM ("circuit", "electrical"), computer science ("user", "design"), and society ("justice", "born"). That is to be expected - topics are included as examples. Yet the decision for which should prevail is primarily due to sheer numbers - most good words wins. There is something to be said for context - the words for STEM and computer science are not explored much, just thrown in as examples. One way is to check for sequences of two words. Presumably this is moot until more data is acquired.
Perhaps one could measure correlation between words, check for whether the word's usual accompanying words are also included, and give the word lesser importance if they aren't. However, this risks amplifying random data.