Read PG original spam articles and identify other mistakes that may have been carried over

xdvom03 / klaus

Bayesian text classification of websites in a nested class system

Creative Commons Zero v1.0 Universal

2 stars 0 forks source link

Read PG original spam articles and identify other mistakes that may have been carried over #24

Closed xdvom03 closed 4 years ago

xdvom03 commented 4 years ago

http://paulgraham.com/spam.html

There are at least these questionable parts:

No smoothing
Choosing 15 best words
Taking number of documents (as opposed to words)

xdvom03 commented 4 years ago

Unknown words get completely arbitrary 0.4 (which is basically never applied because we look for interestingness). I don't do that, unknown words get 0.5 (in pair fights). Obviously, no word interdependences.

It seems that that's it. Should have done this sooner.

xdvom03 commented 4 years ago

Dividing by document count, not word count, was also a mistake, but wasn't found. :(