xdvom03 / klaus

Bayesian text classification of websites in a nested class system
Creative Commons Zero v1.0 Universal
2 stars 0 forks source link

Special case for downloading Wikipedia content #28

Open xdvom03 opened 3 years ago

xdvom03 commented 3 years ago

Wikipedia allows for easier ways to get article content. We don't particularly want all the boilerplate around the article to influence classing, anyway.

Requires solving #11 first, for the link format is different.

https://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler

xdvom03 commented 3 years ago

Wikipedia offers website dumps. But they aren't too useful to us as we want one page at a time. It allows for boilerplate-free articles, fo example: https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering?action=render. It also allows one to download XML (https://en.wikipedia.org/wiki/Special:Export/Naive_Bayes_spam_filtering), which we can handle with:

(cxml:parse-octets (drakma:http-request "https://en.wikipedia.org/wiki/Special:Export/Naive_Bayes_spam_filtering") (cxml-xmls:make-xmls-builder))

This will require additional parsing to get relevant text out in the general case. Once we solve this, we must be careful not to let Wikipedia overwhelm certain classes - it still has a style. But it might allow for some interesting Wikipedia-only test cases.