Open xdvom03 opened 4 years ago
Wikipedia offers website dumps. But they aren't too useful to us as we want one page at a time. It allows for boilerplate-free articles, fo example: https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering?action=render. It also allows one to download XML (https://en.wikipedia.org/wiki/Special:Export/Naive_Bayes_spam_filtering), which we can handle with:
(cxml:parse-octets (drakma:http-request "https://en.wikipedia.org/wiki/Special:Export/Naive_Bayes_spam_filtering") (cxml-xmls:make-xmls-builder))
This will require additional parsing to get relevant text out in the general case. Once we solve this, we must be careful not to let Wikipedia overwhelm certain classes - it still has a style. But it might allow for some interesting Wikipedia-only test cases.
Wikipedia allows for easier ways to get article content. We don't particularly want all the boilerplate around the article to influence classing, anyway.
Requires solving #11 first, for the link format is different.
https://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler