Open bricehalder opened 2 years ago
Idea, not sure how this would work in practice; we can manually pin certain articles as landmarks of specific topic areas ("World war II" article for world war 2 related topics, "Mammal" for animal related topics, etc...). Then for every article we can find the distance of that article with the set of topic landmarks we have, and sort of categorize them this way
The only concern I have is since wikipedia is so interconnected, topically very different articles could have random extremely short paths between them, which could screw up categorization by looking at distance purely. Solution could be to find the shortest N paths between some article and the landmark, and use the cumulative distance
^ This would only be categorization based on topic rather than type (page about a person vs. place vs. country vs. historical event vs. abstract concept, etc.). If we can also find some sort of pattern/label on the wikipedia API, pairing with ^ we'd be able to catogorize an article by both type and topic
For example, the article for FDR can then be categorized as a page about a US History/WW2 person, etc.
Another idea: find some way to parse the Wikipedia categories: https://en.wikipedia.org/wiki/Help:Categories
Yeah the wikipedia categories definitely seem useful. I think in general, we should dig around wikipedia API a bit more.
As for potentially trying to categorize things ourselves, I have a few comments
Some relevant papers
We want to eventually support categories of prompts (also this has the added benefit of being able to remove categories of prompts e.g. prompts related to Nazi Germany).
Not sure if there is a way to leverage Wikipedia API or if we have to run some modeling on our graph