wikispeedruns / wikipedia-speedruns

Source code for Wikipedia Speedruns!
https://wikispeedruns.com
MIT License
97 stars 27 forks source link

Research how to use page topic/category #125

Open bricehalder opened 2 years ago

bricehalder commented 2 years ago

We want to eventually support categories of prompts (also this has the added benefit of being able to remove categories of prompts e.g. prompts related to Nazi Germany).

Not sure if there is a way to leverage Wikipedia API or if we have to run some modeling on our graph

mliu59 commented 2 years ago

Idea, not sure how this would work in practice; we can manually pin certain articles as landmarks of specific topic areas ("World war II" article for world war 2 related topics, "Mammal" for animal related topics, etc...). Then for every article we can find the distance of that article with the set of topic landmarks we have, and sort of categorize them this way

The only concern I have is since wikipedia is so interconnected, topically very different articles could have random extremely short paths between them, which could screw up categorization by looking at distance purely. Solution could be to find the shortest N paths between some article and the landmark, and use the cumulative distance

mliu59 commented 2 years ago

^ This would only be categorization based on topic rather than type (page about a person vs. place vs. country vs. historical event vs. abstract concept, etc.). If we can also find some sort of pattern/label on the wikipedia API, pairing with ^ we'd be able to catogorize an article by both type and topic

For example, the article for FDR can then be categorized as a page about a US History/WW2 person, etc.

mliu59 commented 2 years ago

Another idea: find some way to parse the Wikipedia categories: https://en.wikipedia.org/wiki/Help:Categories

dqian3 commented 2 years ago

Yeah the wikipedia categories definitely seem useful. I think in general, we should dig around wikipedia API a bit more.

As for potentially trying to categorize things ourselves, I have a few comments

  1. The first use case, and maybe easiest to define, would be as a heuristic for our shortest path algorithm (maybe using similarity/overlap in categories in some sort of A* search)
  2. For using it for prompt generation, I think we need to define what makes a "good" prompt. For example, we have said that that a good target prompt has lots of incoming links. What about the relationship between start and end prompts? What about the
  3. The first thought in my mind was something like word embeddings which I learned about in NLP. They encode words as vectors, and similar words are closer together with respect to the vector space. In fact, it seems like some pretrained models this exist already for wikipedia articles: https://wikipedia2vec.github.io/wikipedia2vec/intro/
  4. Even if we don't like the models, or they don't fit our use cases perfectly, one thing that we can do is fine tune them; use them as a starting point and train on our own data.
dqian3 commented 2 years ago

Some relevant papers