wikimedia / revscoring

A generic, machine learning-based revision scoring system for MediaWiki
https://revscoring.readthedocs.io
MIT License
89 stars 50 forks source link

Allow scoring the diff between two arbitrary revisions #93

Closed he7d3r closed 7 years ago

he7d3r commented 9 years ago

It might happen that a vandal edits the same page many times, and each of the edits has a low probability of being reverted, and still the whole set of edits, if looked at as a single edit (e.g. by enabling the enhanced recent changes preference), would have a high probability of being reverted.

In order to have predictions for these sequential edits, it seems necessary to be able to score a revision by comparing it with an older revision than the previous (parent) revision.

E.g.: I want to know the probability of this diff being reverted: https://pt.wikipedia.org/w/index.php?diff=42204427&oldid=42203059 instead of the probabilities of each of the intermediary diffs for that page:

halfak commented 9 years ago

This shouldn't be too hard. I talked this over with Helder, but I forgot to capture the notes. Here is my proposal.

(1) We add parent_revision.id as a Datasource

This will allow us to specify the revision that we would like to be treated as the parent to the dependency injection system. It will be a little bit awkward for the batching strategy we do now (that assumes the parent_id can be found by looking at revision_metadata.parent_id). I don't suspect this will be too bad though since we can solve() the parent_revision.id dependency from, a cache that contains revision.metadata and possibly parent_revision.id already.

(2) We change the format of revids string in ORES to allow for parents to be specified

Right now, we can specify a list of revisions to score by splitting them by bars -- e.g., 1234|5678|9876. So, a regex for that would be: [0-9](\|[0-9])* I propose that we extend this format to optionally include a colon separated parent_id -- e.g. 1233:1234|5677:5678|9875:9876. So, this would change the regex to: ([0-9]:)?[0-9](\|([0-9]:)?[0-9])* This would allow someone to mix requests for revisions that have an old parent_id with requests for revisions that would use the default -- e.g., 1234|5677:5678|9876

he7d3r commented 9 years ago

Just in case we want to be consistent with how MediaWiki does this: users can create internal links to [[Special:Diff/123/456]] (using "/" as a separator instead of ":") and the link will point to https://en.wikipedia.org/w/index.php?oldid=123&diff=456 (notice the order of the parameters)

adamwight commented 9 years ago

For the use case being described, it should also be possible to filter by user, or otherwise cherry-pick over edits by known users, so that we're only scoring the alleged vandal's edits.

However, interposing reverts would be tricky to skip over, cos they would be removing part of the signal.

halfak commented 7 years ago

https://ores.wmflabs.org/v2/scores/enwiki/damaging/4010401?datasource.revision.parent.id=1100

This now works just fine.