Closed he7d3r closed 7 years ago
This shouldn't be too hard. I talked this over with Helder, but I forgot to capture the notes. Here is my proposal.
(1) We add parent_revision.id as a Datasource
This will allow us to specify the revision that we would like to be treated as the parent to the dependency injection system. It will be a little bit awkward for the batching strategy we do now (that assumes the parent_id can be found by looking at revision_metadata.parent_id). I don't suspect this will be too bad though since we can solve()
the parent_revision.id dependency from, a cache that contains revision.metadata and possibly parent_revision.id already.
(2) We change the format of revids string in ORES to allow for parents to be specified
Right now, we can specify a list of revisions to score by splitting them by bars -- e.g., 1234|5678|9876. So, a regex for that would be: [0-9](\|[0-9])*
I propose that we extend this format to optionally include a colon separated parent_id -- e.g. 1233:1234|5677:5678|9875:9876. So, this would change the regex to: ([0-9]:)?[0-9](\|([0-9]:)?[0-9])*
This would allow someone to mix requests for revisions that have an old parent_id with requests for revisions that would use the default -- e.g., 1234|5677:5678|9876
Just in case we want to be consistent with how MediaWiki does this: users can create internal links to
[[Special:Diff/123/456]]
(using "/
" as a separator instead of ":
") and the link will point to
https://en.wikipedia.org/w/index.php?oldid=123&diff=456
(notice the order of the parameters)
For the use case being described, it should also be possible to filter by user, or otherwise cherry-pick over edits by known users, so that we're only scoring the alleged vandal's edits.
However, interposing reverts would be tricky to skip over, cos they would be removing part of the signal.
https://ores.wmflabs.org/v2/scores/enwiki/damaging/4010401?datasource.revision.parent.id=1100
This now works just fine.
It might happen that a vandal edits the same page many times, and each of the edits has a low probability of being reverted, and still the whole set of edits, if looked at as a single edit (e.g. by enabling the enhanced recent changes preference), would have a high probability of being reverted.
In order to have predictions for these sequential edits, it seems necessary to be able to score a revision by comparing it with an older revision than the previous (parent) revision.
E.g.: I want to know the probability of this diff being reverted: https://pt.wikipedia.org/w/index.php?diff=42204427&oldid=42203059 instead of the probabilities of each of the intermediary diffs for that page: