wikimedia / mediawiki-title

mediawiki title normalization library
8 stars 12 forks source link

Bad title parse #18

Closed subbuss closed 8 years ago

subbuss commented 8 years ago

Parsoid is parsing [[К:Википедия: статьи без изображений (тип: судно)]] as a link after switching to mediawiki-title instead of a category as it used to before.

Thanks to a recently introducing bug in dumping error output (but fixed in https://github.com/wikimedia/parsoid/commit/c77dbee5fb964eea341642dc1aafa832862f09fa), led to crashers in roundtrip testing (https://parsoid-tests.wikimedia.org/resultFlagNew/45beb6c0abc5a313ad75259053d74466a51f91cd/4a5a7ae97bd2b4c6fa70f62c99d2e959d072c65f/ruwiki/HMS_Actaeon_(1775)). I was able to trace the failure in template wrapping to a bad parse of the above wikitext.

This is because https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/tt/LinkHandler.js#L283 is no longer true. The namespace id is 4 instead of 14. However, siteinfo dump shows

"-2":{"id":-2,"case":"first-letter","canonical":"Media","":"Медиа"},"-1":{"id":-1,"case":"first-letter","canonical":"Special","":"Служебная"}},"namespacealiases":[{"id":2,"":"U"},{"id":2,"":"У"},{"id":2,"":"Участница"},{"id":3,"":"UT"},{"id":3,"":"ОУ"},{"id":3,"":"Обсуждение участницы"},{"id":4,"":"Wikipedia"},{"id":4,"":"ВП"},{"id":6,"":"Image"},{"id":6,"":"Изображение"},{"id":7,"":"Image talk"},{"id":7,"":"Обсуждение изображения"},{"id":10,"":"T"},{"id":10,"":"Ш"},{"id":14,"*":"К"}

which means this should have been parsed as 14 which would then have been recognized as a category link.

subbuss commented 8 years ago

I suspect this is probably a Parsoid-internal issue with how mediawiki-title is being used.

subbuss commented 8 years ago

Well, it is indeed a mediawiki-title bug. The diff below seems to fix it. It is late and this needs a second look, so, can pick this up tomorrow. /cc @arlolra

[subbu@earth mediawiki-title] git diff
diff --git a/node_modules/mediawiki-title/lib/index.js b/node_modules/mediawiki-title/lib/index.js
index 3a185b8..3bf8349 100644
--- a/node_modules/mediawiki-title/lib/index.js
+++ b/node_modules/mediawiki-title/lib/index.js
@@ -213,7 +213,7 @@ function _splitNamespace(title, siteInfo, defaultNs) {
     var match = title.match(prefixRegex);
     if (match) {
         var namespaceText = match[1];
-        var ns = Namespace.fromText(namespaceText, siteInfo);
+        var ns = (defaultNs !== undefined) ? defaultNs : Namespace.fromText(namespaceText, siteInfo);
         if (ns !== undefined) {
             return {
                 title: match[2],
cscott commented 8 years ago

That diff looks reasonable to me. We'll see if arlo agrees.

subbuss commented 8 years ago

Example test case: "Category:Project:Foo"

arlolra commented 8 years ago

That doesn't look right. Why is K being passed in as the default?

subbuss commented 8 years ago

Looking at all the changes, the bug is in interpretation of the 'defaultNs' parameter. If mediawiki-title is treating that arg as a default to be used when one cannot be inferred, then the bug is in https://gerrit.wikimedia.org/r/#/c/299897/2/lib/config/MWParserEnvironment.js ... the 'ns' parameter is not a "default" namespace parameter, but the namespace that needs to be used with the title.

subbuss commented 8 years ago

mediawiki-core has https://github.com/wikimedia/mediawiki/blob/master/includes/Title.php#L249-L252 to support this functionality.

subbuss commented 8 years ago

So, there are multiple fixes possible here:

  1. Fix how Parsoid is using the explicit ns parameter when calling mediawiki-title
  2. Fix Parsoid to not force an explicit namespace (and just rely on mediawiki-title for all namespace parsing)
  3. Add mediawiki-core equivalent functionality to mediawiki-title and use the appropriate makeTitle in Parsoid code.

Of these, (1) or (2) seems simpler than (3).

arlolra commented 8 years ago

the bug is in interpretation of the 'defaultNs' parameter.

That link to core says, "The namespace to use if none is specified by a prefix." Agreed that the bug is in Parsoid, and it looks like I introduced it in, https://gerrit.wikimedia.org/r/#/c/299897/2/lib/config/MWParserEnvironment.js

But, I think (2) is the right thing to do.

subbuss commented 8 years ago

That link also indicates that title construction might need to construct something with an explicit NS for which it provides additional methods .. hence option (3). But, okay, I'll take a shot at (2) then.

arlolra commented 8 years ago

Yeah, that's exactly what you were doing with,

var title = Title.newFromText(urlDecodedText, this.conf.wiki.siteInfo);
if (ns !== undefined && ns !== 0) {
    title = new Title(title.getKey(), ns, title._siteInfo, title._fragment);
}
return title;

which is why I said I introduced in the above change.

arlolra commented 8 years ago

But, okay, I'll take a shot at (2) then.

Or you can revert my change and put a comment about that not just being a default, as I misinterpreted.

subbuss commented 8 years ago

Yes, I considered reverting it (since I verified that it works earlier today). As it turns out, there is only one use of the explicit NS scenario (the one that broke). I can fix that easily. Let me push fix for (2) and we can discuss on that patch what is a better solution.

But, the broader question is whether mediawiki-title, the library, needs functionality described in (3). It may, but probably better to introduce it on demand.

arlolra commented 8 years ago

It may, but probably better to introduce it on demand.

Agreed.

arlolra commented 8 years ago

I do need the default NS behaviour, as implemented here, for L71, https://gerrit.wikimedia.org/r/#/c/264026/21/lib/ext/Gallery/index.js