webplatform / mediawiki-conversion

Convert MediaWiki XML backup into structured raw text file tree
https://github.com/webplatform/docs
15 stars 4 forks source link

Ensure pages with redirects gets deleted from the repository and that a Git symbolic link is created #1

Closed renoirb closed 9 years ago

renoirb commented 9 years ago

Let’s handle redirects, using Git!

That way we don’t need to create redirect specific configurations for the web server or the static site generator.

Rationale

Under MediaWiki we can create a page redirection. Let’s assume we asked MediaWiki to "leave a redirect" from /wiki/a/b/c to redirect the visitor web browser to /wiki/d/e/f automatically.

Page redirect are created for many purposes. They can be done by a user who explicitly added the appropriate wiki code, or when a user uses the "Move" function.

In the end, MediaWiki basically creates a new wiki document revision with special wiki code in the following format #REDIRECT [[d/e/f]].

Notice that what’s after #REDIRECT is valid MediaWiki flavored Wikitext. When parsing MediaWiki dumpBackup XML file, we also have another way of knowing if a wiki page has a redirect. Since our script loops through the XML file, we can get this information from a <page> described like this <redirect title="d/e/f" />.

<mediawiki>
  <page>
    <title>a/b/c</title>
    <ns>0</ns>
    <id>86</id>
    <redirect title="d/e/f" />
    <revision>
      <id>11686</id>
      <parentid>11679</parentid>
      <timestamp>2012-10-08T21:17:27Z</timestamp>
      <contributor>
        <username>Renoirb</username>
        <id>31337</id>
      </contributor>
      <comment>Save comment</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve" bytes="28">#REDIRECT [[d/e/f]]</text>
      <sha1>...</sha1>
    </revision>
  </page>
</mediawiki>

Since we want to keep only files that actually has content and that git supports similar desired outcome, we’ll adress the use case with what’s available to us within Git.

Expected outcome

When a MediaWiki page is declared as "redirect", we’ll instead do:

  1. Delete the file (e.g. git rm -- a/b/c)
  2. Look where the redirect points to (e.g. d/e/f)
  3. Ensure the targeted file exists
    • true: Create a symbolic link (e.g. git ln -s d/e/f a/b/c)
    • false: Do nothing.
  4. Commit with enforced author and action date (e.g. git commit --author="John Doe <jdoe@example.org>" --date="Thu, 13 Sep 2012 20:50:35 +0000" --message="Original commit message")

    Things to keep in mind

  5. Redirect doesn’t mean the page has been deleted, but the most common use-case is when a page content has been removed and we want to redirect the user elsewhere.
renoirb commented 9 years ago

This should be handled now. To confirm.

renoirb commented 9 years ago

There is no equivalent of symbolic links in Git. Will have to handle through NGINX redirect maps instead at #6