webplatform / mediawiki-conversion

Convert MediaWiki XML backup into structured raw text file tree
https://github.com/webplatform/docs
15 stars 4 forks source link

Ensure there is no filesystem naming collisions for folders #2

Closed renoirb closed 9 years ago

renoirb commented 9 years ago

If an export is run from an UNIX server that has a case sensitive filesystem, an import process may slip folder that has the same name but with different casing.

For example, imagine we have the following URLs exposed on https://docs.webplatform.org/wiki/tutorial/Information Architecture Wiki page "title";

  1. tutorial/Information Architecture/Planning out a website
  2. tutorial/Information Architecture/ja (notice the lowercase "i", and the last part of the URL. It denotes a Japanese translation. Its currently the only way WebPlatform handles localization.)
  3. tutorial/information_architecture
  4. concepts/IA/planning a website

While we run the export script, we already handle filesystem name and we would end up with the following folder and file hierarchy;

  1. tutorial/Information_Architecture/Planning_out_a_website/index.html
  2. tutorial/Information_Architecture/ja.html
  3. tutorial/information_architecture/index.html
  4. concepts/IA/planning_a_website/index.html

Notice that the tutorial/ folder will have two times the same string "Information_Architecture" and "information_architecture". This may not be a problem on a case sensitive filesystem, but it would be in the case on a system that isn’t.

We have to make sure we store content without creating this problem.

Expected outcome

During import, do the following;

  1. For each wiki page, get the Wiki page "title" (e.g. tutorial/Information_Architecture/Planning_out_a_website)
  2. Normalize the title, replacing:
    1. any special characters (e.g. ?, !, :, @, (, ), space, etc...) (N.B. Yes, we do have this)
    2. strip anything not from the us-ascii character-set
  3. Create an associative map of paths;
    1. Split the title by /, assign the new array to a paths variable (e.g. ['tutorial', 'Information_Architecture', 'Planning_out_a_website'])
    2. Send each member to an associative map so that anything at the index 0 are together, same for index 1, and so on.

Note that this part of the problem handles only the file name. We’ll have to setup a configuration file that will take care of serving the right file, even though the name of the file and the URL aren’t exactly the same.

Expected deliverables

renoirb commented 9 years ago

Got around to get the list of words and the different ways its written (below). Gotta ensure any redirect target URLs writes any of those consistently.

variants:
 - accept, Accept
 - accessibility_basics, Accessibility_basics
 - accessibility_testing, Accessibility_testing
 - addStream, addstream
 - animatable, Animatable
 - animation, Animation
 - BGCOLOR, bgColor
 - canvas_tutorial, Canvas_tutorial
 - Connection, connection
 - cookie, Cookie
 - css, CSS
 - DataTransfer, dataTransfer
 - date, Date
 - doctype, DOCTYPE
 - Document, document
 - DOMTokenList, DomTokenList
 - Element, element
 - Error, error
 - Event, event
 - file, File
 - filesystem, FileSystem
 - Floats_and_clearing, floats_and_clearing
 - formTarget, formtarget
 - Function, function
 - gamepad, Gamepad
 - geolocation, Geolocation
 - Getting_Your_Content_Online, getting_your_content_online
 - Global, global
 - History, history
 - How_does_the_Internet_Work, How_does_the_Internet_work
 - ID, id
 - Image, image
 - Implementation, implementation
 - indexeddb, indexedDB
 - ISO, iso
 - javascript, JavaScript
 - JavaScript_for_mobile, javascript_for_mobile
 - json, JSON
 - link, Link
 - Location, location
 - math, Math
 - moveEnd, moveend
 - moveStart, movestart
 - Navigator, navigator
 - Node, node
 - number, Number
 - oauth, OAuth
 - object, Object
 - online, onLine
 - option, Option
 - Performance, performance
 - PhotoSettingsOptions, photoSettingsOptions
 - pointerevents, PointerEvents
 - position, Position
 - q, Q
 - Range, range
 - readOnly, readonly
 - Region, region
 - removeStream, removestream
 - selection, Selection
 - selectors, Selectors
 - storage, Storage
 - String, string
 - StyleMedia, styleMedia
 - styleSheet, stylesheet
 - Styling_lists_and_links, styling_lists_and_links
 - Styling_tables, styling_tables
 - text, Text
 - tfoot, tFoot
 - the_basics_of_html, The_basics_of_HTML
 - The_History_of_the_Web, The_history_of_the_Web, the_history_of_the_web
 - thead, tHead
 - timeStamp, timestamp
 - tutorials, Tutorials
 - Unicode, unicode
 - URL, url
 - websocket, WebSocket
 - what_does_a_good_web_page_need, What_does_a_good_web_page_need

... AnimationEffectApis, AudioTracks ?

renoirb commented 9 years ago

Solved! At LAST! commit e59539a9ad40fad80439e080855dad8fb6e9829c should fix remnants pages that wouldn’t be properly migrated.