Community:SummerOfCode11:MediaWikiParser
MediaWiki is the dominant wiki syntax in the wild, largely due to the influence of Wikipedia. It was chosen as the language for SUMO's knowledge base during our migration from TikiWiki, largely because of its mindshare. However, there is no good implementation of MediaWiki parsing in Python. The best available, py-wikimarkup, is an almost line-for-line port of the original PHP, to the extent that it still has dollar signs in the comments. As such, it is a multi-stage regex-find-and-replace formatter that never creates a parse tree and is thus both (1) practically impossible to extend and (2) hard-coupled to outputting HTML. Both of these hurt us:
- SUMO needed to extend the parser to support such constructs as TikiWiki's "showfor", inclusions, and templates, and all of these had to be implemented with some degree of hackery, degrading comprehensibility of the code and leaving lots of broken corner cases that regularly plague contributors.
- Right now, it's impossible to translate wiki snippets (forum posts, for example) to plain text for inclusion in emails.
We propose to implement a new MediaWiki parser using proper parsing techniques: generating a parse tree, manipulating it, and then outputting (at least initially) HTML. Erik Rose has already done some research toward this: see the continually developing design document and some initial code.