Labs/Ubiquity/Parser 2/Localization Tutorial: Difference between revisions

Line 83: Line 83:


== Entering your anaphora ("magic words") ==
== Entering your anaphora ("magic words") ==
  en.anaphora = ["this", "that", "it", "selection", "him", "her", "them"];
The final required property is the <code>anaphora</code> property which takes a list of "magic words". Currently there is no distinction between all the different [http://en.wikipedia.org/wiki/Deixis deictic] [http://en.wikipedia.org/wiki/Anaphora_%28linguistics%29 anaphora] which might refer to different things.
== Special cases ==
Some special language features can be handled by overriding the default behavior from <code>Parser</code>. Please note that the exact implementation of a number of these features are still in flux.
=== Languages with no spaces ===
If your language does not delimit arguments (or words, more generally) with spaces, there will be a need to write a custom <code>wordBreaker()</code> method and set <code>usespaces = false</code> and <code>joindelimiter = ''</code>. For an example, please take a look at the [Japanese](https://ubiquity.mozilla.com/hg/ubiquity-firefox/raw-file/tip/ubiquity/modules/parser/new/ja.js) or [Chinese](https://ubiquity.mozilla.com/hg/ubiquity-firefox/raw-file/tip/ubiquity/modules/parser/new/zh.js).
=== Case marking languages ===
In general, the plan for Parser 2 is to not try to attempt to handle strongly case marked languages, and to instead encourage the use of adpositions (prepositions or postpositions) as role markers. For more information, please read [http://mitcho.com/blog/projects/in-case-of-case/ In Case of Case...].
=== Stripping articles ===
Some languages have some delimiters which combine with articles. For example, in French, the preposition "à" combines with the masculine definite article "le" but not "la":
# à + la = à la
# à + le = au
You can add both "à" and "au" as delimiters of the `goal` role, but then you will get feminine arguments back with the determiner (e.g. "la table") while masculine arguments would be parsed without a determiner (e.g. "chat").
# "'''à''' la table" = "'''to''' the table"
# "'''au''' chat" = "'''to the''' cat"
These types of ''portmanteau'ed prepositions'' can be handled through a process of argument normalization. Each language's parser can optionally define a <code>normalizeArgument()</code> method which takes an argument and returns a list of normalized alternates. Normalized arguments are returned in the form of <code>{prefix: '', newInput: '', suffix: ''}</code>. For example, if you feed "la table" to the French <code>normalizeArgument()</code>, it ought to return
  [{prefix: 'la ', newInput: 'table', suffix: ''}]
If there are no possible normalizations, <code>normalizeArgument()</code> should simply return <code>[]</code>. Each alternative returned by <code>normalizeArgument()</code> is substituted into a copy of the possible parses just before nountype detection. The prefixes and suffixes are stored in the argument (as <code>inactivePrefix</code> and <code>inactiveSuffix</code>) so they can be incorporated into the suggestion display.
Here, for example, is how the inactive prefix "l'" is displayed in the parser playpen (described below). This way the user is told that the "l'" prefix is being ignored, and the nountype detection and verb action can act on the argument "English". (In the future, of course, we could teach this nountype to accept the Catalan "anglès".)
[[Image:Catalan_portmanteau.png]]
The easiest way to produce this output is to use the [https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/String/match <code>String.match()</code>] method. For example <code>normalizeArgument()</code> code, take a look at the [http://ubiquity.mozilla.com/hg/ubiquity-firefox/file/tip/ubiquity/modules/parser/new/ca.js Catalan] and [http://ubiquity.mozilla.com/hg/ubiquity-firefox/file/12f5d9abf011/ubiquity/modules/parser/new/fr.js French] parser files.
== Test your parser ==
308

edits