User:Waldo/Internationalization API

From MozillaWiki
< User:Waldo
Revision as of 18:09, 25 April 2013 by Waldo (talk | contribs) (Add more implementation details info)
Jump to navigation Jump to search

Introduction

ECMAScript has long had rudimentary localization support. ES5 defines toLocaleString methods (found on various objects like Array.prototype, Number.prototype, and Date.prototype); toLocaleLowerCase and toLocaleUpperCase on String.prototype; and toLocaleDateString and toLocaleTimeString on Date.prototype. Each method acts only with respect to the user's current locale, and each method provides no control over output formatting. The spec algorithms are woefully under-defined. As a practical matter localization support in ES5 is useless.

The ECMAScript Internationalization API (ECMA-402) significantly extends these capabilities, to provide genuinely useful means of localization to ECMAScript. Outputs may be customized in various ways by requesting different components be included in output, formatted in various ways. The locale used for a formatting operation is customizable, and output formatting is intelligently determined in accordance with the locale. It additionally provides comparison functions useful for locale-sensitively sorting data, according to the type of that data (for example, sorting names in phone book order, versus sorting them in dictionary order), considering or ignoring capitalization, accents, and so on.

The Internationalization API introduces one new global property: Intl. This property is an object with various properties corresponding to various sub-APIs: collation (sorting), number formatting, and date/time formatting. (More capabilities will be added in future Internationalization API updates.) The localization APIs from ES5 have been reformulated to use the localization capabilities provided by the Internationalization API. Generally, however, it's preferable to use the Internationalization API directly, as this is more efficient by permitting caching of the structures needed to perform each operation.

Concepts

Most of the concepts used by the Internationalization API are defined in BCP 47: a living aggregation of a set of RFCs (the set may change over time as RFCs in the set are obsoleted and replaced) specifying internationalization mechanics. Full details on concepts should generally be looked up there: ECMA-402 defines most underlying concepts only by reference.

Language tags

Every operation is performed in terms of locales, specified as language tags: en-US, nan-Hant-TW, und, and so on. The main components of a language tag are the language and optionally a script, region, and variations that might exist within these. An extension component follows, permitting inclusion of extra structured data (usually to contextualize a use of the language tag). Finally, an optional private-use component may include arbitrary data (this is for the use of webpages -- not for SpiderMonkey's use). All components are alphanumeric and case-insensitive ASCII. The components are joined into a language tag using hyphens; individual components can be distinguished by length and internal syntax (length, prefix, etc.). The precise details of language tag structure are quite complex, and they include a list of irregular forms for legacy compatibility reasons. See BCP 47 for all the gory details.

One particular subcomponent worth noting specifically is the Unicode extension component, living within the extension component. The Unicode extension component has the basic form "-u(-[a-z0-9]{2,8})+", with precise details in RFC 6067. The Unicode component permits specifying additional details about sort order, numeric system, calendar system, and others.

SpiderMonkey mostly ignores the language, script, region, variant, and private-use components of a language tag. It will pass these components to ICU in language tags provided by the user, but it generally doesn't examine them or do much of interest with them. The one exception is for old-style language tags. XXX add details about the old-style mapping code in Intl.js, and why ICU doesn't perform that mapping itself

SpiderMonkey does, however, sometimes have to (very briefly) care about a Unicode extension component of a language tag -- but only to remove it. ECMA-402 often has better-structured means of specifying the same information, and so its algorithms require the Unicode extension component be removed before processing continues.

Currency codes

Formatting a number to display as currency depends upon the particular currency used, so currency codes play a role in number formatting. For example, one correct formatting for one hundred dollars USD is "$1.00" (two decimal places), while one hundred Japanese yen would be "¥100" (no decimal places). Additional characteristics determined by currency, besides decimal place count, include the currency symbol and a "long" name ("US dollar" and so on). Currency codes are three letters, traditionally capitalized. The full list is found in ISO 4217 and elsewhere.

Time zone name

Date formatting requires knowledge of the time zone. Time zone names are specified by the IANA Time Zone Database.

Operations

ECMA-402 in its first iteration exposes various locale-sensitive operations. Future editions will likely expose more operations.

Collation

Collation is the process of sorting a list of strings, according to particular rules. Different locales sort text differently. Locales may also sort differently in different contexts: dictionary sort order versus phonebook sort order, say. Sorting also may or may not take into account numeric value: [1, 30, 5] order versus [1, 5, 30] order.

Number formatting

Number formatting is simple if one only wishes for a decimal format. But in many contexts, simple decimal formatting is undesirable. Numbers displayed as currencies, percents, and decimal will display differently in different locales. Currencies pose additional problems, as different currencies format fractional values to different numbers of decimal places. And for sufficiently large numbers, grouping separators (in en-US, at thousands places as a comma; in many European locales, at thousands places as periods; other, more exotic forms exist) may be desirable. ECMA-402 permits customization of all these formatting choices.

Date formatting

Locale-sensitive date formatting in ES5 admits only a single implementation-defined format with a fixed set of components. ECMA-402 enhances date formatting to allow the selection of components to be customized -- month, day, and year, for example.. The way in which these components will be displayed in the final format string is locale-dependent, as different locales write out dates in different ways. (For example, an date might be formatted as "September 24, 2012" for en-US and as "24 Sept. 2012" for fr-FR.) Moreover, various styles may be chosen for the components included in the format: "narrow", "short", "long", "2-digit", and "numeric". (Exactly what these styles look like is also implementation-dependent.) These styles feed into the final computation of an appropriate pattern to use to generate a final string.

Internationalization in SpiderMonkey

SpiderMonkey includes significant support for the Internationalization API. The fundamental primitives used to implement the API are provided by an in-tree imported copy of ICU. This is an optional component of a SpiderMonkey build; support may be turned on using the --enable-intl-api configuration option. Most of SpiderMonkey's Internationalization code gets built even when the API is disabled, to prevent bitrot; ICU interfaces of note are stubbed out in this configuration with methods that do nothing but assert. The most important differences in a SpiderMonkey build without Internationalization are that the Intl object isn't added to global objects and the legacy toLocale*String methods are implemented using SpiderMonkey's old JSLocaleCallbacks interface. Features and capabilities of the Internationalization API itself are implemented in both C++ and in self-hosted JavaScript that accesses ICU functionality through various intrinsic functions in the self-hosting global.

The Internationalization API is enabled by default in SpiderMonkey when embedded in Firefox builds.

Code organization

ICU

International Components for Unicode is a library implementing collation, formatting, and other locale-sensitive functions. It provides the underlying functionality used in implementing Internationalization. ICU is imported in intl/icu.

ICU's source code is relatively huge and sprawling: hardly surprising for a 15+ year old project. intl/icu/source/common/unicode/ is probably the most interesting directory, from SpiderMonkey's point of view, as it contains the public headers and interfaces used by SpiderMonkey. Each header and interface within contains copious documentation of the behavior of the function/enum/etc. in question. The documentation isn't always perfectly clear, but quite often it's enough to know how to use the functionality without having to read the implementation.

ICU provides both C and C++ APIs, but the only stable interfaces are C APIs marked as stable. (C++ APIs are considered uniformly unstable. This extends even to interfaces defined entirely in public ICU headers, such as ICU smart pointers.) Given that some people reasonably want to use SpiderMonkey with a system ICU, this means we're generally limited to only the stable C API. (In one case we have to use the C++ API to access functionality; see known issues below.) Unfortunately, this also means we have to hand-roll our own smart pointer for managing ICU resources.

Most of the ICU methods indicate errors through an error code outparam. Also, such APIs check the existing value in that outparam before proceeding. Thus a sequence of ICU calls can occur without error-checking right up til the end, where a single U_FAILURE(status) will suffice to handle all errors that might occur. For example:

ucol_setAttribute(coll, UCOL_STRENGTH, uStrength, &status);
ucol_setAttribute(coll, UCOL_CASE_LEVEL, uCaseLevel, &status);
ucol_setAttribute(coll, UCOL_ALTERNATE_HANDLING, uAlternate, &status);
ucol_setAttribute(coll, UCOL_NUMERIC_COLLATION, uNumeric, &status);
ucol_setAttribute(coll, UCOL_NORMALIZATION_MODE, uNormalization, &status);
ucol_setAttribute(coll, UCOL_CASE_FIRST, uCaseFirst, &status);
if (U_FAILURE(status)) {
    ucol_close(coll);
    JS_ReportErrorNumber(cx, js_GetErrorMessage, NULL, JSMSG_INTERNAL_INTL_ERROR);
    return NULL;
}

Integration

The Intl object is integrated into the global object through code in js/src/builtin/Intl.cpp and js/src/builtin/Intl.h. js_InitIntlClass performs this operation when it's called during global object bootstrapping, in concert with various other initialization methods in the same file and in js/src/vm/GlobalObject.cpp. There's some particular trickiness here, as the various Intl.* constructors aren't global classes, yet need to participate in the reserved-slot constructor/prototype system used by Object, Function, Array, and so on to implement "using the original value of Object.prototype" and "as if by new Array()" and similar.

Self-hosted code

The majority of the self-hosted code implementing Internationalization is in js/src/builtin/Intl.js. This file defines the functions exposed on the various Intl.* constructor functions and the various Intl.*.prototype objects.

Internationalization in various cases requires keeping around large data tables: to record the set of supported currency codes, to record language tag (hyphenated strings describing locales, and various options) mappings, and so on. This data lives in js/src/builtin/IntlData.js and is generated by js/src/builtin/make_intl_data.py. This script downloads original (large) plaintext databases, parses them, and extracts in the proper format the data used by Internationalization. Updating this static data — which should happen any time the underlying databases receive an update — should be as simple as rerunning the script. XXX Link to the mailing lists to track to learn when updates occur!

Intrinsic functions

Self-hosted code calls into various intrinsics to access ICU functionality. The full list of Internationalization intrinsics is (necessarily, at the moment -- this will probably change eventually) in js/src/vm/SelfHosting.cpp, but the intrinsics themselves are implemented in js/src/builtin/Intl.cpp.

Natively-implemented functions

All the constructor functions are implemented in C++ in js/src/builtin/Intl.cpp. These need to call into enough C++ code to create the necessary ICU data structures for it to be worth keeping them in C++.

Tests

Tests live in js/src/tests/test402, an unmodified import of the ECMA-402 test suite. The tests are run during the normal jstests/jsreftest suite. XXX Is this true yet?

Internationalization tests are treated as a third-party import. Contributions to them should go through XXX how to contribute.

As we may not have fully-correctly implemented ECMA-402 at any point, or a bug might be found before a test is committed, we require a mechanism to mark an ECMA-402 test as failing without requiring that marking be upstreamed. XXX what is this mechanism?

In builds with ECMA-402 support disabled, these tests are skipped. XXX how?

Implementation

ECMA-402 currently exposes Intl.Collator, Intl.DateTimeFormatter, and Intl.NumberFormatter objects. The spec also permits initializing an existing object as one of these, for a wrinkle. The fundamental ICU data structures providing the relevant functionality are UCollator*, UNumberFormat*, and UDateFormat*, opaque pointers all. Instances are created using u{col,num,date}_open, passing in appropriate arguments. For objects created by the constructor, the pointer is stored in a reserved slot as a private value. For objects merely initialized by the constructor, the ICU data structures must be (inefficiently!) created anew every time.

Every object initialized as an Intl object has an associated set of internal properties. In ECMA-402 these properties are represented using ES5's traditional double-bracket notation: [[calendar]], [[initializedIntlObject]], and so on. The "ideal" means of implementing these properties would probably be ES6 private names, but they're not stable or well-understood enough to be specified yet (let alone implemented). In the meantime we associate ECMA-402 internal properties with objects using a weak map. Any object initialized as an Intl object has an internal [[initializedIntlObject]] property. This is implemented by placing all such objects as keys in a weak map (internalsMap in builtin/Intl.js). The corresponding value is an internals object. The internals object is an object with properties corresponding to the other internal properties on the object, named naturally -- "calendar", "initializedDateTimeFormat", and so on (no brackets).

Known issues

ECMA-402 says that the supported numbering systems for a locale are (unsurprisingly) locale-dependent. ICU exposes the default numbering system for a locale via a C++ API, but otherwise it pretends any numbering system can be used by any locale. Thus SpiderMonkey's implementation says that the default numbering system is supported (obviously), and it says a handful of common decimal numbering systems are supported. See getNumberingSystems in js/src/builtin/Intl.cpp. If ICU ever provides more comprehensive information here, we should probably use it.

The ICU interface that exposes a locale's default numbering system (see above) is C++, which (see below) means it's not stable. There's an issue on file to add a C API for this. Until that's implemented and we use it, be careful about ICU upgrades.

The means for representing internal properties may not be cross-global-correct. Technically if I do var obj = {}; Intl.Collator(obj); otherWindow.Intl.Collator(obj); the second initialization should throw, because internal properties adhere to the object. The current structuring of the weak map mechanism, however, uses one weak map per global. So that example likely "succeeds" now, where it actually shouldn't. This probably is unlikely to be simply stumbled upon, but it's an issue. Recent self-hosting work may allow us to not clone the internals-mapping behavior into every global object that uses Intl stuff, which would solve this issue. I (Jeff) should look into this at some point, and poke Till for review on a patch if it pans out (given he implemented the relevant self-hosting improvements).

Other random details

....anything?...