Gaia/System/Keyboard/IME/Latin/Dictionary Blob: Difference between revisions

No edit summary
Line 463: Line 463:
* If you run xml2dict.py through a word list containing an emoji, then at the line where a character code point value is serialized into a 16-bit integer: <code>output.write(struct.pack(">H", ord(item[0])))</code>, you get a warning.
* If you run xml2dict.py through a word list containing an emoji, then at the line where a character code point value is serialized into a 16-bit integer: <code>output.write(struct.pack(">H", ord(item[0])))</code>, you get a warning.
** And the resulting dictionary blob is, of course, not realistic.
** And the resulting dictionary blob is, of course, not realistic.
* If you run word_list_converter.js through a word list containing an emoji, then -- alright, JavaScript uses UCS-2 to encode unicode code points (tongue twisting sentence, I know). So those emoji characters get broken into [http://en.wikipedia.org/wiki/UTF-16#U.2B010000_to_U.2B10FFFF surrogate pairs].
* If you run [https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/js/settings/word_list_converter.js word_list_converter.js] through a word list containing an emoji, then -- alright, JavaScript uses UCS-2 to encode unicode code points (tongue twisting sentence, I know). So those emoji characters get broken into [http://en.wikipedia.org/wiki/UTF-16#U.2B010000_to_U.2B10FFFF surrogate pairs].
** For example, this earth emoji '🌏', with code point value U+1F30F, is broken into the surrogate pair '\uD83C\uDF0F' in UCS-2.
** For example, this earth emoji '🌏', with code point value U+1F30F, is broken into the surrogate pair '\uD83C\uDF0F' in UCS-2.
** This will (conveniently) allow us to squeeze an emoji character into our current dictionary blob structure without modification. However, we'll still run into the issue as in "Combining Characters": the engine needs to know that a surrogate pair, albeit taking some space of two nodes, is one single letter.
** This will (conveniently) allow us to squeeze an emoji character into our current dictionary blob structure without modification. However, we'll still run into the issue as in "Combining Characters": the engine needs to know that a surrogate pair, albeit taking some space of two nodes, is one single letter.
Confirmed users
346

edits