Confirmed users
346
edits
No edit summary |
|||
Line 463: | Line 463: | ||
* If you run xml2dict.py through a word list containing an emoji, then at the line where a character code point value is serialized into a 16-bit integer: <code>output.write(struct.pack(">H", ord(item[0])))</code>, you get a warning. | * If you run xml2dict.py through a word list containing an emoji, then at the line where a character code point value is serialized into a 16-bit integer: <code>output.write(struct.pack(">H", ord(item[0])))</code>, you get a warning. | ||
** And the resulting dictionary blob is, of course, not realistic. | ** And the resulting dictionary blob is, of course, not realistic. | ||
* If you run word_list_converter.js through a word list containing an emoji, then -- alright, JavaScript uses UCS-2 to encode unicode code points (tongue twisting sentence, I know). So those emoji characters get broken into [http://en.wikipedia.org/wiki/UTF-16#U.2B010000_to_U.2B10FFFF surrogate pairs]. | * If you run [https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/js/settings/word_list_converter.js word_list_converter.js] through a word list containing an emoji, then -- alright, JavaScript uses UCS-2 to encode unicode code points (tongue twisting sentence, I know). So those emoji characters get broken into [http://en.wikipedia.org/wiki/UTF-16#U.2B010000_to_U.2B10FFFF surrogate pairs]. | ||
** For example, this earth emoji '🌏', with code point value U+1F30F, is broken into the surrogate pair '\uD83C\uDF0F' in UCS-2. | ** For example, this earth emoji '🌏', with code point value U+1F30F, is broken into the surrogate pair '\uD83C\uDF0F' in UCS-2. | ||
** This will (conveniently) allow us to squeeze an emoji character into our current dictionary blob structure without modification. However, we'll still run into the issue as in "Combining Characters": the engine needs to know that a surrogate pair, albeit taking some space of two nodes, is one single letter. | ** This will (conveniently) allow us to squeeze an emoji character into our current dictionary blob structure without modification. However, we'll still run into the issue as in "Combining Characters": the engine needs to know that a surrogate pair, albeit taking some space of two nodes, is one single letter. |