Gaia/System/Keyboard/IME/Latin/Dictionary Blob: Difference between revisions

Gaia/System/Keyboard/IME/Latin/Dictionary Blob (view source)

Revision as of 04:36, 11 February 2015

2,378 bytes added , 11 February 2015

no edit summary

Mnjul

Confirmed users

346

edits

@@ Line 429: / Line 429: @@
 * Let's look at the three tuples with frequency = 2, namely ('3', 2), ('$', 2), and ('ä', 2) (at first run). However, at second run, the ordering is ('ä', 2), ('$', 2), and ('3', 2). Thus, the written files are different.
 * Let's go back to the xxd diff result above and recall that the difference is the interchanging of [2400 0000 0020] and [e400 0000 0200]. These bytes encode ('$', 2) and ('ä', 2). Also, between these addresses is [3300 0000 0020], encoding ('3', 2). So, we have a change of order of characters ''of the same frequency'' -- which does not affect any functionality, since no particular order of characters of the same frequency is expected, at the character table.
+== Ideas for Future Improvements ==
+=== Combining Characters ===
+* [http://en.wikipedia.org/wiki/Combining_character Combining characters] aren't handled very gracefully right now. These characters take separate code points, occupy separate char positions in a string, but appear as one "letter" and may be lexically one "letter" in some specific language.
+* Example: 'री', in [http://en.wikipedia.org/wiki/Devanagari Devanagari] (alphabet used for Hindi) is composed of 'र' ('\u0930') and '"ी' ('\u0940').
+** (Alright, I know Hindi is not covered in Latin IME. But still...)
+* Currently our engine would regard two combining characters as... two independent characters (no surprise), and they take the space of two nodes.
+* However, ideally, if the two combining characters represent one single letter in a language, our engine should regard them as one character as though they're one char node.
+* How?
+=== Emoji and Non-BMP Characters ===
+Unfortunately, not all Unicode characters can fit in the 2-byte code space.
+The [http://en.wikipedia.org/wiki/Emoji#Blocks emoji blocks] at the Unicode [http://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane SMP plane] are a very good example:
+The emoji characters occupy code points U+1F300 to U+1F6FF, and an emoji character cannot fit within two bytes.
+* If you run xml2dict.py through a word list containing an emoji, then at the line where a character code point value is serialized into a 16-bit integer: <code>output.write(struct.pack(">H", ord(item[0])))</code>, you get a warning.
+** And the resulting dictionary blob is, of course, not realistic.
+* If you run word_list_converter.js through a word list containing an emoji, then -- alright, JavaScript uses UCS-2 to encode unicode code points (tongue twisting sentence, I know). So those emoji characters get broken into [http://en.wikipedia.org/wiki/UTF-16#U.2B010000_to_U.2B10FFFF surrogate pairs].
+** For example, this earth emoji '🌏', with code point value U+1F30F, is broken into the surrogate pair '\uD83C\uDF0F' in UCS-2.
+** This will (conveniently) allow us to squeeze an emoji character into our current dictionary blob structure without modification. However, we'll still run into the issue as in "Combining Characters": the engine needs to know that a surrogate pair, albeit taking some space of two nodes, is one single letter.