Confirmed users
346
edits
No edit summary |
No edit summary |
||
Line 429: | Line 429: | ||
* Let's look at the three tuples with frequency = 2, namely ('3', 2), ('$', 2), and ('ä', 2) (at first run). However, at second run, the ordering is ('ä', 2), ('$', 2), and ('3', 2). Thus, the written files are different. | * Let's look at the three tuples with frequency = 2, namely ('3', 2), ('$', 2), and ('ä', 2) (at first run). However, at second run, the ordering is ('ä', 2), ('$', 2), and ('3', 2). Thus, the written files are different. | ||
* Let's go back to the xxd diff result above and recall that the difference is the interchanging of [2400 0000 0020] and [e400 0000 0200]. These bytes encode ('$', 2) and ('ä', 2). Also, between these addresses is [3300 0000 0020], encoding ('3', 2). So, we have a change of order of characters ''of the same frequency'' -- which does not affect any functionality, since no particular order of characters of the same frequency is expected, at the character table. | * Let's go back to the xxd diff result above and recall that the difference is the interchanging of [2400 0000 0020] and [e400 0000 0200]. These bytes encode ('$', 2) and ('ä', 2). Also, between these addresses is [3300 0000 0020], encoding ('3', 2). So, we have a change of order of characters ''of the same frequency'' -- which does not affect any functionality, since no particular order of characters of the same frequency is expected, at the character table. | ||
== Ideas for Future Improvements == | |||
=== Combining Characters === | |||
* [http://en.wikipedia.org/wiki/Combining_character Combining characters] aren't handled very gracefully right now. These characters take separate code points, occupy separate char positions in a string, but appear as one "letter" and may be lexically one "letter" in some specific language. | |||
* Example: 'री', in [http://en.wikipedia.org/wiki/Devanagari Devanagari] (alphabet used for Hindi) is composed of 'र' ('\u0930') and '"ी' ('\u0940'). | |||
** (Alright, I know Hindi is not covered in Latin IME. But still...) | |||
* Currently our engine would regard two combining characters as... two independent characters (no surprise), and they take the space of two nodes. | |||
* However, ideally, if the two combining characters represent one single letter in a language, our engine should regard them as one character as though they're one char node. | |||
* How? | |||
=== Emoji and Non-BMP Characters === | |||
Unfortunately, not all Unicode characters can fit in the 2-byte code space. | |||
The [http://en.wikipedia.org/wiki/Emoji#Blocks emoji blocks] at the Unicode [http://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane SMP plane] are a very good example: | |||
The emoji characters occupy code points U+1F300 to U+1F6FF, and an emoji character cannot fit within two bytes. | |||
* If you run xml2dict.py through a word list containing an emoji, then at the line where a character code point value is serialized into a 16-bit integer: <code>output.write(struct.pack(">H", ord(item[0])))</code>, you get a warning. | |||
** And the resulting dictionary blob is, of course, not realistic. | |||
* If you run word_list_converter.js through a word list containing an emoji, then -- alright, JavaScript uses UCS-2 to encode unicode code points (tongue twisting sentence, I know). So those emoji characters get broken into [http://en.wikipedia.org/wiki/UTF-16#U.2B010000_to_U.2B10FFFF surrogate pairs]. | |||
** For example, this earth emoji '🌏', with code point value U+1F30F, is broken into the surrogate pair '\uD83C\uDF0F' in UCS-2. | |||
** This will (conveniently) allow us to squeeze an emoji character into our current dictionary blob structure without modification. However, we'll still run into the issue as in "Combining Characters": the engine needs to know that a surrogate pair, albeit taking some space of two nodes, is one single letter. |