IDN Display Algorithm: Difference between revisions

IDN Display Algorithm (view source)

Revision as of 13:59, 30 January 2012

104 bytes removed , 30 January 2012

no edit summary

Gerv

Account confirmers, Anti-spam team, Confirmed users, Bureaucrats and Sysops emeriti

4,925

edits

@@ Line 36: / Line 36: @@
 The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label
 all come from the same script, or are from one of a limited and defined number of allowable combinations. The
-hope is that any intra-script near-homographs will be recognisable to people who understand that script. To
+hope is that any intra-script near-homographs will be recognisable to people who understand that script.
-be clear: if a TLD is in the whitelist, we will unconditionally display Unicode. If it is not, the following
+We will retain the whitelist as well, because a) removing it might break
+some domains which worked previously, and b) if a registry submits a
+good policy, we have the ability to give them more freedom than the default restrictions do.
+So an IDN domain would be shown as Unicode if the TLD was on the whitelist or, if not, if it
+met the criteria above.
+==Algorithm==
+If a TLD is in the whitelist, we will unconditionally display Unicode. If it is not, the following
 algorithm will apply.
 [http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36],
-which is about Unicode and security, defines a "Moderately Restrictive" profile. It says
+defines a "Moderately Restrictive" profile. It says the following (with edits for clarity):
-the following (with edits for clarity):
 <blockquote>
-No characters in the label can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile].
+No characters in the label can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile]
+(defined for us by the IDNA2008 standard,
+[http://tools.ietf.org/html/rfc5892 RFC 5892]).
-All characters in each label must be from a single script, or from
+All characters in each label must be from Common + Inherited + a single script, or from one of the following combinations:
-the combinations:
-* Latin + Han + Hiragana + Katakana; or
+* Common + Inherited + Latin + Han + Hiragana + Katakana; or
-* Latin + Han + Bopomofo; or
+* Common + Inherited + Latin + Han + Bopomofo; or
-* Latin + Han + Hangul; or
+* Common + Inherited + Latin + Han + Hangul; or
-* Latin + any single other script except Cyrillic, Greek, or Cherokee
+* Common + Inherited + Latin + any single other script except Cyrillic, Greek, or Cherokee
 </blockquote>
-This system would allow any single script, and also most scripts + Latin, which
-is a common mixing, plus script mixings common in the Far East where
-they use multiple scripts at once.
-The Identifier Profile is defined for us by the IDNA2008 standard,
-[http://tools.ietf.org/html/rfc5892 RFC 5892]. So when we upgrade to IDNA 2008 (a separate discussion),
-that should hopefully eliminate a large number of non-alphabet characters for us.
 [http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives
@@ Line 69: / Line 70: @@
 "The Unicode Consortium in U6.1 (due out soon) is adding the property Script_Extensions,
 to provide data about characters which are only used in a few (but more than one) script.
-The sample code in #39 should be updated to include that, so handling such cases."
+The sample code in #39 should be updated to include that, so handling such cases." We should
+take this enhancement when the data becomes available; in the mean time, Common and Inherited
+characters are permitted without restriction.
 Additional checks:
-* Display as Punycode labels which use more than one numbering system
+* Display as Punycode labels which use more than one numbering system (we would need a list of numbering systems in Unicode)
-* Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database. (Should be < 16k of data for a simple binary test.)
+* Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database (should be < 16k of data for a simple binary test)
-* Display as Punycode labels which have sequences of the same nonspacing mark.
+* Display as Punycode labels which have sequences of the same nonspacing mark (we would need a list of, or the name of a class containing, all such marks)
-We will retain the whitelist as well, because a) removing it might break
-some domains which worked previously, and b) if a registry submits a
-good policy, we have the ability to give them more freedom than the default restrictions do.
-So an IDN domain would be shown as Unicode if the TLD was on the whitelist or, if not, if it
-met the criteria above.
-I think that this would make us display a superset of the IDN domains
-that the other browsers display, in a way which was consistent across
-all copies of Firefox (maintaining the certainty which is a benefit of the current system)
-and which was pretty safe from spoofing.
 ===Possible Issues and Open Questions===
@@ Line 111: / Line 103: @@
 If problems arose in the future (e.g. with homographs between a particular
-script and Latin), we would need to quickly issue a robust response
+script and Latin), our response would be that in the end, it is up to registries
-making the point that it is up to registries to make sure that their customers
+to make sure that their customers
 cannot rip each other off. Browsers can put some technical restrictions in place,
 but we are not in a position to do this job for them while still maintaining
-a level playing field for non-Latin scripts on the web.
+a level playing field for non-Latin scripts on the web. The registries are the
+only people in a position to implement the proper checking here. For our part,
+we want to make sure we don't treat non-Latin scripts as second-class citizens.
 ===Transition===