IDN Display Algorithm: Difference between revisions

IDN Display Algorithm (view source)

Revision as of 18:19, 20 January 2012

1,921 bytes added , 20 January 2012

no edit summary

Gerv

Account confirmers, Anti-spam team, Confirmed users, Bureaucrats and Sysops emeriti

4,925

edits

@@ Line 1: / Line 1: @@
 {{draft}}
-This page outlines a plan for changing the mechanism by which Firefox decides whether to display a given IDN domain name in its Unicode or Punycode form.
+This page outlines a plan for changing the mechanism by which Firefox decides whether to display a given IDN domain label (a domain name is made up of one or more labels, separated by dots) in its Unicode or Punycode form.
 ==Background==
@@ Line 7: / Line 7: @@
 ===The Problem===
-If we just display any possible IDN domain name, we open ourselves up to [http://en.wikipedia.org/wiki/IDN_homograph_attack IDN homograph attacks], where one identical-looking domain can spoof another. So we have to have some mechanism to decide which ones to display and which ones to not display, which does not involve comparing the domain in question against every other single domain which exists (which is impossible).
+If we just display any possible IDN domain label, we open ourselves up to [http://en.wikipedia.org/wiki/IDN_homograph_attack IDN homograph attacks], where one identical-looking domain can spoof another. So we have to have some mechanism to decide which ones to display and which ones to not display, which does not involve comparing the domain in question against every other single domain which exists (which is impossible).
 ===Current Algorithm===
-Our current algorithm is to display as Unicode all IDNs within TLDs on our [http://www.mozilla.org/projects/security/tld-idn-policy-list.html whitelist], and display as Punycode otherwise. We check the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator must apply directly (they cannot be nominated by another person), and on several occasions we have required policy updates or implementation as a condition of getting in.
+Our current algorithm is to display as Unicode all IDN labels within TLDs on our [http://www.mozilla.org/projects/security/tld-idn-policy-list.html whitelist], and display as Punycode otherwise. We check the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator must apply directly (they cannot be nominated by another person), and on several occasions we have required policy updates or implementation as a condition of getting in.
-We also have a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof "/" or ".", and invisible characters. (XXX Do we need to update this to remove some of those, like ZWJ/ZWNJ?)
+We also have a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof the separators "/" and ".", and invisible characters. (XXX Do we need to update this to remove some of those, like ZWJ/ZWNJ?)
 ===Need For Change===
@@ Line 45: / Line 45: @@
 <blockquote>
-No characters in the identifier can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile].
+No characters in the label can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile].
-All characters in each identifier must be from a single script, or from
+All characters in each label must be from a single script, or from
 the combinations:
@@ Line 55: / Line 55: @@
 * Latin + any single other script except Cyrillic, Greek, or Cherokee
 </blockquote>
-I believe the Identifier Profile is defined for us by the IDNA standard. When we upgrade to IDNA 2008 (a separate discussion),
-that should hopefully eliminate a large number of non-alphabet characters for us.
 This system would allow any single script, and also most scripts + Latin, which
@@ Line 63: / Line 60: @@
 they use multiple scripts at once.
-[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] has some more ideas for refinements.
+The Identifier Profile is defined for us by the IDNA2008 standard,
-We should look at that carefully and decide which ones to adopt.
+[http://tools.ietf.org/html/rfc5892 RFC 5892]. So when we upgrade to IDNA 2008 (a separate discussion),
+that should hopefully eliminate a large number of non-alphabet characters for us.
+[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives
+a definition for how we detect whether a string is "single script".
 We will retain the whitelist as well, because a) removing it might break
@@ Line 79: / Line 80: @@
 ===Possible Issues and Open Questions===
-The key thing is defining "single script" in a way which works for the languages people actually use.
+The key thing is defining "single script" in a way which works well enough for the languages people actually use.
-Unicode has properties we can use for this; we would need to check they
+There is no requirement that every word must be registrable as a domain name (the Irish have had to do without
-were adequate.
+o'malley.ie since the beginnings of the DNS) but we don't want to disallow whole swathes of a language's
+vocabulary.
+We would need to see how well the TR 39 algorithm works in practice. The document itself lists a number of
+"possible enhancements", which we would need to consider:
+* Count Common or Inherited characters that are only used with a limited number of scripts as being in either or each script, instead of ignoring them completely. For example, U+0640 ARABIC TATWEEL is used with the scripts Arabic and Syriac, but not Latin or Hangul. This work would be potentially time-consuming and complicated; we may have to call in domain experts.
+* Check for mixing numbers from different systems, such as U+0660 ( ٠ ) ARABIC-INDIC DIGIT ZERO with U+06F0 ( ۰ ) EXTENDED ARABIC-INDIC DIGIT ZERO, or U+09EA ( ৪ ) BENGALI DIGIT FOUR with U+0038 ( 8 ) DIGIT EIGHT. Perhaps we could restrict non-Arabic numerals to particular languages, e.g. Bengali numerals to Bengali?
+* Check for strings which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database. Does our platform have access to this data? If not, how large is it?
+* Detect sequences of the same nonspacing mark.
+* Check to see that all the characters are in the sets of exemplar characters for at least one language in the Unicode Common Locale Data Repository. [XXX What does this mean? -- Gerv]
+Also:
+* Should we document our character hard-blacklist as part of this exercise? Are any characters in it legal in IDNA2008?
+* Do we want to allow the user to choose between multiple "restriction levels", or have a hidden pref? There are significant downsides to allowing this...
+* Do we ever want to display errors other than just by using Punycode?
 ===Downsides===