IDN Display Algorithm: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
{{draft}}
This page explains the plan for changing the mechanism by which Firefox decides whether to display a given IDN domain label (a domain name is made up of one or more labels, separated by dots) in its Unicode or Punycode form.
 
This page outlines a plan for changing the mechanism by which Firefox decides whether to display a given IDN domain label (a domain name is made up of one or more labels, separated by dots) in its Unicode or Punycode form.


==Background==
==Background==
Line 13: Line 11:
Our current algorithm is to display as Unicode all IDN labels within TLDs on our [http://www.mozilla.org/projects/security/tld-idn-policy-list.html whitelist], and display as Punycode otherwise. We check the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator must apply directly (they cannot be nominated by another person), and on several occasions we have required policy updates or implementation as a condition of getting in.
Our current algorithm is to display as Unicode all IDN labels within TLDs on our [http://www.mozilla.org/projects/security/tld-idn-policy-list.html whitelist], and display as Punycode otherwise. We check the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator must apply directly (they cannot be nominated by another person), and on several occasions we have required policy updates or implementation as a condition of getting in.


We also have a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof the separators "/" and ".", and invisible characters. (XXX Do we need to update this to remove some of those, like ZWJ/ZWNJ?)
We also have a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof the separators "/" and ".", and invisible characters. (XXX Do we need to update this to remove some of those, like ZWJ/ZWNJ, for IDNA2008?)


===Need For Change===
===Need For Change===
Line 38: Line 36:
The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label
The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label
all come from the same script, or are from one of a limited and defined number of allowable combinations. The
all come from the same script, or are from one of a limited and defined number of allowable combinations. The
hope is that any intra-script near-homographs will be recognisable to people who understand that script.
hope is that any intra-script near-homographs will be recognisable to people who understand that script. To
be clear: if a TLD is in the whitelist, we will unconditionally display Unicode. If it is not, the following
algorithm will apply.


[http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36],  
[http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36],  
which is about Unicode and security, defines a "Moderately Restrictive" profile which we could use. It says  
which is about Unicode and security, defines a "Moderately Restrictive" profile. It says  
the following (with edits for clarity):
the following (with edits for clarity):


Line 65: Line 65:


[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives  
[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives  
a definition for how we detect whether a string is "single script".
a definition for how we detect whether a string is "single script". Some Common or Inherited characters
are only used in a small number (but more than one) script. Mark Davis writes:
"The Unicode Consortium in U6.1 (due out soon) is adding the property Script_Extensions,
to provide data about characters which are only used in a few (but more than one) script.
The sample code in #39 should be updated to include that, so handling such cases."
 
Additional checks:
 
* Display as Punycode labels which use more than one numbering system
* Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database. (Should be < 16k of data for a simple binary test.)
* Display as Punycode labels which have sequences of the same nonspacing mark.


We will retain the whitelist as well, because a) removing it might break
We will retain the whitelist as well, because a) removing it might break
Line 80: Line 90:
===Possible Issues and Open Questions===
===Possible Issues and Open Questions===


The key thing is defining "single script" in a way which works well enough for the languages people actually use.
The following issues are open, but should not block initial implementation.
There is no requirement that every word must be registrable as a domain name (the Irish have had to do without
o'malley.ie since the beginnings of the DNS) but we don't want to disallow whole swathes of a language's
vocabulary.


We would need to see how well the TR 39 algorithm works in practice. The document itself lists a number of
Suggestion from TR#39:
"possible enhancements", which we would need to consider:


* Count Common or Inherited characters that are only used with a limited number of scripts as being in either or each script, instead of ignoring them completely. For example, U+0640 ARABIC TATWEEL is used with the scripts Arabic and Syriac, but not Latin or Hangul. This work would be potentially time-consuming and complicated; we may have to call in domain experts.
* Check for mixing numbers from different systems, such as U+0660 ( ٠ ) ARABIC-INDIC DIGIT ZERO with U+06F0 ( ۰ ) EXTENDED ARABIC-INDIC DIGIT ZERO, or U+09EA ( ৪ ) BENGALI DIGIT FOUR with U+0038 ( 8 ) DIGIT EIGHT. Perhaps we could restrict non-Arabic numerals to particular languages, e.g. Bengali numerals to Bengali?
* Check for strings which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database. Does our platform have access to this data? If not, how large is it?
* Detect sequences of the same nonspacing mark.
* Check to see that all the characters are in the sets of exemplar characters for at least one language in the Unicode Common Locale Data Repository. [XXX What does this mean? -- Gerv]
* Check to see that all the characters are in the sets of exemplar characters for at least one language in the Unicode Common Locale Data Repository. [XXX What does this mean? -- Gerv]


Also:
Also:


* Should we document our character hard-blacklist as part of this exercise? Are any characters in it legal in IDNA2008?
* Should we document our character hard-blacklist as part of this exercise? It's already visible in the prefs. Are any characters in it legal in IDNA2008 anyway?
* Do we want to allow the user to choose between multiple "restriction levels", or have a hidden pref? There are significant downsides to allowing this...
* Do we want to allow the user to choose between multiple "restriction levels", or have a hidden pref? There are significant downsides to allowing this...
* Do we ever want to display errors other than just by using Punycode?
* Do we ever want to display errors other than just by using Punycode? I suggest not...
* Should we add Armenian to the list of scripts which cannot mix with Latin?
* Should we add Armenian to the list of scripts which cannot mix with Latin?  


===Downsides===
===Downsides===
Line 117: Line 119:
===Transition===
===Transition===


If we decide to adopt this plan, in between adopting it and shipping a Firefox with
In between adopting this plan and shipping a Firefox with
the restrictions implemented, we would admit into the whitelist any
the restrictions implemented, we will admit into the whitelist any
TLD whose anti-spoofing policies at registration time were at least as strong as  
TLD whose anti-spoofing policies <i>at registration time</i> were at least as strong as  
those outlined above.
those outlined above.

Revision as of 13:17, 30 January 2012

This page explains the plan for changing the mechanism by which Firefox decides whether to display a given IDN domain label (a domain name is made up of one or more labels, separated by dots) in its Unicode or Punycode form.

Background

The Problem

If we just display any possible IDN domain label, we open ourselves up to IDN homograph attacks, where one identical-looking domain can spoof another. So we have to have some mechanism to decide which ones to display and which ones to not display, which does not involve comparing the domain in question against every other single domain which exists (which is impossible).

Current Algorithm

Our current algorithm is to display as Unicode all IDN labels within TLDs on our whitelist, and display as Punycode otherwise. We check the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator must apply directly (they cannot be nominated by another person), and on several occasions we have required policy updates or implementation as a condition of getting in.

We also have a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof the separators "/" and ".", and invisible characters. (XXX Do we need to update this to remove some of those, like ZWJ/ZWNJ, for IDNA2008?)

Need For Change

This strategy provides pretty good user protection, and it provides consistency - every Firefox everywhere works the same. However, it does mean that IDNs do not work at all in many TLDs, because the registry (for whatever reason) has not applied for inclusion, or because we do not think they have sufficiently strong protections in place. In addition, ICANN is about to open a large number of new TLDs. So either maintaining a whitelist is going to become burdensome, or the list will become wildly out of date and we will not be serving our users.

Other Browsers

The Chromium IDN page has a good summary of the policies of Chrome/Chromium and the other browsers. Unfortunately, no consensus has emerged on how to do this. Those other mechanisms were considered, but many of them depend on the configuration of the user's computer (e.g. installed languages), and this does not give site owners any confidence that their IDN domain name will be correctly displayed for all their visitors (and no way of telling if it's not).

Proposal

The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label all come from the same script, or are from one of a limited and defined number of allowable combinations. The hope is that any intra-script near-homographs will be recognisable to people who understand that script. To be clear: if a TLD is in the whitelist, we will unconditionally display Unicode. If it is not, the following algorithm will apply.

Unicode Technical Report 36, which is about Unicode and security, defines a "Moderately Restrictive" profile. It says the following (with edits for clarity):

No characters in the label can be outside of the Identifier Profile.

All characters in each label must be from a single script, or from the combinations:

  • Latin + Han + Hiragana + Katakana; or
  • Latin + Han + Bopomofo; or
  • Latin + Han + Hangul; or
  • Latin + any single other script except Cyrillic, Greek, or Cherokee

This system would allow any single script, and also most scripts + Latin, which is a common mixing, plus script mixings common in the Far East where they use multiple scripts at once.

The Identifier Profile is defined for us by the IDNA2008 standard, RFC 5892. So when we upgrade to IDNA 2008 (a separate discussion), that should hopefully eliminate a large number of non-alphabet characters for us.

Unicode Technical Report 39 gives a definition for how we detect whether a string is "single script". Some Common or Inherited characters are only used in a small number (but more than one) script. Mark Davis writes: "The Unicode Consortium in U6.1 (due out soon) is adding the property Script_Extensions, to provide data about characters which are only used in a few (but more than one) script. The sample code in #39 should be updated to include that, so handling such cases."

Additional checks:

  • Display as Punycode labels which use more than one numbering system
  • Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database. (Should be < 16k of data for a simple binary test.)
  • Display as Punycode labels which have sequences of the same nonspacing mark.

We will retain the whitelist as well, because a) removing it might break some domains which worked previously, and b) if a registry submits a good policy, we have the ability to give them more freedom than the default restrictions do. So an IDN domain would be shown as Unicode if the TLD was on the whitelist or, if not, if it met the criteria above.

I think that this would make us display a superset of the IDN domains that the other browsers display, in a way which was consistent across all copies of Firefox (maintaining the certainty which is a benefit of the current system) and which was pretty safe from spoofing.

Possible Issues and Open Questions

The following issues are open, but should not block initial implementation.

Suggestion from TR#39:

  • Check to see that all the characters are in the sets of exemplar characters for at least one language in the Unicode Common Locale Data Repository. [XXX What does this mean? -- Gerv]

Also:

  • Should we document our character hard-blacklist as part of this exercise? It's already visible in the prefs. Are any characters in it legal in IDNA2008 anyway?
  • Do we want to allow the user to choose between multiple "restriction levels", or have a hidden pref? There are significant downsides to allowing this...
  • Do we ever want to display errors other than just by using Punycode? I suggest not...
  • Should we add Armenian to the list of scripts which cannot mix with Latin?

Downsides

This system would permit whole-script confusables (All-Latin "scope.tld" vs all-Cyrillic "ѕсоре.tld"). However, so do the solutions of the other browsers, and it has not proved to be a significant problem so far. If there is a problem, everyone is equally affected.

If problems arose in the future (e.g. with homographs between a particular script and Latin), we would need to quickly issue a robust response making the point that it is up to registries to make sure that their customers cannot rip each other off. Browsers can put some technical restrictions in place, but we are not in a position to do this job for them while still maintaining a level playing field for non-Latin scripts on the web.

Transition

In between adopting this plan and shipping a Firefox with the restrictions implemented, we will admit into the whitelist any TLD whose anti-spoofing policies at registration time were at least as strong as those outlined above.