IDN Display Algorithm: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Created page with "{{draft}} This page outlines a plan for changing the mechanism by which Firefox decides whether to display a given IDN domain name in its Unicode or Punycode form. ==Background...")
 
No edit summary
Line 7: Line 7:
===The Problem===
===The Problem===


If we just display any possible IDN domain name, we open ourselves up to [http://en.wikipedia.org/wiki/IDN_homograph_attack IDN homograph attacks], where one identical-looking domain can spoof another.
If we just display any possible IDN domain name, we open ourselves up to [http://en.wikipedia.org/wiki/IDN_homograph_attack IDN homograph attacks], where one identical-looking domain can spoof another. So we have to have some mechanism to decide which ones to display and which ones to not display, which does not involve comparing the domain in question against every other single domain which exists (which is impossible).


===Current Algorithm===
===Current Algorithm===


Our current algorithm is to display all IDNs within TLDs on our [http://www.mozilla.org/projects/security/tld-idn-policy-list.html whitelist], and none otherwise. We check the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator must apply themselves, and on several occasions we have required policy updates or implementation as a condition of getting in.
Our current algorithm is to display as Unicode all IDNs within TLDs on our [http://www.mozilla.org/projects/security/tld-idn-policy-list.html whitelist], and display as Punycode otherwise. We check the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator must apply directly (they cannot be nominated by another person), and on several occasions we have required policy updates or implementation as a condition of getting in.


We also have a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof "/" or ".", and invisible characters. (XXX Do we need to update this to remove some of those, like ZWJ/ZWNJ?)
We also have a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof "/" or ".", and invisible characters. (XXX Do we need to update this to remove some of those, like ZWJ/ZWNJ?)

Revision as of 17:26, 20 January 2012

Draft-template-image.png THIS PAGE IS A WORKING DRAFT Pencil-emoji U270F-gray.png
The page may be difficult to navigate, and some information on its subject might be incomplete and/or evolving rapidly.
If you have any questions or ideas, please add them as a new topic on the discussion page.

This page outlines a plan for changing the mechanism by which Firefox decides whether to display a given IDN domain name in its Unicode or Punycode form.

Background

The Problem

If we just display any possible IDN domain name, we open ourselves up to IDN homograph attacks, where one identical-looking domain can spoof another. So we have to have some mechanism to decide which ones to display and which ones to not display, which does not involve comparing the domain in question against every other single domain which exists (which is impossible).

Current Algorithm

Our current algorithm is to display as Unicode all IDNs within TLDs on our whitelist, and display as Punycode otherwise. We check the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator must apply directly (they cannot be nominated by another person), and on several occasions we have required policy updates or implementation as a condition of getting in.

We also have a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof "/" or ".", and invisible characters. (XXX Do we need to update this to remove some of those, like ZWJ/ZWNJ?)

Need For Change

This strategy provides pretty good user protection, and it provides consistency - every Firefox everywhere works the same. However, it does mean that IDNs do not work at all in many TLDs, because the registry (for whatever reason) has not applied for inclusion, or because we do not think they have sufficiently strong protections in place. In addition, ICANN is about to open a large number of new TLDs. So either maintaining a whitelist is going to become burdensome, or the list will become wildly out of date and we will not be serving our users.

Other Browsers

The Chromium IDN page has a good summary of the policies of Chrome/Chromium and the other browsers.

Proposal

The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label are single-script, or one of a limited and defined number of allowable combinations.

Unicode Technical Report 36, which is about Unicode and security, defines a "Moderately Restrictive" profile which we could use.

It says the following (with edits for clarity):

No characters in the identifier can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile].

All characters in each identifier must be from a single script, or from the combinations:

  • Latin + Han + Hiragana + Katakana; or
  • Latin + Han + Bopomofo; or
  • Latin + Han + Hangul; or
  • Latin + any single other script except Cyrillic, Greek, or Cherokee

I believe the character profile is defined for us by the IDNA standard. When we upgrade to IDNA 2008 (a separate discussion), that should hopefully eliminate a large number of non-alphabet characters for us.

This system would allow any single script, and also most scripts + Latin, which is a common mixing, plus script mixings common in the Far East where they use multiple scripts at once.

Unicode TR39 has some more ideas for refinements. We should look at that carefully and decide which ones to adopt.

We will retain the whitelist as well, because a) removing it might break some domains which worked previously, and b) if a registry submits a good policy, we have the ability to give them more freedom than the default restrictions do. So an IDN domain would be shown as Unicode if the TLD was on the whitelist or, if not, if it met the criteria above.

I think that this would make us display a superset of the IDN domains that the other browsers display, in a way which was consistent across all copies of Firefox (maintaining the certainty which is a benefit of the current system) and which was pretty safe from spoofing.

Possible Issues and Open Questions

The key thing is defining "single script" in a way which works for the languages people actually use. Unicode has properties we can use for this; we would need to check they were adequate.

Downsides

This system would permit whole-script confusables (All-Latin "scope.tld" vs all-Cyrillic "scope.tld"). However, so do the solutions of the other browsers, and it has not proved to be a problem so far. If there is a problem, everyone is equally affected.