IDN Display Algorithm: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
(More explanation)
 
(15 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{draft}}
This page explains how Firefox decides whether to display a given IDN label (a domain name is made up of one or more labels, separated by dots) in its Unicode (i.e. normal) or Punycode (i.e. gobbledigook) form.


This page outlines a plan for changing the mechanism by which Firefox decides whether to display a given IDN domain name in its Unicode or Punycode form.
Implementing this plan was covered by {{bug|722299}}.


==Background==
==Background==
Line 7: Line 7:
===The Problem===
===The Problem===


If we just display any possible IDN domain name, we open ourselves up to [http://en.wikipedia.org/wiki/IDN_homograph_attack IDN homograph attacks], where one identical-looking domain can spoof another. So we have to have some mechanism to decide which ones to display and which ones to not display, which does not involve comparing the domain in question against every other single domain which exists (which is impossible).
If we just display any possible IDN label, we open ourselves up to [http://en.wikipedia.org/wiki/IDN_homograph_attack IDN homograph attacks], where one identical-looking domain can spoof another. So we have to have some mechanism to decide which ones to display and which ones to not display, which does not involve comparing the domain in question against every other single domain which exists (which is impossible).


===Current Algorithm===
===Previous Algorithm===


Our current algorithm is to display as Unicode all IDNs within TLDs on our [http://www.mozilla.org/projects/security/tld-idn-policy-list.html whitelist], and display as Punycode otherwise. We check the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator must apply directly (they cannot be nominated by another person), and on several occasions we have required policy updates or implementation as a condition of getting in.
We updated how Firefox dealt with this in 2012. Our previous algorithm was to display as Unicode all IDN labels within TLDs on our [http://www.mozilla.org/projects/security/tld-idn-policy-list.html whitelist], and display as Punycode otherwise. We checked the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator had to apply directly (they cannot be nominated by another person), and on several occasions we required policy updates or implementation as a condition of getting in.


We also have a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof "/" or ".", and invisible characters. (XXX Do we need to update this to remove some of those, like ZWJ/ZWNJ?)
We also had a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof the separators "/" and ".", and invisible characters. This still exists.


===Need For Change===
===Why We Changed===


This strategy provides pretty good user protection, and it provides consistency -  
The old strategy provided pretty good user protection, and it provided consistency -  
every Firefox everywhere works the same. However, it does mean
every Firefox everywhere works the same. However, it meant
that IDNs do not work at all in many TLDs, because the registry (for whatever reason)
that IDNs did not work at all in many TLDs, because the registry (for whatever reason)
has not applied for inclusion, or because we do not think they have sufficiently
had not applied for inclusion, or because we didn't think they had sufficiently
strong protections in place. In addition, ICANN is about to open
strong protections in place. In addition, ICANN was about to open
a [http://newgtlds.icann.org/ large number of new TLDs]. So either maintaining a whitelist is going to become
a [http://newgtlds.icann.org/en/program-status/application-results/strings-1200utc-13jun12-en large number of new TLDs]. So either maintaining a whitelist was going to become
burdensome, or the list will become wildly out of date and we will not
burdensome, or the list was going to become wildly out of date and we would not
be serving our users.
be serving our users.


===Other Browsers===
==The New Idea==


The Chromium IDN page has a [http://www.chromium.org/developers/design-documents/idn-in-google-chrome good summary]
Instead, we now augment our whitelist with something based on ascertaining whether all the characters in a label
of the policies of Chrome/Chromium and the other browsers. Unfortunately, no consensus has emerged on how to do
all come from the same script, or are from one of a limited and defined number of allowable combinations. The
this. Those other mechanisms were considered, but many of them depend on the configuration of the user's
hope is that any intra-script near-homographs will be recognisable to people who understand that script.
computer (e.g. installed languages), and this does not give site owners any confidence that their IDN domain
 
name will be correctly displayed for all their visitors (and no way of telling if it's not).
We retain the whitelist as well, because a) removing it might break
some domains which worked previously, and b) if a registry submits a
good policy, we have the ability to give them more freedom than the default restrictions do.
So an IDN is shown as Unicode if the TLD was on the whitelist or, if not, if it
met the criteria above.


==Proposal==
==Algorithm==


The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label
If a TLD is in the whitelist, we unconditionally display Unicode. If it is not, the following
all come from the same script, or are from one of a limited and defined number of allowable combinations. The
algorithm applies.
hope is that any intra-script near-homographs will be recognisable to people who understand that script.


[http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36],
[http://www.unicode.org/reports/tr39/#Restriction_Level_Detection Unicode Technical Report 39]  
which is about Unicode and security, defines a "Moderately Restrictive" profile which we could use. It says  
defines a "Moderately Restrictive" profile. It says the following (with edits for clarity, after discussion with the author):
the following (with edits for clarity):


<blockquote>
<blockquote>
No characters in the identifier can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier
No characters in the label can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile]
Profile].
(defined for us by the IDNA2008 standard,
[http://tools.ietf.org/html/rfc5892 RFC 5892]).


All characters in each identifier must be from a single script, or from
All characters in each label must be from Common + Inherited + a single script, or from one of the following combinations:
the combinations:


* Latin + Han + Hiragana + Katakana; or
* Common + Inherited + Latin + Han + Hiragana + Katakana; or
* Latin + Han + Bopomofo; or
* Common + Inherited + Latin + Han + Bopomofo; or
* Latin + Han + Hangul; or
* Common + Inherited + Latin + Han + Hangul; or
* Latin + any single other script except Cyrillic, Greek, or Cherokee
* Common + Inherited + Latin + any single other "Recommended" or "Aspirational" script except Cyrillic or Greek
</blockquote>
</blockquote>


I believe the character profile is defined for us by the IDNA standard. When we upgrade to IDNA 2008 (a separate discussion),
[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives
that should hopefully eliminate a large number of non-alphabet characters for us.
a definition for how we detect whether a string is "single script".
 
Some Common or Inherited characters
are only used in a small number (but more than one) script. Mark Davis writes:
"The Unicode Consortium in U6.1 (due out soon) is adding the property [http://unicode.org/Public/6.1.0/ucd/ScriptExtensions.txt Script_Extensions],
to provide data about characters which are only used in a few (but more than one) script.
The sample code in #39 should be updated to include that, so handling such cases."
This data is now available, but not yet in the Firefox platform ({{bug|844106}}). In the mean time, Common and Inherited
characters are permitted without restriction.
 
We also implement additional checks, as suggested by TR #39 sections 5.3 and 5.4:
 
* Display as Punycode labels which use more than one numbering system
* <strike>Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters</strike> -- {{bug|857481}}
* Display as Punycode labels which have sequences of the same nonspacing mark


This system would allow any single script, and also most scripts + Latin, which
===Possible Issues and Open Questions===
is a common mixing, plus script mixings common in the Far East where
they use multiple scripts at once.


[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode TR39] has some more ideas for refinements.
The following issues are still open, but were not considered important enough to block initial implementation.
We should look at that carefully and decide which ones to adopt.


We will retain the whitelist as well, because a) removing it might break
Further suggestions from TR #39:
some domains which worked previously, and b) if a registry submits a
good policy, we have the ability to give them more freedom than the default restrictions do.
So an IDN domain would be shown as Unicode if the TLD was on the whitelist or, if not, if it
met the criteria above.


I think that this would make us display a superset of the IDN domains
* Check to see that all the characters are in the sets of exemplar characters for at least one language in the Unicode Common Locale Data Repository. [XXX What does this mean? -- Gerv]
that the other browsers display, in a way which was consistent across
all copies of Firefox (maintaining the certainty which is a benefit of the current system)
and which was pretty safe from spoofing.


===Possible Issues and Open Questions===
Also:


The key thing is defining "single script" in a way which works for the languages people actually use.
* Should we document our character hard-blacklist as part of this exercise? It's already visible in the prefs. Are any characters in it legal in IDNA2008 anyway? We should review it to make sure we aren't disallowing characters now allowed by IDNA 2008.
Unicode has properties we can use for this; we would need to check they
* Do we want to allow the user to choose between multiple "restriction levels", or have a hidden pref? There are significant downsides to allowing this...
were adequate.
* Do we ever want to display errors other than just by using Punycode? I suggest not...
* Should we add Armenian to the list of scripts which cannot mix with Latin?


===Downsides===
===Downsides===


This system would permit whole-script confusables
This system permits whole-script confusables
(All-Latin "scope.tld" vs all-Cyrillic "ѕсоре.tld"). However, so do the
(All-Latin "scope.tld" vs all-Cyrillic "ѕсоре.tld"). However, so do the
solutions of the other browsers, and it has not proved to be a
solutions of the other browsers, and it has not proved to be a
significant problem so far. If there is a problem, everyone is equally affected.
significant problem so far. If there is a problem, every browser is equally affected.


If problems arose in the future (e.g. with homographs between a particular
Our response to this issue is that in the end, it is up to registries  
script and Latin), we would need to quickly issue a robust response
to make sure that their customers
making the point that it is up to registries to make sure that their customers
cannot rip each other off. Browsers can put some technical restrictions in place,
cannot rip each other off. Browsers can put some technical restrictions in place,
but we are not in a position to do this job for them while still maintaining
but we are not in a position to do this job for them while still maintaining
a level playing field for non-Latin scripts on the web.
a level playing field for non-Latin scripts on the web. The registries are the
 
only people in a position to implement the proper checking here. For our part,
===Transition===
we want to make sure we don't treat non-Latin scripts as second-class citizens.
 
If we decide to adopt this plan, in between adopting it and shipping a Firefox with
the restrictions implemented, we would admit into the whitelist any
TLD whose anti-spoofing policies at registration time were at least as strong as
those outlined above.

Latest revision as of 18:30, 17 April 2017

This page explains how Firefox decides whether to display a given IDN label (a domain name is made up of one or more labels, separated by dots) in its Unicode (i.e. normal) or Punycode (i.e. gobbledigook) form.

Implementing this plan was covered by bug 722299.

Background

The Problem

If we just display any possible IDN label, we open ourselves up to IDN homograph attacks, where one identical-looking domain can spoof another. So we have to have some mechanism to decide which ones to display and which ones to not display, which does not involve comparing the domain in question against every other single domain which exists (which is impossible).

Previous Algorithm

We updated how Firefox dealt with this in 2012. Our previous algorithm was to display as Unicode all IDN labels within TLDs on our whitelist, and display as Punycode otherwise. We checked the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator had to apply directly (they cannot be nominated by another person), and on several occasions we required policy updates or implementation as a condition of getting in.

We also had a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof the separators "/" and ".", and invisible characters. This still exists.

Why We Changed

The old strategy provided pretty good user protection, and it provided consistency - every Firefox everywhere works the same. However, it meant that IDNs did not work at all in many TLDs, because the registry (for whatever reason) had not applied for inclusion, or because we didn't think they had sufficiently strong protections in place. In addition, ICANN was about to open a large number of new TLDs. So either maintaining a whitelist was going to become burdensome, or the list was going to become wildly out of date and we would not be serving our users.

The New Idea

Instead, we now augment our whitelist with something based on ascertaining whether all the characters in a label all come from the same script, or are from one of a limited and defined number of allowable combinations. The hope is that any intra-script near-homographs will be recognisable to people who understand that script.

We retain the whitelist as well, because a) removing it might break some domains which worked previously, and b) if a registry submits a good policy, we have the ability to give them more freedom than the default restrictions do. So an IDN is shown as Unicode if the TLD was on the whitelist or, if not, if it met the criteria above.

Algorithm

If a TLD is in the whitelist, we unconditionally display Unicode. If it is not, the following algorithm applies.

Unicode Technical Report 39 defines a "Moderately Restrictive" profile. It says the following (with edits for clarity, after discussion with the author):

No characters in the label can be outside of the Identifier Profile (defined for us by the IDNA2008 standard, RFC 5892).

All characters in each label must be from Common + Inherited + a single script, or from one of the following combinations:

  • Common + Inherited + Latin + Han + Hiragana + Katakana; or
  • Common + Inherited + Latin + Han + Bopomofo; or
  • Common + Inherited + Latin + Han + Hangul; or
  • Common + Inherited + Latin + any single other "Recommended" or "Aspirational" script except Cyrillic or Greek

Unicode Technical Report 39 gives a definition for how we detect whether a string is "single script".

Some Common or Inherited characters are only used in a small number (but more than one) script. Mark Davis writes: "The Unicode Consortium in U6.1 (due out soon) is adding the property Script_Extensions, to provide data about characters which are only used in a few (but more than one) script. The sample code in #39 should be updated to include that, so handling such cases." This data is now available, but not yet in the Firefox platform (bug 844106). In the mean time, Common and Inherited characters are permitted without restriction.

We also implement additional checks, as suggested by TR #39 sections 5.3 and 5.4:

  • Display as Punycode labels which use more than one numbering system
  • Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters -- bug 857481
  • Display as Punycode labels which have sequences of the same nonspacing mark

Possible Issues and Open Questions

The following issues are still open, but were not considered important enough to block initial implementation.

Further suggestions from TR #39:

  • Check to see that all the characters are in the sets of exemplar characters for at least one language in the Unicode Common Locale Data Repository. [XXX What does this mean? -- Gerv]

Also:

  • Should we document our character hard-blacklist as part of this exercise? It's already visible in the prefs. Are any characters in it legal in IDNA2008 anyway? We should review it to make sure we aren't disallowing characters now allowed by IDNA 2008.
  • Do we want to allow the user to choose between multiple "restriction levels", or have a hidden pref? There are significant downsides to allowing this...
  • Do we ever want to display errors other than just by using Punycode? I suggest not...
  • Should we add Armenian to the list of scripts which cannot mix with Latin?

Downsides

This system permits whole-script confusables (All-Latin "scope.tld" vs all-Cyrillic "ѕсоре.tld"). However, so do the solutions of the other browsers, and it has not proved to be a significant problem so far. If there is a problem, every browser is equally affected.

Our response to this issue is that in the end, it is up to registries to make sure that their customers cannot rip each other off. Browsers can put some technical restrictions in place, but we are not in a position to do this job for them while still maintaining a level playing field for non-Latin scripts on the web. The registries are the only people in a position to implement the proper checking here. For our part, we want to make sure we don't treat non-Latin scripts as second-class citizens.