IDN Display Algorithm: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 36: Line 36:
The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label
The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label
all come from the same script, or are from one of a limited and defined number of allowable combinations. The
all come from the same script, or are from one of a limited and defined number of allowable combinations. The
hope is that any intra-script near-homographs will be recognisable to people who understand that script. To
hope is that any intra-script near-homographs will be recognisable to people who understand that script.  
be clear: if a TLD is in the whitelist, we will unconditionally display Unicode. If it is not, the following
 
We will retain the whitelist as well, because a) removing it might break
some domains which worked previously, and b) if a registry submits a
good policy, we have the ability to give them more freedom than the default restrictions do.
So an IDN domain would be shown as Unicode if the TLD was on the whitelist or, if not, if it
met the criteria above.
 
==Algorithm==
 
If a TLD is in the whitelist, we will unconditionally display Unicode. If it is not, the following
algorithm will apply.
algorithm will apply.


[http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36],  
[http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36],  
which is about Unicode and security, defines a "Moderately Restrictive" profile. It says  
defines a "Moderately Restrictive" profile. It says the following (with edits for clarity):
the following (with edits for clarity):


<blockquote>
<blockquote>
No characters in the label can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile].
No characters in the label can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile]
(defined for us by the IDNA2008 standard,
[http://tools.ietf.org/html/rfc5892 RFC 5892]).


All characters in each label must be from a single script, or from
All characters in each label must be from Common + Inherited + a single script, or from one of the following combinations:
the combinations:


* Latin + Han + Hiragana + Katakana; or
* Common + Inherited + Latin + Han + Hiragana + Katakana; or
* Latin + Han + Bopomofo; or
* Common + Inherited + Latin + Han + Bopomofo; or
* Latin + Han + Hangul; or
* Common + Inherited + Latin + Han + Hangul; or
* Latin + any single other script except Cyrillic, Greek, or Cherokee
* Common + Inherited + Latin + any single other script except Cyrillic, Greek, or Cherokee
</blockquote>
</blockquote>
This system would allow any single script, and also most scripts + Latin, which
is a common mixing, plus script mixings common in the Far East where
they use multiple scripts at once.
The Identifier Profile is defined for us by the IDNA2008 standard,
[http://tools.ietf.org/html/rfc5892 RFC 5892]. So when we upgrade to IDNA 2008 (a separate discussion),
that should hopefully eliminate a large number of non-alphabet characters for us.


[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives  
[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives  
Line 69: Line 70:
"The Unicode Consortium in U6.1 (due out soon) is adding the property Script_Extensions,  
"The Unicode Consortium in U6.1 (due out soon) is adding the property Script_Extensions,  
to provide data about characters which are only used in a few (but more than one) script.  
to provide data about characters which are only used in a few (but more than one) script.  
The sample code in #39 should be updated to include that, so handling such cases."
The sample code in #39 should be updated to include that, so handling such cases." We should
take this enhancement when the data becomes available; in the mean time, Common and Inherited
characters are permitted without restriction.


Additional checks:
Additional checks:


* Display as Punycode labels which use more than one numbering system
* Display as Punycode labels which use more than one numbering system (we would need a list of numbering systems in Unicode)
* Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database. (Should be < 16k of data for a simple binary test.)
* Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database (should be < 16k of data for a simple binary test)
* Display as Punycode labels which have sequences of the same nonspacing mark.
* Display as Punycode labels which have sequences of the same nonspacing mark (we would need a list of, or the name of a class containing, all such marks)
 
We will retain the whitelist as well, because a) removing it might break
some domains which worked previously, and b) if a registry submits a
good policy, we have the ability to give them more freedom than the default restrictions do.
So an IDN domain would be shown as Unicode if the TLD was on the whitelist or, if not, if it
met the criteria above.
 
I think that this would make us display a superset of the IDN domains
that the other browsers display, in a way which was consistent across
all copies of Firefox (maintaining the certainty which is a benefit of the current system)
and which was pretty safe from spoofing.


===Possible Issues and Open Questions===
===Possible Issues and Open Questions===
Line 111: Line 103:


If problems arose in the future (e.g. with homographs between a particular
If problems arose in the future (e.g. with homographs between a particular
script and Latin), we would need to quickly issue a robust response
script and Latin), our response would be that in the end, it is up to registries  
making the point that it is up to registries to make sure that their customers
to make sure that their customers
cannot rip each other off. Browsers can put some technical restrictions in place,
cannot rip each other off. Browsers can put some technical restrictions in place,
but we are not in a position to do this job for them while still maintaining
but we are not in a position to do this job for them while still maintaining
a level playing field for non-Latin scripts on the web.
a level playing field for non-Latin scripts on the web. The registries are the
only people in a position to implement the proper checking here. For our part,
we want to make sure we don't treat non-Latin scripts as second-class citizens.  


===Transition===
===Transition===
Account confirmers, Anti-spam team, Confirmed users, Bureaucrats and Sysops emeriti
4,925

edits