Account confirmers, Anti-spam team, Confirmed users, Bureaucrats and Sysops emeriti
4,925
edits
No edit summary |
No edit summary |
||
Line 36: | Line 36: | ||
The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label | The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label | ||
all come from the same script, or are from one of a limited and defined number of allowable combinations. The | all come from the same script, or are from one of a limited and defined number of allowable combinations. The | ||
hope is that any intra-script near-homographs will be recognisable to people who understand that script. | hope is that any intra-script near-homographs will be recognisable to people who understand that script. | ||
be | |||
We will retain the whitelist as well, because a) removing it might break | |||
some domains which worked previously, and b) if a registry submits a | |||
good policy, we have the ability to give them more freedom than the default restrictions do. | |||
So an IDN domain would be shown as Unicode if the TLD was on the whitelist or, if not, if it | |||
met the criteria above. | |||
==Algorithm== | |||
If a TLD is in the whitelist, we will unconditionally display Unicode. If it is not, the following | |||
algorithm will apply. | algorithm will apply. | ||
[http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36], | [http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36], | ||
defines a "Moderately Restrictive" profile. It says the following (with edits for clarity): | |||
the following (with edits for clarity): | |||
<blockquote> | <blockquote> | ||
No characters in the label can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile]. | No characters in the label can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile] | ||
(defined for us by the IDNA2008 standard, | |||
[http://tools.ietf.org/html/rfc5892 RFC 5892]). | |||
All characters in each label must be from a single script, or from | All characters in each label must be from Common + Inherited + a single script, or from one of the following combinations: | ||
the combinations: | |||
* Latin + Han + Hiragana + Katakana; or | * Common + Inherited + Latin + Han + Hiragana + Katakana; or | ||
* Latin + Han + Bopomofo; or | * Common + Inherited + Latin + Han + Bopomofo; or | ||
* Latin + Han + Hangul; or | * Common + Inherited + Latin + Han + Hangul; or | ||
* Latin + any single other script except Cyrillic, Greek, or Cherokee | * Common + Inherited + Latin + any single other script except Cyrillic, Greek, or Cherokee | ||
</blockquote> | </blockquote> | ||
[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives | [http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives | ||
Line 69: | Line 70: | ||
"The Unicode Consortium in U6.1 (due out soon) is adding the property Script_Extensions, | "The Unicode Consortium in U6.1 (due out soon) is adding the property Script_Extensions, | ||
to provide data about characters which are only used in a few (but more than one) script. | to provide data about characters which are only used in a few (but more than one) script. | ||
The sample code in #39 should be updated to include that, so handling such cases." | The sample code in #39 should be updated to include that, so handling such cases." We should | ||
take this enhancement when the data becomes available; in the mean time, Common and Inherited | |||
characters are permitted without restriction. | |||
Additional checks: | Additional checks: | ||
* Display as Punycode labels which use more than one numbering system | * Display as Punycode labels which use more than one numbering system (we would need a list of numbering systems in Unicode) | ||
* Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database | * Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database (should be < 16k of data for a simple binary test) | ||
* Display as Punycode labels which have sequences of the same nonspacing mark | * Display as Punycode labels which have sequences of the same nonspacing mark (we would need a list of, or the name of a class containing, all such marks) | ||
all | |||
===Possible Issues and Open Questions=== | ===Possible Issues and Open Questions=== | ||
Line 111: | Line 103: | ||
If problems arose in the future (e.g. with homographs between a particular | If problems arose in the future (e.g. with homographs between a particular | ||
script and Latin), | script and Latin), our response would be that in the end, it is up to registries | ||
to make sure that their customers | |||
cannot rip each other off. Browsers can put some technical restrictions in place, | cannot rip each other off. Browsers can put some technical restrictions in place, | ||
but we are not in a position to do this job for them while still maintaining | but we are not in a position to do this job for them while still maintaining | ||
a level playing field for non-Latin scripts on the web. | a level playing field for non-Latin scripts on the web. The registries are the | ||
only people in a position to implement the proper checking here. For our part, | |||
we want to make sure we don't treat non-Latin scripts as second-class citizens. | |||
===Transition=== | ===Transition=== |