Gerv's IDN Display Algorithm FAQ

From MozillaWiki
Revision as of 10:23, 21 April 2017 by Gerv (talk | contribs) (First version)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This is an unofficial FAQ about Firefox's IDN Display Algorithm, which decides when and if to display an Internationalized Domain Name (IDN) in its proper form, and why it makes the choices it makes. Mozilla is still officially weighing up its options on this issue, but this represents one strand of opinion.

What does Firefox's algorithm do?

It displays single-script IDNs, but forbids most script mixing. So if a domain owner wants their IDN domain name to display properly they can only use one script per domain label, or one of a limited set of whitelisted combinations. This solves almost all of the confusables problem. It doesn't allow e.g. domains like http://paypaІ.com/ from 2005, which has just one Cyrillic letter.

Why does Firefox's algorithm not solve the edge case of whole-script spoofing?

Because Firefox doesn't have perfect knowledge. Take the Cyrillic http://аррІе.com/ (https://www.xn--80ak6aa92e.com/) - you the reader, as a human, can see that this is a Cyrillic homograph of a major brand name, but the browser does not have a concept of "major brand name". In general, there's no way to know what's a whole-script homograph of what else without a database of all domain names in a TLD. Registries have such a database; the browser does not.

By forbidding script mixing Firefox's algorithm does, though, reduce the possible Cyrillic variants of the real http://apple.com/ from 32 to 1, and for many domains it reduces it to 0 because they have one or more letters which don't have a Cyrillic lookalike. There is no Cyrillic letter which looks like a "g", for example, so Google's domains don't have a problem.

Why doesn't Firefox just display all IDNs in their Punycode (xn--) form (i.e. set network.IDN_show_punycode to True)?

Because we believe all currently-used languages and scripts of the world should be first class citizens on the web. IDN was invented so that everyone could have domain names in their own scripts and languages, not just people who speak languages which use the Latin script. People have been working hard for decades to make this happen. We support that admirable goal, and isn't going to give up on it now.

Punycode (xn-- form) is a hacky implementation detail which was never meant to be displayed to end users. It's almost impossible for a user to detect a punycode spoof of a domain that is always displayed to the user as punycode (quick: tell the difference between https://www.xn--80ak6aa92e.com/ and https://www.xn--80ak6ae92e.com/). Showing Punycode on legitimate sites just leads to a different, and worse, set of problems.

Why doesn't Firefox just not display Cyrillic IDNs? They seem to be most of the problem.

If the Internet had started as an e.g. Russian invention, and Latin-based languages were the ones late to the party, users of English and other Latin-script languages would be pretty annoyed at being treated that way. So it seems unfair to treat Cyrillic that way.

OK. Why doesn't Firefox just show both forms of the name, or have some other UI indicator?

Because most users wouldn't have a clue what was going on, and ignore the gibberish bit (or the tiny icon, or the underlining, or whatever UI treatment is chosen). It's additional complexity for a tiny edge case. Also, it makes IDN domain names seem worse, scarier or more exotic than Latin ones, which also seems unfair. They should be a normal part of Internet life.

Why doesn't Firefox vary the rules based on TLD?

Because everyone wants their chosen name to work in .com. And this is just another way of making some scripts second-class, it just only makes them second-class in more important parts of the domain name space.

Why doesn't Firefox decide based on the script associated with the language the user speaks?

How does Firefox know what language the user speaks? And users can speak more than one language.

OK. Why doesn't Firefox decide based on the script associated with the browser's UI language?

The main reason is that this means that IDNs would work in some Firefoxes and not in others, and domain owners wouldn't have any way of telling which. This would make using IDN domain names a dodgy proposition for any organization, because they can never know which of their customers will see them correctly and which won't. The result would be to stunt the growth of IDNs and perpetuate the Latin-centric nature of the web. We believe it's important that if an IDN domain name works in one Firefox, it works in all Firefoxes. (And ideally, if it works in one browser, it works in all browsers, but we don't have control over that.)

The second reason is that many people use browsers with a UI language different to the ones they speak, or that is only one of the ones they speak. For example, many speakers of non-English languages nevertheless use US English Firefox. Also, browsers only have one UI language, and real people speak and understand multiple languages. And that's before you've accounted for shared computers and internet cafes, with multiple people of differing capabilities using the same computer. How often should Firefox bother the user to ask them again, to make sure the setting is always correct?

Lastly, this fix wouldn't actually solve the problem for everyone. http://apple.com and http://аррІе.com/ look the same even to people who read Cyrillic.

OK, OK. Why doesn't Firefox decide based on the script associated with the OS's UI language?

For almost exactly the same set of reasons that it doesn't work to use the browser's UI language.

Microsoft does this, which is why it is being reported that they aren't affected by the recent whole-script "discoveries", but it has precisely the negative effects outlined in the answer to the above question.

Why doesn't Firefox change the browser UI to show some sort of warning?

Because that would be an irritating and ugly way to solve the problem. And how would you like it if your language's script was marked as "potentially dangerous"?

Why doesn't Firefox change the browser UI in some non-perjorative way, like colouring letters from different scripts differently?

Because people wouldn't understand what the change meant. And how would you like it if domain names in your language's script always came with a fuchsia background?

Why doesn't Firefox implement Chrome's fix?

Chrome's "fix" is very specific to the issue of Cyrillic/Latin spoofing (whereas there are many other troublesome combinations) and, while they have attempted to reduce the scope, still treats that script as second-class. Their changes makes 2,800 legitimately-registered domains in .com alone stop displaying properly. If one of them was the name of your company, you would be justifiably upset.

Is this actually a problem?

Well... not really a big one. We are not aware of anyone who's actually lost anything of value this way. Phishing is, sadly, already easy enough. And we have Safe Browsing to deal with that - if such a whole-script spoof domain actually starts getting used for phishing, it would end up on the Safe Browsing list.

Who should fix this, if not browser makers?

Registries have all the information necessary to fix this problem. But unfortunately it's not in their financial interest to refuse registrations.

Why can't Firefox have a whitelist of good registries, with good anti-spoofing policies?

This was how it worked until 2012, but then ICANN decided the world needed a thousand top-level domains, and so a thousand registries, and maintaining the list didn't scale.

What can I do to help myself?

Well, you could register the Cyrillic whole-script homograph of your own important domain name, if there is one.