1
    2
    3
    4
    5
    6
    7
    8
    9
   10
   11
   12
   13
   14
   15
   16
   17
   18
   19
   20
   21
   22
   23
   24
   25
   26
   27
   28
   29
   30
   31
   32
   33
   34
   35
   36
   37
   38
   39
   40
   41
   42
   43
   44
   45
   46
   47
   48
   49
   50
   51
   52
   53
   54
   55
   56
   57
   58
   59
   60
   61
   62
   63
   64
   65
   66
   67
   68
   69
   70
   71
   72
   73
   74
   75
   76
   77
   78
   79
   80
   81
   82
   83
   84
   85
   86
   87
   88
   89
   90
   91
   92
   93
   94
   95
   96
   97
   98
   99
  100
  101
  102
  103
  104
  105
  106
  107
  108
  109
  110
  111
  112
  113
  114
  115
  116
  117
  118
  119
  120
  121
  122
  123
  124
  125
  126
  127
  128
  129
  130
  131
  132
  133
  134
  135
  136
  137
  138
  139
  140
  141
  142
  143
  144
  145
  146
  147
  148
  149
  150
  151
  152
  153
  154
  155
  156
  157
  158
  159
  160
  161
  162
  163
  164
  165
  166
  167
  168
  169
  170
  171
  172
  173
  174
  175
  176
  177
  178
  179
  180
  181
  182
  183
  184
  185
  186
  187
  188
  189
  190
  191
  192
  193
  194
  195
  196
  197
  198
  199
  200
  201
  202
  203
  204
  205
  206
  207
  208
  209
  210
  211
  212
  213
  214
  215
  216
  217

docs / idn.md [blame]

# Internationalized Domain Names (IDN) in Google Chrome

## Background

Many years ago, domains could only consist of the Latin letters A to Z, digits,
and a few other characters. [Internationalized Domain Names
(IDNs)](https://en.wikipedia.org/wiki/Internationalized_domain_name) were
created to better support non-Latin alphabets for web users around the globe.

Different characters from different (or even the same!) languages can look very
similar. We’ve seen
[reports](https://bugs.chromium.org/p/chromium/issues/detail?id=683314) of
proof-of-concept attacks. These are called [homograph
attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack). For example, the
Latin "a" looks a lot like the Cyrillic "а", so someone could register
`http://ebаy.com` (using Cyrillic "`а`"), which could be confused for
`http://ebay.com`. This is a limitation of how URLs are displayed in browsers in
general, not a specific bug in Chrome.

In a perfect world, domain registrars would not allow these confusable domain
names to be registered. Some domain registrars do exactly that, mostly by
restricting the characters allowed, but many do not. To better protect against
these attacks, browsers display some domains in
[punycode](https://en.wikipedia.org/wiki/Punycode) (looks like `xn--...`)
instead of the original IDN, according to their own IDN policies.

This is a challenging problem space. Chrome has a global user base of billions
of people around the world, many of whom are not viewing URLs with Latin
letters. We want to prevent confusion, while ensuring that users across
languages have a great experience in Chrome. Displaying either punycode or a
visible security warning on too wide of a set of URLs would hurt web usability
for people around the world.

Chrome and other browsers try to balance these needs by implementing IDN
policies in a way that allows IDN to be shown for valid domains, but protects
against confusable homograph attacks.

Chrome's IDN policy is one of several tools that aim to protect users.
[Google Safe Browsing](https://safebrowsing.google.com/) continues to help
protect over two billion devices every day by showing warnings to users when
they attempt to navigate to dangerous or deceptive sites or download dangerous
files. Password managers continue to remember which domain password logins are
for, and won’t automatically fill a password into a domain that is not the
exactly correct one.

## How IDN works

IDNs were devised to support arbitrary Unicode characters in hostnames in a
backward-compatible way. This works by having user agents transform hostnames
containing non-ASCII Unicode characters into an ASCII-only hostname, which can
then be sent on to DNS servers. This is done by encoding each domain label into
its punycode representation. This representation includes a four-character
prefix (`xn--`) and then the unicode translated to ASCII Compatible Encoding
(ACE). For example, `http://öbb.at` is transformed to `http://xn--bb-eka.at`.

## Google Chrome's IDN policy

Since Chrome 51, Chrome uses an IDN display policy that does not take into
account the language settings (the Accept-Language list) of the browser. A
[similar strategy](https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm) is
used by Firefox.

Google Chrome decides if it should show Unicode or punycode for each domain
label (component) of a hostname separately. To decide if a component should be
shown in Unicode, Google Chrome uses the following algorithm:
1. Convert each component stored in the ACE to Unicode per [UTS 46 transitional
   processing](http://unicode.org/reports/tr46/#Processing) (`ToUnicode`).

2. If there is an error in `ToUnicode` conversion (e.g. contains [disallowed
   characters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Auts46%3Ddisallowed%3A%5D&abb=on&g=&i=),
   [starts with a combining
   mark](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da390a6b3d9844a1dcc1f99fb1ae478ecf),
   or [violates BiDi
   rules](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da8a9311811fb0f3db1644ac1a88056370)),
   show punycode.

3. If there is a character in a label not belonging to [Characters allowed in
   identifiers](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AIdentifierStatus%3DAllowed%3A&abb=on&g=&i=)
   per [Unicode Technical Standard 39 (UTS
   39)](http://www.unicode.org/reports/tr39/#Identifier_Status_and_Type), show
   punycode.

4. If any character in a label belongs to [the disallowed
   list](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5Cu01CD-%5Cu01DC%5D+%5B%5Cu1c80-%5Cu1c8f%5D++%5B%5Cu1e90-%5Cu1e9b%5D++%5B%5Cu1f00-%5Cu1fff%5D++%5B%5Cua640-%5Cua69f%5D-%5B%5Cua720-%5Cua72f%5D+%5B%5Cu0338+%5Cu058a+%5Cu2010+%5Cu2019+%5Cu2027+%5Cu30a0+%5Cu02bb+%5Cu02bc+%5D&abb=on&g=&i=),
   show punycode.

5. If the component uses characters drawn from multiple scripts, it is subject
to a script mixing check based on ["Highly Restrictive" profile of UTS
39](http://www.unicode.org/reports/tr39/#Restriction_Level_Detection) with an
additional restriction on Latin. If the component fails the check, show the
component in punycode.
  - Latin, Cyrillic or Greek characters cannot be mixed with each other
  - Latin characters in the ASCII range can be mixed ONLY with Chinese (Han,
    Bopomofo), Japanese (Kanji, Katakana, Hiragana), or Korean (Hangul, Hanja)
  - Han (CJK Ideographs) can be mixed with Bopomofo
  - Han can be mixed with Hiragana and Katakana
  - Han can be mixed with Korean Hangul

6. If two or more numbering systems (e.g. European digits + Bengali digits) are
mixed, show punycode.

7. If there are any invisible characters (e.g. a sequence of the same combining
mark or a sequence of Kana combining marks), show punycode.

8. If there are any characters used in an unusual way, show punycode. E.g.
[`LATIN MIDDLE DOT (·)`](https://unicode.org/cldr/utility/character.jsp?a=00B7)
used outside [ela geminada](https://en.wiktionary.org/wiki/ela_geminada).

9. Test the label for [mixed script confusable per UTS
39](http://unicode.org/reports/tr39/#Mixed_Script_Confusables). If mixed script
confusable is detected, show punycode.

10. Test the label for [whole script
confusables](http://unicode.org/reports/tr39/#Whole_Script_Confusables): If all
the letters in a given label belong to a set of whole-script-confusable letters
in one of the [whole-script-confusable
scripts](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=kWholeScriptConfusables&sq=package:chromium)
and if the hostname doesn't have a corresponding
[allowed top-level-domain](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.h?type=cs&q=allowed_tlds)
for that script, show punycode.
**Example for Cyrillic:**
The first label in hostname `аррӏе.com` (`xn--80ak6aa92e.com`) is all [Cyrillic
letters that look like Latin letters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%D0%B0%D1%81%D4%81%D0%B5%D2%BB%D1%96%D1%98%D3%8F%D0%BE%D1%80%D4%9B%D1%95%D4%9D%D1%85%D1%83%D1%8A%D0%AC%D2%BD%D0%BF%D0%B3%D1%B5%D1%A1%5D&g=gc&i=)
**AND** the TLD (`com`) is not Cyrillic **AND** the TLD is not one of the TLDs
known to host a large number of Cyrillic domains (e.g. `ru`, `su`, `pyc`, `ua`).
Show it in punycode.

11. If the label contains only [digits and digit
spoofs](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=IsDigitLookalike),
show punycode.

12. If the label matches a [dangerous
pattern](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&g=0&l=422),
show punycode.

13. If the [skeleton](http://unicode.org/reports/tr39/#def-skeleton) of the
registrable part of a hostname is identical to one of the top domains after
removing diacritic marks and mapping each character to its spoofing skeleton
(e.g. `www.googlé.com` with `é` in place of `e`), show punycode.

Otherwise, show Unicode.

This is implemented by `IDNToUnicodeOneComponent()` and `IsIDNComponentSafe()`
in
[`components/url_formatter/url_formatter.cc`](https://cs.chromium.org/search/?q=components/url_formatter/url_formatter.cc)
and `IDNSpoofChecker` class in
[`components/url_formatter/spoof_checks/idn_spoof_checker.cc`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc).

## Additional Protections

In addition to the spoof checks above, Chrome also implements a full page
security warning to protect against lookalike URLs. You can find an example of 
this warning at `chrome://interstitials/lookalike`. This warning blocks main
frame navigations that involve lookalike URLs, either as a direct navigation or
as part of a redirect.

The algorithm to show this warning is as follows:

1. If the scheme of the navigation is not `http` or `https`, allow
the navigation.

2. If the navigation is a redirect, check the redirect chain. If the redirect
chain is safe, allow the navigation. (See Defensive Registrations section for
details).

3. If the hostname of the navigation has at least a medium site engagement
score, allow the navigation. Site engagement score is assigned to sites by the
[Site Engagement
Service](https://www.chromium.org/developers/design-documents/site-engagement).

4. If the hostname of the navigation is in
[`domains.list`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/top_domains/domains.list),
allow the navigation.

5. If the user previously allowed the hostname of the navigation by clicking
"Ignore" in the warning, allow the navigation. Currently, user decisions are
stored per tab, so navigating to the same site in a new tab may show the
warning.

6. If the hostname has the same skeleton as a recently engaged site or a top 500
domain, block the navigation and show the warning.

All of these checks are done locally on the client side.

### Defensive Registrations

Domain owners can sometimes register multiple versions of their domains, such
as the ASCII and IDN versions, to improve user experience and prevent potential
spoofs. We call these supplementary domains defensive registrations.

In some cases, Chrome's lookalike warning may flag and block navigations to
these domains:
 - If one of the sites is in `domains.list` but the other isn't, the latter will
be blocked. 
 - If the user engaged with one of the sites but not the other, the latter will
be blocked.

### Avoiding a lookalike warning on your site

**Domain owners can avoid the "Did you mean" warning by redirecting their
defensive registrations to their canonical domain.**

**Example**: If you own both `example.com` and `éxample.com` and the majority of
your traffic is to `example.com`, you can fix the warning by redirecting
`éxample.com` to `example.com`. The lookalike warning logic considers this a
safe redirect and allows the navigation. If you must also redirect `http`
navigations to `https`, do this in a single redirect such as
`http://éxample.com -> https://example.com`. Use HTTP 301 or HTTP 302
redirects, the lookalike warning ignores meta redirects.

## Reporting Security Bugs

We reward certain cases of IDN spoofs according to [Chrome's Vulnerability
Reward Program](https://www.google.com/about/appsecurity/chrome-rewards/index.html)
policies. Please see [this
document]( https://docs.google.com/document/d/1_xJz3J9kkAPwk3pma6K3X12SyPTyyaJDSCxTfF8Y5sU/edit?usp=sharing)
before reporting a security bug.