1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
docs / idn.md [blame]
# Internationalized Domain Names (IDN) in Google Chrome
## Background
Many years ago, domains could only consist of the Latin letters A to Z, digits,
and a few other characters. [Internationalized Domain Names
(IDNs)](https://en.wikipedia.org/wiki/Internationalized_domain_name) were
created to better support non-Latin alphabets for web users around the globe.
Different characters from different (or even the same!) languages can look very
similar. We’ve seen
[reports](https://bugs.chromium.org/p/chromium/issues/detail?id=683314) of
proof-of-concept attacks. These are called [homograph
attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack). For example, the
Latin "a" looks a lot like the Cyrillic "а", so someone could register
`http://ebаy.com` (using Cyrillic "`а`"), which could be confused for
`http://ebay.com`. This is a limitation of how URLs are displayed in browsers in
general, not a specific bug in Chrome.
In a perfect world, domain registrars would not allow these confusable domain
names to be registered. Some domain registrars do exactly that, mostly by
restricting the characters allowed, but many do not. To better protect against
these attacks, browsers display some domains in
[punycode](https://en.wikipedia.org/wiki/Punycode) (looks like `xn--...`)
instead of the original IDN, according to their own IDN policies.
This is a challenging problem space. Chrome has a global user base of billions
of people around the world, many of whom are not viewing URLs with Latin
letters. We want to prevent confusion, while ensuring that users across
languages have a great experience in Chrome. Displaying either punycode or a
visible security warning on too wide of a set of URLs would hurt web usability
for people around the world.
Chrome and other browsers try to balance these needs by implementing IDN
policies in a way that allows IDN to be shown for valid domains, but protects
against confusable homograph attacks.
Chrome's IDN policy is one of several tools that aim to protect users.
[Google Safe Browsing](https://safebrowsing.google.com/) continues to help
protect over two billion devices every day by showing warnings to users when
they attempt to navigate to dangerous or deceptive sites or download dangerous
files. Password managers continue to remember which domain password logins are
for, and won’t automatically fill a password into a domain that is not the
exactly correct one.
## How IDN works
IDNs were devised to support arbitrary Unicode characters in hostnames in a
backward-compatible way. This works by having user agents transform hostnames
containing non-ASCII Unicode characters into an ASCII-only hostname, which can
then be sent on to DNS servers. This is done by encoding each domain label into
its punycode representation. This representation includes a four-character
prefix (`xn--`) and then the unicode translated to ASCII Compatible Encoding
(ACE). For example, `http://öbb.at` is transformed to `http://xn--bb-eka.at`.
## Google Chrome's IDN policy
Since Chrome 51, Chrome uses an IDN display policy that does not take into
account the language settings (the Accept-Language list) of the browser. A
[similar strategy](https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm) is
used by Firefox.
Google Chrome decides if it should show Unicode or punycode for each domain
label (component) of a hostname separately. To decide if a component should be
shown in Unicode, Google Chrome uses the following algorithm:
1. Convert each component stored in the ACE to Unicode per [UTS 46 transitional
processing](http://unicode.org/reports/tr46/#Processing) (`ToUnicode`).
2. If there is an error in `ToUnicode` conversion (e.g. contains [disallowed
characters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Auts46%3Ddisallowed%3A%5D&abb=on&g=&i=),
[starts with a combining
mark](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da390a6b3d9844a1dcc1f99fb1ae478ecf),
or [violates BiDi
rules](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da8a9311811fb0f3db1644ac1a88056370)),
show punycode.
3. If there is a character in a label not belonging to [Characters allowed in
identifiers](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AIdentifierStatus%3DAllowed%3A&abb=on&g=&i=)
per [Unicode Technical Standard 39 (UTS
39)](http://www.unicode.org/reports/tr39/#Identifier_Status_and_Type), show
punycode.
4. If any character in a label belongs to [the disallowed
list](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5Cu01CD-%5Cu01DC%5D+%5B%5Cu1c80-%5Cu1c8f%5D++%5B%5Cu1e90-%5Cu1e9b%5D++%5B%5Cu1f00-%5Cu1fff%5D++%5B%5Cua640-%5Cua69f%5D-%5B%5Cua720-%5Cua72f%5D+%5B%5Cu0338+%5Cu058a+%5Cu2010+%5Cu2019+%5Cu2027+%5Cu30a0+%5Cu02bb+%5Cu02bc+%5D&abb=on&g=&i=),
show punycode.
5. If the component uses characters drawn from multiple scripts, it is subject
to a script mixing check based on ["Highly Restrictive" profile of UTS
39](http://www.unicode.org/reports/tr39/#Restriction_Level_Detection) with an
additional restriction on Latin. If the component fails the check, show the
component in punycode.
- Latin, Cyrillic or Greek characters cannot be mixed with each other
- Latin characters in the ASCII range can be mixed ONLY with Chinese (Han,
Bopomofo), Japanese (Kanji, Katakana, Hiragana), or Korean (Hangul, Hanja)
- Han (CJK Ideographs) can be mixed with Bopomofo
- Han can be mixed with Hiragana and Katakana
- Han can be mixed with Korean Hangul
6. If two or more numbering systems (e.g. European digits + Bengali digits) are
mixed, show punycode.
7. If there are any invisible characters (e.g. a sequence of the same combining
mark or a sequence of Kana combining marks), show punycode.
8. If there are any characters used in an unusual way, show punycode. E.g.
[`LATIN MIDDLE DOT (·)`](https://unicode.org/cldr/utility/character.jsp?a=00B7)
used outside [ela geminada](https://en.wiktionary.org/wiki/ela_geminada).
9. Test the label for [mixed script confusable per UTS
39](http://unicode.org/reports/tr39/#Mixed_Script_Confusables). If mixed script
confusable is detected, show punycode.
10. Test the label for [whole script
confusables](http://unicode.org/reports/tr39/#Whole_Script_Confusables): If all
the letters in a given label belong to a set of whole-script-confusable letters
in one of the [whole-script-confusable
scripts](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=kWholeScriptConfusables&sq=package:chromium)
and if the hostname doesn't have a corresponding
[allowed top-level-domain](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.h?type=cs&q=allowed_tlds)
for that script, show punycode.
**Example for Cyrillic:**
The first label in hostname `аррӏе.com` (`xn--80ak6aa92e.com`) is all [Cyrillic
letters that look like Latin letters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%D0%B0%D1%81%D4%81%D0%B5%D2%BB%D1%96%D1%98%D3%8F%D0%BE%D1%80%D4%9B%D1%95%D4%9D%D1%85%D1%83%D1%8A%D0%AC%D2%BD%D0%BF%D0%B3%D1%B5%D1%A1%5D&g=gc&i=)
**AND** the TLD (`com`) is not Cyrillic **AND** the TLD is not one of the TLDs
known to host a large number of Cyrillic domains (e.g. `ru`, `su`, `pyc`, `ua`).
Show it in punycode.
11. If the label contains only [digits and digit
spoofs](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=IsDigitLookalike),
show punycode.
12. If the label matches a [dangerous
pattern](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&g=0&l=422),
show punycode.
13. If the [skeleton](http://unicode.org/reports/tr39/#def-skeleton) of the
registrable part of a hostname is identical to one of the top domains after
removing diacritic marks and mapping each character to its spoofing skeleton
(e.g. `www.googlé.com` with `é` in place of `e`), show punycode.
Otherwise, show Unicode.
This is implemented by `IDNToUnicodeOneComponent()` and `IsIDNComponentSafe()`
in
[`components/url_formatter/url_formatter.cc`](https://cs.chromium.org/search/?q=components/url_formatter/url_formatter.cc)
and `IDNSpoofChecker` class in
[`components/url_formatter/spoof_checks/idn_spoof_checker.cc`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc).
## Additional Protections
In addition to the spoof checks above, Chrome also implements a full page
security warning to protect against lookalike URLs. You can find an example of
this warning at `chrome://interstitials/lookalike`. This warning blocks main
frame navigations that involve lookalike URLs, either as a direct navigation or
as part of a redirect.
The algorithm to show this warning is as follows:
1. If the scheme of the navigation is not `http` or `https`, allow
the navigation.
2. If the navigation is a redirect, check the redirect chain. If the redirect
chain is safe, allow the navigation. (See Defensive Registrations section for
details).
3. If the hostname of the navigation has at least a medium site engagement
score, allow the navigation. Site engagement score is assigned to sites by the
[Site Engagement
Service](https://www.chromium.org/developers/design-documents/site-engagement).
4. If the hostname of the navigation is in
[`domains.list`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/top_domains/domains.list),
allow the navigation.
5. If the user previously allowed the hostname of the navigation by clicking
"Ignore" in the warning, allow the navigation. Currently, user decisions are
stored per tab, so navigating to the same site in a new tab may show the
warning.
6. If the hostname has the same skeleton as a recently engaged site or a top 500
domain, block the navigation and show the warning.
All of these checks are done locally on the client side.
### Defensive Registrations
Domain owners can sometimes register multiple versions of their domains, such
as the ASCII and IDN versions, to improve user experience and prevent potential
spoofs. We call these supplementary domains defensive registrations.
In some cases, Chrome's lookalike warning may flag and block navigations to
these domains:
- If one of the sites is in `domains.list` but the other isn't, the latter will
be blocked.
- If the user engaged with one of the sites but not the other, the latter will
be blocked.
### Avoiding a lookalike warning on your site
**Domain owners can avoid the "Did you mean" warning by redirecting their
defensive registrations to their canonical domain.**
**Example**: If you own both `example.com` and `éxample.com` and the majority of
your traffic is to `example.com`, you can fix the warning by redirecting
`éxample.com` to `example.com`. The lookalike warning logic considers this a
safe redirect and allows the navigation. If you must also redirect `http`
navigations to `https`, do this in a single redirect such as
`http://éxample.com -> https://example.com`. Use HTTP 301 or HTTP 302
redirects, the lookalike warning ignores meta redirects.
## Reporting Security Bugs
We reward certain cases of IDN spoofs according to [Chrome's Vulnerability
Reward Program](https://www.google.com/about/appsecurity/chrome-rewards/index.html)
policies. Please see [this
document]( https://docs.google.com/document/d/1_xJz3J9kkAPwk3pma6K3X12SyPTyyaJDSCxTfF8Y5sU/edit?usp=sharing)
before reporting a security bug.