We all know that security is hard. But it’s often hard in an advanced-math kind of way: cryptography, encryption, hashing algorithms, cipher suites, elliptic curves, and all the rest are challenging subjects that are not easy to understand.
By contrast, our topic for today – internationalization – is hard in a messy-reality kind of way. Internationalization (often called “i18n” for short because who wants to keep typing those 18 letters between the “i” and the “n”?) involves some exceedingly complicated matters, including:
- thousands of human languages
- dozens of scripts (Latin, Cyrillic, Arabic, Kanji, Runic, etc.) used to write those languages
- 1+ million individual Unicode characters that comprise those scripts
- computer encodings for Unicode characters, such as UTF-8 and UTF-16
- methods for comparing characters to check if they are “the same” (for several meanings of “same”)
- characters that can be visually mistaken with other, similar characters (these are called “confusable characters”)
- the many, many ways that humans and computers can go wrong when trying to understand and handle languages, scripts, and characters
Internationalization is a vast topic that I can’t even begin to cover in a brief blog post (I keep meaning to write an introductory book about it, but I haven’t found the time – until then, check out some slides I created a few years ago for the team that publishes IETF RFCs).
For our purposes here, the real fun begins when we consider the intersection of security and i18n.
Consider a simple example of typejacking: a website address www.paypa1.com with the number one instead of lowercase letter “L” in the company name. Depending on what font face it’s presented in, that address might fool you into thinking you were communicating with PayPal Inc. instead of some attacker.
Even worse are characters that look the same but aren’t. Consider Cyrillic small letter a, which is Unicode character U+0430. Here it is:“а”. Can you tell the difference between that and Latin small letter “a”? I didn’t think so. Now what if an email message points you to www.pаypаl.com using the Cyrillic letter instead of the Latin letter? Can you say: “How’s the phishing today?”
Then there are characters that Unicode can consider to be the same depending on which rules are applied. Unicode has the concept of “compatibility equivalence”, such that characters like modifier letter small alpha ᵅ (U+1D45) and circled Latin small letter a ⓐ (U+24D0) can be considered the same as our good friend Latin small letter “a”. If your system applies the wrong set of rules (allowing characters with compatibility equivalents), then an attacker that authenticates as ⓙⓤⓛⓘⓔⓣ could hijack an account for Juliet, leaving Romeo very unhappy. (Indeed, a similar vulnerability hit the Spotify music service a few years ago.)
As even this small sample shows, Unicode is extremely expressive (e.g., there are Unicode characters for most of your favorite emojis). However, with that expressiveness comes complexity and the possibility for abuse.
Is the answer to force everyone to use plain old ASCII? No, that would be incredibly naïve. 🙂 We live in a multi-lingual world, and it’s important to present information and allow communication in the languages and scripts that people understand (remember, ASCII might be just as inscrutable to someone from east Asia as east Asian scripts are to you).
So how can we safely handle internationalized usernames, domain names, and other such identifiers?
Just use UTF-8, right?
Unfortunately, saying “just use UTF-8” is the i18n equivalent of saying “just use SSL/TLS” for security.
Yes, it gets you part of the way there, but there’s much more to the story than that. In fact, UTF-8 won’t solve any of the i18n issues outlined above, just as TLS won’t solve a whole raft of security issues.
Let’s say you need to deploy a system that allows users to register their own usernames. What’s the best way to design it for both safety and internationalization?
Here are some core principles I’d keep in mind:
First, try to restrict the allowable characters to letter and numbers. Sure, it’s fun if people can have the Unicode character for black chess king ♚ as a username, but if you allow characters like that you’re also going to allow usernames that, as we have seen, can be visually mistaken for more legitimate usernames. In the PRECIS framework (RFC 7564), which I co-authored recently with Marc Blanchet, we did this by defining a construct called the IdentifierClass. Among other things, this construct also disallows characters with compatibility equivalents, such as circled Latin small letter “a” ⓐ.
Second, think carefully about which scripts you can realistically and safely support. For example, do you understand the intricacies of language scripts that are written right to left (such as Arabic and Hebrew)? Can your full tool chain (databases, command line, GUIs, support systems, etc.) even handle such scripts if you need to debug some code, reset a password, ban a user, or modify user permissions? If you don’t understand and can’t handle such scripts, you will run into operational problems eventually – it’s only a matter of time.
Third, strongly consider disallowing mixed-script strings (i.e., strings with some characters from one script and some characters from another script). This will help to avoid things like the “paypal” string with Cyrillic “a” instead of Latin “a”.
Finally, recognize that no matter how hard you try, it’s impossible to completely prevent user confusion. Human language and communication often depend on a whole lot of context, negotiation, and back-and-forth. When we’re dealing with things like domain names and usernames, we don’t have the opportunity to discuss intention or work out meaning on the fly – either it’s understandable on immediate inspection or it isn’t (usually with bad consequences). That’s why it’s so important to be careful about what you allow in the first place.
For further reading, check out the i18n slides I mentioned above, the PRECIS framework (RFC 7564), the PRECIS document on usernames and passwords (also recently approved for publication as an RFC), and the Unicode security considerations.
Be safe out there!