Blog: Passwords

Unicode – the future of passwords? Possibly…

consultant-placeholder03 Pedro Venda 03 Jul 2014

Or maybe: Unicode: How to make correcthorsebatterystaple in to an amazingly strong password



Years ago, we failed miserably whilst trying to crack a local admin password with a large RainbowTable. It should have worked, as every possible hash of that format for the 104 key keyboard was included in the table, for the password length in question. After an embarrassed call with the client, they mentioned that it included a £ (Pound Sterling) sign.

Of course – the RainbowTable was computed with a US keyboard codepage, so we were always going to fail. D’oh!

Instant lesson is to include Windows extended characters in your passwords. Characters that are only on your local language codepage, or access others using Alt then the 4 digit decimal code for the character. The £ or Sterling sign is Alt0163 – very handy to know if you ever lose your codepage!

So you can easily access another 140ish characters, significantly increasing complexity, also defeating hash cracking if the attacker fails to include your special characters in their charset.

Fast forward a few years, and the idea returned after watching a cool presentation by @yiannistox from the Hashcat Project at OWASP Birmingham (@OWASPBrum). Are we using the most complete character set possible when setting passwords? We know about ASCII and extended character tables, but what about Unicode? Can Unicode characters be used to create password hashes, will this make hashes any stronger/more difficult to crack?

We set out to research this problem and figure out whether or not Unicode characters could be used in operating systems, browsers and web applications. Password hashes strengthened by using Unicode characters will require a significantly larger character set to ensure successful cracking and thus increase the complexity level of the process by a staggering amount.

Well, that’s what you would hope…

Unicode and current security issues

Unicode has been around for a while, some frameworks have partial or complete implementations of of Unicode handling. However they are not without issues, and often they are enabled by default so sometimes input handling may not work as expected because the developer did not expect anything other than straight forward ASCII characters.

There are numerous samples of security faults that were caused by mishandling of Unicode documented on the Internet. We chose to highlight 4 simple cases that highlight potential inherent risks of using Unicode.

1. Spotify – April/2012 (?)
Attackers were able to hijack accounts of whom they knew the usernames of. This was due to an issue in the application’s canonicalisation routine. By creating an account with Unicode characters, attackers were able to request a password reset and reset the password of a different account than that which they created.

2. Tweetdeck – June/2014
A cross site scripting vector was uncovered because the application attempted to replace certain sequences of characters or some Unicode symbols by image icons. However that triggered certain JavaScript methods that reverted the escaping that made user input safe to reflect.
Hence straightforward unencoded, undisguised cross site scripting vectors would simply work if followed by a unicode character.

3. MS Windows – August/2011 (?)
Operating systems were not immune to Unicode related issues. It is possible to disguise file names with Unicode characters so that they look exactly like normal ASCII names. Attackers can take advantage of this by tricking the user in to thinking that a certain file is that with the Unicode name, whilst the file that is parsed by the operating system or an application is actually hidden. Also the Right to Left Mark was used successfully to disguise extensions of executables, which can be made to appear to be images.

4. Paypal – 2011 (?)
Paypal and other high profile sites were victims of phishing attacks using a Unicode characters to disguise URLs. Since 2010 it is possible to register domain names with Unicode characters in order to enable accentuation and non latin characters to be used on domain names.

Because the handling of domain names on the DNS world is still ASCII, this is deliberately just a method of changing the appearance of domain names, and therefore URLs. Punycode was developed to convert Unicode domain names into ASCII equivalents on the browser.

Attackers registered domain names with Unicode characters so that they look exactly like the domain names of the victim websites ( in this case). Before hitting the browser window, there was no way to visually tell the difference between the Unicode and ASCII domain names, and therefore users could not visually determine whether the URL was fake or not. This is in essence a ‘Homograph attack’.

What is Unicode?

Unicode is a standard that defines encoding and representation for consistently handling text in computers, like ASCII. But unlike ASCII, Unicode was created by a consortium with the purpose of handling all text symbols of all the world’s languages and writing systems. Notably 8 bit encoding is far from sufficient to pull this one off!

Speaking of encoding, the Unicode consortium have defined at least 3 encoding methods:
– UTF-8: 8 bit variable length encoding, up to 6 bytes (4 is enough);
– UTF-16: 16 bit variable length encoding, up to 4 bytes + 16bit BOM;
– UTF-32: 32 bit fixed length (4 bytes);

All these encode the same symbol with different sequences of bits but the most prevalent method would be UTF-8, which defines approximately
1,111,998 different printable characters.

Complexity comparison

The most inefficient way to obtain a password is by performing a brute force attack. But this is also the only method that guarantees that the password will be found, whether it is run against an online login form or offline against a password hash.

The guarantee that brute forcing will eventually find the password does have a significant implication – the process must employ a _sufficient_ character. In other words, if we know a certain password is an 8 digit number, then doing a brute force attack with all numbers composed of 8x digits 0-4 is not guaranteed to find the password. In this case a sufficient character set would have to include digits from 0 to 9.

Similarly, to guarantee that a certain hash generated from an ASCII password is cracked, then it is necessary to execute the brute force attack using the entire ASCII character set (previous knowledge about the password could allow the character set to be reduced).

If we consider that ASCII has 95 printable characters, then a 6 character password has an approximate entropy of
95^6 ~ 1×10^11

If the same 6 character password had been created using the Unicode character set, then the character set increases from 95 to 1,111,998, yielding an approximate entropy of
1,111,998^6 ~ 1×10^36

10^11 vs 10^36 is a massive difference in complexity. The latter takes
10,000,000,000,000,000,000,000,000 longer to run than the former.

Another illustrative example is as follows: If the entire ASCII brute forcing mentioned above took only 1 second to run, then the Unicode brute forcing would take something around 316,887,907,240 million years (mind you that the estimated age of the universe is around
13,798,000,000 million years).

Looking at this backwards, it was possible to determine that the complexity incurred in a brute force attack for each 1 Unicode character is about the same as 3 or more ASCII characters. So the complexity of a 3 character Unicode password is comparable to that of a 9 character ASCII password.

From a defence point of view, opening up the character set to something as wide as Unicode represents a huge gain in defending against plain brute forcing attacks.

We’ll be publishing part 2 later where we’ll look at the downside of using Unicode for your passwords. Setting a Windows login password that includes Unicode characters is *quite* a bad idea…