Saturday, October 25, 2008

Death to Homoglyphs! RIP 0, o ,O ,1 ,i ,I , and l

I've had some great feedback from new users of G02.ME over the last couple of days, most of it positive. Even better, some of it constructive.

One thing I had not anticipated was that people would be retyping G02.ME URL's. I had assumed they would all be copied and pasted into messages. Turns out, it's not that uncommon for someone to read a G02.ME URL (off a mobile phone or a print out, for example) and then type it in to the browser to visit the link.

Problem is, some characters are hard to recognize outside the context of ordinary (English) words. For example, the last character of is an I (capital letter), but could just as easily look like a 1 (digit) or l (lower case L), depending on the font being used.

Characters that look like other characters are called homoglyphs. In order to make G02.ME URL's more readable, I'd have to get rid of all homoglyphs from the characters used to encode shortened URLs. In version 1 of G02.ME, I used 64 characters for encoding (0-9, A-Z, a-z, -, and _). There are seven problematic characters from a readability point of view in this set: (0 o O 1 i I l). I had hoped to substitute those seven characters for 7 other usable characters.

Unfortunately, from the set of allowable URL characters many of the alternatives are problematic, especially when people include hyperlinks in emails. Email programs will scan the text of an email to look for URL-looking blocks of text. URL's that include characters like parentheses, brackets, and periods, can confused these URL parsers. I concluded there were only 3 additional safe characters I could use (* ^ ~).

So, rather than replace the offending characters, I decided to just remove them. G02.ME now uses a base 57 character encoding rather than a base 64 one. That means the information content of each letter is now 5.8 bits instead of 6 bits each. So, there will be fewer available really short URL's before G02.ME has to add additional characters. Here's the number of available unique encodings of different sizes:

CharactersBase 57Base 64
23 thousand4 thousand
3185 thousand262 thousand
410 million16 million
5602 million1 billion

I also had to make sure that I don't generate any base 57 URLs that could look like any of the base 64 URLs I've already created on G02.ME. I just had to skip all single-character, and URL's beginning with a '1' or '2' (there have been fewer than 64 * 3 (192) URLs generated to date). Since '1' is already an illegal character, and '2' is the ZERO character already, I only had to skip the first 57 1-character encodings. New G02.ME URLs are starting with '32'.

Ironically, my domain name already has a homoglyph in it; the ZERO in G-ZERO-TWO-DOT-ME. There's not much I can do about that as was already registered and parked by someone else.

No comments: