Friday, October 31, 2008

A Diversion from My Diversion

I spent most of yesterday writing code that I have no immediate use for. Roy Leban brought up a problem in the office that was intriguing to me:

What is the best way to generate string keys for records, such that they can be used to create a user-specified ordered list. Users should be able to insert elements anywhere in the list. Each newly generated key needs to be positioned alphabetically between it's two nearest neighbors.

This is also reminiscent of the early days for programming in Basic. Each program statement needs a line number to determine the order of execution. Programmers commonly would start their programs at line "10", and increment by 10 (10, 20, 30, ...). This way, they could squeeze 10 lines of new code between the others without having to renumber their programs.

So how do you "number" the records in a database, so that you rarely have to renumber - an expensive operation? The solution Roy and I both independently hit upon (as I'm sure many others have before us), is to treat the strings as digits in a fractional number. If you think of each record number as a number between 0 and 1, you can assign a value of 0.5 to the first record. When a user creates records after that they can be given the values 0.75, 0.875, etc. Each time, you just assign a record to a value "half-way" between the record before and after it.

Using digits 0-9, each string key can be thought of as the suffix after "0." in the decimal representation of the number. The first key is "5", and following keys are "75", "875", etc.

Since App Engine, strings are limited to 500 bytes, this scheme allows for numbers up to 500 digits long. In the worst case, users will be able to append items to the list 1,660 times (500 * log(10)/log(2)) before using up all the digits of the string. We can improve this by using a base-64 character encoding (say, characters 0-9, A-Z, a-z, and -/_). This gives us a limit of 3,000 (500 * 6) insertions into any one spot in the list.

Still, for some applications the probability of inserting new items at the end of the list is much higher than inserting at the beginning. We can "cheat" the division of our number space by selecting our "midpoint" at a fraction of the total space. So instead of choosing the mid point of (A+B)/2, we can choose it to be closer to A, say only 1/8th the way between A and B (7/8 * A + 1/8 * B). This gives us about 5 times (1/(log(8)-log(7))) as much room, or 15,000 items to append (at the cost of 1/8 the room to prepend on the low side).

It was a fun piece of code to write and think about, even if I don't have an immediate need for it. I still have a nagging feeling that there is a better approach by just adding linear increments to items ... but they all have pathological problems once you start to do arbitrary insertions. I would love to hear if there is a more elegant solution known to this problem.

Saturday, October 25, 2008

Death to Homoglyphs! RIP 0, o ,O ,1 ,i ,I , and l

I've had some great feedback from new users of G02.ME over the last couple of days, most of it positive. Even better, some of it constructive.

One thing I had not anticipated was that people would be retyping G02.ME URL's. I had assumed they would all be copied and pasted into messages. Turns out, it's not that uncommon for someone to read a G02.ME URL (off a mobile phone or a print out, for example) and then type it in to the browser to visit the link.

Problem is, some characters are hard to recognize outside the context of ordinary (English) words. For example, the last character of http://g02.me/2I is an I (capital letter), but could just as easily look like a 1 (digit) or l (lower case L), depending on the font being used.

Characters that look like other characters are called homoglyphs. In order to make G02.ME URL's more readable, I'd have to get rid of all homoglyphs from the characters used to encode shortened URLs. In version 1 of G02.ME, I used 64 characters for encoding (0-9, A-Z, a-z, -, and _). There are seven problematic characters from a readability point of view in this set: (0 o O 1 i I l). I had hoped to substitute those seven characters for 7 other usable characters.

Unfortunately, from the set of allowable URL characters many of the alternatives are problematic, especially when people include hyperlinks in emails. Email programs will scan the text of an email to look for URL-looking blocks of text. URL's that include characters like parentheses, brackets, and periods, can confused these URL parsers. I concluded there were only 3 additional safe characters I could use (* ^ ~).

So, rather than replace the offending characters, I decided to just remove them. G02.ME now uses a base 57 character encoding rather than a base 64 one. That means the information content of each letter is now 5.8 bits instead of 6 bits each. So, there will be fewer available really short URL's before G02.ME has to add additional characters. Here's the number of available unique encodings of different sizes:

CharactersBase 57Base 64
23 thousand4 thousand
3185 thousand262 thousand
410 million16 million
5602 million1 billion

I also had to make sure that I don't generate any base 57 URLs that could look like any of the base 64 URLs I've already created on G02.ME. I just had to skip all single-character, and URL's beginning with a '1' or '2' (there have been fewer than 64 * 3 (192) URLs generated to date). Since '1' is already an illegal character, and '2' is the ZERO character already, I only had to skip the first 57 1-character encodings. New G02.ME URLs are starting with '32'.

Ironically, my domain name already has a homoglyph in it; the ZERO in G-ZERO-TWO-DOT-ME. There's not much I can do about that as go2.me was already registered and parked by someone else.

Thursday, October 23, 2008

Introducing G02.ME

Origins of G02.ME

The G02.ME project was started at a Google Hack-a-Thon in Seattle in August of 2008. I wanted to develop a project I could "implement in a day" - and at least have something basic working. Since I had just purchased the domain "g02.me" (yes, I have a domain acquisition habit), why not implement another TinyURL clone?

[BTW, G02.ME is spelled "G-ZERO-TWO-DOT-ME" - though I pronounce it "GO TO ME"]

TinyURL and the History of URL Shortening Services

For those not familiar with TinyURL, it's a URL shortening service. For a variety of reasons, people often want to send URL's (web links) to friends, but the original is too long. Sites like TinyURL create an alias to the original URL that is much shorter, so they can be sent in email or IM messages much more easily.

The original TinyURL was created by Kevin "Gilby" Gilbertson in 2002. In this interview in 2006, he was asked about his plans for the site. ZDNet reporter, David Berlind, thought there could be a lot of business potential behind sites like his, but Gilby seemed reluctant to piss off his user base by selling out to commercial interests. Gilby had a long list of features he wanted to implement, but he doesn't seem to have progressed on that much over the last 2 years.

Today there are an explosion of TinyURL competitors, each with a different set of features (in a future blog post, I will try to provide a detailed feature comparison of the field). The most popular I've found are:

Goals and Features of G02.ME

The primary purpose of G02.ME was to help me become more familiar with the Google App Engine platform, as well as be an experimental playground for me to explore some ideas I've had about building web services and user interfaces.

The features that differentiate G02.ME from TinyURL are:

  • G02.ME directs users to a page that has an Info Panel above the main site page. The info panel allows you to get information about the page, and interact with other users that are visiting the G02.ME shortened link.
  • G02.ME displays Analytics about the link, including the number of times it's been viewed, and how many times people have shared the link with others.
  • G02.ME allows anyone to add comments in the Info Panel to encourage discussing and interaction around the link you're sharing. For the moment, these are all public comments, which can be contributed anonymously or credited to your chosen user name.
  • G02.ME allows you to associate tags with a given link. You can then visit a tag summary page that displays all the links that have been associated with that tag.
  • G02.ME calculates the current popularity of every link in real-time. By visiting the G02.ME home page you can see which links have been shared, viewed, and commented on the most. I'll describe the popularity algorithm in more detail in a future post.
  • G02.ME has a rich JSON API. Virtually all the data and updates can be scripted via a REST/JSON API to the service.

Product Design Philosophy

The design elements I wanted to explore in the creation of G02.ME are:

  • Keep the user interface spare and clean. Maximize the amount of information and functionality, but keep the use of extraneous visual elements to a minumum (this goal is partially driven by the fact that this is a one-man project, and I have limited design skills).
  • I wanted to see how far I could go by using a text-entry centric design. G02.ME basically has a command line interface. Users have to learn a data entry format for comments that let them create a user name, and add free-form tags to their links. I have a fundamental belief that humans are very adaptable, and that they can be taught to learn any interface. Text-based interfaces are also quite efficient and fast to use. So, while it will appeal to a more geeky audience, G02.ME provides a lot of power within a single text input box.
  • G02.ME is also an open source project. I'm developing G02.ME in real-time and exposing the source code as I develop it. Firstly, I like this form of incremental improvement - shipping features frequently, and (hopefully) getting feedback from users to drive further improvements. Secondly, I hope that other developers can learn from this as a sample project in seeing how I've tackled problems in the App Engine environment.

Future Goals and Features

G02.ME has been a bit of a distraction from a larger project I started last year (PageForest). But there are still a few features to add and bugs to fix before I set it aside and return my focus to PageForest. I hope to get feedback here on this blog or as posted issues on the open source project. I would even invite contributions from other developers if anyone is so inclined.

The following features are on my immediate task list:

  • Site Security - Today, G02.ME has a completely open/anonymous user interface - there is no need to create a registered account. I'll be offering the option to register, and thereby provide some security for your personal data (now your comments can be deleted by other users). I also have more work to do to secure the JSON API from abuse.
  • Tag Cloud - I'll be adding ranking to tags and displaying the most frequently used tags on the home page.
  • More Analytics - I want to show more about where users are coming from when they click on a G02.ME link. I can also display some information about whether sites are trending up or down in popularity.

Thanks for reading this far! I hope you'll at least give G02.ME a try. Please use it in your twitter posts and invite your friends to use it as well!