Friday, October 31, 2008

A Diversion from My Diversion

I spent most of yesterday writing code that I have no immediate use for. Roy Leban brought up a problem in the office that was intriguing to me:

What is the best way to generate string keys for records, such that they can be used to create a user-specified ordered list. Users should be able to insert elements anywhere in the list. Each newly generated key needs to be positioned alphabetically between it's two nearest neighbors.

This is also reminiscent of the early days for programming in Basic. Each program statement needs a line number to determine the order of execution. Programmers commonly would start their programs at line "10", and increment by 10 (10, 20, 30, ...). This way, they could squeeze 10 lines of new code between the others without having to renumber their programs.

So how do you "number" the records in a database, so that you rarely have to renumber - an expensive operation? The solution Roy and I both independently hit upon (as I'm sure many others have before us), is to treat the strings as digits in a fractional number. If you think of each record number as a number between 0 and 1, you can assign a value of 0.5 to the first record. When a user creates records after that they can be given the values 0.75, 0.875, etc. Each time, you just assign a record to a value "half-way" between the record before and after it.

Using digits 0-9, each string key can be thought of as the suffix after "0." in the decimal representation of the number. The first key is "5", and following keys are "75", "875", etc.

Since App Engine, strings are limited to 500 bytes, this scheme allows for numbers up to 500 digits long. In the worst case, users will be able to append items to the list 1,660 times (500 * log(10)/log(2)) before using up all the digits of the string. We can improve this by using a base-64 character encoding (say, characters 0-9, A-Z, a-z, and -/_). This gives us a limit of 3,000 (500 * 6) insertions into any one spot in the list.

Still, for some applications the probability of inserting new items at the end of the list is much higher than inserting at the beginning. We can "cheat" the division of our number space by selecting our "midpoint" at a fraction of the total space. So instead of choosing the mid point of (A+B)/2, we can choose it to be closer to A, say only 1/8th the way between A and B (7/8 * A + 1/8 * B). This gives us about 5 times (1/(log(8)-log(7))) as much room, or 15,000 items to append (at the cost of 1/8 the room to prepend on the low side).

It was a fun piece of code to write and think about, even if I don't have an immediate need for it. I still have a nagging feeling that there is a better approach by just adding linear increments to items ... but they all have pathological problems once you start to do arbitrary insertions. I would love to hear if there is a more elegant solution known to this problem.

No comments: