richardtech (richardtech) wrote,

Representing numbers

Recently, Olly Betts and I have been working on making Xapian able to sort results in numeric order, and restrict values by numeric ranges. One approach to this (which seems to be used by all the other search engine type software out there) is to define a comparison function which decodes a stored representation of two numbers, and then compares them.

But, here at Xapian, we're all about speed; and our thinking was that the comparison function will be called a very large number of times during a typical match, and we didn't fancy the overhead of that. Also, we then have to generalise all our code which does comparisons to use a comparison function, rather than a hardcoded string-compare, potentially adding an extra indirection cost to the existing comparison code. (Actually, sorting is already done using a comparison function, but range restriction is done directly).

So, we've implemented a cunning hack: we've defined a function which converts any number stored as a double to a string representation - such that the resulting string representations sort (according to a lexicographical string sort) in the same order as the numbers.

In addition, we wanted to keep the encoding of small integers (and other "common" numbers) as small as possible, since this is also performance critical when retrieving large amounts of data from a database.

A few weeks ago, with a few hints from simont, I implemented this kind of sorting, and it worked fairly well. Most small numbers were encoded as 3 or 4 bytes, and the sorting and range restriction all works perfectly.

However, Olly being a perfectionist decided that this wasn't nearly good enough: in particular, negative numbers all used 9 bytes, and INFINITY didn't sort in the right position. So, we've come up with a new encoding, which Olly spent much of last night debugging, with some assistance from me before we went out. Now, he's asleep, and I've just removed the final bug and committed it to Xapian: hopefully a nice present for him when he gets up.

And, what an improvement it is! Storing small (positive or negative) integers now takes 1 byte. Most 32 bit integers encode in no more than 1 byte more than the number of significant bytes in their standard representation. And nothing other than +INFINITY takes more than 8 bytes.

It's a bit of a complicated piece of code, though: lots of double negations to flip bits into the correct sort order, etc. However, it works. Hurrah!

(Those interested can see the code on the Xapian ViewVC system. See the float_to_string() method, and the corresponding string_to_float() method.)

Comments for this post were locked by the author