richardtech (richardtech) wrote,

Representing numbers

Recently, Olly Betts and I have been working on making Xapian able to sort results in numeric order, and restrict values by numeric ranges. One approach to this (which seems to be used by all the other search engine type software out there) is to define a comparison function which decodes a stored representation of two numbers, and then compares them.

But, here at Xapian, we're all about speed; and our thinking was that the comparison function will be called a very large number of times during a typical match, and we didn't fancy the overhead of that. Also, we then have to generalise all our code which does comparisons to use a comparison function, rather than a hardcoded string-compare, potentially adding an extra indirection cost to the existing comparison code. (Actually, sorting is already done using a comparison function, but range restriction is done directly).

So, we've implemented a cunning hack: we've defined a function which converts any number stored as a double to a string representation - such that the resulting string representations sort (according to a lexicographical string sort) in the same order as the numbers.

In addition, we wanted to keep the encoding of small integers (and other "common" numbers) as small as possible, since this is also performance critical when retrieving large amounts of data from a database.

A few weeks ago, with a few hints from simont, I implemented this kind of sorting, and it worked fairly well. Most small numbers were encoded as 3 or 4 bytes, and the sorting and range restriction all works perfectly.

However, Olly being a perfectionist decided that this wasn't nearly good enough: in particular, negative numbers all used 9 bytes, and INFINITY didn't sort in the right position. So, we've come up with a new encoding, which Olly spent much of last night debugging, with some assistance from me before we went out. Now, he's asleep, and I've just removed the final bug and committed it to Xapian: hopefully a nice present for him when he gets up.

And, what an improvement it is! Storing small (positive or negative) integers now takes 1 byte. Most 32 bit integers encode in no more than 1 byte more than the number of significant bytes in their standard representation. And nothing other than +INFINITY takes more than 8 bytes.

It's a bit of a complicated piece of code, though: lots of double negations to flip bits into the correct sort order, etc. However, it works. Hurrah!

(Those interested can see the code on the Xapian ViewVC system. See the float_to_string() method, and the corresponding string_to_float() method.)
Tiny niggle: you normalise to the nine-byte representation of +infinity whenever the exponent is greater than 2055. Surely you ought to be normalising to either that or the empty string representation of -infinity depending on the sign of the input? This will only be relevant in the non-IEEE case, of course, since in the IEEE case your prior test against DBL_MIN is sufficient to rule out that possibility; but if you do try to run this code against a non-IEEE representation which permits exponents greater than 2055, then I think the current implementation will deal incorrectly with extremely large but finite negative numbers.

Also, I'm unconvinced by your claim that nothing other than plus infinity takes more than eight bytes. Surely in the most general case of a number stored with an 11-bit exponent and no trailing zero bytes removable from the mantissa, you're going to take three header bits, 11 exponent bits and 52 mantissa bits for a total of 66, overflowing into nine bytes? Not that this is a bug in your implementation, as far as I can see: any finite number meriting a ninth byte will necessarily have the bottom five or six bits of that byte being zero, so your nine-FF encoding of +infinity will still reliably sort above it. So I think this is solely a bug in this LJ post :-)

Might be worth documenting somewhere that this transformation destroys information about the IEEE sign of zero. That's probably what you want – the two signs of zero are defined by IEEE to compare equal, so a transformation specifically intended to preserve comparison results certainly should indeed squash them into the same output string – but just in case anyone ever tries to do anything with this stuff which would notice changes in the sign of zero, it'd be good to have documented it so they couldn't say they weren't warned.
Aha, yes, proof positive. The closest IEEE representation of 2^32 * e (which is to say the number whose exact mathematical value is 11674931554.5426807403564453125 or 6121026514868073/524288) is converted by your function to the nine-byte string E0 69 6F C2 A2 C5 15 DA 40. (I compiled it on its own to make sure.)
Bah - knew it was too good to be true. On the other hand, I should have known that making a over-confident claim like that would ensure that you would check it. :)

Still, 9 bytes but sorting in the right order is pretty good, so I'm pretty happy.

I've added a documentation note about your -zero case, and fixed the -infinity in the non-IEEE case; thanks!
I should have known that making a over-confident claim like that would ensure that you would check it. :)


I was immediately suspicious of that claim, because eight bytes gives you (a bit over) 264 possible strings, which is (close enough to) the number of representable doubles in the first place; so you'd have had to have a very well-packed encoding to squeeze everything in there.

Not unimaginably well-packed, of course; I can easily imagine easy-to-compute encodings that do manage to get everything in under 8 bytes. For example, the most obvious such encoding is the one I think I originally suggested to you: convert a double into a 64-bit integer x containing its IEEE bit pattern, then map positive numbers into 0x8000000000000000+x and negative ones into -x (which naturally squashes the two zeroes together, and also preserves trailing mantissa zeroes in negative numbers as your method does), then encode as an 8-byte binary string and trim trailing zero bytes as you currently do.

But that approach, pleasantly simple as it is, doesn't have the additional advantages of your encoding: the numbers it encodes as 1-byte strings are not especially interesting or common ones, and it's fundamentally limited to the exact range of IEEE double rather than being able to go a little way beyond it as yours can.

I'm not convinced there's great utility in the latter; IEEE is so close to universal these days that I think if it were me I'd be comfortable with ignoring non-IEEE platforms completely. But the former property (encoding common numbers as 1 byte) is clearly useful, and I suspect that the inherent limitations imposed by the relative sorting order between strings of different lengths (i.e. the fact that there's a fixed number of longer strings that sort between any two one-byte strings) probably constrains your ability to retain it in any encoding with an upper bound of 8 bytes. So I think you've probably done the right thing.
Yes, supporting non-IEEE platforms isn't something we worried about much - however, since we had space to encode larger exponents, we thought we might as well let the code do it... The compaction thing was the main aim.

We also store integers in a similar variable length format, in most the places we store them, which has had a huge effect on database sizes (reduction by a factor of 5 or 6 in some cases, since we were storing 8 byte values before).

Comments for this post were locked by the author