But, here at Xapian, we're all about speed; and our thinking was that the comparison function will be called a very large number of times during a typical match, and we didn't fancy the overhead of that. Also, we then have to generalise all our code which does comparisons to use a comparison function, rather than a hardcoded string-compare, potentially adding an extra indirection cost to the existing comparison code. (Actually, sorting is already done using a comparison function, but range restriction is done directly).

So, we've implemented a cunning hack: we've defined a function which converts any number stored as a double to a string representation - such that the resulting string representations sort (according to a lexicographical string sort) in the same order as the numbers.

In addition, we wanted to keep the encoding of small integers (and other "common" numbers) as small as possible, since this is also performance critical when retrieving large amounts of data from a database.

A few weeks ago, with a few hints from

**simont**, I implemented this kind of sorting, and it worked fairly well. Most small numbers were encoded as 3 or 4 bytes, and the sorting and range restriction all works perfectly.

However, Olly being a perfectionist decided that this wasn't nearly good enough: in particular, negative numbers all used 9 bytes, and INFINITY didn't sort in the right position. So, we've come up with a new encoding, which Olly spent much of last night debugging, with some assistance from me before we went out. Now, he's asleep, and I've just removed the final bug and committed it to Xapian: hopefully a nice present for him when he gets up.

And, what an improvement it is! Storing small (positive or negative) integers now takes 1 byte. Most 32 bit integers encode in no more than 1 byte more than the number of significant bytes in their standard representation. And nothing other than +INFINITY takes more than 8 bytes.

It's a bit of a complicated piece of code, though: lots of double negations to flip bits into the correct sort order, etc. However, it works. Hurrah!

(Those interested can see the code on the Xapian ViewVC system. See the float_to_string() method, and the corresponding string_to_float() method.)

simontJuly 4 2007, 10:54:59 UTC 9 years ago

Also, I'm unconvinced by your claim that nothing other than plus infinity takes more than eight bytes. Surely in the most general case of a number stored with an 11-bit exponent and no trailing zero bytes removable from the mantissa, you're going to take three header bits, 11 exponent bits and 52 mantissa bits for a total of 66, overflowing into nine bytes? Not that this is a bug in your implementation, as far as I can see: any finite number meriting a ninth byte will necessarily have the bottom five or six bits of that byte being zero, so your nine-FF encoding of +infinity will still reliably sort above it. So I think this is solely a bug in this LJ post :-)

Might be worth documenting somewhere that this transformation destroys information about the IEEE sign of zero. That's probably what you want – the two signs of zero are defined by IEEE to

compareequal, so a transformation specifically intended to preserve comparison results certainly should indeed squash them into the same output string – but just in case anyone ever tries to do anything with this stuff which would notice changes in the sign of zero, it'd be good to have documented it so they couldn't say they weren't warned.simontJuly 4 2007, 11:03:51 UTC 9 years ago

fluffyrichardJuly 4 2007, 11:40:27 UTC 9 years ago

Still, 9 bytes but sorting in the right order is pretty good, so I'm pretty happy.

I've added a documentation note about your -zero case, and fixed the -infinity in the non-IEEE case; thanks!

simontJuly 4 2007, 12:36:52 UTC 9 years ago

*grin*

I was immediately suspicious of that claim, because eight bytes gives you (a bit over) 2

^{64}possible strings, which is (close enough to) the number of representable doubles in the first place; so you'd have had to have averywell-packed encoding to squeeze everything in there.Not

unimaginablywell-packed, of course; I can easily imagine easy-to-compute encodings that do manage to get everything in under 8 bytes. For example, the most obvious such encoding is the one I think I originally suggested to you: convert a double into a 64-bit integer x containing its IEEE bit pattern, then map positive numbers into 0x8000000000000000+x and negative ones into -x (which naturally squashes the two zeroes together, and also preserves trailing mantissa zeroes in negative numbers as your method does), then encode as an 8-byte binary string and trim trailing zero bytes as you currently do.But that approach, pleasantly simple as it is, doesn't have the additional advantages of your encoding: the numbers it encodes as 1-byte strings are not especially interesting or common ones, and it's fundamentally limited to the exact range of IEEE double rather than being able to go a little way beyond it as yours can.

I'm not convinced there's great utility in the latter; IEEE is so close to universal these days that I think if it were me I'd be comfortable with ignoring non-IEEE platforms completely. But the former property (encoding common numbers as 1 byte) is clearly useful, and I suspect that the inherent limitations imposed by the relative sorting order between strings of different lengths (i.e. the fact that there's a fixed number of longer strings that sort between any two one-byte strings) probably constrains your ability to retain it in any encoding with an upper bound of 8 bytes. So I think you've probably done the right thing.

fluffyrichardJuly 4 2007, 13:01:25 UTC 9 years ago

We also store integers in a similar variable length format, in most the places we store them, which has had a huge effect on database sizes (reduction by a factor of 5 or 6 in some cases, since we were storing 8 byte values before).

Comments for this post were locked by the author