Richard's techie comments|
[Most Recent Entries]
Below are the 6 most recent journal entries recorded in
|Wednesday, July 4th, 2007|
Recently, Olly Betts and I have been working on making Xapian able to sort results in numeric order, and restrict values by numeric ranges. One approach to this (which seems to be used by all the other search engine type software out there) is to define a comparison function which decodes a stored representation of two numbers, and then compares them.
But, here at Xapian, we're all about speed; and our thinking was that the comparison function will be called a very large number of times during a typical match, and we didn't fancy the overhead of that. Also, we then have to generalise all our code which does comparisons to use a comparison function, rather than a hardcoded string-compare, potentially adding an extra indirection cost to the existing comparison code. (Actually, sorting is already done using a comparison function, but range restriction is done directly).
So, we've implemented a cunning hack: we've defined a function which converts any number stored as a double to a string representation - such that the resulting string representations sort (according to a lexicographical string sort) in the same order as the numbers.
In addition, we wanted to keep the encoding of small integers (and other "common" numbers) as small as possible, since this is also performance critical when retrieving large amounts of data from a database.
A few weeks ago, with a few hints from simont
, I implemented this kind of sorting, and it worked fairly well. Most small numbers were encoded as 3 or 4 bytes, and the sorting and range restriction all works perfectly.
However, Olly being a perfectionist decided that this wasn't nearly good enough: in particular, negative numbers all used 9 bytes, and INFINITY didn't sort in the right position. So, we've come up with a new encoding, which Olly spent much of last night debugging, with some assistance from me before we went out. Now, he's asleep, and I've just removed the final bug and committed it to Xapian: hopefully a nice present for him when he gets up.
And, what an improvement it is! Storing small (positive or negative) integers now takes 1 byte. Most 32 bit integers encode in no more than 1 byte more than the number of significant bytes in their standard representation. And nothing other than +INFINITY takes more than 8 bytes.
It's a bit of a complicated piece of code, though: lots of double negations to flip bits into the correct sort order, etc. However, it works. Hurrah!
(Those interested can see the code on the Xapian ViewVC system
. See the float_to_string() method, and the corresponding string_to_float() method.)
|Friday, October 13th, 2006|
Just received my T-shirt from Google for being involved in their Summer of Code this year. (I mentored Franz Pletz integrating Xapian with MoinMoin.)
Here's a photo, in which my unshaven face does not appear.
|Tuesday, October 3rd, 2006|
|Copyright on reproductions of old images
, which I still occasionally read, has drawn my attention to a recent "manifesto"
launched by the British library at the Labour Party conference last week, on the subject of Intellectual Property. This makes a number of positive statements about IP in the context of digital communications and modern technology, and will hopefully be part of a debate resulting in the preservation of our ability to effectively share and create new works, both commercially and non-commercially.
However, what I'd really like to see institutions such as the British Library and public museums changing is their policy on reproductions of out-of-copyright images. Currently, they claim copyright on reproductions of such images (ie, they claim that the act of scanning an old painting is a creative work, and that the photograph is protected by copyright, even if the original painting is long out of copyright). They then use this copyright claim to place extortionate costs on publication of images of their collection.
This is somewhat suspect legally, but who has the lawyers to challenge them? Only creative works are meant to be covered by copyright. Where's the creative act in putting a painting into a reproduction system and pressing a button? The National Gallery even prides itself on how accurate a reproduction the copied images you can purchase from their website are, showing that there's no creative step here.
For example, according to this document on the British Library's website
it would cost me 102 pounds to buy a license to put a copy of a single one of their digital images on a website for 6 months, even if that website were a non-commercial website supplying, say, educational resources and lesson plan ideas for schools. The digital image they've made should not be covered by copyright, and should be able to be used without restriction. (It would be fair enough to charge a small fee to go towards the initial bandwidth cost and scanning cost, but don't forget that this is a public organisation which should be using some of its funding from the tax payers to allow all British citizens, including those who don't happen to live in London, as much access to their collection as possible.) Once you have a copy (digital or otherwise) of an out-of-copyright painting, you should be able to do whatever you like with it.
While I welcome the copyright statement the British Library has made here, it should stop trying to abuse copyright in this manner before we salute it as a guardian of our rights, as the slashdot article seems to.
|Sunday, May 14th, 2006|
|Some notes on reference counting
Firstly, I've learnt a lot about memory barriers and suchlike over the weekend - this thread safe reference counting thing isn't as easy as it sounds... Also my ability to deal with assembly has definitely improved; I managed to replicate the solution used in the debian package to resolve bug #321291
, before realising that I should probably just use the source from the debian tarball of libatomic-ops-dev, instead of the slightly older upstream.
Secondly, the documentation of the meaning of the constraints used by GCC's inline assembly syntax is very poor.
Thirdly, if you want to learn about this sort of thing, you could do worse than following lots of links from Hans Boehm's page at hp.com
One thought that reading that brought up: I've often thought that one of the good things about reference counting (as opposed to other garbage collection schemes) is that you don't get unpredicatable large pauses for memory allocation. However what about recursive data structures? When a thread unreferences a large recursive datastructure such that the reference goes to zero, this can cause a large pause as the datastructure is recursively unreferenced, and deallocated. I suppose at least the pause is relatively predictable in this case - assuming that you know the datastructure was large to begin with, which you might well not.
|Thursday, May 11th, 2006|
|More on atomic operations
Aha! I've found libatomic-ops, (http://packages.debian.org/unstable/libdevel/libatomic-ops-dev
) which will do nicely. It's not perfect (in particular, it doesn't provide "increment" and "decrement" operations, so you have to pass a parameter of 1 or -1 to the "add" functions, which I expect will be slightly less efficient on some platforms. However, that's not a big thing, and it supports lots of platforms nicely.
So, instead, I can focus on writing / finding a cross platform reference counting library, which is what I actually wanted in the first place. :)
|Wednesday, May 10th, 2006|
|Atomic integer operations
A short while ago I was bitten by the lack of support for atomic integer operations in Linux. Whilst the kernel implements exactly the operations I need for all the supported operating systems (in the header files included by asm/atomic.h), including those files from userland code is apparently not allowed. On Debian based systems it works fine and no warning is reported, but on Redhat based systems (and presumably others, too), a warning message is displayed ("Using kernel header in userland program. BAD!"), and my compilation fails (because I use the -Werror flag to ensure clean builds).
Because I was in a hurry, and couldn't find a better solution immediately, I didn't look into the problem deeply, and just wrote a couple of assembly functions to perform atomic integer operations on the platform I was working on at the time, and made a mental note to look for a proper solution soon.
I spent most of yesterday (when I wasn't out in the garden), looking into this.
- On Windows, programmers can use the InterlockedIncrement() and InterlockedDecrement() to perform the atomic integer manipulations I wanted.
- On MacOSX, programmers can use OSAtomicIncrement32Barrier() and OSAtomicDecrement32Barrier() to perform the atomic integer manipulations I wanted.
- On BSD, programmers can use atomic_add() and atomic_cmpset() (though apparently, according to the atomic(9) manpage, "The current set of atomic operations do not necessarily guarantee atomicity across multiple processors.", though it's guaranteed for i386, and had I written the assembly myself I'd probably have missed the issue which makes it not work on ia64.)
- On Linux, since you're not allowed to use the kernel header files, there is no available solution.
After some searching, I found a (rather flamey) 18 month old thread on the linux kernel development mailing list discussing this. To summarise roughly, the kernel headers are currently not in a state in which they can safely be included from userland (eg, see http://www.ussg.iu.edu/hypermail/linux/kernel/0411.3/1127.html) and the kernel developers don't want to be tied to providing any more APIs than they have to, so they refuse to support any inclusion of kernel headers from userland - except for a few specific cases which don't include the atomic modification code. Fair enough so far.
However, things aren't quite that simple: one post mentions that it's not always possible to implement atomic operations simply in userland (http://www.ussg.iu.edu/hypermail/linux/kernel/0411.3/1129.html), and another post (from Linus) mentions that including the headers from an application might require the application to be licensed under the GPL (since the code is in inline functions, and thus including these headers involves actually incorporating code into the application, as opposed to simply linking to a system call). (Unfortunately, I can't find a reference to that post now, so I may have mis-summarised.)
This is really frustrating for application developers on Linux: there is some fairly good code which, with a little work and maintainance, could provide the neccessary atomic operations. If it weren't for the legal questions, and the flamey nature of kernel development, I might try and pull that code into shape and persuade the kernel developers to add the atomic operations to the small set of headers which can be included from userland. Instead, I'm putting together a library which will implement atomic operations (or link to the appropriate system defined ones) with a standard, cross platform, interface. I'll add a post in a while to let you know how I get on with that.