barnesc at engr.orst.edu wrote:So my question is: are there any other *practical* applications of a
B-tree based list/set/dict ? In other words, is this module totally
worth coding, or is it just academic wankery and theoretical flim
flam ? :)
B-trees are specifically designed for disk storage. Seeking to a
page takes much longer than reading a page of several kilobytes.
B-trees read the keys for a many-way comparison in one seek.
For in-memory comparison-based dictionaries, Red-Black trees,
AVL trees, 2-3 trees, or skip-lists, are a better choice.
--
--Bryan
---------------------------------------------------------------------
Memory Usage
---------------------------------------------------------------------
Here overhead is compared to a C array of > 1 million PyObject *s.
The B-tree is assumed to have a branch factor of ~1024.
A B-tree has an average of 1.5x overhead (2x at worst, 1x at best,
using malloc). This assumes that B-tree nodes contain a fixed
size C array for storing values (which will be half-full on
average), and that the child array is only allocated for internal
nodes as needed.
Any balanced binary tree with left and right child pointers has
a 6x overhead (using malloc, measured via a gcc win32 test program),
or with a free list, a 3x overhead (though memory will not
be returned until the data structure's destructor is called).
I haven't tried using PyObject_Malloc(), but it should be in
between these two factors.
A skip list has n values and n next pointers on the bottom level.
In the ideal case (with respect to memory usage), we make the
probability sufficiently small so that the higher levels take up a
negligible amount of memory. Thus we have a 4x overhead (using
malloc, measured via a gcc win32 test program), or with a free list,
an overhead slightly greater than 2x (again, memory will only be
returned when the destructor is called). I didn't test
PyObject_Malloc(), but it should give an overhead between 2x and 4x.
Thus, on average, a > 1 million element B-tree uses 25% less memory
than any other balanced data structure which I am aware of, and 50%
more memory than a raw C array.
---------------------------------------------------------------------
Speed
---------------------------------------------------------------------
I have no idea how B-trees compare to skip lists (the likely
contender) in terms of speed.
To do this, one would need two high performance C implementations.
From this, the Python performance can presumably be deduced (although
caching causes unpredictable effects, the faster C algorithm should be
faster in Python).
You could start with optimized parallel C implementations, such as
the ones by John-Mark Gurney:
* http://resnet.uoregon.edu/~gurney_j/jmpc/btree.html
* http://resnet.uoregon.edu/~gurney_j/jmpc/skiplist.html
He indicates that the skip list is slower than the B-tree, but doesn't
give any solid numbers. Of course you'd have to check his code and
optimize it to death (Perhaps representing a sorted set of integers
using each respective data structure would be a good test).
Also note that my B-tree code currently has optimized special cases
for block insertions, deletions, and retrievals, since this is easy
(er, well, relatively speaking) to do in the B-tree case. In the skip
list case, block retrievals are easy (though not as fast as B-tree
block retrievals due to pointer indirection and cache problems). I
have no idea how to do a fast block insert or delete on a skip list.
----------------------------------------------------------------------
Well, enough ivory tower silliness and academic blustering.
<on_topic>
Are there any *practical* applications for in-memory balanced data
structures (e.g. skip list, AVL tree, RB tree, B tree) ?
</on_topic>
- Connelly Barnes