By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,621 Members | 1,104 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,621 IT Pros & Developers. It's quick & easy.

Builtin classes list, set, dict reimplemented via B-trees

P: n/a
barnesc at engr.orst.edu wrote:
So my question is: are there any other *practical* applications of a
B-tree based list/set/dict ? In other words, is this module totally
worth coding, or is it just academic wankery and theoretical flim
flam ? :)
B-trees are specifically designed for disk storage. Seeking to a
page takes much longer than reading a page of several kilobytes.
B-trees read the keys for a many-way comparison in one seek.

For in-memory comparison-based dictionaries, Red-Black trees,
AVL trees, 2-3 trees, or skip-lists, are a better choice.
--
--Bryan


---------------------------------------------------------------------
Memory Usage
---------------------------------------------------------------------

Here overhead is compared to a C array of > 1 million PyObject *s.

The B-tree is assumed to have a branch factor of ~1024.

A B-tree has an average of 1.5x overhead (2x at worst, 1x at best,
using malloc). This assumes that B-tree nodes contain a fixed
size C array for storing values (which will be half-full on
average), and that the child array is only allocated for internal
nodes as needed.
Any balanced binary tree with left and right child pointers has
a 6x overhead (using malloc, measured via a gcc win32 test program),
or with a free list, a 3x overhead (though memory will not
be returned until the data structure's destructor is called).
I haven't tried using PyObject_Malloc(), but it should be in
between these two factors.
A skip list has n values and n next pointers on the bottom level.
In the ideal case (with respect to memory usage), we make the
probability sufficiently small so that the higher levels take up a
negligible amount of memory. Thus we have a 4x overhead (using
malloc, measured via a gcc win32 test program), or with a free list,
an overhead slightly greater than 2x (again, memory will only be
returned when the destructor is called). I didn't test
PyObject_Malloc(), but it should give an overhead between 2x and 4x.

Thus, on average, a > 1 million element B-tree uses 25% less memory
than any other balanced data structure which I am aware of, and 50%
more memory than a raw C array.
---------------------------------------------------------------------
Speed
---------------------------------------------------------------------

I have no idea how B-trees compare to skip lists (the likely
contender) in terms of speed.

To do this, one would need two high performance C implementations.
From this, the Python performance can presumably be deduced (although

caching causes unpredictable effects, the faster C algorithm should be
faster in Python).

You could start with optimized parallel C implementations, such as
the ones by John-Mark Gurney:

* http://resnet.uoregon.edu/~gurney_j/jmpc/btree.html
* http://resnet.uoregon.edu/~gurney_j/jmpc/skiplist.html

He indicates that the skip list is slower than the B-tree, but doesn't
give any solid numbers. Of course you'd have to check his code and
optimize it to death (Perhaps representing a sorted set of integers
using each respective data structure would be a good test).

Also note that my B-tree code currently has optimized special cases
for block insertions, deletions, and retrievals, since this is easy
(er, well, relatively speaking) to do in the B-tree case. In the skip
list case, block retrievals are easy (though not as fast as B-tree
block retrievals due to pointer indirection and cache problems). I
have no idea how to do a fast block insert or delete on a skip list.

----------------------------------------------------------------------

Well, enough ivory tower silliness and academic blustering.

<on_topic>

Are there any *practical* applications for in-memory balanced data
structures (e.g. skip list, AVL tree, RB tree, B tree) ?

</on_topic>

- Connelly Barnes
Sep 14 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
IMO sorted dict implementation can be useful, eg. one can get an
interval:
L = D['A' : 'K']

other useful data types:
linkedlist
queue, stack (well deque can do it efficiently in py 2.4)
prioritydict (for graph algorithms)
multimap, multiset (i've never used it but it's in the c++ stl)
mutable string (kind of list/array of chars, but with string functions)

nsz

Sep 14 '05 #2

P: n/a
ba*****@engr.orst.edu wrote:
Here overhead is compared to a C array of > 1 million PyObject *s.

Thus, on average, a > 1 million element B-tree uses 25% less memory
than any other balanced data structure which I am aware of, and 50%
more memory than a raw C array.
That's overhead of indexing; it doesn't consider the space
already used to store the keys and values. The B-tree can get by
with modestly fewer pointers, because it has fewer internal
nodes that need to be referenced by other internal pointers.
That assumes that the B-tree nodes are kept as linear arrays,
which means that either inserting into them will time
proportional to the fan-out, or searching them will.

[...] I have no idea how B-trees compare to skip lists (the likely
contender) in terms of speed.
I'd expect Red-Black trees to be at least as good a contender.
Are there any *practical* applications for in-memory balanced data
structures (e.g. skip list, AVL tree, RB tree, B tree) ?


Yes, absolutely. Efficient ordered sub-ranges; efficient rank
queries; sustainable performance with adversarial inputs.
--
--Bryan
Sep 14 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.