473,386 Members | 1,699 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Builtin classes list, set, dict reimplemented via B-trees

barnesc at engr.orst.edu wrote:
So my question is: are there any other *practical* applications of a
B-tree based list/set/dict ? In other words, is this module totally
worth coding, or is it just academic wankery and theoretical flim
flam ? :)
B-trees are specifically designed for disk storage. Seeking to a
page takes much longer than reading a page of several kilobytes.
B-trees read the keys for a many-way comparison in one seek.

For in-memory comparison-based dictionaries, Red-Black trees,
AVL trees, 2-3 trees, or skip-lists, are a better choice.
--
--Bryan


---------------------------------------------------------------------
Memory Usage
---------------------------------------------------------------------

Here overhead is compared to a C array of > 1 million PyObject *s.

The B-tree is assumed to have a branch factor of ~1024.

A B-tree has an average of 1.5x overhead (2x at worst, 1x at best,
using malloc). This assumes that B-tree nodes contain a fixed
size C array for storing values (which will be half-full on
average), and that the child array is only allocated for internal
nodes as needed.
Any balanced binary tree with left and right child pointers has
a 6x overhead (using malloc, measured via a gcc win32 test program),
or with a free list, a 3x overhead (though memory will not
be returned until the data structure's destructor is called).
I haven't tried using PyObject_Malloc(), but it should be in
between these two factors.
A skip list has n values and n next pointers on the bottom level.
In the ideal case (with respect to memory usage), we make the
probability sufficiently small so that the higher levels take up a
negligible amount of memory. Thus we have a 4x overhead (using
malloc, measured via a gcc win32 test program), or with a free list,
an overhead slightly greater than 2x (again, memory will only be
returned when the destructor is called). I didn't test
PyObject_Malloc(), but it should give an overhead between 2x and 4x.

Thus, on average, a > 1 million element B-tree uses 25% less memory
than any other balanced data structure which I am aware of, and 50%
more memory than a raw C array.
---------------------------------------------------------------------
Speed
---------------------------------------------------------------------

I have no idea how B-trees compare to skip lists (the likely
contender) in terms of speed.

To do this, one would need two high performance C implementations.
From this, the Python performance can presumably be deduced (although

caching causes unpredictable effects, the faster C algorithm should be
faster in Python).

You could start with optimized parallel C implementations, such as
the ones by John-Mark Gurney:

* http://resnet.uoregon.edu/~gurney_j/jmpc/btree.html
* http://resnet.uoregon.edu/~gurney_j/jmpc/skiplist.html

He indicates that the skip list is slower than the B-tree, but doesn't
give any solid numbers. Of course you'd have to check his code and
optimize it to death (Perhaps representing a sorted set of integers
using each respective data structure would be a good test).

Also note that my B-tree code currently has optimized special cases
for block insertions, deletions, and retrievals, since this is easy
(er, well, relatively speaking) to do in the B-tree case. In the skip
list case, block retrievals are easy (though not as fast as B-tree
block retrievals due to pointer indirection and cache problems). I
have no idea how to do a fast block insert or delete on a skip list.

----------------------------------------------------------------------

Well, enough ivory tower silliness and academic blustering.

<on_topic>

Are there any *practical* applications for in-memory balanced data
structures (e.g. skip list, AVL tree, RB tree, B tree) ?

</on_topic>

- Connelly Barnes
Sep 14 '05 #1
2 2923
IMO sorted dict implementation can be useful, eg. one can get an
interval:
L = D['A' : 'K']

other useful data types:
linkedlist
queue, stack (well deque can do it efficiently in py 2.4)
prioritydict (for graph algorithms)
multimap, multiset (i've never used it but it's in the c++ stl)
mutable string (kind of list/array of chars, but with string functions)

nsz

Sep 14 '05 #2
ba*****@engr.orst.edu wrote:
Here overhead is compared to a C array of > 1 million PyObject *s.

Thus, on average, a > 1 million element B-tree uses 25% less memory
than any other balanced data structure which I am aware of, and 50%
more memory than a raw C array.
That's overhead of indexing; it doesn't consider the space
already used to store the keys and values. The B-tree can get by
with modestly fewer pointers, because it has fewer internal
nodes that need to be referenced by other internal pointers.
That assumes that the B-tree nodes are kept as linear arrays,
which means that either inserting into them will time
proportional to the fan-out, or searching them will.

[...] I have no idea how B-trees compare to skip lists (the likely
contender) in terms of speed.
I'd expect Red-Black trees to be at least as good a contender.
Are there any *practical* applications for in-memory balanced data
structures (e.g. skip list, AVL tree, RB tree, B tree) ?


Yes, absolutely. Efficient ordered sub-ranges; efficient rank
queries; sustainable performance with adversarial inputs.
--
--Bryan
Sep 14 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Blair Hall | last post by:
Can anyone please tell me how to correctly use a built in function when there is a function of the same name in local scope? Here is an example. Suppose the following is in myApply.py: def...
10
by: Stefan Seefeld | last post by:
hi there, I'm trying to convert a tuple to a list, and get a 'TypeError: list objects are unhashable'. Can anybody enlighten me as to the possible causes for this ? Where does hashing come...
49
by: Christopher J. Bottaro | last post by:
I find myself doing the following very often: class Struct: pass .... blah = Struct() blah.some_field = x blah.other_field = y ....
4
by: Tuure Laurinolli | last post by:
Someone pasted the original version of the following code snippet on #python today. I started investigating why the new-style class didn't work as expected, and found that at least some instances...
7
by: John Reese | last post by:
Why hello there ha ha. I have got in the habit of testing the types of variables with isinstance and the builtin type names instead of using the types module, as was the style back around Python...
6
by: bertgoos | last post by:
Hey, I have the following code that has to send every command it receives to a list of backends. Instead of: class MultiBackend(object): """Renders to multiple backends""" def __init__(self,...
3
by: Daniel Nogradi | last post by:
I used to have the following code to collect all (old style) class names defined in the current module to a list called reg: def meta( reg ): def _meta( name, bases, dictionary ): reg.append(...
23
by: =?ISO-8859-1?Q?Sch=FCle_Daniel?= | last post by:
Hello, lst = list((1,2,3)) lst = t = tupel((1,2,3)) t = (1,2,3) s = set((1,2,3)) s = ...
20
by: Seongsu Lee | last post by:
Hi, I have a dictionary with million keys. Each value in the dictionary has a list with up to thousand integers. Follow is a simple example with 5 keys. dict = {1: , 2: , 900000: , 900001:...
2
by: Dmitry | last post by:
Hi All, I've trying to develop one Python application, and neet to solve one problem. I need to list all classes defined in one package (not module!). Could anybody please show me more...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.