B-trees - C / C++

rsprawls

I found a disk for a b-tree algorithm that I purchased back in 93 or
so. I'd hoped to find this, but now I'd like to know how worthwhile
are b-trees in today's advancements? This is old C code using pascal
and cdecl declarations, but it would be a minor project to bring up to
date in C, but I was thinking it might be a good project to convert it
over to C++, especially since I don't know C++ very well and would
like a good sorting algorithm for medium sized datafiles.

Are b-trees still a valued data structure today? It's been so long
since I set foot in programming, I'm itching to start again, but I'm
wanting a starting point.

Aug 2 '08 #1

Subscribe Reply

2767

=?UTF-8?B?RXJpayBXaWtzdHLDtm0=?=

On 2008-08-03 00:02, rsprawls wrote:

I found a disk for a b-tree algorithm that I purchased back in 93 or
so. I'd hoped to find this, but now I'd like to know how worthwhile
are b-trees in today's advancements? This is old C code using pascal
and cdecl declarations, but it would be a minor project to bring up to
date in C, but I was thinking it might be a good project to convert it
over to C++, especially since I don't know C++ very well and would
like a good sorting algorithm for medium sized datafiles.

Are b-trees still a valued data structure today? It's been so long
since I set foot in programming, I'm itching to start again, but I'm
wanting a starting point.

Not really topical for this group, you might want to ask in a general
programming group like comp.programmin g.

AFAIK B-Trees or one of its derivatives are used in most modern
filesystems, since the time it takes to read in a node from the disk is
so high you want to have as few reads as possible. I would also suspect
that they are commonly used in databases and other places which deals
with on-disk storage. If the datastructur is in RAM other structures are
generally better.

--
Erik WikstrÃ¶m

Aug 3 '08 #2

James Kanze

On Aug 3, 2:43 am, Erik Wikström <Erik-wikst...@telia. comwrote:

On 2008-08-03 00:02, rsprawls wrote:

I found a disk for a b-tree algorithm that I purchased back
in 93 or so. I'd hoped to find this, but now I'd like to
know how worthwhile are b-trees in today's advancements?
This is old C code using pascal and cdecl declarations, but
it would be a minor project to bring up to date in C, but I
was thinking it might be a good project to convert it over
to C++, especially since I don't know C++ very well and
would like a good sorting algorithm for medium sized
datafiles.

Are b-trees still a valued data structure today? It's been
so long since I set foot in programming, I'm itching to
start again, but I'm wanting a starting point.

Not really topical for this group, you might want to ask in a
general programming group like comp.programmin g.

AFAIK B-Trees or one of its derivatives are used in most
modern filesystems, since the time it takes to read in a node
from the disk is so high you want to have as few reads as
possible.

I'm not sure what you're considering a B-Tree here. I don't
think that the OS does anything special with directories,
however: the general philosophy is that if your directory
structure is well organized, no individual directory will
contain enough entries to make anything other than linear search
worth the effort. Also, the main use of B-Trees is to keep
sorted data; the OS's I know don't maintain the directory
entries sorted, so a B-Tree wouldn't improve access in any way.

I would also suspect that they are commonly used in databases
and other places which deals with on-disk storage.

In a relational database, it probably depends on the types of
indexes, and the use which is made of them. I think modern
databases dynamically tune the organization according to the
actual accesses they see, which means that if sorted access
isn't useful, they won't use a B-Tree. (Sorted access is useful
not only when you request sorted output, but also when you have
index criteria like "where x a and x < b". On the other hand,
a hash table is generally better for "where x = a", and off
hand, I can't think of any "optimizing " structure for things
like "where x like '...%'"---I've also noticed that when I use
such requests, my access times go up significantly:-).)

If the data structure is in RAM other structures are generally
better.

That used to be true, but I'm not sure now, with paging, etc.
If you need random access to sorted data, using a B-Tree with
each node filling an entire page (and page aligned) might be
worth it (but you'd need one awfully big table to see the
difference).

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 3 '08 #3

=?UTF-8?B?RXJpayBXaWtzdHLDtm0=?=

On 2008-08-03 10:08, James Kanze wrote:

On Aug 3, 2:43 am, Erik WikstrÃ¶m <Erik-wikst...@telia. comwrote:
>On 2008-08-03 00:02, rsprawls wrote:

I found a disk for a b-tree algorithm that I purchased back
in 93 or so. I'd hoped to find this, but now I'd like to
know how worthwhile are b-trees in today's advancements?
This is old C code using pascal and cdecl declarations, but
it would be a minor project to bring up to date in C, but I
was thinking it might be a good project to convert it over
to C++, especially since I don't know C++ very well and
would like a good sorting algorithm for medium sized
datafiles.

Are b-trees still a valued data structure today? It's been
so long since I set foot in programming, I'm itching to
start again, but I'm wanting a starting point.

>Not really topical for this group, you might want to ask in a
general programming group like comp.programmin g.

>AFAIK B-Trees or one of its derivatives are used in most
modern filesystems, since the time it takes to read in a node
from the disk is so high you want to have as few reads as
possible.

I'm not sure what you're considering a B-Tree here. I don't
think that the OS does anything special with directories,
however: the general philosophy is that if your directory
structure is well organized, no individual directory will
contain enough entries to make anything other than linear search
worth the effort. Also, the main use of B-Trees is to keep
sorted data; the OS's I know don't maintain the directory
entries sorted, so a B-Tree wouldn't improve access in any way.

The nice things about B-Trees is that they keep the depth of the tree
very low, which means that the number of nodes that have to be examined
before you find the data you want is also low. But making each node in
the tree the size of (or a multiple of) a disk-block you can make the
number of disk-reads equal to the number of nodes examined. And since
disk-reads are far more expensive than almost any operation you might do
on the data this is a Good Thing.

An example of how one might organise a filesystem would be to keep four
things on disk, a directory (containing directories and files), a B-Tree
of inodes (where an inode contains the data about a file, includin
information about where the actual file data is stores), file data, and
a B-Tree with free space information.

Each entry in the directory represents a file in the system and contains
the inode number of the file. The directory can probably be made small
enough to keep at least part of it in cache.

To read the contents of a file you first look it up in the directory
which gives you the inode number, which is used to find the inode in the
B-Tree. The inode then contains data about the file and in which disk-
blocks the file data is stored. For large files (or sparse files) you
can once again use a B-Tree for the disk-blocks to speed up random access.

The free space tree is much simpler, it just contains the address of
each free disk-block.

--
Erik WikstrÃ¶m

Aug 3 '08 #4

Jerry Coffin

In article <345093ab-bd86-40a4-bdb6-
d7**********@t5 4g2000hsg.googl egroups.com>, ja*********@gma il.com
says...

[ ... Warning: this post is long, and topicality is marginal at best: ]

I'm not sure what you're considering a B-Tree here. I don't
think that the OS does anything special with directories,
however: the general philosophy is that if your directory
structure is well organized, no individual directory will
contain enough entries to make anything other than linear search
worth the effort. Also, the main use of B-Trees is to keep
sorted data; the OS's I know don't maintain the directory
entries sorted, so a B-Tree wouldn't improve access in any way.

Windows NTFS does use B-trees for directories. Small directories (up to
something like 4K of total data) are stored entirely in the directory's
MFT entry as essentially a vector. Large directories (anything where the
directory entries won't fit entirely in the MFT) are stored as B-trees.

Most Unix file systems use hashing instead. E.g. ext2 and ext3 are quite
common on Linux. There's a pretty decent description of their hashing
at: http://www.nongnu.org/ext2-doc/ext2....EXED-DIRECTORY

Interestingly, though this uses hashing, it's not "flat" hashing -- the
hash is used as the key in (you've probably guessed by now) a balanced
tree -- pretty much a B-tree.

Though the decision as to when to use this indexing is made a bit
differently than it is in NTFS, the same basic idea applies: only large
directories are indexed in this fashion.

There are certainly a few primitive file systems (e.g. FAT) that don't
support indexing directories at all, but at this point, I'd consider
them more the exception than the rule. Oddly, one of the primary uses of
FAT at this point is in things like cards in digital cameras -- which
typically organize your pictures as a single directory containing _lots_
of files (hundreds or thousands of files is quite common). IOW, one of
the most common uses of FAT also closely resembles its worst case.

[ ... ]

In a relational database, it probably depends on the types of
indexes, and the use which is made of them. I think modern
databases dynamically tune the organization according to the
actual accesses they see, which means that if sorted access
isn't useful, they won't use a B-Tree. (Sorted access is useful
not only when you request sorted output, but also when you have
index criteria like "where x a and x < b". On the other hand,
a hash table is generally better for "where x = a", and off
hand, I can't think of any "optimizing " structure for things
like "where x like '...%'"---I've also noticed that when I use
such requests, my access times go up significantly:-).)

If you didn't mind a huge index, you could create an index that pointed
to a particular record with the string starting from every possible
offset in a field. For example, given two records with keys of "abcd"
and "xyz", the index would look something like:

key record number
abcd 0
bcd 0
cd 0
d 0
xyz 1
yz 1
z 1

Given its (almost inevitably large) size, you'd probably want to
organize that index as a b-tree.

Then, matching a search with a leading wildcard would be essentially
similar to matching something without the leading wildcard. Given the
larger index size, you might easily have an extra level or two in the
tree, so access might be somewhat slower, but not the orders of
magnitude slower that we expect with current SQL servers.

Of course, on its own, this design only optimizes access in the case of
a leading wildcard -- but IME, that's a very common scenario. OTOH, with
only a small amount of extra data, this index could also be used for
embedded wildcards: you'd also store the offset into the string for each
entry. In this case, searching for something like 'ab%f' would consist
of searching for 'ab' with an offset of 0 and 'f' with an offset >= 2.

[ ... ]

That used to be true, but I'm not sure now, with paging, etc.
If you need random access to sorted data, using a B-Tree with
each node filling an entire page (and page aligned) might be
worth it (but you'd need one awfully big table to see the
difference).

I wouldn't be surprised if it did pretty well -- like paging, the
performance of caching also tends to depend heavily upon locality of
reference -- and that's a factor B-trees were designed to maximize to a
large degree.

--
Later,
Jerry.

The universe is a figment of its own imagination.

Aug 3 '08 #5

Jerry Coffin

In article <044d17e0-e39a-4103-9aa6-
be**********@m3 6g2000hse.googl egroups.com>, ja*********@gma il.com
says...

[ ... ]

When you create the first file in a directory, you probably
want to allocate from a relatively large block of free space,
to support creating more files nearby as above, since it's
_fairly_ unusual to create a new directory that'll only ever
contain one file.

(You don't know the Unix culture. I've seen empty directories
created for locking purposes:-).)

That's not a problem at all -- an empty directory (or an empty file)
only involves creating a directory entry. In addition, you're only
really concerned when you have difficulty allocating free space on the
disk because it has all been fragmented -- so something created for
locking purposes isn't likely to last long enough to make any real
difference anyway.

More generally, those both sound like interesting heuristics.
But somehow, given the evolution of systems (if it ain't broke,
don't fix it, etc.), I doubt that there widespread in current
Unix or Windows. (Maybe Linux... where the philosophy seems at
times to be "if it ain't broke, it ain't complicated enough".)

:-)

[ I don't have time to write more now, but I'll probably post another
follow-up to your post when I can...]

--
Later,
Jerry.

The universe is a figment of its own imagination.

Aug 4 '08 #6

James Kanze

On Aug 4, 3:50 pm, Erik Wikström <Erik-wikst...@telia. comwrote:

On 2008-08-04 14:41, James Kanze wrote:
On Aug 3, 7:52 pm, Erik Wikström <Erik-wikst...@telia. comwrote:
On 2008-08-03 16:59, James Kanze wrote:
On Aug 3, 12:25 pm, Erik Wikström <Erik-wikst...@telia. comwrote:
On 2008-08-03 10:08, James Kanze wrote:

An example of how one might organise a filesystem would
be to keep four things on disk, a directory (containing
directories and files), a B-Tree of inodes (where an
inode contains the data about a file, includin
information about where the actual file data is stores),
file data, and a B-Tree with free space information.

I rather suspect that most systems keep the mapping inode
number to where the inode is located on disk in memory,
initializing it once, when the disk is mounted. (The
original Unix kept the inodes in a separate contiguous
sub-partition, with the inode number being used as a
direct index. A very clever scheme to maximize head
movement:-). I'm not sure what is being used today, but
for large volumes, at least, a hash table would seem most
appropriate.

With old FSs it was common to pre-allocate a certain number
of inodes when the disk was formatted, which had two
problems, either you had more inodes than you needed or you
did not have enough. In the first case you wasted space,
in the second you were out of luck. New FSs allocates
inodes on the fly and the trend seems to be to put as much
as possible in the B-Tree (or Hash-tree) nodes and only
keep the directory and file data separate.

New, being at least everything since the Berkley 4.2 kernel,
right:-). Except that I've never heard of inodes being kept
in a B-Tree, and I don't see what that would gain (or even
how it would be possible, since you don't even have any
reasonable ordering criteria).

New as in the last few years. I might have confused you by
using the term inode, perhaps meta-data block or some other
term would be better to avoid associations with UFS. One
ordering criteria would be to use the inode (or meta-data
block) number, another to hash the path.

But why? That's what I'm trying to figure out. I'm familiar
with what B-Trees are classically used for (ISAM), and they seem
very much optimized for that: rapid direct access, but also
sequential access according to an ordering criterion. I don't
see where the second would ever apply to inodes or meta-blocks.

As for the gain, while I'm not an expert on the area there are
at least two FSs that I know of (one in early development)
which uses unified B-Trees to store everything but the actual
file data (and perhaps the free blocks, not too sure about how
that works).

OK. I can sort of see why one might want a unified method, and
if B-Trees have distinct advantages in some cases, why not use
them systematically. Jokingly: if all you have is a hammer,
everything looks like a nail. Realistically, however: you have
finite develoment time, and its probably preferable spending it
to get one structure 100% right, rather than three different
structures only more or less right. Once you've got the B-Tree,
you use it unless there is a distinct disadvantage in doing so.

[...]

Solaris (AFAIK) uses UFS which is hardly a modern FS,

I didn't say it was modern:-).

but it is old and proven which is usually more important than
features and performance in the enterprise market (especially
when both features and performance can be added by in the form
of extra software and RADI arrays).

That's sort of what I hinted at in my reply to Jerry: if it
ain't broke, don't fix it (vs if it ain't broke, it ain't
complicated enough). There's also the fact that there doesn't
seem to be as much money to be made making your system more
effecient as there is to be made selling a new, more powerful
hardware, and that I've never heard of anyone benchmarking this
sort of thing in the procuration procedures. Solaris is by far
the system I'm most familiar with, and it was definitly my
impression that what I'd learned back when I was working on
kernel software (some 20 or more years ago) hadn't really
changed that much.

[...]

XFS used two synced B-Trees to manage free space, one was
sorted by logical address, which is handy when trying to find
free space close to some already allocated file (such as when
appending data). The other tree was sorted by size of the free
space, which is nice when trying to find the best place for a
file (as when creating).

If the file system supports contiguous files, this is
definitely an advantage. The filesystems I'm familiar with
today don't. :-(.

No FS I know of do, but that does not mean that trying to keep
up locality of reference is a bad idea.

It depends. If you're going to read sector by sector anyway, on
a multi-user system, it probably doesn't matter. Someone else
is going to access elsewhere between your accesses anyway.

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 4 '08 #7

Similar topics

1639

loading objects from ZODB

by: Achim Domma | last post by:

Hi, I'm playing around with ZODB but I don't understand when objects are loaded from the DB. Assume I have an object containing a list of other objects. If this object is loaded from ZODB, when is the list filled? Only on demand? I think yes, because otherwise reading the top node would result in loading the whole database. But I wanted to be sure! regards, Achim

Python

2805

in Ram database?

by: John D. | last post by:

Does anyone know if there is an "in-ram" dictionary based database for Python. I can see some applications where this can really be powerful in applications requiring 10,000 records or less. Is such a database available? If so, who is working on it, or where is it? I've heard that Python supports a stand-alone database. I looked

Python

1785

Balanced tree type coming in next Python?

by: Kenneth McDonald | last post by:

I can't see anything about this in the notes on the upcoming 2.4 on python.org, but for some reason I thought I remembered seeing that a balanced tree sequence type would be included in Python in the near future. Is this correct? Thanks, Ken

Python

2486

storing large amounts of data in a list/dictionary

by: flamesrock | last post by:

Hi, Basically, what I'm trying to do is store large amounts of data in a list or dictionary and then convert that to a custom formatted xml file. My list looks roughly like this: (d,r,p]]]) My question is, would it be faster to use a dictionary if the elements

Python

2911

ZODB memory problems (was: processing a Very Large file)

by: DJTB | last post by:

zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of objects I'd like to store in the ZODB is too large to fit in RAM, my program gets killed with signal 11 or signal 9... Below a minimal working (or actually: it doesn't work because of memory

Python

5245

Hashtable or Btree?

by: Eloff | last post by:

I've got 100MB of urls organized by domain and then by document. I thought that a hastable of hastables or a btree of btrees would be a good way to lookup a specific url quickly by first finding the domain and then finding the matching document. What do you think would be better? And do you have any implementation you recommend? Thanks, and merry Christmas folks. -Dan

C / C++

353

Builtin classes list, set, dict reimplemented via B-trees

by: barnesc | last post by:

Nifty, Tim Peters responded to my e-mail. I must've said something interesting. Cool, a PyCelebrity! > >> ... >> I've gotten bored and went back to one of my other projects: >> reimplementing the Python builtin classes list(), set(), dict(), >> and frozenset() with balanced trees (specifically, counted B-trees >> stored in memory). >>

Python

2237

int8 primary keys still not using index without manual JDBC driverpatch (7.4RC1)

by: Craig O'Shannessy | last post by:

Hi all, Just thought I'd mention that I really think this problem needs to be fixed. I I'm patching the 7.4RC1 JDBC drivers as we speak due to this optimiser bug, and it's the third time I've had to do this. I would think this bug causes quite a lot of people to evaluate postgres and decide it has awful primary key performance! I love postgres, and hate to think that this could be happening.

PostgreSQL Database

2642

ZODB for inverted index?

by: vd12005 | last post by:

Hello, While playing to write an inverted index (see: http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with a classic dict, (i have thousand of documents and millions of terms, stemming or other filtering are not considered, i wanted to understand how to handle GB of text first). I found ZODB and try to use it a bit, but i think i must be misunderstanding how to use it even after reading...

Python

16870

What's the difference of hash table and binary tree?

by: fdmfdmfdm | last post by:

This might not be the best place to post this topic, but I assume most of the experts in C shall know this. This is an interview question. My answer is: hash table gives you O(1) searching but according to your hash function you might take more memory than binary tree. On the contrary, binary tree gives you O(logN) searching but less memory. Am I right?

C / C++

8379

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8709

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8494

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

7309

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6162

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5627

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4150

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

1924

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

1597

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General