B-trees - C / C++

rsprawls

I found a disk for a b-tree algorithm that I purchased back in 93 or
so. I'd hoped to find this, but now I'd like to know how worthwhile
are b-trees in today's advancements? This is old C code using pascal
and cdecl declarations, but it would be a minor project to bring up to
date in C, but I was thinking it might be a good project to convert it
over to C++, especially since I don't know C++ very well and would
like a good sorting algorithm for medium sized datafiles.

Are b-trees still a valued data structure today? It's been so long
since I set foot in programming, I'm itching to start again, but I'm
wanting a starting point.

Aug 2 '08 #1

Subscribe Post Reply

2751

=?UTF-8?B?RXJpayBXaWtzdHLDtm0=?=

On 2008-08-03 00:02, rsprawls wrote:

I found a disk for a b-tree algorithm that I purchased back in 93 or
so. I'd hoped to find this, but now I'd like to know how worthwhile
are b-trees in today's advancements? This is old C code using pascal
and cdecl declarations, but it would be a minor project to bring up to
date in C, but I was thinking it might be a good project to convert it
over to C++, especially since I don't know C++ very well and would
like a good sorting algorithm for medium sized datafiles.

Are b-trees still a valued data structure today? It's been so long
since I set foot in programming, I'm itching to start again, but I'm
wanting a starting point.

Not really topical for this group, you might want to ask in a general
programming group like comp.programming.

AFAIK B-Trees or one of its derivatives are used in most modern
filesystems, since the time it takes to read in a node from the disk is
so high you want to have as few reads as possible. I would also suspect
that they are commonly used in databases and other places which deals
with on-disk storage. If the datastructur is in RAM other structures are
generally better.

--
Erik WikstrÃ¶m

Aug 3 '08 #2

James Kanze

On Aug 3, 2:43 am, Erik Wikström <Erik-wikst...@telia.comwrote:

On 2008-08-03 00:02, rsprawls wrote:

I found a disk for a b-tree algorithm that I purchased back
in 93 or so. I'd hoped to find this, but now I'd like to
know how worthwhile are b-trees in today's advancements?
This is old C code using pascal and cdecl declarations, but
it would be a minor project to bring up to date in C, but I
was thinking it might be a good project to convert it over
to C++, especially since I don't know C++ very well and
would like a good sorting algorithm for medium sized
datafiles.

Are b-trees still a valued data structure today? It's been
so long since I set foot in programming, I'm itching to
start again, but I'm wanting a starting point.

Not really topical for this group, you might want to ask in a
general programming group like comp.programming.

AFAIK B-Trees or one of its derivatives are used in most
modern filesystems, since the time it takes to read in a node
from the disk is so high you want to have as few reads as
possible.

I'm not sure what you're considering a B-Tree here. I don't
think that the OS does anything special with directories,
however: the general philosophy is that if your directory
structure is well organized, no individual directory will
contain enough entries to make anything other than linear search
worth the effort. Also, the main use of B-Trees is to keep
sorted data; the OS's I know don't maintain the directory
entries sorted, so a B-Tree wouldn't improve access in any way.

I would also suspect that they are commonly used in databases
and other places which deals with on-disk storage.

In a relational database, it probably depends on the types of
indexes, and the use which is made of them. I think modern
databases dynamically tune the organization according to the
actual accesses they see, which means that if sorted access
isn't useful, they won't use a B-Tree. (Sorted access is useful
not only when you request sorted output, but also when you have
index criteria like "where x a and x < b". On the other hand,
a hash table is generally better for "where x = a", and off
hand, I can't think of any "optimizing" structure for things
like "where x like '...%'"---I've also noticed that when I use
such requests, my access times go up significantly:-).)

If the data structure is in RAM other structures are generally
better.

That used to be true, but I'm not sure now, with paging, etc.
If you need random access to sorted data, using a B-Tree with
each node filling an entire page (and page aligned) might be
worth it (but you'd need one awfully big table to see the
difference).

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 3 '08 #3

=?UTF-8?B?RXJpayBXaWtzdHLDtm0=?=

On 2008-08-03 10:08, James Kanze wrote:

On Aug 3, 2:43 am, Erik WikstrÃ¶m <Erik-wikst...@telia.comwrote:
>On 2008-08-03 00:02, rsprawls wrote:

I found a disk for a b-tree algorithm that I purchased back
in 93 or so. I'd hoped to find this, but now I'd like to
know how worthwhile are b-trees in today's advancements?
This is old C code using pascal and cdecl declarations, but
it would be a minor project to bring up to date in C, but I
was thinking it might be a good project to convert it over
to C++, especially since I don't know C++ very well and
would like a good sorting algorithm for medium sized
datafiles.

Are b-trees still a valued data structure today? It's been
so long since I set foot in programming, I'm itching to
start again, but I'm wanting a starting point.

>Not really topical for this group, you might want to ask in a
general programming group like comp.programming.

>AFAIK B-Trees or one of its derivatives are used in most
modern filesystems, since the time it takes to read in a node
from the disk is so high you want to have as few reads as
possible.

I'm not sure what you're considering a B-Tree here. I don't
think that the OS does anything special with directories,
however: the general philosophy is that if your directory
structure is well organized, no individual directory will
contain enough entries to make anything other than linear search
worth the effort. Also, the main use of B-Trees is to keep
sorted data; the OS's I know don't maintain the directory
entries sorted, so a B-Tree wouldn't improve access in any way.

The nice things about B-Trees is that they keep the depth of the tree
very low, which means that the number of nodes that have to be examined
before you find the data you want is also low. But making each node in
the tree the size of (or a multiple of) a disk-block you can make the
number of disk-reads equal to the number of nodes examined. And since
disk-reads are far more expensive than almost any operation you might do
on the data this is a Good Thing.

An example of how one might organise a filesystem would be to keep four
things on disk, a directory (containing directories and files), a B-Tree
of inodes (where an inode contains the data about a file, includin
information about where the actual file data is stores), file data, and
a B-Tree with free space information.

Each entry in the directory represents a file in the system and contains
the inode number of the file. The directory can probably be made small
enough to keep at least part of it in cache.

To read the contents of a file you first look it up in the directory
which gives you the inode number, which is used to find the inode in the
B-Tree. The inode then contains data about the file and in which disk-
blocks the file data is stored. For large files (or sparse files) you
can once again use a B-Tree for the disk-blocks to speed up random access.

The free space tree is much simpler, it just contains the address of
each free disk-block.

--
Erik WikstrÃ¶m

Aug 3 '08 #4

Jerry Coffin

In article <345093ab-bd86-40a4-bdb6-
d7**********@t54g2000hsg.googlegroups.com>, ja*********@gmail.com
says...

[ ... Warning: this post is long, and topicality is marginal at best: ]

I'm not sure what you're considering a B-Tree here. I don't
think that the OS does anything special with directories,
however: the general philosophy is that if your directory
structure is well organized, no individual directory will
contain enough entries to make anything other than linear search
worth the effort. Also, the main use of B-Trees is to keep
sorted data; the OS's I know don't maintain the directory
entries sorted, so a B-Tree wouldn't improve access in any way.

Windows NTFS does use B-trees for directories. Small directories (up to
something like 4K of total data) are stored entirely in the directory's
MFT entry as essentially a vector. Large directories (anything where the
directory entries won't fit entirely in the MFT) are stored as B-trees.

Most Unix file systems use hashing instead. E.g. ext2 and ext3 are quite
common on Linux. There's a pretty decent description of their hashing
at: http://www.nongnu.org/ext2-doc/ext2....EXED-DIRECTORY

Interestingly, though this uses hashing, it's not "flat" hashing -- the
hash is used as the key in (you've probably guessed by now) a balanced
tree -- pretty much a B-tree.

Though the decision as to when to use this indexing is made a bit
differently than it is in NTFS, the same basic idea applies: only large
directories are indexed in this fashion.

There are certainly a few primitive file systems (e.g. FAT) that don't
support indexing directories at all, but at this point, I'd consider
them more the exception than the rule. Oddly, one of the primary uses of
FAT at this point is in things like cards in digital cameras -- which
typically organize your pictures as a single directory containing _lots_
of files (hundreds or thousands of files is quite common). IOW, one of
the most common uses of FAT also closely resembles its worst case.

[ ... ]

In a relational database, it probably depends on the types of
indexes, and the use which is made of them. I think modern
databases dynamically tune the organization according to the
actual accesses they see, which means that if sorted access
isn't useful, they won't use a B-Tree. (Sorted access is useful
not only when you request sorted output, but also when you have
index criteria like "where x a and x < b". On the other hand,
a hash table is generally better for "where x = a", and off
hand, I can't think of any "optimizing" structure for things
like "where x like '...%'"---I've also noticed that when I use
such requests, my access times go up significantly:-).)

If you didn't mind a huge index, you could create an index that pointed
to a particular record with the string starting from every possible
offset in a field. For example, given two records with keys of "abcd"
and "xyz", the index would look something like:

key record number
abcd 0
bcd 0
cd 0
d 0
xyz 1
yz 1
z 1

Given its (almost inevitably large) size, you'd probably want to
organize that index as a b-tree.

Then, matching a search with a leading wildcard would be essentially
similar to matching something without the leading wildcard. Given the
larger index size, you might easily have an extra level or two in the
tree, so access might be somewhat slower, but not the orders of
magnitude slower that we expect with current SQL servers.

Of course, on its own, this design only optimizes access in the case of
a leading wildcard -- but IME, that's a very common scenario. OTOH, with
only a small amount of extra data, this index could also be used for
embedded wildcards: you'd also store the offset into the string for each
entry. In this case, searching for something like 'ab%f' would consist
of searching for 'ab' with an offset of 0 and 'f' with an offset >= 2.

[ ... ]

That used to be true, but I'm not sure now, with paging, etc.
If you need random access to sorted data, using a B-Tree with
each node filling an entire page (and page aligned) might be
worth it (but you'd need one awfully big table to see the
difference).

I wouldn't be surprised if it did pretty well -- like paging, the
performance of caching also tends to depend heavily upon locality of
reference -- and that's a factor B-trees were designed to maximize to a
large degree.

--
Later,
Jerry.

The universe is a figment of its own imagination.

Aug 3 '08 #5

Jerry Coffin

In article <044d17e0-e39a-4103-9aa6-
be**********@m36g2000hse.googlegroups.com>, ja*********@gmail.com
says...

[ ... ]

When you create the first file in a directory, you probably
want to allocate from a relatively large block of free space,
to support creating more files nearby as above, since it's
_fairly_ unusual to create a new directory that'll only ever
contain one file.

(You don't know the Unix culture. I've seen empty directories
created for locking purposes:-).)

That's not a problem at all -- an empty directory (or an empty file)
only involves creating a directory entry. In addition, you're only
really concerned when you have difficulty allocating free space on the
disk because it has all been fragmented -- so something created for
locking purposes isn't likely to last long enough to make any real
difference anyway.

More generally, those both sound like interesting heuristics.
But somehow, given the evolution of systems (if it ain't broke,
don't fix it, etc.), I doubt that there widespread in current
Unix or Windows. (Maybe Linux... where the philosophy seems at
times to be "if it ain't broke, it ain't complicated enough".)

:-)

[ I don't have time to write more now, but I'll probably post another
follow-up to your post when I can...]

--
Later,
Jerry.

The universe is a figment of its own imagination.

Aug 4 '08 #6

James Kanze

On Aug 4, 3:50 pm, Erik Wikström <Erik-wikst...@telia.comwrote:

On 2008-08-04 14:41, James Kanze wrote:
On Aug 3, 7:52 pm, Erik Wikström <Erik-wikst...@telia.comwrote:
On 2008-08-03 16:59, James Kanze wrote:
On Aug 3, 12:25 pm, Erik Wikström <Erik-wikst...@telia.comwrote:
On 2008-08-03 10:08, James Kanze wrote:

An example of how one might organise a filesystem would
be to keep four things on disk, a directory (containing
directories and files), a B-Tree of inodes (where an
inode contains the data about a file, includin
information about where the actual file data is stores),
file data, and a B-Tree with free space information.

I rather suspect that most systems keep the mapping inode
number to where the inode is located on disk in memory,
initializing it once, when the disk is mounted. (The
original Unix kept the inodes in a separate contiguous
sub-partition, with the inode number being used as a
direct index. A very clever scheme to maximize head
movement:-). I'm not sure what is being used today, but
for large volumes, at least, a hash table would seem most
appropriate.

With old FSs it was common to pre-allocate a certain number
of inodes when the disk was formatted, which had two
problems, either you had more inodes than you needed or you
did not have enough. In the first case you wasted space,
in the second you were out of luck. New FSs allocates
inodes on the fly and the trend seems to be to put as much
as possible in the B-Tree (or Hash-tree) nodes and only
keep the directory and file data separate.

New, being at least everything since the Berkley 4.2 kernel,
right:-). Except that I've never heard of inodes being kept
in a B-Tree, and I don't see what that would gain (or even
how it would be possible, since you don't even have any
reasonable ordering criteria).

New as in the last few years. I might have confused you by
using the term inode, perhaps meta-data block or some other
term would be better to avoid associations with UFS. One
ordering criteria would be to use the inode (or meta-data
block) number, another to hash the path.

But why? That's what I'm trying to figure out. I'm familiar
with what B-Trees are classically used for (ISAM), and they seem
very much optimized for that: rapid direct access, but also
sequential access according to an ordering criterion. I don't
see where the second would ever apply to inodes or meta-blocks.

As for the gain, while I'm not an expert on the area there are
at least two FSs that I know of (one in early development)
which uses unified B-Trees to store everything but the actual
file data (and perhaps the free blocks, not too sure about how
that works).

OK. I can sort of see why one might want a unified method, and
if B-Trees have distinct advantages in some cases, why not use
them systematically. Jokingly: if all you have is a hammer,
everything looks like a nail. Realistically, however: you have
finite develoment time, and its probably preferable spending it
to get one structure 100% right, rather than three different
structures only more or less right. Once you've got the B-Tree,
you use it unless there is a distinct disadvantage in doing so.

[...]

Solaris (AFAIK) uses UFS which is hardly a modern FS,

I didn't say it was modern:-).

but it is old and proven which is usually more important than
features and performance in the enterprise market (especially
when both features and performance can be added by in the form
of extra software and RADI arrays).

That's sort of what I hinted at in my reply to Jerry: if it
ain't broke, don't fix it (vs if it ain't broke, it ain't
complicated enough). There's also the fact that there doesn't
seem to be as much money to be made making your system more
effecient as there is to be made selling a new, more powerful
hardware, and that I've never heard of anyone benchmarking this
sort of thing in the procuration procedures. Solaris is by far
the system I'm most familiar with, and it was definitly my
impression that what I'd learned back when I was working on
kernel software (some 20 or more years ago) hadn't really
changed that much.

[...]

XFS used two synced B-Trees to manage free space, one was
sorted by logical address, which is handy when trying to find
free space close to some already allocated file (such as when
appending data). The other tree was sorted by size of the free
space, which is nice when trying to find the best place for a
file (as when creating).

If the file system supports contiguous files, this is
definitely an advantage. The filesystems I'm familiar with
today don't. :-(.

No FS I know of do, but that does not mean that trying to keep
up locality of reference is a bad idea.

It depends. If you're going to read sector by sector anyway, on
a multi-user system, it probably doesn't matter. Someone else
is going to access elsewhere between your accesses anyway.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 4 '08 #7

Similar topics

loading objects from ZODB

by: Achim Domma | last post by:

Hi, I'm playing around with ZODB but I don't understand when objects are loaded from the DB. Assume I have an object containing a list of other objects. If this object is loaded from ZODB, when...

Python

in Ram database?

by: John D. | last post by:

Does anyone know if there is an "in-ram" dictionary based database for Python. I can see some applications where this can really be powerful in applications requiring 10,000 records or less. ...

Python

Balanced tree type coming in next Python?

by: Kenneth McDonald | last post by:

I can't see anything about this in the notes on the upcoming 2.4 on python.org, but for some reason I thought I remembered seeing that a balanced tree sequence type would be included in Python in...

Python

storing large amounts of data in a list/dictionary

by: flamesrock | last post by:

Hi, Basically, what I'm trying to do is store large amounts of data in a list or dictionary and then convert that to a custom formatted xml file. My list looks roughly like this: (d,r,p]]])...

Python

ZODB memory problems (was: processing a Very Large file)

by: DJTB | last post by:

zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...

Python

Hashtable or Btree?

by: Eloff | last post by:

I've got 100MB of urls organized by domain and then by document. I thought that a hastable of hastables or a btree of btrees would be a good way to lookup a specific url quickly by first finding...

C / C++

Builtin classes list, set, dict reimplemented via B-trees

by: barnesc | last post by:

Nifty, Tim Peters responded to my e-mail. I must've said something interesting. Cool, a PyCelebrity! > >> ... >> I've gotten bored and went back to one of my other projects: >>...

Python

int8 primary keys still not using index without manual JDBC driverpatch (7.4RC1)

by: Craig O'Shannessy | last post by:

Hi all, Just thought I'd mention that I really think this problem needs to be fixed. I I'm patching the 7.4RC1 JDBC drivers as we speak due to this optimiser bug, and it's the third time...

PostgreSQL Database

ZODB for inverted index?

by: vd12005 | last post by:

Hello, While playing to write an inverted index (see: http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with a classic dict, (i have thousand of documents and millions of terms,...

Python

What's the difference of hash table and binary tree?

by: fdmfdmfdm | last post by:

This might not be the best place to post this topic, but I assume most of the experts in C shall know this. This is an interview question. My answer is: hash table gives you O(1) searching but...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice