C++ hash map container

Matthias Käppler

Hi,

I need to store objects of a FileInfo class containing information about a
specific file. I can uniquely identify these files by their serial number.
I thought it would be a good idea to use a hash map so I can access the
file information in constant time, without having to iterate over a
(possibly very large) list of files.

As far as I know, std::map does not hash its entries (I guess it takes
O(nlogn) time to find an entry). There's an extension class called
__gnu_cxx::hash_map but it's an SGI extension and I'm not sure in how far
it is safe to use in terms of portability.
I tried browsing boost.org but couldn't find an implementation of a hash map
either.

Any suggestions?

Thanks,
Matthias

Jul 22 '05 #1

Subscribe Post Reply

22314

Jonathan Turkanis

"Matthias Käppler" <no****@digitalraid.com> wrote in message
news:co*************@news.t-online.com...

Hi,

I need to store objects of a FileInfo class containing information about a
specific file. I can uniquely identify these files by their serial number.
I thought it would be a good idea to use a hash map so I can access the
file information in constant time, without having to iterate over a
(possibly very large) list of files.

As far as I know, std::map does not hash its entries (I guess it takes
O(nlogn) time to find an entry). There's an extension class called
__gnu_cxx::hash_map but it's an SGI extension and I'm not sure in how far
it is safe to use in terms of portability.
I tried browsing boost.org but couldn't find an implementation of a hash map
either.

The recently approved library technical report TR1 contains specifications for
four hashed container templates: unordered_set, unordered_map,
unordered_multiset and unordered_multimap.

An implementation can be found in the Boost Yahoo Files section, under the
peculiar name jtc1-sc22-wg21-2003-n1456. See

http://groups.yahoo.com/group/boost/...3-n1456.tar.gz

(You need to sign up for the boost developers list to access this material. See
http://www.boost.org/more/mailing_lists.htm#main)

BTW, I'm not sure the above implementation contains the latest changes to the
TR1 specs, but it shouldn't matter for your purposes.

Jonathan

Jul 22 '05 #2

Jerry Coffin

Matthias Käppler <no****@digitalraid.com> wrote in message news:<co*************@news.t-online.com>...

[ ... ]

As far as I know, std::map does not hash its entries (I guess it takes
O(nlogn) time to find an entry).

No -- it's O(log N). Given that containers are (intended to be)
contained entirely in addressable memory, the difference between
logarithmic and constant time is small enough that in real life the
difference will depend on the details of the implementation as much as
the computational complexity.

Just for example, if we assume a million items in the collection, a
map should require roughly 20 operations for a search. A million items
suggests about four characters to distinguish between items, so we'll
average comparing 2 bytes for most of those comparisons, meaning we
look at about 38 bytes plus the length of the key being searched for.

For hashing, the initial hash requires looking at the entire key being
searched for, and in a typical hash table we can expect to do about
three probes to find the right item, meaning we do about 6 bytes plus
the length of the key being searched for. If (for example) the key is
a string that averages 20 bytes, we get 58 bytes looked at for the
tree, and 29 bytes looked at for the hash table, implying that the
hash table should be twice as fast.

Given the exponential growth of a three, all but the last three or
four levels are likely to be in the cache. That makes those
comparisons _essentially_ free, since loading even one item from
memory will probably be slower than traversing whatever part of the
tree fits in teh cache. In a hash table, if our hash function is good
at all, we can expect the probes to be distributed almost randomly
through memory, meaning we get no locality of reference and therefore
no meaningful use of the cache (unless it's small enough that nearly
the entire table will fit in the cache).

This means a million items is an (extremely) rough approximation of a
break-even point, with the hash table clearly winning if you have
enough items, but the tree winning at the smaller end.

It's also worth noting, however, that you can _exepct_ this kind of
behavior out of a tree. Algorithms for hash tables that need (for
example) to accomodate arbitrary growth aren't nearly as well known,
and some algorithms are pretty ugly. The algorithms (of which I'm
aware) that avoid the worst of the ugliness also lose a fair amount in
average speed.

The bottom line is that I wouldn't take for granted that hashing is
necessary to get decent performance in your situation.

--
Later,
Jerry.

The universe is a figment of its own imagination.

Jul 22 '05 #3

Matthias Käppler

That was an interesting read, thanks Jerry.
Maybe I just stick with std::map.

Jerry Coffin wrote:

Matthias Käppler <no****@digitalraid.com> wrote in message
news:<co*************@news.t-online.com>...

[ ... ]
As far as I know, std::map does not hash its entries (I guess it takes
O(nlogn) time to find an entry).

No -- it's O(log N). Given that containers are (intended to be)
contained entirely in addressable memory, the difference between
logarithmic and constant time is small enough that in real life the
difference will depend on the details of the implementation as much as
the computational complexity.

Just for example, if we assume a million items in the collection, a
map should require roughly 20 operations for a search. A million items
suggests about four characters to distinguish between items, so we'll
average comparing 2 bytes for most of those comparisons, meaning we
look at about 38 bytes plus the length of the key being searched for.

For hashing, the initial hash requires looking at the entire key being
searched for, and in a typical hash table we can expect to do about
three probes to find the right item, meaning we do about 6 bytes plus
the length of the key being searched for. If (for example) the key is
a string that averages 20 bytes, we get 58 bytes looked at for the
tree, and 29 bytes looked at for the hash table, implying that the
hash table should be twice as fast.

Given the exponential growth of a three, all but the last three or
four levels are likely to be in the cache. That makes those
comparisons _essentially_ free, since loading even one item from
memory will probably be slower than traversing whatever part of the
tree fits in teh cache. In a hash table, if our hash function is good
at all, we can expect the probes to be distributed almost randomly
through memory, meaning we get no locality of reference and therefore
no meaningful use of the cache (unless it's small enough that nearly
the entire table will fit in the cache).

This means a million items is an (extremely) rough approximation of a
break-even point, with the hash table clearly winning if you have
enough items, but the tree winning at the smaller end.

It's also worth noting, however, that you can _exepct_ this kind of
behavior out of a tree. Algorithms for hash tables that need (for
example) to accomodate arbitrary growth aren't nearly as well known,
and some algorithms are pretty ugly. The algorithms (of which I'm
aware) that avoid the worst of the ugliness also lose a fair amount in
average speed.

The bottom line is that I wouldn't take for granted that hashing is
necessary to get decent performance in your situation.

Jul 22 '05 #4

Dave O'Hearn

jc*****@taeus.com (Jerry Coffin) wrote in message news:<b2*************************@posting.google.c om>...

Matthias Käppler <no****@digitalraid.com> wrote in message news:<co*************@news.t-online.com>...

[ ... ]
As far as I know, std::map does not hash its entries (I guess it takes
O(nlogn) time to find an entry).

No -- it's O(log N). Given that containers are (intended to be)
contained entirely in addressable memory, the difference between
logarithmic and constant time is small enough that in real life the
difference will depend on the details of the implementation as much as
the computational complexity.

Just for example, if we assume a million items in the collection, a
map should require roughly 20 operations for a search. A million items
suggests about four characters to distinguish between items, so we'll
average comparing 2 bytes for most of those comparisons, meaning we
look at about 38 bytes plus the length of the key being searched for.

What about page faults? A tree can cause a page fault on every one of
those 20 operations. Most likely, the first couple will be cached, but
once you get down several levels in the tree, you will start getting
page faults.

There are special tree implementations that reduce page faults, by
making B-Trees where each node has many children, like hundreds of
children, instead of just a 'left' and a 'right'. It's very unlikely
that std::set is going to use that optimization, as it only makes
sense on large data sets.

I would take advantage of generic programming, and implement the
application first using std::set, and then with whichever hash_set
extension you like, and profiling which actually behaves better in
practice. With generic code and a few typedefs, possibly only one line
of code will have to be changed to try out another container type.

--
Dave O'Hearn

Jul 22 '05 #5

Jerry Coffin

Matthias Käppler <no****@digitalraid.com> wrote in message news:<co*************@news.t-online.com>...

That was an interesting read, thanks Jerry.
Maybe I just stick with std::map.

Well, I'm not sure I'd _stick_ with it, but there's a pretty fair
chance that I'd at least start with it. When/if a profiler shows a
reason to do so is soon enough to switch to something else if needed.

--
Later,
Jerry.

The universe is a figment of its own imagination.

Jul 22 '05 #6

Jerry Coffin

da******@pobox.com (Dave O'Hearn) wrote in message news:<3e**************************@posting.google. com>...

[ ... ]

What about page faults? A tree can cause a page fault on every one of
those 20 operations. Most likely, the first couple will be cached, but
once you get down several levels in the tree, you will start getting
page faults.
Remember that a tree grows exponentially. The single bottom level of a
(balanced) tree accounts for _half_ the data. If only the bottom four
levels of the tree have been paged out, that means only 1/16th of the
data is in memory.

That can only happen in one of two ways: either your memory is HUGELY
overcommitted, or your tree is used VERY little (or both). If memory
is overcommitted so badly that only 1/16th of frequently-used data is
in memory, NOTHING is going to fast -- and we're still only talking
about 4 disk accesses, which means the lookup itself is still likely
to appear instantaneous.

Alternatively, the data is simply used SO rarely that essentially ALL
the data has been paged out. First of all, even at that, the penalty
isn't terrible. With a modern disk a single lookup might be expected
to take around 15 ms. 20 of those translates to 300 ms total, which is
long enough for the user to notice a pause, but that's about it. Our
original posutlate to get to this point says this is happening only
(say) once in something like a matter of hours or so, at which point
saving milliseconds starts to become pointless.

In reality, if these lookups are really that rare, chances are the
user will notice a lot longer pause, but it'll be while the code to DO
the lookup gets paged in. In this case, optimizing the lookup itself
will usually make so little difference the user will never even notice
-- if the code takes (for example) 4 seconds to page in, then we might
be looking at 4.1 vs. 4.3 seconds, and the average user won't even
know which was faster.

In this case, we're clearly optimizing the wrong thing -- eliminating
the lookup time entirely still makes less difference than a 10%
improvement in the code paging time.
There are special tree implementations that reduce page faults, by
making B-Trees where each node has many children, like hundreds of
children, instead of just a 'left' and a 'right'. It's very unlikely
that std::set is going to use that optimization, as it only makes
sense on large data sets.
It's certainly true that such things exist, but based on what's been
said about the intended use, I doubt they're suitable. B-trees (and
their brethren such as B* and B+ trees) normally make little sense
unles you know up-front that the majority of the data WILL reside on
disk.

Given that he said he's storing information about files, that seems
hard for me to believe -- just for example, the computer I'm writing
this on has about 350,000 files and 1 Gig. of RAM. I'd guess typical
users have an even higher ratio of RAM:file-count than I do (at least
IME, programmers generally have more, smaller files than "normal"
users). There are certainly file servers with more files, but they
typically have more memory as well.

As such, we more or less have to postulate something like using a
wimpy workstation to store data about all the files on a big server,
or else storing a LOT of information about each file. The former case
sounds to me like a bad enough system design that no data sructure can
save it. The latter case simply calls for separating the large data
from the index that finds it (or only putting data into the leaf
nodes, about like a B+tree).
I would take advantage of generic programming, and implement the
application first using std::set, and then with whichever hash_set
extension you like, and profiling which actually behaves better in
practice. With generic code and a few typedefs, possibly only one line
of code will have to be changed to try out another container type.

I quite agree -- in fact, this is pretty much what I said in my
follow-up.

--
Later,
Jerry.

The universe is a figment of its own imagination.

Jul 22 '05 #7

Dave O'Hearn

jc*****@taeus.com (Jerry Coffin) wrote:

Remember that a tree grows exponentially. The single bottom level of a
(balanced) tree accounts for _half_ the data. If only the bottom four
levels of the tree have been paged out, that means only 1/16th of the
data is in memory.

That can only happen in one of two ways: either your memory is HUGELY
overcommitted, or your tree is used VERY little (or both). If memory
is overcommitted so badly that only 1/16th of frequently-used data is
in memory, NOTHING is going to fast -- and we're still only talking
about 4 disk accesses, which means the lookup itself is still likely
to appear instantaneous.

Ah. I had recently read a paper, with measurements on trees vs. skip
lists, and it was demonstrated that 2-way trees caused far more paging
on large data sets. On second look through, "large data sets" were
quite large indeed; the behavior didn't manifest until 600,000 ints
were added to the container. Of course a skip list is not a hashtable,
but if anything, the hashtable would page less than the skip list.

Still, it is not going to happen unless the memory is hugely
overcommitted, as you said.

--
Dave O'Hearn

Jul 22 '05 #8

Similar topics

Hash Table

by: Pelo GANDO | last post by:

Hi everybody ! I am a beginner in C++. I am looking for a (simple if it's possible) source code in C++ about hash table... Thank you ! Pelo

C / C++

Hash map question.

by: Jim Lynch | last post by:

I've searched google for help with this one, but I haven't found it yet. What I want is a hash map this looks like: __gnu_cxx::hash_map<string, ArcNode> ArcNodeMap; ArcNode is a class. It looks...

C / C++

setting values for all elements in hash to a value

by: mahurshi | last post by:

i have a hash that's populated with some keys/values i did some calculations (logic simulations) and and set the values. now i want to set all the values to "-1" (the keys remain the same) so...

C / C++

don't need dictionary's keys - hash table?

by: kdotsky | last post by:

Hello, I am using some very large dictionaries with keys that are long strings (urls). For a large dictionary these keys start to take up a significant amount of memory. I do not need access to...

Python

RogueWave hash set ordering

by: Dervish | last post by:

Here is pseudocode: typedef int CarID; typedef RWTPtrHashSet<CarID,HashAdaptorCarID, CarIDEqualTo> container_type; typedef RWTPtrHashSetIterator<CarID,HashAdaptorCarID, CarIDEqualTo>...

C / C++

Hash Map

by: Salman | last post by:

How can I use Haspmap in C++ and what will be benefit of hashmap over arrays?

C / C++

how to have user defined hash for unordered_map ?

by: abir | last post by:

Hi, I want a user defined key for tr1 unordered_map. My classes are, template<typename T> struct work{ int count; work(int count) : count(count){} }; template<typename W>

C / C++

Comment on how to uniquely track your objects in C# / hash table /get hash code

by: raylopez99 | last post by:

A quick sanity check, and I think I am correct, but just to make sure: if you have a bunch of objects that are very much like one another you can uniquely track them simply by using an ArrayList...

C# / C Sharp

xml dsig, how to save and hash reference elements?

by: AK | last post by:

Hello, I need to do the following with an xml document which has a list of assets: 1. Hash the assets 2. Hash the element describing the assets 3. Create a digital signature (using X.509...

C# / C Sharp

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General