Re: Good ole gnu::hash_map, I'm impressed

On Jul 16, 10:53 pm, Mirco Wahab <wa...@chemie.u ni-halle.dewrote:

Q1: Does anybody else (besides me) like to "hash something"?
How do you do that?

It depends. You might like to have a look at my "Hashing.hh "
header (in the code at kanze.james.neu f.fr/code-en.html---the
Hashing component is in the Basic section). Or for a discussion
and some benchmarks,
http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html. (That
article is a little out of date now, as I've tried quite a few
more hashing algorithms. But the final conclusions still hold,
more or less.)

Q2: Which "future" can be expected regarding "hashing"?

There will be an std::unordered_ set and std::unordered_ map in
the next version of the standard, implemented using hash tables,
and there will be standard hash functions for most of the common
types. (I wonder, however. Is the quality of the hashing
function going to be guaranteed?)

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jul 17 '08 #1

Subscribe Reply

3416

Lionel B

On Thu, 17 Jul 2008 01:21:34 -0700, James Kanze wrote:

On Jul 16, 10:53 pm, Mirco Wahab <wa...@chemie.u ni-halle.dewrote:

[...]

>Q2: Which "future" can be expected regarding "hashing"?

There will be an std::unordered_ set and std::unordered_ map in the next
version of the standard, implemented using hash tables, and there will
be standard hash functions for most of the common types.

GNU g++ has supported those for quite a while in tr1, it seems.

(I wonder, however. Is the quality of the hashing function going to be
guaranteed?)

By whom/what? I don't think the standard makes any guarantees. I've only
got a draft here, which says just:

6.3.3 Class template hash [tr.unord.hash]

1 The unordered associative containers defined in this clause use
specializations of hash as the default hash function. This class template
is only required to be instantiable for integer types
([basic.fundament al]), floating point types ([basic.fundament al]),
pointer types ([dcl.ptr]), and std::string and std::wstring.

template <class T>
struct hash : public std::unary_func tion<T, std::size_t>
{
std::size_t operator()(T val) const;
};

2 The return value of operator() is unspecified, except that equal
arguments yield the same result. operator() shall not throw exceptions.

Still, you can always roll your own [possibly inappropriate metaphor
alert]

--
Lionel B

Jul 17 '08 #2

Mirco Wahab

James Kanze wrote:

On Jul 16, 10:53 pm, Mirco Wahab <wa...@chemie.u ni-halle.dewrote:
>Q1: Does anybody else (besides me) like to "hash something"?
How do you do that?

It depends. You might like to have a look at my "Hashing.hh "
header (in the code at kanze.james.neu f.fr/code-en.html---the
Hashing component is in the Basic section). Or for a discussion
and some benchmarks,
http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html. (That
article is a little out of date now, as I've tried quite a few
more hashing algorithms. But the final conclusions still hold,
more or less.)

Ah, thanks for the links. I'll work through it. I see, you
took relatively small working sets. (I considered my 14MB
setup "small" ;-)

I'd try to use your implementation in comparision but
don't know which files are really necessary. Do you
have a .zip of the hash stuff?

>Q2: Which "future" can be expected regarding "hashing"?

There will be an std::unordered_ set and std::unordered_ map in
the next version of the standard, implemented using hash tables,
and there will be standard hash functions for most of the common
types. (I wonder, however. Is the quality of the hashing
function going to be guaranteed?)

We'll see - if some usable implementations show up. In the mean time,
the old hash_map seems to be "good enough" for my kind of stuff.
I did additional tests regarding the *reading* speed from the map.

The whole problem would be now:

1) read a big text to memory (14 MB here)
2) tokenize it (by simple regex, this seems to be very fast or fast enough)
3) put the tokens (words) into a hash and/or increment their frequencies
4) sort the hash keys (the words) according to their frequencies into a vector
5) report highest (2) and lowest (1) frequencies

Now I added 4 and 5. The tree-based std::map falls further behind
(as expected). The ext/hash_map keeps its margin.

std::map (1-5) 0m8.227s real
Perl (1-5) 0m4.732s real
ext/hash_map (1-5) 0m4.465s real

Maybe I didn't find the optimal solution for copying the hash keys to the
vector (I'll add the source at the end).

From "visual inspection" of the test runs, it
can be seen that the array handling (copying
from hash to vector) is very efficient in Perl.

Furthermore, I run into the problem of how-to access the hash values
from a sort function. The only solution that (imho) doesn't involve
enormous complexity, just puts the hash module-global. How to cure that?

Regards

M.

Addendum:

[perl source] ==>
my $fn = 'fulltext.txt';
print "start slurping\n";
open my $fh, '<', $fn or die "$fn - $!";
my $data; { local $/; $data = <$fh}

my %hash;
print "start hashing\n";
++$hash{$1} while $data =~ /(\w\w*)/g;

print "start sorting (ascending, for frequencies)\n" ;
my @keys = sort { $hash{$a} <=$hash{$b} } keys %hash;

print "done, $fn (" . int(length($dat a)/1024) . " KB) has "
. (scalar keys %hash) . " different words\n";

print "infrequent : $keys[0] = $hash{$keys[0]} times\n"
. "very often: $keys[-2] = $hash{$keys[-2]} times\n"
. "most often: $keys[-1] = $hash{$keys[-1]} times\n"
<==

[hash_map source]==>
#include <boost/regex.hpp>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <string>

// define this to use the tree-based std::map
#ifdef USE_STD_MAP
#include <map>
typedef std::map<std::s tring, intStdHash;
#else
#if defined (_MSC_VER)
#include <hash_map>
typedef stdext::hash_ma p<std::string, intStdHash;
#else
#include <ext/hash_map>
namespace __gnu_cxx {
template<struct hash< std::string {
size_t operator()(cons t std::string& s) const {
return hash< const char* >()( s.c_str() );
} // gcc.gnu.org/ml/libstdc++/2002-04/msg00107.html
}; // allow the gnu hash_map to work on std::string
}
typedef __gnu_cxx::hash _map<std::strin g, intStdHash;
#endif
#endif

char *slurp(const char *fname, size_t* len);
size_t word_freq(const char *block, size_t len, StdHash& hash);

// *** ouch, make it a module global? ***
StdHash hash;
// *** how do we better compare on the external hash? ***
struct ExtHashSort { // comparison functor for sort()
bool operator()(cons t std::string& a, const std::string& b) const {
return hash[a] < hash[b];
}
};

int main()
{
using namespace std;
size_t len, nwords;

const char *fn = "fulltext.t xt"; // about 14 MB
cout << "start slurping" << endl;
char *block = slurp(fn, &len); // read file into memory

// StdHash hash; no more!
cout << "start hashing" << endl;
nwords = word_freq(block , len, hash); // put words into a hash
delete [] block; // no longer needed

cout << "done, " << fn << " (" << len/1024
<< "KB) has " << nwords << " different words" << endl;

vector<stringke ys;
keys.reserve(nw ords);

cout << "sorting out the longest and shortest words" << endl;
StdHash::const_ iterator p, end; // copy keys to vector
for(p=hash.begi n(),end=hash.en d(); p!=end; ++p) keys.push_back( p->first);
sort(keys.begin (), keys.end(), ExtHashSort()); // sort by hashed number value

cout << "infrequent :" << keys[0] << "=" << hash[keys[0]] << " times\n"
<< "very often:" << keys[nwords-2] << "=" << hash[keys[nwords-2]] << " times\n"
<< "most often:" << keys[nwords-1] << "=" << hash[keys[nwords-1]] << " times\n";

return 0;
}

char *slurp(const char *fname, size_t* len)
{
std::ifstream fh(fname); // open
fh.seekg(0, std::ios::end); // get to EOF
*len = fh.tellg(); // read file pointer
fh.seekg(0, std::ios::beg); // back to pos 0
char* data = new char [*len+1];
fh.read(data, *len); // slurp the file
return data;
}

size_t word_freq(const char *block, size_t len, StdHash& hash)
{
using namespace boost;
match_flag_type flags = match_default;
static regex r("\\w\\w*");
cmatch match;

const char *from=block, *to=block+len;
while( regex_search(fr om, to, match, r, flags) ) {
hash[ std::string(mat ch[0].first, match[0].second) ]++;
from = match[0].second;
}
return hash.size();
}
<==

Jul 17 '08 #3

James Kanze

On Jul 17, 11:45 am, Lionel B <m...@privacy.n etwrote:

On Thu, 17 Jul 2008 01:21:34 -0700, James Kanze wrote:
On Jul 16, 10:53 pm, Mirco Wahab <wa...@chemie.u ni-halle.dewrote:

[...]

Q2: Which "future" can be expected regarding "hashing"?

There will be an std::unordered_ set and std::unordered_ map
in the next version of the standard, implemented using hash
tables, and there will be standard hash functions for most
of the common types.

GNU g++ has supported those for quite a while in tr1, it seems.

(I wonder, however. Is the quality of the hashing function
going to be guaranteed?)

By whom/what?

By the standard.

I don't think the standard makes any guarantees. I've only got
a draft here, which says just:

6.3.3 Class template hash [tr.unord.hash]

1 The unordered associative containers defined in this clause use
specializations of hash as the default hash function. This class template
is only required to be instantiable for integer types
([basic.fundament al]), floating point types ([basic.fundament al]),
pointer types ([dcl.ptr]), and std::string and std::wstring.

template <class T>
struct hash : public std::unary_func tion<T, std::size_t>
{
std::size_t operator()(T val) const;
};

2 The return value of operator() is unspecified, except that equal
arguments yield the same result. operator() shall not throw exceptions.

That's about what I expected, and more or less what I said.
(But I do hope they add a few more types. There's no way you
can write a hash function on std::type_info, for example, yet it
seems quite reasonable to me to want to use it as an index in an
unordered_map.)

Still, you can always roll your own [possibly inappropriate
metaphor alert]

Which, for most people, is likely to be worse than whatever is
in the library; while there are no guarantees, I'm willing to
bet that most implementations will do something which is fairly
good most of the time. (But if you're willing to consider
special data sets, it's relatively trivial to get tons of
collisions with the string hashing functions in g++'s
implementation. )

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jul 17 '08 #4

James Kanze

On Jul 17, 2:07 pm, Mirco Wahab <wa...@chemie.u ni-halle.dewrote:

James Kanze wrote:
On Jul 16, 10:53 pm, Mirco Wahab <wa...@chemie.u ni-halle.dewrote:
Q1: Does anybody else (besides me) like to "hash something"?
How do you do that?

It depends. You might like to have a look at my "Hashing.hh "
header (in the code at kanze.james.neu f.fr/code-en.html---the
Hashing component is in the Basic section). Or for a discussion
and some benchmarks,
http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html. (That
article is a little out of date now, as I've tried quite a few
more hashing algorithms. But the final conclusions still hold,
more or less.)

Ah, thanks for the links. I'll work through it. I see, you
took relatively small working sets. (I considered my 14MB
setup "small" ;-)

Basically, I took what I had handy, or could easily generate.
And I intentionally used sets of very different sizes, because
part of my goal was to determine at what point hash tables
started significantly beating std::map. (At the time, there was
no proposal for a standard hash table, and it was a question of
how many entries did one need before going to something
non-standard.

With regards to data sets, there are at least two others that
I'd like to add: a very big set (more than 10000 entries) of
URL's, and a set of all two character strings. I can generate,
and in fact have generated the latter, but I don't know off hand
where to find the former.

I'd try to use your implementation in comparision but
don't know which files are really necessary. Do you
have a .zip of the hash stuff?

Not of just the hash stuff; you'd have to down-load the entire
library. There aren't too many files in the Hashing component,
however, and it shouldn't be too difficult to remove the
dependencies that it has on othe files. (The only one which
comes to mind is that it depends on <gb/stdint.hfor
GB_uint32_t. If your compiler has <stdint.h>, you can use it
and uint32_t instead.)

You can also look at the benchmark code in the Benchmark
sub-system. There are a lot of dependencies there, since it
uses my usual BenchHarness, but it shouldn't be too difficult to
extract the actual hash algorithms to play around with.

Q2: Which "future" can be expected regarding "hashing"?

There will be an std::unordered_ set and std::unordered_ map in
the next version of the standard, implemented using hash tables,
and there will be standard hash functions for most of the common
types. (I wonder, however. Is the quality of the hashing
function going to be guaranteed?)

We'll see - if some usable implementations show up. In the mean time,
the old hash_map seems to be "good enough" for my kind of stuff.
I did additional tests regarding the *reading* speed from the map.

The whole problem would be now:

1) read a big text to memory (14 MB here)
2) tokenize it (by simple regex, this seems to be very fast or fast enough)
3) put the tokens (words) into a hash and/or increment their frequencies
4) sort the hash keys (the words) according to their frequencies into a vector
5) report highest (2) and lowest (1) frequencies

Now I added 4 and 5. The tree-based std::map falls further behind
(as expected). The ext/hash_map keeps its margin.

std::map (1-5) 0m8.227s real
Perl (1-5) 0m4.732s real
ext/hash_map (1-5) 0m4.465s real

Just curious, but what is the time for just reading the file? I
wouldn't be surprised if that doesn't account for a large part
of the time. In which case, the biggest speed up might be
there: using system level IO or memory mapping the file.
(Neither of those would be portable, though.)

Maybe I didn't find the optimal solution for copying the hash
keys to the vector (I'll add the source at the end).

Since you're reading the entire file into a single block of
contiguous memory (for which you really could use std::vector),
you really don't have to ever copy anything. Just put a pointer
to the start of the word in your data structures, and put a nul
character behind the word. (This should definitely speed things
up compared to using string: no more dynamic allocations at
all.)

From "visual inspection" of the test runs, it
can be seen that the array handling (copying
from hash to vector) is very efficient in Perl.

Furthermore, I run into the problem of how-to access the hash
values from a sort function. The only solution that (imho)
doesn't involve enormous complexity, just puts the hash
module-global. How to cure that?

Pass by reference?

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jul 17 '08 #5

Similar topics

4587

Cannot get hash_map to work with C++ in .NET

by: Sabrina | last post by:

Can someone help? I have been trying to get the hash_map in C++ for .NET to work with strings and const char*. I am using the const char* as the key and a pointer to another class as the data type. I would like to be able to assign unique names to each key and allow the user to be able to search the hash_map to find that key and in so doing, retrieve the information that is being pointed to by the other class' pointer(data type). I can...

.NET Framework

11881

Why won't this hash_map compile?

by: Mark | last post by:

Hi, I'm trying to use hash_map (gcc 3.2.2) with a std::string as the key. It will compile if I use <map> but I get a bunch of template compile errors when I change it to hash_map. Any suggestions? My program and the errors are below... #include <ext/hash_map> #include <string>

C / C++

8625

hash_map, Standard Template Library

by: peter_k | last post by:

Hi I've defined hash_map in my code using this: ------------------------------------------- #include <string> #include <hash_map.h> & namespace __gnu_cxx {

C / C++

7911

replace std::map with <ext/hash_map>

by: g | last post by:

hello! I wanna replace an std::map<std::string,Services*> with hash_map.How I will do this? any link with examples? transactions.insert(std::pair<std::string,Services*>("Aservice",new xxxx)); transactions.insert(std::pair<std::string,Services*>("Vservice",new

C / C++

9402

STL hash_map

by: jayesah | last post by:

Hi All, I am developing my code with Apache stdcxx. I am bound to use STL of Apache only. Now today I need hash_map in code but as I learned, it is not available in Apache since it is not standard c++. Though it is available with GNU STL. The code module where I use hash_map will generate separate object file during compilation. This code module is also using STL string.

C / C++

4125

hash_map elements deletion

by: lokki | last post by:

Hello, can anybody tell me what's wrong with following example code? char *k, *v; k = new char; strcpy(k, "a2"); v = new char;

C / C++

1228

hash_map

by: aaragon | last post by:

Hello everyone, I have a VERY BIG set of double values that I want to map to intervals so I thought a clever way to do this was using a hash table. Let's say that I want to map all double values in the range 0-0.5 to a single std::pair<double,double>. This is what I've done so far: #include <iostream>

C / C++

4282

hash_map in STL.

by: Amit Bhatia | last post by:

Hi, I am trying to use hash maps from STL on gcc 3.3 as follows: #ifndef NODE_H #define NODE_H #include <ext/hash_map> #include "node_hasher.h" class Node; typedef hash_map<pair<int,int>, Node, Node_HasherLoc_Tree;

C / C++

475

Re: Good ole gnu::hash_map, I'm impressed

by: Mirco Wahab | last post by:

Alf P. Steinbach wrote: No, there isn't any (afaik). You can look it up here: http://www.boost.org/doc/libs/1_35_0/doc/html/boost_tr1/unsupported.html#boost_tr1.unsupported.unordered_map Thanks & regards M.

C / C++

9672

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9519

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10214

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10164

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9042

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7540

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6780

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5563

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4113

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp