By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,712 Members | 1,369 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,712 IT Pros & Developers. It's quick & easy.

Hash table in C++? STL?

P: n/a
Hi All,
Does C++/STL have hashtable where I can do stuff like:

Hashtable h<int>;

h.store("one",1);
h.store("two",2);

and then later retrieve them like:

cout<<h.fetch("one")<<endl;

prints 1

any idea?

Thanks!
Jul 22 '05 #1
Share this Question
Share on Google+
34 Replies


P: n/a

"pembed2003" <pe********@yahoo.com> wrote in message
news:db*************************@posting.google.co m...
Hi All,
Does C++/STL have hashtable where I can do stuff like:

Hashtable h<int>;

h.store("one",1);
h.store("two",2);

and then later retrieve them like:

cout<<h.fetch("one")<<endl;

prints 1

any idea?

Thanks!


It has a map which has the same functionality as above (but store is called
insert and fetch is called find). Map is usually based on a red-black tree
algorithm. Several library implementers also offer a genuine hash table
(called hash_map) and apparently some version of this will make it into a
future version of the standard.

john
Jul 22 '05 #2

P: n/a
John Harrison wrote:

It has a map which has the same functionality as above (but store is called
insert and fetch is called find). Map is usually based on a red-black tree
algorithm. Several library implementers also offer a genuine hash table
(called hash_map) and apparently some version of this will make it into a
future version of the standard.


Does anyone know what will need to be provided in order to hash on a
particular type? std::map requires a less-than comparison operation --
what operations would a hash-based map need?

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.
Jul 22 '05 #3

P: n/a
"Kevin Goodsell" <us*********************@neverbox.com> wrote
Does anyone know what will need to be provided in
order to hash on a particular type? std::map requires
a less-than comparison operation -- what operations
would a hash-based map need?


A hashing function and an equality operation are the minimum requirements, but
requiring a relational comparison (such as less_than) instead of equality would
allow for certain optimizations on the part of the implementers (such as using
a tree-like structure instead of lists for the buckets). I don't know whether
the standard will require equality or a relational comparator.

Sadly, the hashing function is where all the complexity lies and I'd be
pleasantly surprised if more than 1% of programmers knew what constitutes a
good hashing function. With a bad hashing function (and a naive bucket
strategy), performance can quickly degenerate to much worse than a balanced
tree on large data sets.

Claudio Puviani
Jul 22 '05 #4

P: n/a
Kevin Goodsell <us*********************@neverbox.com> wrote in
news:Nx*****************@newsread2.news.pas.earthl ink.net:
John Harrison wrote:

It has a map which has the same functionality as above (but store is
called insert and fetch is called find). Map is usually based on a
red-black tree algorithm. Several library implementers also offer a
genuine hash table (called hash_map) and apparently some version of
this will make it into a future version of the standard.


Does anyone know what will need to be provided in order to hash on a
particular type? std::map requires a less-than comparison operation --
what operations would a hash-based map need?


That would depend on your implementation of hash_map. I would suspect
some mechanism for generating a hash value from your key type, and I
suppose some sort of comparison on the hash value as well (and probably
you'd still need some sort of comparison function for your keys since they
may need to be compared in the event of hash collisions).
Jul 22 '05 #5

P: n/a
On Wed, 14 Apr 2004 20:54:05 GMT in comp.lang.c++, Kevin Goodsell
<us*********************@neverbox.com> wrote,
Does anyone know what will need to be provided in order to hash on a
particular type? std::map requires a less-than comparison operation --
what operations would a hash-based map need?


Mainly, a hashing function that takes a argument of type Key
and returns an int.

http://std.dkuug.dk/jtc1/sc22/wg21/d...003/n1443.html

Jul 22 '05 #6

P: n/a
David Harmon wrote:
On Wed, 14 Apr 2004 20:54:05 GMT in comp.lang.c++, Kevin Goodsell
<us*********************@neverbox.com> wrote,
Does anyone know what will need to be provided in order to hash on a
particular type? std::map requires a less-than comparison operation --
what operations would a hash-based map need?

Mainly, a hashing function that takes a argument of type Key
and returns an int.

http://std.dkuug.dk/jtc1/sc22/wg21/d...003/n1443.html


(Having only glanced over the link so far...)

I thought maybe they'd supply a hashing function that you can use, for
example if there's an operator<<(ostream &, const Key&). The string
representation could be generated with that and hashed. Since the
proposal includes a hash function for std::string, I suppose it would be
fairly easy to create this yourself.

I don't know how well this would work in practice, though. Obviously the
printed form of the Key would need to be unique.

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.
Jul 22 '05 #7

P: n/a
"John Harrison" <jo*************@hotmail.com> wrote in message news:<c5************@ID-196037.news.uni-berlin.de>...
"pembed2003" <pe********@yahoo.com> wrote in message
news:db*************************@posting.google.co m...
Hi All,
Does C++/STL have hashtable where I can do stuff like:

Hashtable h<int>;

h.store("one",1);
h.store("two",2);

and then later retrieve them like:

cout<<h.fetch("one")<<endl;

prints 1

any idea?

Thanks!


It has a map which has the same functionality as above (but store is called
insert and fetch is called find). Map is usually based on a red-black tree
algorithm. Several library implementers also offer a genuine hash table
(called hash_map) and apparently some version of this will make it into a
future version of the standard.

john


Hi John,
Can you point me to the documentation where I can find the map and
hash_map data structure? Thanks for the help!
Jul 22 '05 #8

P: n/a
"pembed2003" <pe********@yahoo.com> wrote in message
Can you point me to the documentation where I can find the map and
hash_map data structure? Thanks for the help!


It's not in the standard, but lots of compilers have it. Go to the include
folder like C:\Program Files\Borland\CBuilder6\Include\Stlport and look for
a file like hash_map. Or try your compiler documentation or
http://www.sgi.com/tech/stl/HashedAs...Container.html.
Jul 22 '05 #9

P: n/a
>
Hi John,
Can you point me to the documentation where I can find the map and
hash_map data structure? Thanks for the help!


http://www.dinkumware.com/refxcpp.html

but remember because hasp_map is non standard the description here might
differ from the implementation you have (if you have one at all).

john
Jul 22 '05 #10

P: n/a
"Claudio Puviani" <pu*****@hotmail.com> wrote in message news:MMhfc.43722
"Kevin Goodsell" <us*********************@neverbox.com> wrote
Does anyone know what will need to be provided in
order to hash on a particular type? std::map requires
a less-than comparison operation -- what operations
would a hash-based map need?

A hashing function and an equality operation are the minimum requirements, but requiring a relational comparison (such as less_than) instead of equality would allow for certain optimizations on the part of the implementers (such as using a tree-like structure instead of lists for the buckets). I don't know whether the standard will require equality or a relational comparator.

Sadly, the hashing function is where all the complexity lies and I'd be
pleasantly surprised if more than 1% of programmers knew what constitutes a good hashing function. With a bad hashing function (and a naive bucket
strategy), performance can quickly degenerate to much worse than a balanced tree on large data sets.


I don't think I'm in that 1%. A naive guess is to add up all the letters
and modulo by the bucket size, which they say should be prime, but why is
this?

size_t hash(const char * s, int buckets) {
size_t h = 0;
for ( ; *s; ++s) h += *s;
return h%bucket.
}

But we need hash only the first maxchars characters.

To answer Kevin's question about the hash function, it should be like this
in general

class hash {
public:
typedef char char_type;
typedef typename std::vector<T>::size_type hash_type;
explicit hash(size_t maxchars);
hash_type operator()(char_type begin, char_type end, size_t numbuckets)
const;
private:
size_t maxchars;
};

The hash may increase numbuckets dynamically as it grows, so numbuckets is a
parameter to operator(). But the number of chars to hash is a constant, and
should not change as we increase or decrease the number of buckets, so it is
argument to the constructor and gets stored in the class.

In our code above we rely on an implicit conversion from char_type to
hash_type (which seems not reasonable), and hash_type::operator+= (which
seems reasonable as hash_type is just an unsigned integer).
Jul 22 '05 #11

P: n/a
Siemel Naran wrote:

I don't think I'm in that 1%.
I wouldn't dare claim to be either, without doing some research first.
A naive guess is to add up all the letters
and modulo by the bucket size,
I think you mean table size. Each bucket has a separate size determined
by the number of items that hash to that index.

Or, perhaps you meant "the size measured in number of buckets" and I was
just reading it wrong. I could see that.
which they say should be prime, but why is
this?
I'm not sure I ever completely understood that. The one obvious reason
doesn't apply to most implementations. In "probing" collision-resolution
schemes it makes sense, but most use "chaining" or buckets. What I'm
calling "probing" is the situation where a collision (i.e., a hash to a
location that's already taken) is handled by trying other locations. For
example, a naive approach is to try the next slot, until an empty one is
found. A better technique (I think) would be double hashing, where a
second hash function is used to determine the sequence of slots to try.
So if h1() is the primary hash function and h2() is the secondary,
suppose h1(x) = n, but n is already used. Then you would try n + h2(x),
n + 2*h2(x), n + 3*h2(x), etc. In all cases, modding by the table size.

In this case, having the table size prime ensures that you will
eventually try every table slot -- if there's an open one, you'll find it.

But nobody seems to use schemes like this very often, and it seems
obvious why. Why would you want to take up additional slots in the
table, increasing the chances of future collisions, when you can just
chain things outside the table (in buckets)? Lookup complexity for the
element is no worse (may even be better), and the table is less cluttered.

So why the prime table size? I can't see an obvious reason. If hash
values are approximately uniformly distributed between 0 and N, I'd
think that ideal table sizes would be numbers that evenly divide N,
because any remainder would split the table into two areas with
different probabilities of any particular value hashing into them.

size_t hash(const char * s, int buckets) {
size_t h = 0;
for ( ; *s; ++s) h += *s;
return h%bucket.
}
Kernighan & Pike's _The Practice of Programming_ mentioned a similar
hash function, but it included (if I recall correctly) multiplying the
total by a constant before adding in the next character value. This
constant has been experimentally determined to be effective for ASCII
text. I'm not sure if anyone really knows why it works. I don't recall
what the value was, but I'm sure someone can tell us.

But we need hash only the first maxchars characters.


Actually, I think K&P also talked about this, and suggested it was a bad
idea. This is from memory again, so I may be wrong, but I believe they
mentioned Java using a scheme like this originally. It turned out to not
work well, and was changed to use the entire string.

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.
Jul 22 '05 #12

P: n/a
pe********@yahoo.com (pembed2003) wrote in message news:<db**************************@posting.google. com>...
Hi John,
Can you point me to the documentation where I can find the map and
hash_map data structure? Thanks for the help!


std::map is documented in quite a number of books, e.g. Accelerated C++
(Koenig&Moo) or Generic Programming and the STL(Austern)

http://www.amazon.com/exec/obidos/ASIN/020170353X/
http://www.amazon.com/exec/obidos/ASIN/0201309564/

hash_map isn't in the standard, so there are multiple different
versions around. Get the documentation from the same source as
your code. Your compiler vendor may have included something.

Regards,
Michiel Salters
Jul 22 '05 #13

P: n/a
"Kevin Goodsell" <us*********************@neverbox.com> wrote
So why the prime table size? I can't see an obvious reason.
If hash values are approximately uniformly distributed between
0 and N, I'd think that ideal table sizes would be numbers that
evenly divide N, because any remainder would split the table
into two areas with different probabilities of any particular
value hashing into them.


If the hash values are uniformly distributed (which is one characteristic of a
good hashing function), then any table size would be equivalent; i.e., there
would be no "ideal" size. The reason that prime numbers are used for table
sizes is that many hashing functions are NOT good and sometimes generate hash
values that are multiples of one or more numbers. Those values, modulo a
relatively non-prime number will cluster. Those same values modulo a relatively
prime number will uniformly span the entire set of 0..N-1. In other words,
primality is used as insurance against bad hashing functions. You can persuade
yourself of this by modeling it in a spreadsheet.

Claudio Puviani

[somehow, my server missed the first time I sent this reply]
Jul 22 '05 #14

P: n/a
"Kevin Goodsell" <us*********************@neverbox.com> wrote
So why the prime table size? I can't see an obvious reason.
If hash values are approximately uniformly distributed between
0 and N, I'd think that ideal table sizes would be numbers that
evenly divide N, because any remainder would split the table
into two areas with different probabilities of any particular
value hashing into them.


If the hash values are uniformly distributed (which is one characteristic of a
good hashing function), then any table size would be equivalent; i.e., there
would be no "ideal" size. The reason that prime numbers are used for table
sizes is that many hashing functions are NOT good and sometimes generate hash
values that are multiples of one or more numbers. Those values, modulo a
relatively non-prime number will cluster. Those same values modulo a relatively
prime number will uniformly span the entire set of 0..N-1. In other words,
primality is used as insurance against bad hashing functions. You can persuade
yourself of this by modeling it in a spreadsheet.

Claudio Puviani
Jul 22 '05 #15

P: n/a
Claudio Puviani wrote:
"Kevin Goodsell" <us*********************@neverbox.com> wrote
So why the prime table size? I can't see an obvious reason.
If hash values are approximately uniformly distributed between
0 and N, I'd think that ideal table sizes would be numbers that
evenly divide N, because any remainder would split the table
into two areas with different probabilities of any particular
value hashing into them.

If the hash values are uniformly distributed (which is one characteristic of a
good hashing function), then any table size would be equivalent; i.e., there
would be no "ideal" size. The reason that prime numbers are used for table
sizes is that many hashing functions are NOT good and sometimes generate hash
values that are multiples of one or more numbers. Those values, modulo a
relatively non-prime number will cluster. Those same values modulo a relatively
prime number will uniformly span the entire set of 0..N-1. In other words,
primality is used as insurance against bad hashing functions. You can persuade
yourself of this by modeling it in a spreadsheet.


I thought it might be something like that. Although, I disagree that
there's no "ideal" size with uniformly distributed hash values. I
definitely think some sizes are better than others (though it may be a
small difference). As a trivial example, suppose the range of hash
values is 0 to 75, and the table size is 50. Hash values from 0 to 25,
and also from 51 to 75, end up in the lower 25 slots, while only those
hash values in the range 26 to 50 end up in the upper 25 slots. As a
result, the probability of a key hashing to the first 25 slots is twice
as high as the probability of hashing to the last 25.

I doubt this makes any difference in practice, but it is what I had in
mind in my previous post.

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.
Jul 22 '05 #16

P: n/a
Mi*************@logicacmg.com (Michiel Salters) wrote in message news:<fc**************************@posting.google. com>...
pe********@yahoo.com (pembed2003) wrote in message news:<db**************************@posting.google. com>...
Hi John,
Can you point me to the documentation where I can find the map and
hash_map data structure? Thanks for the help!


std::map is documented in quite a number of books, e.g. Accelerated C++
(Koenig&Moo) or Generic Programming and the STL(Austern)

http://www.amazon.com/exec/obidos/ASIN/020170353X/
http://www.amazon.com/exec/obidos/ASIN/0201309564/

hash_map isn't in the standard, so there are multiple different
versions around. Get the documentation from the same source as
your code. Your compiler vendor may have included something.


Thanks for everyone's help so it looks like map is the standard and
should be available for any platform? But hash_map is not standard.

The reason I am asking for hash table is because I need to write a
function that finds the overlap of two integer array. I am currently
doing:

#include<stdlib.h>
#include<string.h>
#include<sys/int_limits.h>

#define SIZE ((sizeof(int)) * (INT16_MAX) * (2))

void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one[i] + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two[i] + INT16_MAX])
printf("%d\n",two[i]);
free(buffer);
}

int main(int argc,char** argv){
int a[] = {1,3,45,6,4,-1,8};
int b[] = {4,2,4,222,8,45,-1};
find(a,7,b,7);
}

this gives me a O(n) algr. but seems to be wasting a lot of memory so
I am thinking using a hashtable instead. am i in the right track?
Thanks!
Jul 22 '05 #17

P: n/a
"Claudio Puviani" <pu*****@hotmail.com> wrote in message news:Hexfc.66079
"Kevin Goodsell" <us*********************@neverbox.com> wrote
So why the prime table size? I can't see an obvious reason.
If hash values are approximately uniformly distributed between
0 and N, I'd think that ideal table sizes would be numbers that
evenly divide N, because any remainder would split the table
into two areas with different probabilities of any particular
value hashing into them.


If the hash values are uniformly distributed (which is one characteristic

of a good hashing function), then any table size would be equivalent; i.e., there would be no "ideal" size. The reason that prime numbers are used for table
sizes is that many hashing functions are NOT good and sometimes generate hash values that are multiples of one or more numbers. Those values, modulo a
relatively non-prime number will cluster. Those same values modulo a relatively prime number will uniformly span the entire set of 0..N-1. In other words,
primality is used as insurance against bad hashing functions. You can persuade yourself of this by modeling it in a spreadsheet.


This makes sense. Can you write the barebones of a C++ program that can
illustrate this? I guess you could write a spreadshet too, but the
newsgroup only wants free text. I can test it if you want.

We could call call a function hash_function, and we could use my simple
function in the other post, maybe multiplying by the constant Kevin
suggested.
Jul 22 '05 #18

P: n/a
"Kevin Goodsell" <us*********************@neverbox.com> wrote in message
news:Gnqfc.11299
size_t hash(const char * s, int buckets) {
size_t h = 0;
for ( ; *s; ++s) h += *s;
return h%bucket.
}


Kernighan & Pike's _The Practice of Programming_ mentioned a similar
hash function, but it included (if I recall correctly) multiplying the
total by a constant before adding in the next character value. This
constant has been experimentally determined to be effective for ASCII
text. I'm not sure if anyone really knows why it works. I don't recall
what the value was, but I'm sure someone can tell us.


I did some research, and they said multiplying by the alphabet size (ie.
number of valid chars).

But we need hash only the first maxchars characters.


Actually, I think K&P also talked about this, and suggested it was a bad
idea. This is from memory again, so I may be wrong, but I believe they
mentioned Java using a scheme like this originally. It turned out to not
work well, and was changed to use the entire string.


Suppose our company coding standard is to prefix namespace and class names
with ABC_. Thus ABC_Contact, ABC_Thing. Imagine if our hash function used
just the first 3 chars!

So I think the number of chars to use depends on the scenario, but it does
intuitively make sense to me. We could perhaps use the last N chars, middle
N chars, etc. A templated hashing function makes all this possible.
Jul 22 '05 #19

P: n/a
"pembed2003" <pe********@yahoo.com> wrote in message
#define SIZE ((sizeof(int)) * (INT16_MAX) * (2))

void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one[i] + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two[i] + INT16_MAX])
printf("%d\n",two[i]);
free(buffer);
}
It seems you don't use the first INT16_MAX chars of buffer, or maybe I'm
reading something wrong.
int main(int argc,char** argv){
int a[] = {1,3,45,6,4,-1,8};
int b[] = {4,2,4,222,8,45,-1};
find(a,7,b,7);
}

this gives me a O(n) algr. but seems to be wasting a lot of memory so
I am thinking using a hashtable instead. am i in the right track?
Thanks!


If the array size 'a' or 'one' is small then the O(N^2) algorithm is fine.
Either loop over the chars in 'one' or 'two', then over the chars in the
other loop. If you want to use STL, you can use std::count_if and supply a
binary predicate that employs std::find (the binary predicate will have a
constructor that takes the array 'one' as an argument).

Note you can also sort the elements in 'one' and 'two', then loop over both
containers simultaneously. Running time O(lg(N)) for the sort if using
std::sort which normally uses quicksort, and O(N) for the scan. I don't
think the standard provides such an algorithm though.

You could sort the array in O(N) time using radix sort (but it will use so
much memory, which is what you want to avoid), or some other specialized
algorithm. You could also sort only 'one' and loop over the chars in 'two',
but use binary_search to see if a char in 'one' exists in 'two'.

If the array size is big, what difference will a hashtable make? You're
using the hashtable to store elements visited, right? If the hashtable
bucket size is small then you'll have lots of collisions (lots of elements
mapping to the same element) and looking one element could be linear time
(if the collided elements are stored in a vector). If the hashtable bucket
size is big then you won't have lots of collisions, but then might as well
use your original solution.
Jul 22 '05 #20

P: n/a
>
Thanks for everyone's help so it looks like map is the standard and
should be available for any platform? But hash_map is not standard.

Right.
The reason I am asking for hash table is because I need to write a
function that finds the overlap of two integer array. I am currently
doing:

#include<stdlib.h>
#include<string.h>
#include<sys/int_limits.h>

#define SIZE ((sizeof(int)) * (INT16_MAX) * (2))

void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one[i] + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two[i] + INT16_MAX])
printf("%d\n",two[i]);
free(buffer);
}

int main(int argc,char** argv){
int a[] = {1,3,45,6,4,-1,8};
int b[] = {4,2,4,222,8,45,-1};
find(a,7,b,7);
}

this gives me a O(n) algr. but seems to be wasting a lot of memory so
I am thinking using a hashtable instead. am i in the right track?
Thanks!


I think the standard answer would be to use a set. This is untested code.

#include <set>
#include <algorithm>
#include <iterator>
#include <iostream>

void find(int* one,int c1,int* two,int c2)
{
std::set<int> common;
common.insert(one, one + c1);
common.insert(two, two + c2);
// output common elements to cout
std::copy(common.begin(), common.end(),
std::ostream_iterator<int>(std::cout, "\n"));
}

That is an O(N lg N) algorithm.

john
Jul 22 '05 #21

P: n/a
Siemel Naran wrote:
"Kevin Goodsell" <us*********************@neverbox.com> wrote in message
news:Gnqfc.11299

Kernighan & Pike's _The Practice of Programming_ mentioned a similar
hash function, but it included (if I recall correctly) multiplying the
total by a constant before adding in the next character value. This
constant has been experimentally determined to be effective for ASCII
text. I'm not sure if anyone really knows why it works. I don't recall
what the value was, but I'm sure someone can tell us.

I did some research, and they said multiplying by the alphabet size (ie.
number of valid chars).


I don't recall that. The particular version I was talking about has been
discussed from time to time in the newsgroups, and I found one such
discussion here:

http://groups.google.com/groups?thre...btinternet.com

If I did that right, the link should take you to Richard Heathfield's
reply to the thread's original poster.

They mention a few points. First, K&R2 contains an almost identical
hashing function on page 144 using 31 as the constant. Second, the
values suggested by K&P were apparently 31 and 37. Finally, Chris Torek
suggests 33, a value he got from James Gosling. During widespread
experimentation for the Berkley DB library, it did better on average
than 31.

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.
Jul 22 '05 #22

P: n/a
"John Harrison" <jo*************@hotmail.com> wrote in message news:<c5************@ID-196037.news.uni-berlin.de>...

Hi John,
Can you point me to the documentation where I can find the map and
hash_map data structure? Thanks for the help!


http://www.dinkumware.com/refxcpp.html

but remember because hasp_map is non standard the description here might
differ from the implementation you have (if you have one at all).

john


I read the documentation but I don't fully understand it. I don't want
to sound like a dump axx but so far, I have tried:

#include<map>
#include<string>

using namespace std;

int main(int argc,char** argv){
std::map<std::string,int>m();
m.insert("hello",1); // doesn't compile? why?
return 0;
}

I am sure I did something wrong maybe even in the constructor. What I
mean to do is:

* m is a map whoes key is a std::string and values will be of type int
* insert the string "hello" into m with value 1.

I am trying to read the man page with 'man map.h' but nothing show up.
Where is the man page and how can I find map.h if I want to look at
it?

Can someone tell me what's wrong? Can you post a short working example
so I can learn from it? Thanks!
Jul 22 '05 #23

P: n/a
pembed2003 wrote:

I read the documentation but I don't fully understand it. I don't want
to sound like a dump axx but so far, I have tried:

#include<map>
#include<string>

using namespace std;

int main(int argc,char** argv){
std::map<std::string,int>m();
I don't think you really wanted to declare a function called 'm' that
takes no arguments and returns a std::map. Lose the parentheses.
m.insert("hello",1); // doesn't compile? why?
I don't believe there is a form of map::insert that takes two arguments.
Try one of these:

m["hello"] = 1;

OR

m.insert(std::map<std::string, int>::value_type("hello", 1));

Obviously I'd recommend the former for this case. When you do need the
insert() function it might help to have a typedef handy (in fact, this
is nice in many cases):

typedef std::map<std::string, int> Map;
Map m;
m.insert(Map::value_type("hello", 1));
return 0;
}

I am sure I did something wrong maybe even in the constructor. What I
mean to do is:

* m is a map whoes key is a std::string and values will be of type int
* insert the string "hello" into m with value 1.

I am trying to read the man page with 'man map.h' but nothing show up.
Where is the man page and how can I find map.h if I want to look at
it?


1) 'man pages' are not defined by the standard. If you have them, the
names, locations, commands, etc. depend on your system.

2) Why would it be map.h? The header is <map>. Typically the file name
is the same as the name in the brackets, i.e. 'map'. This is not
required of course, but it's usually the case.

3) Locations of headers (if they exist as files -- which they typically
do) depend on your implementation. Consult the documentation or a group
that discusses your system and/or compiler.

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.
Jul 22 '05 #24

P: n/a
Kevin Goodsell wrote:

I don't believe there is a form of map::insert that takes two arguments.


Come to think of it, that's wrong. There is a form that takes a "hint"
argument, for example. But there is not one that takes the key and value
as separate arguments, I'm fairly sure.

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.
Jul 22 '05 #25

P: n/a
"Siemel Naran" <Si*********@REMOVE.att.net> wrote in message news:<A6********************@bgtnsc04-news.ops.worldnet.att.net>...

[ ... ]
size_t hash(const char * s, int buckets) {
size_t h = 0;
for ( ; *s; ++s) h += *s;
return h%bucket.
}

But we need hash only the first maxchars characters.
As hash functions go, this is better than many, but less than ideal
for general use. Long strings might produce values larger than a
size_t can represent, producing values that vary from one
implementation to another, while short strings can only produce quite
small values (e.g. with 8-bit characters, a 3-character string can
only produce a value up to 765 and typical English text will only
produce values in a MUCH more restricted range). As one final caveat,
since it uses plain char, which might be signed, it could produce
negative hash values, which can often cause problems.

One way of dealing with this is something like this:

// warning: untested code.

size_t hash(unsigned char const *key, size_t buckets) {

const int bits = 32; // assume 32 bits is enough for the hash
value.
size_t h = 0x55555555; // alternating 0's and 1's.
size_t carry;
size_t shift = bits - CHAR_BIT;
size_t mask = ((1 << CHAR_BIT) -1) << shift;

while (*key) {
h ^= *key++;
carry = (h & mask) >> shift;
h <<= CHAR_BIT;
h |= carry;
}
return h % buckets;
}

The left-shift is to propagate the effects of the input characters
across the entire hash value more quickly. The right-shift (producing
an N-bit rotate) keeps from losing the effects of the characters from
early in the string.

The result is that every bit in the input string has an effect on the
output, and the effects of the characters get distributed throughout
the output fairly quickly.

Results are undoubtedly open to variation, but most of the
"improvements" I've tried to this scheme have taken more time than
they saved.
The hash may increase numbuckets dynamically as it grows, so numbuckets is a
parameter to operator(). But the number of chars to hash is a constant, and
should not change as we increase or decrease the number of buckets, so it is
argument to the constructor and gets stored in the class.


In typical use, you normally want to hash the entire key.
Jul 22 '05 #26

P: n/a
"Siemel Naran" <Si*********@REMOVE.att.net> wrote in message news:<wf********************@bgtnsc04-news.ops.worldnet.att.net>...
"pembed2003" <pe********@yahoo.com> wrote in message
#define SIZE ((sizeof(int)) * (INT16_MAX) * (2))

void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one[i] + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two[i] + INT16_MAX])
printf("%d\n",two[i]);
free(buffer);
}
It seems you don't use the first INT16_MAX chars of buffer, or maybe I'm
reading something wrong.
int main(int argc,char** argv){
int a[] = {1,3,45,6,4,-1,8};
int b[] = {4,2,4,222,8,45,-1};
find(a,7,b,7);
}

this gives me a O(n) algr. but seems to be wasting a lot of memory so
I am thinking using a hashtable instead. am i in the right track?
Thanks!


If the array size 'a' or 'one' is small then the O(N^2) algorithm is fine.


Is my solution a O(N^2) solution? I think it's a O(N) right?
Becasue...

Assume array a and b both have N elements, the program loops through
both array exactly once, so we have:

for(i = 0; i < N; i++) // O(N)
array_one[i] = 1;
for(i = 0; i < N; i++) // O(N)
does array_two[i] in array_one[i]

N + N = 2N

which is a O(N)
Either loop over the chars in 'one' or 'two', then over the chars in the
other loop. If you want to use STL, you can use std::count_if and supply a
binary predicate that employs std::find (the binary predicate will have a
constructor that takes the array 'one' as an argument).

Will a binary search beat a hash lookup? I know this has no exact
answer but generally speaking will a binary search be faster than a
hash lookup?
Note you can also sort the elements in 'one' and 'two', then loop over both
containers simultaneously. Running time O(lg(N)) for the sort if using
std::sort which normally uses quicksort, and O(N) for the scan. I don't
think the standard provides such an algorithm though.


But this solution requires:

1. Sort one array with best case O(N*log N) using a quicksort
2. Loop through another array lookup for dups which is a O(N)

I think it goes like:

sort(array_one); // O(N*log N)
for(i = 0; i < N; i++) // O(N)
look for array_two[i] in array_one

which is a O(N*log N) algr. It doesn't look like it will beat my O(N)
algr. Am I right?

After I discovered map, I am actually doing:

void find(int* a,int c1,int* b,int c2){
std::map<int,int> m;
for(int i = 0; i < c1; i++)
m[a[i]] = 1;
for(int j = 0; j < c2; j++)
if(m[b[j]])
std::cout<<b[j]<<std::endl;
}

What do you think about that? Thanks!
Jul 22 '05 #27

P: n/a
"Kevin Goodsell" <us*********************@neverbox.com> wrote in message
news:C%
I don't believe there is a form of map::insert that takes two arguments.

Come to think of it, that's wrong. There is a form that takes a "hint"
argument, for example. But there is not one that takes the key and value
as separate arguments, I'm fairly sure.


Right. There's also another 2 argument insert, namely map::insert(Iter
begin, Iter end) where Iter is a template argument. It's basically batch
insert. If the range is sorted then the running time of the algorithm is
O(N). If the range is not sorted then the running time is O(N*lg(N)). So
this is a really smart function that detects if your range is sorted! I
think it probably calls the 2 argument insert where the first argument is a
hint.
Jul 22 '05 #28

P: n/a
"pembed2003" <pe********@yahoo.com> wrote in message
"Siemel Naran" <Si*********@REMOVE.att.net> wrote in message news:<wfCfc.34540
#define SIZE ((sizeof(int)) * (INT16_MAX) * (2))
Prefer to use const size_t instead of #define.
void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one[i] + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two[i] + INT16_MAX])
printf("%d\n",two[i]);
free(buffer);
} this gives me a O(n) algr. but seems to be wasting a lot of memory so
I am thinking using a hashtable instead. am i in the right track?
Thanks!


If the array size 'a' or 'one' is small then the O(N^2) algorithm is fine.
Is my solution a O(N^2) solution? I think it's a O(N) right?
It is O(N). Sorry for confusing. The O(N^2) algorithm is when you scan all
the chars in 'one', and for each scan all the chars in 'two'.
Becasue...

Assume array a and b both have N elements, the program loops through
both array exactly once, so we have:

for(i = 0; i < N; i++) // O(N)
array_one[i] = 1;
for(i = 0; i < N; i++) // O(N)
does array_two[i] in array_one[i]

N + N = 2N

which is a O(N)
Right.
Either loop over the chars in 'one' or 'two', then over the chars in the
other loop. If you want to use STL, you can use std::count_if and supply a binary predicate that employs std::find (the binary predicate will have a constructor that takes the array 'one' as an argument).


Will a binary search beat a hash lookup? I know this has no exact
answer but generally speaking will a binary search be faster than a
hash lookup?


Guessing hash lookup is faster in general. It's worth running an experiment
to see for your case. Who knows, maybe the hash function is too complicated
to compute and the sorted array is faster. It's pretty easy to all methods
given you have an implementation of hash_map available. If you can, post
your results if you do the experiment, because it's nice to know. (State
the size of N, result using: original algorithm, hash_map, std::map,
binary_search, N^2 method, double sort).
Note you can also sort the elements in 'one' and 'two', then loop over both containers simultaneously. Running time O(lg(N)) for the sort if using
std::sort which normally uses quicksort, and O(N) for the scan. I don't
think the standard provides such an algorithm though.


But this solution requires:

1. Sort one array with best case O(N*log N) using a quicksort
2. Loop through another array lookup for dups which is a O(N)

I think it goes like:

sort(array_one); // O(N*log N)
for(i = 0; i < N; i++) // O(N)
look for array_two[i] in array_one

which is a O(N*log N) algr. It doesn't look like it will beat my O(N)
algr. Am I right?


Yes, I think your O(N) algorithm is probably best, but it consumes lots of
space. Nevertheless, it is possible that the binary search method is faster
than hash_table.

Also, you can sort both arrays, and walk through each simultaneously in
linear time.
After I discovered map, I am actually doing:

void find(int* a,int c1,int* b,int c2){
std::map<int,int> m;
for(int i = 0; i < c1; i++)
m[a[i]] = 1;
for(int j = 0; j < c2; j++)
if(m[b[j]])
std::cout<<b[j]<<std::endl;
}

What do you think about that? Thanks!


Looks right. You can use std::set too, which in principle should be a tad
faster. How fast does it run? Also note that a node in a map is pretty
expensive because you have to allocate space for one parent and two child
pointers, and if sizeof(int)==sizeof(node*), this is 4 times the memory per
node, plus the cost of maintaining the pointers. I'm willing to bet you a
$1 that sorting 'a' with std::sort and using std::binary_search instead is
faster than map and uses less memory.

You can replace std::map with std::hash_map if you are using STL. Don't
know about the other implementations, probably something similar.

original algorithm:
What you originally wrote
Still not sure why array size is double, why the multiply by 2 at the end?
const size_t SIZE = ((sizeof(int)) * (INT16_MAX) * (2))

std::map:
The above algorithm

hash_map (probably largest:
The above algorithm with std::map replaced by hash_map. Maybe there is a
function to set the bucket size that you can try to tweak for optimal
performance?

binary_search:
Similar to the above algorithm.
void find(int* a,int c1,int* b,int c2){
std::sort(a, a+c1);
for(int j = 0; j < c2; j++)
if(std::binary_search(a, a+c1, b[j]))
std::cout<<b[j]<<std::endl;
}

N^2 method:
void find(int* a,int c1,int* b,int c2){
for(int j = 0; j < c2; j++)
if(std::find(a, a+c1, b[j]))
std::cout<<b[j]<<std::endl;
}

double sort:
void find(int* a,int c1,int* b,int c2){
std::sort(a, a+c1);
std::sort(b, b+c1);
int i =0, j = 0;
while (true) {
if (i==c1 || j==c2) break;
const int ai = a[i];
const int bj = b[j];
if (ai < bj) ++i;
else if (ai > bj) ++j;
else {
std::cout<<bj<<std::endl;
++i;
++j;
}
}
}
It's 1:30am, so I don't claim the above is error free!

Jul 22 '05 #29

P: n/a
"John Harrison" <jo*************@hotmail.com> wrote in message
news:c5msbf$3hebc$1@ID-
void find(int* one,int c1,int* two,int c2)
{
std::set<int> common;
common.insert(one, one + c1);
common.insert(two, two + c2);
// output common elements to cout
std::copy(common.begin(), common.end(),
std::ostream_iterator<int>(std::cout, "\n"));
}

That is an O(N lg N) algorithm.


This is a good algorithm!
Jul 22 '05 #30

P: n/a
"John Harrison" <jo*************@hotmail.com> wrote in message news:<c5************@ID-196037.news.uni-berlin.de>...

[ The original code was: ]
void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one[i] + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two[i] + INT16_MAX])
printf("%d\n",two[i]);
free(buffer);
}

[ ... and you suggested replacing it with: ]
void find(int* one,int c1,int* two,int c2)
{
std::set<int> common;
common.insert(one, one + c1);
common.insert(two, two + c2);
// output common elements to cout
std::copy(common.begin(), common.end(),
std::ostream_iterator<int>(std::cout, "\n"));
}


At least as I read things, his code was finding (something similar to)
the intersection between the two sets, where it looks to me like yours
finds the union of them instead. If a true intersection of sets is
what he wants, he can use std::set_intersection to find it:

void find(int const *one, size_t c1, int const *two, size_t c2) {
std::set<int> a, b;

std::copy(one, one+c1, std::inserter(a, a.begin()));
std::copy(two, two+c2, std::inserter(b, b.begin()));
std::set_intersection(a.begin(), a.end(), b.begin(), b.end(),
std::ostream_iterator<int>(std::cout, "\n"));
}

What his code produces isn't exactly the intersection between the sets
though. In a set_intersection, each item in the output is unique,
whereas his code can produce a particular number in the output the
same number of times it occurs in the second array, if it also occurs
in the first array (e.g. with his code and sample data, the output
will contain 4 twice). If that's really needed, something like this
will do the job with roughly O(N * lg N) commplexity, and O(N) extra
storage:

void find(int *one, size_t c1, int *two, size_t c2) {
std::set<int> a;

std::copy(one, one+c1, std::inserter(a, a.begin()));
for (int i=0; i<c2; ++i)
if (std::find(a.begin(), a.end(), two[i])!=a.end())
std::cout << two[i] << "\n";
}

Theoretically, std::remove_copy_if should probably be used instead of
the loop I have above, but that may be more trouble than it's worth
(though boost::lambda could help reduce the extra trouble somewhat).
Later,
Jerry.

--
The universe is a figment of its own imagination.
Jul 22 '05 #31

P: n/a
"Siemel Naran" <Si*********@REMOVE.att.net> wrote in message news:<r1********************@bgtnsc04-news.ops.worldnet.att.net>...
"pembed2003" <pe********@yahoo.com> wrote in message
"Siemel Naran" <Si*********@REMOVE.att.net> wrote in message news:<wfCfc.34540
> #define SIZE ((sizeof(int)) * (INT16_MAX) * (2))
Prefer to use const size_t instead of #define.


The only advantage of using a const size_t is:

* We have type checking in compile time
* size_t is more portable

Are there any other advantages?
> void find(int* one,int c1,int* two,int c2){
> int i;
> int* buffer = malloc(SIZE);
> memset((void*)buffer,0,SIZE);
> for(i = 0; i < c1; i++)
> buffer[one[i] + INT16_MAX] = 1;
> for(i = 0; i < c2; i++)
> if(buffer[two[i] + INT16_MAX])
> printf("%d\n",two[i]);
> free(buffer);
> }


I am trying to rewrite the malloc(SIZE) using:

int* buffer = new int[MAX_INT * 2];

but the compiler won't let me compile. I guess that's because MAX_INT
* 2 is outside the range of an integer and inside '[]' we can't have
anything bigger than an int. Is that correct? If so, how do you make a
really huge dynamic array:

int* buffer = new int[???];

What to put inside [] to make, say a 100000000 slots. This doesn't
work:

int* buffer = new int[10000000];

Seems like real disadvantage because malloc doesn't have that problem?

[snip]
original algorithm:
What you originally wrote
Still not sure why array size is double, why the multiply by 2 at the end?
const size_t SIZE = ((sizeof(int)) * (INT16_MAX) * (2))


The reason I need to multiple 2 is because if there are negative
number, I need to turn it into a positive number. For example:

-1234 is like:

array[-1234 + MAX_INT] = 1;

if I don't do that, I will end up:

array[-1234] = 1;

which is wrong. :-( Now consider we have:

1234, I need to make it so:

array[1234 + MAX_INT] = 1;

In other word, the range inside array is always:

array[-i + MAX_INT] to array[i + MAX_INT];

Suppose i = -MAX_INT, the whole thing still works because:

array[-MAX_INT + MAX_INT] to array[MAX_INT + MAX_INT];

array[0] to array[MAX_INT * 2];

so I need to multiply by 2. Maybe I am doing it wrong but if I don't
do that, how do I end up handling a negative number?

Thanks for all the help!
Jul 22 '05 #32

P: n/a
Kevin Goodsell <us*********************@neverbox.com> wrote in message news:<HZ******************@newsread2.news.pas.eart hlink.net>...
pembed2003 wrote:

I read the documentation but I don't fully understand it. I don't want
to sound like a dump axx but so far, I have tried:

#include<map>
#include<string>

using namespace std;

int main(int argc,char** argv){
std::map<std::string,int>m();


I don't think you really wanted to declare a function called 'm' that
takes no arguments and returns a std::map. Lose the parentheses.


That's it! Thanks for the help.
m.insert("hello",1); // doesn't compile? why?


I don't believe there is a form of map::insert that takes two arguments.
Try one of these:

m["hello"] = 1;

OR

m.insert(std::map<std::string, int>::value_type("hello", 1));


Very fancy! Now if I have a class Person, can I say:

Person p("first name","last name");
m.insert(std::map<Person, int>::value_type(p,1));

What functions does class Person has to define for this to work? Thanks!
Jul 22 '05 #33

P: n/a
pembed2003 wrote:

Very fancy! Now if I have a class Person, can I say:

Person p("first name","last name");
m.insert(std::map<Person, int>::value_type(p,1));

What functions does class Person has to define for this to work? Thanks!


Like all things stored in standard containers, it must meet the
requirements of Assignable and CopyConstructible types. Basically it has
to have normal value semantics.

In addition, to work as a map key, you need to supply a comparison
operation (that determines the ordering in the map). This defaults to
operator<(), so if that's defined it will work. You can also supply your
comparison criteria as a template parameter.

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.
Jul 22 '05 #34

P: n/a
"pembed2003" <pe********@yahoo.com> wrote in message
"Siemel Naran" <Si*********@REMOVE.att.net> wrote in message news:<r1Nfc.36415
Prefer to use const size_t instead of #define.


The only advantage of using a const size_t is:

* We have type checking in compile time
* size_t is more portable

Are there any other advantages?


Also a practical one, you can usually see it in a debugger. I think there
are a few other ones.

I am trying to rewrite the malloc(SIZE) using:

int* buffer = new int[MAX_INT * 2];

but the compiler won't let me compile. I guess that's because MAX_INT
* 2 is outside the range of an integer and inside '[]' we can't have
anything bigger than an int. Is that correct? If so, how do you make a
really huge dynamic array:


Don't know how. Even std::malloc has this problem because its signature is

void * malloc(size_t);

but the compiler probably only checks the array size constraints when it
sees square brackets [...].

Just wondering, isn't INT_MAX*2 about the entire memory in the system, about
4GB on 32 bit int systems? Or how do these things work?

Still not sure why array size is double, why the multiply by 2 at the end? const size_t SIZE = ((sizeof(int)) * (INT16_MAX) * (2))


The reason I need to multiple 2 is because if there are negative
number, I need to turn it into a positive number. For example:


Got it. You could also convert the int to unsigned and use array size =
UINT_MAX.

Jul 22 '05 #35

This discussion thread is closed

Replies have been disabled for this discussion.