473,766 Members | 2,120 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Performance of hash_set vs. Java

Hello,

I have spent a bunch of time converting a Java program I wrote to C++ in
order to improve performance, and have found that it is not necessarily
faster.

Specifically, I'm writing a circuit simulator, so I must first parse the
netlist file. The program goes through an input file, and makes a hash_set
of unique nodes (std::string's) . The words are then sorted and numbered by
copying the hash_set into a vector and sorting. Then, I run a series of
binary searches on the vector to allocate lines in a matrix for each
element.

I've modified STL's lower_bound() to my getIndex() which functions like
Java's equivalent by returning negative indices when the element is not
found.

Also, Java provides a HashSet class that can hash any kind of element it may
contain. Since, the STL only hashes selected items I had to convert my
std::strings to const char*'s for hashing.

Other than that, my code in the Java and C++ is quite similar. However, the
parse of a 100k line file took around a second for Java, but more like 17s
for C++ . In fact, using std::set instead of hash_set actually improved
performance to about 15s in C++.

Any idea why there would be such a difference? In particular, I suspect
that my hash function may be slow from the call to c_str(). Does this have
to allocate new memory or just point to somewhere inside the object? I'm
kind of curious what Java actually uses for its hash. Any general
performace/other criticisms on my scheme are certainly welcome.

On the plus side there is a great direct sparse linear solver for C. I've
only been able to find a iterative sparse solver for Java. This makes the
C++ code overall much faster than Java, but I suspect that the parse stage
should run at least as fast or faster.

Thanks,

Alex Gerdemann
University of Illinois at Urbana-Champaign

For reference, I've included a sketch of the code

So I did something like the following:

struct hashSet {
__gnu_css::hash <const char*> h;

bool operator()(std: :string& s) const {
return h(s.ctr());
}
};

inline int getIndex(const std::vector<std ::string>& list, const std::string&
item) {
std::vector<std ::string>::cons t_iterator i =
std::lower_boun d(list.begin(), list.end(), item);
if(i==list.end( )) {
return -(static_cast<in t>(i-list.begin())+1 );
} else if(*i != item) {
return -(static_cast<in t>(i-list.begin())+1 );
}
return static_cast<int >(i-list.begin());
}

__gnu_cxx::hash _set<std::strin g, hashString> nodeSet;
std::vector<std ::string> nodeVector;

while(!done) {
//read a line of file
//split the line into words and push into a vector
//check for syntax errors
nextNode = some word in line;
nodeSet.insert( nextNode);
}
std::copy(nodeS et.begin(),node .end(),std::bac k_inserter(node Vector));
std::sort(nodeV ector.begin(),n odeVector.end() );
while(!done) {
//read line from previously stored words
//more error checking
thisNode = some word in line
getIndex(nodeVe ctor, thisNode);
//allocate an object representing element containing proper indices
representing nodes
//push pointer to element into a vector
}
Jul 22 '05 #1
10 4271
Alex Gerdemann wrote:
Hello,
struct hashSet {
__gnu_css::hash <const char*> h;

bool operator()(std: :string& s) const {
return h(s.ctr());
}
};

inline int getIndex(const std::vector<std ::string>& list, const std::string&
item) {
std::vector<std ::string>::cons t_iterator i =
std::lower_boun d(list.begin(), list.end(), item);
if(i==list.end( )) {
return -(static_cast<in t>(i-list.begin())+1 );
} else if(*i != item) {
return -(static_cast<in t>(i-list.begin())+1 );
}
return static_cast<int >(i-list.begin());
}

__gnu_cxx::hash _set<std::strin g, hashString> nodeSet;
std::vector<std ::string> nodeVector;

When you use std::string , you will have *alot* of string copying
going on. Java will just pass its "refrece"/"pointer" along.
Try convert lists, etc. to hold std::string* (pointers )rather

Jul 22 '05 #2
On Mon, 11 Oct 2004 11:56:38 GMT, "Alex Gerdemann"
<nu*******@hotm ail.com> wrote:
Hello,

I have spent a bunch of time converting a Java program I wrote to C++ in
order to improve performance, and have found that it is not necessarily
faster.
No, it probably won't be unless you use a better class than
std::string to hold the strings. java.lang.Strin g is in several ways
more efficient than typical implementations of std::string.
Specifically , I'm writing a circuit simulator, so I must first parse the
netlist file. The program goes through an input file, and makes a hash_set
of unique nodes (std::string's) . The words are then sorted and numbered by
copying the hash_set into a vector and sorting. Then, I run a series of
binary searches on the vector to allocate lines in a matrix for each
element.

I've modified STL's lower_bound() to my getIndex() which functions like
Java's equivalent by returning negative indices when the element is not
found.

Also, Java provides a HashSet class that can hash any kind of element it may
contain. Since, the STL only hashes selected items I had to convert my
std::strings to const char*'s for hashing.
The main benefit Java has in hashing is that Strings cache their
hashcodes, so they need only be calculated once. std::string knows
nothing about hashing, and therefore doesn't perform this
optimization.
Other than that, my code in the Java and C++ is quite similar. However, the
parse of a 100k line file took around a second for Java, but more like 17s
for C++ . In fact, using std::set instead of hash_set actually improved
performance to about 15s in C++.
How long did it take just to read in the file? I don't know much about
GCC's hash_set, but I assume it is configurable for bucket size, etc.
You might want to tweak this. What version of GCC are you using?
Any idea why there would be such a difference? In particular, I suspect
that my hash function may be slow from the call to c_str(). Does this have
to allocate new memory or just point to somewhere inside the object?
Sometimes is has to do the former, due to reference counting problems,
but usually it just returns a pointer to the storage. Tracing in using
the debugger would tell you what's going on in the hash calls.

I'mkind of curious what Java actually uses for its hash. Any general
performace/other criticisms on my scheme are certainly welcome.
Java just iterates over the whole String, generating the hash value,
and then caches it for future calls.
On the plus side there is a great direct sparse linear solver for C. I've
only been able to find a iterative sparse solver for Java. This makes the
C++ code overall much faster than Java, but I suspect that the parse stage
should run at least as fast or faster.

Thanks,

Alex Gerdemann
University of Illinois at Urbana-Champaign

For reference, I've included a sketch of the code

So I did something like the following:

struct hashSet {
__gnu_css::hash <const char*> h;

bool operator()(std: :string& s) const {
return h(s.ctr());
}
I doubt that is your bottleneck, but if the strings are long, then
Java's hashcode caching might be giving it a major advantage.
};

inline int getIndex(const std::vector<std ::string>& list, const std::string&
item) {
std::vector<std ::string>::cons t_iterator i =
std::lower_bou nd(list.begin() , list.end(), item);
if(i==list.end( )) {
return -(static_cast<in t>(i-list.begin())+1 );
} else if(*i != item) {
return -(static_cast<in t>(i-list.begin())+1 );
}
return static_cast<int >(i-list.begin());
}

__gnu_cxx::has h_set<std::stri ng, hashString> nodeSet;
std::vector<st d::string> nodeVector;

while(!done) {
//read a line of file
//split the line into words and push into a vector
//check for syntax errors
nextNode = some word in line;
nodeSet.insert( nextNode);
The above code may be where your main bottleneck is. How do you read
the line and then split it into words? If you are creating a number of
temporary strings, that will be slowing you down. You might consider
using a fixed vector or two and reusing them, to avoid temporaries.
}
Here you definitely want:
nodeVector.rese rve(nodeSet.siz e());
std::copy(node Set.begin(),nod e.end(),std::ba ck_inserter(nod eVector));
std::sort(node Vector.begin(), nodeVector.end( ));
Both of those operations should be fast if you are using a reference
counted string class. I believe GCC 3+ uses such a beast.
while(!done) {
//read line from previously stored words
//more error checking
thisNode = some word in line
getIndex(nodeVe ctor, thisNode);
//allocate an object representing element containing proper indices
representing nodes
//push pointer to element into a vector
}


Overall, I think your best bet would be to benchmark smaller bits of
the code or put it through a profiler (-gprof IIRC, which I probably
don't), to find out where the bottleneck actually lies.

Tom
Jul 22 '05 #3
"Alex Gerdemann" <nu*******@hotm ail.com> wrote in message
news:Wxuad.3699 33$Fg5.53609@at tbi_s53...
I have spent a bunch of time converting a Java program I wrote to C++ in
order to improve performance, and have found that it is not necessarily
faster. This can often be the case. Some native Java features can be more efficient
that the C++ equivalents (which sometimes are designed for more
flexibility).
This is especially true for C++ streams.
Specifically, I'm writing a circuit simulator, so I must first parse the
netlist file. The program goes through an input file, and makes a
hash_set of unique nodes (std::string's) . The words are then sorted and
numbered by copying the hash_set into a vector and sorting. Then, I run a
series of binary searches on the vector to allocate lines in a matrix for
each element. As someone already pointed out, copying strings can be expensive in C++.
An alternative would be to keep the originally read std::string,
then use only a "const char*" pointer obtained using the string::c_str()
member function (I'd do this rather than use a std::string*).
I've modified STL's lower_bound() to my getIndex() which functions like
Java's equivalent by returning negative indices when the element is not
found. This wrapper is an unnecessary Java-ism, but ok,
it should be irrelevant to performance.
Also, Java provides a HashSet class that can hash any kind of element it
may contain. Since, the STL only hashes selected items I had to convert
my std::strings to const char*'s for hashing. All C++ hash_set implementations I've seen allow users to provide
a custom hash function as an additional parameter. But this shouldn't
be a big problem.
Other than that, my code in the Java and C++ is quite similar. However,
the parse of a 100k line file took around a second for Java, but more like
17s for C++ . In fact, using std::set instead of hash_set actually
improved performance to about 15s in C++. Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
has to be implementation-specific.
However, you may want to see if you can pre-allocate or predefine the size
of your hash table. Also, as previously pointed out, beware of std::string
instances being copied.
Any idea why there would be such a difference? In particular, I suspect
that my hash function may be slow from the call to c_str(). Does this
have to allocate new memory or just point to somewhere inside the object?
I'm kind of curious what Java actually uses for its hash. Any general
performace/other criticisms on my scheme are certainly welcome.

Are you sure that the hashing is the performance bottleneck?
C++ i/o streams could also be a likely cause for the weak performance.

All in all, I am very confident that better performance can be achieved
in C++ than in Java. But in some cases, especially string and file i/o,
this can require extra work :(

Unfortunately, the code you posted shows very little about the file
reading code and hash table, so we can't really help there.
Regarding the vector and sorting only, I can say that using const char*
instead of std::string is likely to improve execution speed.

Cheers -Ivan
--
http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
Jul 22 '05 #4
Tom wrote:
Other than that, my code in the Java and C++ is quite similar. However,
the
parse of a 100k line file took around a second for Java, but more like 17s
for C++ . In fact, using std::set instead of hash_set actually improved
performance to about 15s in C++.
How long did it take just to read in the file? I don't know much about
GCC's hash_set, but I assume it is configurable for bucket size, etc.
You might want to tweak this. What version of GCC are you using?


I'm using Cygwin's special version of GCC 3.3.3. I can't really seperate
how much time is spent just reading the file, as I process it line by line,
doing some work between each I/O call.
while(!done ) {
//read a line of file
//split the line into words and push into a vector
//check for syntax errors
nextNode = some word in line;
nodeSet.insert( nextNode);


The above code may be where your main bottleneck is. How do you read
the line and then split it into words? If you are creating a number of
temporary strings, that will be slowing you down. You might consider
using a fixed vector or two and reusing them, to avoid temporaries.


Since you asked here's exactly what I did:

std::vector< std::vector< std::string > > lines;
std::string line;
std::vector<std ::string> thisLine;

do {
std::getline(fi le,line);
if (line.length() != 0) {
split(line,this Line);
lines.push_back (thisLine);
//process the line
} while(!netlist. eof());

inline void split(const std::string& line, std::vector<std ::string>& words)
{
unsigned int firstMark = 0, lastMark = 0;
words.clear();
while(lastMark! =std::string::n pos) {
firstMark=line. find_first_not_ of(" ",lastMark) ;
if(firstMark==s td::string::npo s) break;
lastMark=line.f ind_first_of(" ",firstMark );
words.push_back (line.substr(fi rstMark,lastMar k-firstMark));
}
}

}


Here you definitely want:
nodeVector.rese rve(nodeSet.siz e());
std::copy(nod eSet.begin(),no de.end(),std::b ack_inserter(no deVector));
std::sort(nod eVector.begin() ,nodeVector.end ());


Both of those operations should be fast if you are using a reference
counted string class. I believe GCC 3+ uses such a beast.


I added the reserve call which saved me a couple of tenths of a second.
Nice, but must not be the major bottleneck. What is a "reference counted"
class?
Overall, I think your best bet would be to benchmark smaller bits of
the code or put it through a profiler (-gprof IIRC, which I probably
don't), to find out where the bottleneck actually lies.


I went ahead and tried this, but can't quite understand the results. The
total time it computes is way off. (It says its 8s vs. the actual 14s).
Also, it thinks that only 30% of run time was spent in the main() function.
This can't possibly be right.

-Alex Gerdemann
University of Illinois Urbana-Champaign
Jul 22 '05 #5
Ivan wrote:
I've modified STL's lower_bound() to my getIndex() which functions like
Java's equivalent by returning negative indices when the element is not
found. This wrapper is an unnecessary Java-ism, but ok,
it should be irrelevant to performance.


Well the specific negative value returned is unnecessary, but I do need to
know if the string isn't found. It's kind of annoying that the function
doesn't return this information because to find it myself I have to:

1) check that the returned iterator does not point to set.end()
2) check that the iterator points to what it claims.

Certainly, the search algorithm knows internally if it found what it was
looking for. Having to do that job again will certainly cost at least some
time.

The reason the node may not be in the list has to do with the details of
circuit analysis. Specifically, the matrix describing the system should not
include a row or column representing the ground node. So, my program
collects all the nodes the user defines, searches for the ground node, and
erases it from the list. That way, I can sort the list, and use the index
of each node to allocate a line in the matrix. Since I've collected all the
nodes, I know that if a particular node isn't found in the list, it must be
the ground node. It's kind of an ugly mechanism, I guess, but I currently
don't have a better idea.
Also, Java provides a HashSet class that can hash any kind of element it
Other than that, my code in the Java and C++ is quite similar. However,
the parse of a 100k line file took around a second for Java, but more
like 17s for C++ . In fact, using std::set instead of hash_set actually
improved performance to about 15s in C++.

Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
has to be implementation-specific.
However, you may want to see if you can pre-allocate or predefine the size
of your hash table. Also, as previously pointed out, beware of std::string
instances being copied.


I would like to convert the vector to store pointers to strings, rather than
the strings themselves, but then I cannot search use the built in sort, and
binary searches to find a particular string efficiently.
Unfortunately, the code you posted shows very little about the file
reading code and hash table, so we can't really help there.
Regarding the vector and sorting only, I can say that using const char*
instead of std::string is likely to improve execution speed.


I didn't write my own hash table. I though hash_set handled this problem on
its own. Not coming from a CS background, I don't know how to write a good
hash scheme, and it seems like this is the sort of thing that should be
provided by the built in libraries. Since it works faster, for now, I've
switched back to the regular (tree?) set. I want to use a set so repeat
copies of nodes will not be duplicated when added to the set. I actually
don't do any of the searches until after the set is copied into a vector.
Given that, should I actually expect the hash_set to have a faster insertion
time? Not having a CS background, I just read Java's documentation which
says that a hash set is faster in most cases.

On the I/O code, I posted this in my other reply, but here's another copy:

std::vector< std::vector< std::string > > lines;
std::string line;
std::vector<std ::string> thisLine;

do {
std::getline(fi le,line);
if (line.length() != 0) {
split(line,this Line);
lines.push_back (thisLine);
//process the line
} while(!netlist. eof());

inline void split(const std::string& line, std::vector<std ::string>& words)
{
unsigned int firstMark = 0, lastMark = 0;
words.clear();
while(lastMark! =std::string::n pos) {
firstMark=line. find_first_not_ of(" ",lastMark) ;
if(firstMark==s td::string::npo s) break;
lastMark=line.f ind_first_of(" ",firstMark );
words.push_back (line.substr(fi rstMark,lastMar k-firstMark));
}
}

Thanks for the tips,

-Alex Gerdemann
University of Illinois Urbana-Champaign
Jul 22 '05 #6
"Alex Gerdemann" <nu*******@hotm ail.com> wrote in message
news:%WTad.2344 54$D%.43700@att bi_s51...
Ivan wrote:
I've modified STL's lower_bound() to my getIndex() which functions like
Java's equivalent by returning negative indices when the element is not
found. This wrapper is an unnecessary Java-ism, but ok,
it should be irrelevant to performance.


Well the specific negative value returned is unnecessary, but I do need to
know if the string isn't found. It's kind of annoying that the function
doesn't return this information because to find it myself I have to:

1) check that the returned iterator does not point to set.end()
2) check that the iterator points to what it claims.

Certainly, the search algorithm knows internally if it found what it was
looking for. Having to do that job again will certainly cost at least
some
time.

Actually lower_bound may will not test that the hit value is actually
equal to the one being looked for (because it only uses '<' for comparison)
but I agree the interface is cumbersome.
std::equal_rang e could be an alternative...
Also, Java provides a HashSet class that can hash any kind of element it
Other than that, my code in the Java and C++ is quite similar. However,
the parse of a 100k line file took around a second for Java, but more
like 17s for C++ . In fact, using std::set instead of hash_set actually
improved performance to about 15s in C++.

Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
has to be implementation-specific.
However, you may want to see if you can pre-allocate or predefine the
size
of your hash table. Also, as previously pointed out, beware of
std::string
instances being copied.


I would like to convert the vector to store pointers to strings, rather
than
the strings themselves, but then I cannot search use the built in sort,
and
binary searches to find a particular string efficiently.


Actually you can, but you will need to provide these algorithms with an
additional parameter, which is a comparison function:

// add this declaration
struct StrPtrCompare
{
bool operator()(char const* a, char const* b) const
{ return std::strcmp( a, b ) < 0; }
};

//and from your function call call:
std::sort( vect.begin(), vect.end(), StrPtrCompare() );

Unfortunately, the code you posted shows very little about the file
reading code and hash table, so we can't really help there.
Regarding the vector and sorting only, I can say that using const char*
instead of std::string is likely to improve execution speed.


I didn't write my own hash table.


I did not suggest you should. Just some implementations of hash_set
have an equivalent of vector::reserve () to preallocate a larger
table (and avoid later reallocations). But no big deal...
copies of nodes will not be duplicated when added to the set. I actually
don't do any of the searches until after the set is copied into a vector.
Given that, should I actually expect the hash_set to have a faster
insertion
time? Not having a CS background, I just read Java's documentation which
says that a hash set is faster in most cases. What may be faster is to first copy and sort everything into a vector,
then look for (contiguous) duplicate items and remove them.
[ since you need a sorted vector anyway, the hash may be redundant ]
On the I/O code, I posted this in my other reply, but here's another copy:

std::vector< std::vector< std::string > > lines;
Unfortunately, such a data structure can be inefficient in C++,
and involve many object copies an memory allocations.
But there are a few tricks that can help....
std::string line;
std::vector<std ::string> thisLine;

do {
std::getline(fi le,line);
if (line.length() != 0) {
split(line,this Line);
lines.push_back (thisLine);
Instead of the last two lines, the following will avoid
an intermediate copy of the objects and sub-objects:
lines.push_back ( std::vector<std ::string>() );
split( line, lines.back() );

Also, it will help if you call
lines.reserve( someGuessOfTheN umberOfInputLin es );
prior to reading the file.
Alternatively, you could use a different type:
std::vector< std::vector< std::string > > lines;
// will be faster on some platforms.
//process the line
} while(!netlist. eof());

inline void split(const std::string& line, std::vector<std ::string>&
words)
{
unsigned int firstMark = 0, lastMark = 0;
words.clear();
while(lastMark! =std::string::n pos) {
firstMark=line. find_first_not_ of(" ",lastMark) ;
if(firstMark==s td::string::npo s) break;
lastMark=line.f ind_first_of(" ",firstMark );
words.push_back (line.substr(fi rstMark,lastMar k-firstMark));
}
}


Now not talking about efficiency, all the input code above probably
could be simplified as follows (including <sstream> and <iterator>):

while( std::getline(fi le,line) )
{
lines.push_back ( std::vector<std ::string>() ); // add empty object
lines.back().as sign(
std::istream_it erator<std::str ing>( istringstream(l ine) )
std::istream_it erator<std::str ing>() );
}

This is the C++ input code I would start with.
[ for the rest, the tips above - use char* after this input code
and skip the hash if possible - should improve execution speed ]
If the reading code remains a critical bottleneck, the
ultimate solution would be to:
1) skip iostreams, and use memory mapping or a single fread
to bring the whole file into memory
2) parse the file in-place, and add null chars at the end of
each 'word', so I can use a simple char* to access all
strings in-place, without any memory allocation.
That would take me an extra day of programming and be less
portable/maintainable, but be blazingly fast...

Well... I hope some of this stuff will be useful.
[I'm a bit in a rush]

Cheers,
Ivan
--
http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
Jul 22 '05 #7
I (Ivan Vecerina) wrote in message news:ck******** **@newshispeed. ch...
copies of nodes will not be duplicated when added to the set. I actually don't do any of the searches until after the set is copied into a vector. Given that, should I actually expect the hash_set to have a faster
insertion
time? Not having a CS background, I just read Java's documentation which says that a hash set is faster in most cases. What may be faster is to first copy and sort everything into a vector,
then look for (contiguous) duplicate items and remove them.
[ since you need a sorted vector anyway, the hash may be redundant ]

NB: however this only makes sense if there are few duplicate strings to
detect.
Well... I hope some of this stuff will be useful.

And of course it is difficult to suggest solutions
without seeing the big picture...

Regards
-Ivan
--
http://ivan.vecerina.com/contact/?subject=NG_POST <- e-mail contact form
Jul 22 '05 #8
On Tue, 12 Oct 2004 16:28:16 GMT, "Alex Gerdemann"
<nu*******@hotm ail.com> wrote:
I'm using Cygwin's special version of GCC 3.3.3. I can't really seperate
how much time is spent just reading the file, as I process it line by line,
doing some work between each I/O call.
Ok, I'm 95% sure that gcc 3.3 uses a copy-on-write (COW) string
implementation, with reference counting (gcc 3.4 certainly does). This
means that copying a string doesn't require that any memory is
allocated - the copy shares representation with the original and only
creates its own unique copy when it might modify it. This also means
that copying strings is cheap, as long as those strings are only
accessed through const member functions after the copy has occurred.
while(!don e) {
//read a line of file
//split the line into words and push into a vector
//check for syntax errors
nextNode = some word in line;
nodeSet.insert( nextNode);


The above code may be where your main bottleneck is. How do you read
the line and then split it into words? If you are creating a number of
temporary strings, that will be slowing you down. You might consider
using a fixed vector or two and reusing them, to avoid temporaries.


Since you asked here's exactly what I did:

std::vector< std::vector< std::string > > lines;


You definitely want:
lines.reserve(e stimateOfNumber OfLines);
std::string line;
std::vector<st d::string> thisLine;

do {
line.reserve(en oughForALine); //may well help
std::getline(fi le,line);
if (line.length() != 0) {
split(line,this Line);
lines.push_back (thisLine);
The above line is going to be slow, since it involves copying the
whole vector. Instead, you might do:
//avoid coping the vector:
lines.resize(li nes.size() + 1); //add extra default element
lines.back().sw ap(thisLine); //swap it with the current element
//process the line
} while(!netlist. eof());

inline void split(const std::string& line, std::vector<std ::string>& words)
{
unsigned int firstMark = 0, lastMark = 0;
words.clear();
words.reserve(5 0); //say
while(lastMark! =std::string::n pos) {
firstMark=line. find_first_not_ of(" ",lastMark) ;
if(firstMark==s td::string::npo s) break;
lastMark=line.f ind_first_of(" ",firstMark );
words.push_back (line.substr(fi rstMark,lastMar k-firstMark));
That might be slightly more efficient as:

words.push_back (std::string(li ne, firstMark, lastMark - firstMark));

but I doubt it will make much difference with a COW string
implementation.
}
}

}
Here you definitely want:
nodeVector.rese rve(nodeSet.siz e());
std::copy(no deSet.begin(),n ode.end(),std:: back_inserter(n odeVector));
std::sort(no deVector.begin( ),nodeVector.en d());


Both of those operations should be fast if you are using a reference
counted string class. I believe GCC 3+ uses such a beast.


I added the reserve call which saved me a couple of tenths of a second.
Nice, but must not be the major bottleneck. What is a "reference counted"
class?


See above.
Overall, I think your best bet would be to benchmark smaller bits of
the code or put it through a profiler (-gprof IIRC, which I probably
don't), to find out where the bottleneck actually lies.


I went ahead and tried this, but can't quite understand the results. The
total time it computes is way off. (It says its 8s vs. the actual 14s).
Also, it thinks that only 30% of run time was spent in the main() function.
This can't possibly be right.


It might not be counting time spent in system functions, such as IO
(system vs user time?).

If you want to optimize the above, I'd do this:

Write a simple immutable string type that is initialized with a
pointer and a length, but allocate no memory and does nothing in the
destructor. Include in the string a cached hashcode value, so that the
hashcode need only be calculated once. e.g.

class mystring
{
char const* m_ptr;
std::size_t m_length;
mutable unsigned long m_hashCode;
public:
mystring(char const* ptr, std::size_t length)
:m_ptr(ptr), m_length(length ),
m_hashCode(stat ic_cast<unsigne d long>(-1))
{
}

unsigned long hashCode() const
{
if (m_hashCode == static_cast<uns igned long>(-1))
{
//calculate hashCode (copy java.lang.Strin g code?)
}
return m_hashCode;
}

char operator[](std::size_t index) const
{
return m_ptr[index];
}

//operator==, <, etc.
//compiler generated destructor, copy, assignment are fine.
};

Read the entire file into a vector<char>.

Iterate over the vector, adding creating "mystring"s pointing into the
vector for each word. You might also consider replacing ' ' characters
with '\0's, so that you can add c_str() method to mystring that simply
looks like this:
char const* c_str() const
{
return m_ptr;
}

Operate on this new vector<vector<m ystring> >.

That should be much much faster, since the memory allocation overhead
will be vastly decreased. If you don't need a vector of lines, just
have an overall vector of words for another speed up. Essentially,
optimization in non-numerical C++ is often about reducing the number
of calls to "new" and "delete", which are often even slower than the
Java versions (new + gc).

Tom
Jul 22 '05 #9
Alex Gerdemann wrote:
[snip] Other than that, my code in the Java and C++ is quite similar. However, the
parse of a 100k line file took around a second for Java, but more like 17s
for C++ . In fact, using std::set instead of hash_set actually improved
performance to about 15s in C++.


Hmm. Just a quick question:
Are you sure the C++ optimizer run over your code and you are not
timing a debug version?

Especially with container templates the optimizer can often dramatically
reduce execution time.

--
Karl Heinz Buchegger
kb******@gascad .at
Jul 22 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
1983
by: Abhijit Ray | last post by:
I am using hash_set which is available from gcc ( and which i presume is not part of the C++ standard yet ) okay , In hash tables the key is used by a hash function to calculate a index and the object is stored under that key. I would like to have the index value. I want to iterate over all the objects stored in the hash_set and get all the objects and their index. ) The aim is to display the hash_set graphically, so that we can...
133
8583
by: Gaurav | last post by:
http://www.sys-con.com/story/print.cfm?storyid=45250 Any comments? Thanks Gaurav
5
2335
by: Bart Blommerde | last post by:
Hi, My question is about the STL extensions hash_set and hash_map, especially the SGI versions of these templates. When defining a class like this : #include <hash_set> class MyClass : public hash_set<int> { // members here. };
1
3947
by: Timo Qvist | last post by:
Hi, I'm a bit new to STL and really new to SGI's hash_set implementation and I've having problem instantiating a hash_set with a custom hash function, I could really use some help sifting through g++ (v3.3.3) error message on this particular problem. I want a hash_set with class pointers as the hashed type, my class is simply called Transition, the following code doesn't work;
3
3340
by: Markus Dehmann | last post by:
I have a class "Data" and I store Data pointers in an STL set. But I have millions of inserts and many more lookups, and my profiler found that they cost a lot of runtime. Therefore, I want to substitute the set<Data*> with a hash_set<Data*>: typedef hash_set<const Data*, hash<const Data*>, eqData> DataPointerHashSet; // typedef set<Data*> DataPointerHashSet; // (see complete code below) But it doesn't work! Everything is fine...
5
17128
by: jian | last post by:
I am trying to use the hash_set, but cannot get it work. Here's my code: // file name: hash1.cpp #include <hash_set> int main() { hash_set<int> h; } When I type:
1
3675
by: zs | last post by:
Hi! I get warning message shown below in VS.NET 2k3. Is this deprecated by microsoft or by standard? I need hash_set to store and search small strings (<20 chars long). I'll have less then 300 strings, should I use some other stl conteiner? d:\VS_Project\thread_test\thread_test\MtHashSet.h(14) : warning C4996: 'std::hash_set<_Kty>::__ctor' was declared deprecated with
8
3116
by: Rakesh | last post by:
Hi - What is wrong this implementation? I get a core dump at the free() statement? Thanks Rakesh #include <ext/hash_map> #include <iostream.h> #include <ext/hash_set>
5
2536
by: Markus Dehmann | last post by:
Do I have to handle hash collisions in a hash_set myself? I did a test in which I use find() to look for objects in a hash_set. These objects are definitely not contained, but find() sometimes finds them anyway. See this code: <code> #include <iostream> #include <stdexcept> #include <ctime>
0
10168
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10009
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8835
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7381
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6651
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5279
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5423
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3929
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3532
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.