473,915 Members | 4,402 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Benchmarks

The task: Write a program that reads a set of words from standard
input and prints the number of distinct words.

I came across a website that listed a few programs to accomplish this
task: http://unthought.net/c++/c_vs_c++.html (ignore all the language
flaming :-), and thought that all of them did unnecessary operations,
so I wrote my own. But for some reason, my version turned out slower
that ALL of the versions in the website, even though it seems to
perform less operations (yes, I benchmarked them on my own computer).

According to the website, the slowest version is:

#include <set>
#include <string>
#include <iostream>

int main(int argc, char **argv)
{
// Declare and Initialize some variables
std::string word;
std::set<std::s tringwordcount;
// Read words and insert in rb-tree
while (std::cin >word) wordcount.inser t(word);
// Print the result
std::cout << "Words: " << wordcount.size( ) << std::endl;
return 0;
}

My version is about 12 times slower than that. It uses lower-level
constructs. Here it is:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

struct SetNode
{
char *word;
struct SetNode *next;
};

// An unorderd set of words
//
static struct SetNode *gSet = 0;
static int gSetSize = 0;

#define kInitWordSize 32

// Returns a word read from stdin. The returned pointer must be
// deallocated with free().
//
static char *
ReadOneWord(voi d)
{
int ch = getchar();

while (ch != EOF && isspace(ch))
ch = getchar();
if (ch == EOF)
return 0;

char *word = (char *) malloc(kInitWor dSize);
if (!word)
return 0;

int size = kInitWordSize;
int i = 0;

while (ch != EOF && !isspace(ch)) {
if (i >= size) {
size *= 2;

char *newWord = (char *) realloc(word, size);
if (!newWord) {
free(word);
return 0;
}
word = newWord;
}

word[i++] = ch;
ch = getchar();
}

if (i >= size) {
size *= 2;

char *newWord = (char *) realloc(word, size);
if (!newWord) {
free(word);
return 0;
}
word = newWord;
}
word[i] = '\0';

return word;
}

// Inserts a word into the set if it isn't in the set.
// The passed string is expected to have been allocated with
// a memory allocation function, and it should be considered
// lost after passed to this function.
//
static void
InsertWord(char *aWord)
{
struct SetNode *node;

for (node = gSet; node; node = node->next) {
if (strcmp(node->word, aWord) == 0) {
free(aWord);
return;
}
}

node = (struct SetNode *) malloc(sizeof(s truct SetNode));
if (!node) {
free(aWord);
return;
}

node->word = aWord;
node->next = gSet;
gSet = node;
++gSetSize;
}

static void
DeleteSet(void)
{
struct SetNode *node = gSet;
struct SetNode *temp;

while (node) {
temp = node;
node = node->next;
free(temp->word);
free(temp);
}

gSet = 0;
gSetSize = 0;
}

int
main(void)
{
char *word;

while ((word = ReadOneWord()))
InsertWord(word );

printf("Words: %d\n", gSetSize);

// Skip cleanup for now...
//DeleteSet();
}

Any ideas as to what causes the big slowdown?

Sebastian

Nov 6 '08
120 5684
Longjun <Lo***********@ gmail.comwrites :
On Nov 6, 11:53*pm, s0s...@gmail.co m wrote:
>The task: Write a program that reads a set of words from standard
input and prints the number of distinct words.

I came across a website that listed a few programs to accomplish this
task:http://unthought.net/c++/c_vs_c++.html(ignore all the language
flaming :-), and thought that all of them did unnecessary operations,
so I wrote my own. But for some reason, my version turned out slower
that ALL of the versions in the website, even though it seems to
perform less operations (yes, I benchmarked them on my own computer).

According to the website, the slowest version is:

#include <set>
#include <string>
#include <iostream>
[...]
>>
My version is about 12 times slower than that. It uses lower-level
constructs. Here it is:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
[...]
>>
Any ideas as to what causes the big slowdown?

Noticed that you've implemented your own mechanism of scanning words
from standard input and insert a new elements in your "sets". I don't
know why you implement it by yourself. Are you clear the principle of
the class cin/cout and set? Are you sure that your own function have a
better performance to the standard one?
This discussion is cross-posted to comp.lang.c and comp.lang.c++. His
solution is written in C, which doesn't have sets built into the
standard library.

But certainly the equivalent functionality could be written in C.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Nov 6 '08 #11
On Nov 6, 7:53*am, s0s...@gmail.co m wrote:
The task: Write a program that reads a set of words from standard
input and prints the number of distinct words.
[snip]

I am surprised chuck has not chimed in here.
A hash table is *ideal* for this.

P.S. to those analyzing performance...
Since we have to examine every word in the list, the performance of
the algorithm *cannot* be faster than O(n).
The hash table solution is O(n).

Using a btree, skiplist, avltree, etc. will be O(n log n) because:
For each word, we must collect it.
For this word, we must check for duplicity. With a hash table the
check is O(1). With a logarithmic search structure, the check is
O(log n). (There is a multiplicative constant less than 1, but that
does not alter the O(log n) behavior.
Hence: O(log n) * O(n) = O(n log n) for most ordered list variants.

There is another structure that would be competitive. I guess that a
ternary search tree might beat a hash table just because of the
excellence in memory access pattern. At least for lists of less than
a million items (and it would be hard to come up with more than a
million correctly spelled real words).
http://www.cs.princeton.edu/~rs/strings/

Nov 6 '08 #12
user923005 wrote:
The hash table solution is O(n).
You would have hard time proving that.
Nov 6 '08 #13
On Nov 6, 1:21*pm, Juha Nieminen <nos...@thanks. invalidwrote:
user923005 wrote:
The hash table solution is O(n).

* You would have hard time proving that.
Cuckoo hashing has guaranteed O(1) lookup and delete and amortized
O(1) insert.

My solution would also do the counting at insert time (IOW, the hash
insert function will return 0 if the item is already in the table and
return 1 if the item was inserted).
In that way, there is no need to scan the table and you can make it as
large as you like.

IOW:

unsigned long long count = 0;

while (item = fgets(string, sizeof string, stdin))
count += cuckoo_hash_ins ert(item);
Nov 6 '08 #14
On Nov 6, 1:51*pm, c...@tiac.net (Richard Harter) wrote:
On Thu, 6 Nov 2008 12:53:55 -0800 (PST), user923005
<dcor...@connx. comwrote:
On Nov 6, 7:53=A0am, s0s...@gmail.co m wrote:
The task: Write a program that reads a set of words from standard
input and prints the number of distinct words.
[snip]
[...] Using a btree, skiplist, avltree, etc. will be O(n log n) because:[...]

hash table * * *- O(log n)
comparison tree - O((log n)^2)
radix trees * * - O(log n)

[etc]
I don't have any idea what anyone here is talking about. This is
clearly a "trie" problem. The performance is O(n), where n is the
length of the input (in characters). If your performance is any
different from that your implementation is just wrong. Here is my
solution, which is simpler/shorter than anything given so far or on
that website:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct trieNode {
int canTerminateHer e;
struct trieNode * letter[26];
};

static int * insertTrie (struct trieNode * tn, const char * w) {
if ('\0' == *w) return &tn->canTerminateHe re;
if (NULL == tn->letter[*w-'a']) {
if (NULL == (tn->letter[*w-'a'] = (struct trieNode *) calloc
(1, sizeof (struct trieNode)))) return NULL;
}
return insertTrie (tn->letter[*w-'a'], w+1);
}

int main () {
struct trieNode start = {0};
char buff[2048], *s, *t;
int count = 0, *ret;
while (buff == fgets (buff, 2048, stdin)) {
for (t = buff; *t;) {
s = t + strspn (t, "abcdefghijklmn opqrstuvwxyz");
if (s != t) {
char c = *s;
*s = '\0';
if (NULL == (ret = insertTrie (&start, t))) exit (-1);
*s = c;
count += 1 ^ *ret;
*ret = 1;
}
t = s + strcspn (s, "abcdefghijklmn opqrstuvwxyz");
}
}
printf ("Count: %d\n", count);
return 0;
}

This makes the assumption that all inputs are continguous words in
lines no longer than 2048 characters separated by white space or line
feeds or whatever. It also assumes that you have enough memory to
hold all the words input, at a rate of (26 * sizeof (void *) + sizeof
(int)) * (size of the dictionary of the input in characters), roughly
speaking. The program doesn't clean up after itself, but I don't
think that was in the requirements.

As for the *real* performance of this thing, it will come down to the
calloc() speed. I could waste a lot more lines of code to massively
improve the performance here. It might be worth it if, say, the
benchmark were trying to measure performance per line of code. So the
score would be, say, lines of code times time taken to output the
correct count or something, and it would then probably be worth it to
implement a calloc pooler (it would double the size at least, but
probably make the performance at least 3 times higher, if the words
had a high enough uniqueness rate).

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/
Nov 7 '08 #15
On 6 Nov, 15:53, s0s...@gmail.co m wrote:
The task: Write a program that reads a set of words from standard
input and prints the number of distinct words.

I came across a website that listed a few programs to accomplish this
task:http://unthought.net/c++/c_vs_c++.html(ignore all the language
flaming :-), and thought that all of them did unnecessary operations,
so I wrote my own. But for some reason, my version turned out slower
that ALL of the versions in the website, even though it seems to
perform less operations (yes, I benchmarked them on my own computer).

According to the website, the slowest version is:

#include <set>
#include <string>
#include <iostream>

int main(int argc, char **argv)
{
* * * * // Declare and Initialize some variables
* * * * std::string word;
* * * * std::set<std::s tringwordcount;
* * * * // Read words and insert in rb-tree
* * * * while (std::cin >word) wordcount.inser t(word);
* * * * // Print the result
* * * * std::cout << "Words: " << wordcount.size( ) << std::endl;
* * * * return 0;

}
the above uses an rb tree (or equivalent) in the set class

My version is about 12 times slower than that. It uses lower-level
constructs. Here it is:
[snip version using linear search]
Any ideas as to what causes the big slowdown?

this is a very important lesson. Print it out in big letters
and post it on your wall.

CAREFUL ALGORITH SELECTION CAN CREAM MICRO-OPTIMISATION

I may get this printed on a tee shirt

--
Nick Keighley

Nov 7 '08 #16
Juha Nieminen wrote:
s0****@gmail.co m wrote:
>Any ideas as to what causes the big slowdown?

Why do you expect that searching a linked list could be even
close to the speed of searching a balanced binary tree?
Which is O(log N), as compared to O(N). However a hashtable is
even faster, being O(1) for suitable organization.

F'ups set to c.l.c. only.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home .att.net>
Try the download section.
Nov 7 '08 #17
user923005 wrote:
s0s...@gmail.co m wrote:
>The task: Write a program that reads a set of words from
standard input and prints the number of distinct words.

[snip]

I am surprised chuck has not chimed in here.
A hash table is *ideal* for this.
I just did, a few minutes ago. Been having problems with the
news-server.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home .att.net>
Try the download section.
Nov 7 '08 #18
CBFalconer wrote:
Juha Nieminen wrote:
>s0****@gmail.co m wrote:
>>Any ideas as to what causes the big slowdown?
Why do you expect that searching a linked list could be even
close to the speed of searching a balanced binary tree?

Which is O(log N), as compared to O(N). However a hashtable is
even faster, being O(1) for suitable organization.

F'ups set to c.l.c. only.
Why? Because computational complexities only apply to C?
Nov 7 '08 #19
Juha Nieminen <no****@thanks. invalidwrites:
user923005 wrote:
>The hash table solution is O(n).

You would have hard time proving that.
He shouldn't. It should be pretty clear. No hash table needs
to examine each element more than once.

Phil
--
We must respect the other fellow's religion, but only in the sense and to the
extent that we respect his theory that his wife is beautiful and his children
smart. -- Henry Louis Mencken (1880-1956), American editor and critic
Nov 7 '08 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1448
by: aaronwmail-usenet | last post by:
I've been wondering about benchmarks recently. What is a fair benchmark? How should benchmarks be vetted or judged? I decided to see what you folks thought, so for discussion I compared two priority queue implementations I published for Python in 1995 against the "heap" priority queue implementation added to the python distribution
0
1186
by: Dathan Vance Pattishall | last post by:
------=_NextPart_000_0401_01C35A8D.94584FD0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit At the conference a few months ago, the mysql team said that the benchmarks they where running would be published here http://www.mysql.com/benchmarks but it doesn't look like this is the link.
13
1895
by: Shane Wright | last post by:
Hi, I've been trying to spec a new server for my company's database for a few weeks and one of the biggest problems I've had is trying to find meaningful performance information about how PostgreSQL will perfom under various disk configurations. But, we have now taken the plunge and I'm in a position to do some benchmarking to actually get some data. Basically I was wondering if anyone else had any particular recommendations (or...
15
6910
by: roberts.noah | last post by:
Are there any decent benchmarks of the difference between using c strings and std::string out there? Google isn't being friendly about it. Obviously this would be dependant on different implementations but I don't care. I would be happy to find ANY comparison at this point if it was valid and semi scientifically done.
4
1609
by: Dibur | last post by:
I have been told to set the performance benchmarks for all the db2 databases hosted on all servers. I have no clue as to how to approach this problem. Can this be done theoritically through extrapolation techniques.One thing that is important here is that all the servers are in production phase so practically carrying out load testing is not feasible. Any ideas , suggestions or comments are welcome
10
3065
by: StephQ | last post by:
I found that old post: http://groups.google.com/group/comp.lang.c++/browse_frm/thread/3a2562c9a5f8998/15519204726d01e8?lnk=gst&q=vector+no+surprise&rnum=2#15519204726d01e8 I just erased the #include <kubux.....lines. ****** old post for your convenince ******** You are right: #include <vector> #include <iostream>
80
3578
by: tech | last post by:
Hi, i have the following problem In file1.h namespace A { class Bar { void foo();
318
11215
by: King Raz | last post by:
The shootout site has benchmarks comparing different languages. It includes C# Mono vs Java but not C# .NET vs Java. So I went through all the benchmark on the site ... http://kingrazi.blogspot.com/2008/05/shootout-c-net-vs-java-benchmarks.html Just to keep the post on topic for my friends at comp.lang.c++, how do I play default windows sounds with C++?
84
571
by: s0suk3 | last post by:
The task: Write a program that reads a set of words from standard input and prints the number of distinct words. I came across a website that listed a few programs to accomplish this task: http://unthought.net/c++/c_vs_c++.html (ignore all the language flaming :-), and thought that all of them did unnecessary operations, so I wrote my own. But for some reason, my version turned out slower that ALL of the versions in the website, even...
0
9881
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11354
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
11066
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10542
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9732
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5943
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
6148
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4778
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
3368
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.