STL map and pthread performance problem on Linux/GCC

nan.li.g

Hello, all,
I have an interesting problem about stl map and pthread on Linux
and g++. The source code is as follows.
//mt_map_test.cpp
#include <string>
#include <map>
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <string.h>
#include <pthread.h>
using namespace std;

static void* thrd_work( void* data )
{
long count = reinterpret_cast<long>( data );

const int SIZE =32;
char buf[SIZE];
memset( buf, 'h', SIZE );
map<string, string> strMap;

for ( long i=0; i< count ; i++ ) {
char key[8];
sprintf( key, "%d", i );
strMap[ key ] = buf;
}
}

int main()
{
int JOB_NUM = 320000;
pthread_t tid[THRD_NUM];

for ( int i=0; i< THRD_NUM; i++ ){
pthread_create( tid+i, NULL, &thrd_work,
reinterpret_cast<void*> (JOB_NUM) );
}

for ( int i=0; i< THRD_NUM; i++ ){
pthread_join( tid[i], NULL );
}

}
And here is what I got from my workstation ( Dual AMD Opteron machine,
RHEL 3)

[nan@eudyptula test]$ for t in 1 2; do g++ -DTHRD_NUM=$t
mt_map_test.cpp -lpthread ; time ./a.out; done

real 0m1.390s
user 0m1.280s
sys 0m0.120s

real 0m3.450s
user 0m5.320s
sys 0m1.170s
I expected that the 2 times should be roughly equal. But clearly I
experienced significant slowdown with 2 threads.The same also happened
to a dual Intel Xeon machine. I suspect the internal stl map
implementation is improper.

I've spent hours googling without any answer. I really need advice from
a C++ expert. Thanks a lot.

Nan

Aug 16 '05 #1

Subscribe Reply

7961

Maxim Yegorushkin

na******@gmail.com wrote:

[]

I expected that the 2 times should be roughly equal. But clearly I
experienced significant slowdown with 2 threads.The same also happened
to a dual Intel Xeon machine. I suspect the internal stl map
implementation is improper.

I've spent hours googling without any answer. I really need advice from
a C++ expert. Thanks a lot.

The problem may well be in malloc(), since it uses a mutex to protect
its data structures, thus serializing your map<> inserts. Try
http://www.hoard.org/

Aug 16 '05 #2

Larry I Smith

na******@gmail.com wrote:

Hello, all,
I have an interesting problem about stl map and pthread on Linux
and g++. The source code is as follows.
//mt_map_test.cpp
#include <string>
#include <map>
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <string.h>
#include <pthread.h>
using namespace std;

static void* thrd_work( void* data )
{
long count = reinterpret_cast<long>( data );

const int SIZE =32;
//char buf[SIZE];
// allow room for a nul-terminator so the temp std::string
// constructed from buf[] by 'strMap[ key ] = buf' will
// always be SIZE bytes in length.
char buf[SIZE + 1];
memset( buf, 'h', SIZE );
// nul-terminate buf[]. if we don't, then when a
// std::string is made from buf[] the nbr of bytes
// put into the std::string will be all bytes up to the
// next zero byte (an unknown length) - undefined behavior.
buf[SIZE] = '\0';
map<string, string> strMap;

for ( long i=0; i< count ; i++ ) {
char key[8];
sprintf( key, "%d", i );
strMap[ key ] = buf;
}
}

int main()
{
int JOB_NUM = 320000;
pthread_t tid[THRD_NUM];

for ( int i=0; i< THRD_NUM; i++ ){
pthread_create( tid+i, NULL, &thrd_work,
reinterpret_cast<void*> (JOB_NUM) );
}

for ( int i=0; i< THRD_NUM; i++ ){
pthread_join( tid[i], NULL );
}

}
And here is what I got from my workstation ( Dual AMD Opteron machine,
RHEL 3)

[nan@eudyptula test]$ for t in 1 2; do g++ -DTHRD_NUM=$t
mt_map_test.cpp -lpthread ; time ./a.out; done

real 0m1.390s
user 0m1.280s
sys 0m0.120s

real 0m3.450s
user 0m5.320s
sys 0m1.170s
I expected that the 2 times should be roughly equal. But clearly I
experienced significant slowdown with 2 threads.The same also happened
to a dual Intel Xeon machine. I suspect the internal stl map
implementation is improper.

I've spent hours googling without any answer. I really need advice from
a C++ expert. Thanks a lot.

Nan

Just because the you have 2 cpu's doesn't mean that each thread
will run on its own cpu. I've seen discussions about this in
the gcc and g++ newsgroups. I don't remember the details, but
I seem to recall that special compile/link options were involved
to force usage of the multiple cpu's.

FYI, here's what I get on my old single-cpu PII-450 with
384MB of RAM:

real 0m9.407s
user 0m9.006s
sys 0m0.215s

real 0m18.671s
user 0m17.959s
sys 0m0.489s

Regards,
Larry

Aug 16 '05 #3

nan.li.g

Thanks a lot for the replies. I can use sched_setaffinity for
processor binding on Linux. But that does not look like what the
problem is. Also, I made a multi-process program using fork(), and
the result there is satisfactory.

Below is the new code with processor binding, multi-process and
corrections on the std::string.

-----------------------------------------------------------
#include <string>
#include <map>
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <string.h>
#include <pthread.h>
#include <sys/wait.h>
#include <sched.h>

using namespace std;

struct WorkData {

int cpu;
long count;

};

static void* work( void* data )
{
WorkData* workData = reinterpret_cast<WorkData*>( data );

cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET( workData->cpu, &mask);
sched_setaffinity(0, &mask);

const int SIZE =32;
char buf[SIZE];
memset( buf, 'h', SIZE );
buf[SIZE-1] = '\0';
map<string, string> strMap;

for ( long i=0; i< workData->count ; i++ ) {
char key[8];
sprintf( key, "%d", i );
strMap[ key ] = buf;
}
}
int main()
{
const int JOB_NUM = 320000;
WorkData workData[ CONCURRENCY ];
#ifdef MT
pthread_t tid[ CONCURRENCY ];
for ( int i=0; i< CONCURRENCY; i++ ){
workData[i].count = JOB_NUM;
workData[i].cpu = i;
pthread_create( tid+i, NULL, &work, reinterpret_cast<void*>
(&(workData[i])) );
}

for ( int i=0; i< CONCURRENCY; i++ ){
pthread_join( tid[i], NULL );
}

#endif

#ifdef MP

pid_t pid[ CONCURRENCY ];
for ( int i=0; i< CONCURRENCY; i++ ){
workData[i].count = JOB_NUM;
workData[i].cpu = i;

if ( (pid[i] = fork()) == 0 ) {

work( reinterpret_cast<void*> ( &(workData[i])));
exit( 0 );
}
}

for ( int i=0; i< CONCURRENCY; i++ ){
waitpid( pid[i], NULL, 0 );
}

#endif
}

----------------------------------------------------------
Performace:

[nan@eudyptula test]$ for m in MT MP; do for t in 1 2; do printf "\n%s
%d" $m $t; g++ -D$m -DCONCURRENCY=$t mt_map_test.cpp -lpthread ;
time ./a.out; done; done

MT 1
real 0m1.410s
user 0m1.380s
sys 0m0.030s

MT 2
real 0m3.650s
user 0m5.230s
sys 0m1.640s

MP 1
real 0m1.380s
user 0m1.340s
sys 0m0.040s

MP 2
real 0m1.400s
user 0m2.650s
sys 0m0.130s

I also suspect there is some locking problem. I have not got my
program to work with hoard yet. But I did not find any slowdown when I
did a simple test on malloc in a multi-threaded program. For now, I
am going to dig into the map implementation on my machine.

Aug 17 '05 #4

nan.li.g

I tried the same program again tonight on a Redhat FC3 dual processor
machine. The problem just went away. Notice the gcc on FC3 is
3.4.2 . Before I used 3.2.3. on RHEL3.

gcc -v
Reading specs from /usr/lib/gcc/i386-redhat-linux/3.4.2/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--disable-checking --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-java-awt=gtk
--host=i386-redhat-linux
Thread model: posix
gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)

$ for m in MT ; do for t in 1 2; do printf "\n%s %d" $m $t; g++
-D$m -DCONCURRENCY=$t mt_map_test.cpp -lpthread; time ./a.out; done;
done

MT 1
real 0m1.707s
user 0m1.632s
sys 0m0.075s

MT 2
real 0m1.805s
user 0m3.303s
sys 0m0.240s

BTW, the signature of sched_setaffinity changed on FC3,
need to use 'sched_setaffinity(0, sizeof(mask), &mask);'

I also got hoard working on FC3. But no further improvement. On RHEL3,
I got the so file with the following warning:
Compiling for Linux
In file included from libhoard.cpp:164:
usesimtls.cpp: In function `int pthread_create(pthread_t*, const
pthread_attr_t*, void*(*)(void*), void*)':
usesimtls.cpp:323: warning: `pthread_attr_setstackaddr' is deprecated
(declared
at /usr/include/nptl/pthread.h:299)
In file included from libhoard.cpp:164:
usesimtls.cpp: In function `void pthread_exit(void*)':
usesimtls.cpp:354: warning: `noreturn' function does ret

But I got segfault in running my program with MT2. MT1, MP1 and MP2
are all OK.

Clearly, something has changed between gcc 3.2 and gcc 3.4.

Aug 17 '05 #5

Maxim Yegorushkin

na******@gmail.com wrote:

[]

And here is what I got from my workstation ( Dual AMD Opteron machine,
RHEL 3)

[nan@eudyptula test]$ for t in 1 2; do g++ -DTHRD_NUM=$t
mt_map_test.cpp -lpthread ; time ./a.out; done

real 0m1.390s
user 0m1.280s
sys 0m0.120s

real 0m3.450s
user 0m5.320s
sys 0m1.170s
I expected that the 2 times should be roughly equal. But clearly I
experienced significant slowdown with 2 threads.The same also happened
to a dual Intel Xeon machine. I suspect the internal stl map
implementation is improper.

I've spent hours googling without any answer. I really need advice from
a C++ expert. Thanks a lot.

The problem may be in libstdc++ caching allocator. See
http://gcc.gnu.org/onlinedocs/libstd...allocator.html

I ran your code on a Dual Xeon 2.8 box with caching disabled and
enabled. Here are my results:

my@devel:~/src/exp> cat /etc/issue
Welcome to SuSE Linux 9.2 (i586) - Kernel \r (\l).
my@devel:~/src/exp> uname -a
Linux devel 2.6.11.4-20a-smp #1 SMP Wed Mar 23 21:52:37 UTC 2005 i686
i686 i386 GNU/Linux
my@devel:~/src/exp> g++ --version
g++ (GCC) 3.3.4 (pre 3.3.5 20040809)
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is
NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.

my@devel:~/src/exp> for t in 1 2; do g++ -O3 -DTHRD_NUM=$t exp.cpp
-pthread ; GLIBCPP_FORCE_NEW=1 GLIBCXX_FORCE_NEW=1 time -p ./a.out;
done
real 1.16
user 1.10
sys 0.05
real 1.37
user 2.55
sys 0.14

my@devel:~/src/exp> for t in 1 2; do g++ -O3 -DTHRD_NUM=$t exp.cpp
-pthread ; time -p ./a.out; done
real 1.12
user 1.06
sys 0.05
real 3.48
user 5.28
sys 1.40

In the former case with caching disabled it's clear from the real time
numbers that the task is scaled well on the two processors.

A little boost is gained using hoard allocator:

my@devel:~/src/exp> for t in 1 2; do g++ -O3 -DTHRD_NUM=$t exp.cpp
-pthread ; LD_PRELOAD=/usr/local/lib/libhoard.so:/usr/lib/libdl.so
GLIBCPP_FORCE_NEW=1 GL\IBCXX_FORCE_NEW=1 time -p ./a.out; done
real 1.15
user 1.09
sys 0.05
real 1.31
user 2.46
sys 0.13

Aug 17 '05 #6

vic

Just to fix up the report output, I ran the following command,

cat /proc/cpuinfo | egrep "(model name|cpu MHz)" ; \
uname -a | awk '{print "kernel name : " $1 " " $3 " " $11 " " $14
}' ; \
gcc -v 2>&1 | grep gcc | awk '{print "gcc version : " $3 $4 }'; \
for method in MT MP; do \
for concurrency in 1 2 3 4 5 ; do \
printf "%s / %s = " $method $concurrency ; \
g++ -D$method=1 -DCONCURRENCY=$concurrency mt_map_test.cpp
-lpthread ; \
for samples in 1 2 3 4 5 6 7 8 9 ; do \
/usr/bin/time ./a.out 2>&1 | \
grep -v swaps | \
sed -e "s/^$.*$user.*$/\1/g" ; \
done | sort | head -5 | tail -1 ; \
done ; \
done

The basic idea is to run multiple times and pull the median . . .

For convenience I also added some conditional definitions,
#ifndef CONCURRENCY
#define CONCURRENCY 1
#endif

#if !defined( MP ) && !defined( MT )
#define MT 1
#endif

My machine has only a single processor, so the results do not seem
surprising,

model name : AMD Athlon(tm) 64 Processor 2800+
cpu MHz : 1808.843
kernel name : Linux 2.6.12-1.1398_FC4 x86_64 GNU/Linux
gcc version : 4.0.120050727
MT / 1 = 1.30
MT / 2 = 2.65
MT / 3 = 4.01
MT / 4 = 5.35
MT / 5 = 6.47
MP / 1 = 1.34
MP / 2 = 2.69
MP / 3 = 4.01
MP / 4 = 5.35
MP / 5 = 6.69

Aug 17 '05 #7

nan.li.g

Thanks a lot. Maxim. GLIBCPP_FORCE_NEW is exactly where the problem
is. I dropped the processor binding code and ran the following script
( adapted from Vic's report script ) on 2 machines.

$ cat /home/nan/measure.sh
#!/bin/bash

cat /proc/cpuinfo | egrep "(model name|cpu MHz)" ;
uname -a | awk '{print "kernel name : " $1 " " $3 " " $11 " " $14
}';
gcc -v 2>& 1 | grep '^gcc' | awk '{print "gcc version : " $3 $4 }';
printf "\n CD CE";

for m in MT MP; do
for t in $(seq 1 4); do
printf "\n%s / %d = " $m $t;
g++ -D$m -DCONCURRENCY=$t mt_map_test.cpp
-lpthread;
for cache in 'export' 'export -n' ; do
for samples in $(seq 1 5); do
$cache GLIBCPP_FORCE_NEW=1; $cache
GLIBCXX_FORCE_NEW=1; /usr/bin/time -f %e ./a.out 2>&1;
done | sort | head -3 | tail -1 | tr '\n' ' '
done
done
done

Machine 1
$ sh measure.sh
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2791.459
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2791.459
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2791.459
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2791.459
kernel name : Linux 2.6.9-1.667smp 2004 i386
gcc version : 3.4.220041017

CD CE
MT / 1 = 1.69 1.68
MT / 2 = 1.79 1.86
MT / 3 = 2.85 2.93
MT / 4 = 3.12 3.11
MP / 1 = 1.54 1.54
MP / 2 = 1.56 1.57
MP / 3 = 3.49 2.45
MP / 4 = 3.51 3.14

Machine 2
$ sh measure.sh
model name : Intel(R) Xeon(TM) CPU 2.40GHz
cpu MHz : 2387.948
model name : Intel(R) Xeon(TM) CPU 2.40GHz
cpu MHz : 2387.948
model name : Intel(R) Xeon(TM) CPU 2.40GHz
cpu MHz : 2387.948
model name : Intel(R) Xeon(TM) CPU 2.40GHz
cpu MHz : 2387.948
kernel name : Linux 2.4.21-4.ELsmp 2003 i386
gcc version : 3.2.320030502

CD CE
MT / 1 = 2.00 1.97
MT / 2 = 2.19 4.93
MT / 3 = 3.16 7.22
MT / 4 = 3.75 9.94
MP / 1 = 1.82 1.99
MP / 2 = 1.84 2.00
MP / 3 = 3.11 3.24
MP / 4 = 3.73 4.25

Note the above machines have 2 Hyperthreaded CPUs, so it only scales
up to 2.
For this particular program, it seems that MP is consistently faster
than MT.
And surprisingly, GLIBCPP_FORCE_NEW or GLIBCXX_FORCE_NEW has no effect
on gcc 3.4.

Aug 17 '05 #8

Maxim Yegorushkin

na******@gmail.com wrote:

Thanks a lot. Maxim. GLIBCPP_FORCE_NEW is exactly where the problem
is. I dropped the processor binding code and ran the following script
( adapted from Vic's report script ) on 2 machines.
[]
Note the above machines have 2 Hyperthreaded CPUs, so it only scales
up to 2.
For this particular program, it seems that MP is consistently faster
than MT.
And surprisingly, GLIBCPP_FORCE_NEW or GLIBCXX_FORCE_NEW has no effect
on gcc 3.4.

I've just checked out gcc 3.4.3 sources and GLIBCXX_FORCE_NEW check is
still there.

I ran a slightly modified script version to include results with Hoard
allocator. Here CE/D stand for c++ caching allocator enabled/disabled,
H stands for Hoard.

my@devel:~/src/exp> ./measure
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2792.047
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2792.047
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2792.047
model name : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz : 2792.047
kernel name : Linux 2.6.11.4-20a-smp 2005 i386
gcc version : 3.3.4

CE CD CEH CDH
MT / 1 = 1.63 1.66 1.63 1.65
MT / 2 = 4.29 1.79 4.24 1.77
MT / 3 = 6.16 2.53 6.23 2.44
MT / 4 = 9.33 3.05 9.03 2.24
MP / 1 = 1.62 1.49 1.63 1.51
MP / 2 = 1.63 1.51 1.64 1.53
MP / 3 = 2.29 2.14 2.30 2.17
MP / 4 = 2.70 2.59 2.74 2.66

Aug 18 '05 #9

nan.li.g

It would be nice if you can try this again with gcc 3.4.3.

Aug 18 '05 #10

Similar topics

gcc -pthread option for Linux

by: Avin | last post by:

Hi, I am using the Red Hat Linux 9.0 & gcc 3.2 I am trying to make multithreaded library from my c/c++ source code. Do I have to use -pthread gcc option during compile time, or this option is...

C / C++

Pthread sleep cycle problem

by: lokb | last post by:

Hi, I am creating a detach thread as shown below and caling pthread_create in a while loop where the file names in the directory are fetched and is passed as a parmater to pthread create. The...

C / C++

Calling sqleintr or exit causes program (linux) to hang (due to deadlock by pthread)

by: szehau | last post by:

Hi all, I have a program written in C with embeded SQL. Following are the configuration: DB2/LINUX 8.1.5 Thread model: posix gcc version 3.2 20020903 (Red Hat Linux 8.0 3.2-7) My problems...

DB2 Database

115

Mystery: static variables & performance

by: Mark Shelor | last post by:

I've encountered a troublesome inconsistency in the C-language Perl extension I've written for CPAN (Digest::SHA). The problem involves the use of a static array within a performance-critical...

C / C++

Race condition on accept() system call using pthread on Red Hat ES 2.1 (kernel 2.4.9)

by: fran | last post by:

Server: IBM XSERIES 225 (Intel Xeon CPU 2.40GHz 1GB RAM) Operating System: Linux RedHat ES 2.1 kernel 2.4.9 Languaje: C++ Compiler: gcc 2.96 Libs: pthread We are in need of your help in...

C / C++

How emulate pthread condition variables in native Windows?

by: noleander | last post by:

I've got an application that uses Pthread to do threading. Mostly Im using Condition Variables and the associated function calls: - pthread_cond_wait() - pthread_cond_signal() -...

.NET Framework

pthread and multiple CPU

by: noel.yim | last post by:

Hi, First question is: I want to assign 8 threads to 8 processors (must one to one), when I initialize the thread, does it guarantee that each CPU has one thread? or there can be some case such...

C / C++

pthread: how to avoid processor context switch..

by: tonno01 | last post by:

Hi... i am developing some multi-threaded application on linux (fedora) Intel SMP hardware. i was wondering if it is possible to force a pthread to use always the same processor, in other words,...

C / C++

Again: STL map and pthread performance problem on Linux/GCC

by: Alexander Kotelnikov | last post by:

Hello. I faced the same problem which arose before Message-ID: <1124162913.832263.262410@g14g2000cwa.googlegroups.com>...

C / C++

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp