473,396 Members | 2,139 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Slow Regex Code

Still learning C++. I'm writing some regex using boost. It works great.
Only thing is... this code seems slow to me compared to equivelent Perl
and Python. I'm sure I'm doing something incorrect. Any tips?

#include <boost/regex.hpp>
#include <iostream>

// g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35
/usr/local/lib/libboost_regex-gcc41-mt-s.a
// g++ numbers.cpp -o numbers.exe
-Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib

void number_search(const std::string& portion)
{

static const boost::regex Numbers("\\b\\d{9}\\b");
static const boost::regex& rNumbers = Numbers;
boost::smatch matches;

std::string::const_iterator Start = portion.begin();
std::string::const_iterator End = portion.end();

while (boost::regex_search(Start, End, matches, rNumbers))
{
std::cout << matches.str() << std::endl;
Start = matches[0].second;
}
}

int main ()
{
std::string portion;
while (std::getline(std::cin, portion))
{
number_search(portion);
}
return 0;
}
Jun 27 '08 #1
13 4485
On Jun 8, 6:32 pm, brad <byte8b...@gmail.comwrote:
Still learning C++. I'm writing some regex using boost. It
works great. Only thing is... this code seems slow to me
compared to equivelent Perl and Python.
Seems slow, or is measurably slower. There are two
possibilities:

1. it only seems slower, because the rest of the code is
significantly faster, or

2. it really is slower, because perl and python can compile it
into some sort of efficient byte code, since they already
have an "execution" machine for such byte code loaded.

Note that pure (non-extended) regular expressions can be made to
run considerably faster, since they can be converted to a pure
DFA. My own regular expression class does this. For most
purposes, however, boost:regex will be fast enough, and worth
the added flexibility. (My own regular expression class was
designed for a very specific use. Where it doesn't need the
extensions, but it does need some additional features which
aren't in Boost. For most general use, boost::regex is
preferable.)

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Jun 27 '08 #2
brad wrote:
// g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35
/usr/local/lib/libboost_regex-gcc41-mt-s.a
// g++ numbers.cpp -o numbers.exe
-Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib
For starters, you could try adding some optimization flags, such as
-O3 and -march=<your architecture(eg. -march=pentium4).

(No, I don't know if that will make the regexp matching faster, but it
doesn't hurt to try.)
Jun 27 '08 #3
On Sun, 08 Jun 2008 12:32:30 -0400, brad <by*******@gmail.comwrote:
>I'm writing some regex using boost. It works great.
Only thing is... this code seems slow to me compared to equivelent Perl
and Python. I'm sure I'm doing something incorrect. Any tips?
Try PCRE.

--
Roland Pibinger
"The best software is simple, elegant, and full of drama" - Grady Booch
Jun 27 '08 #4
brad wrote:
Still learning C++. I'm writing some regex using boost. It works great.
Only thing is... this code seems slow to me compared to equivelent Perl
and Python. I'm sure I'm doing something incorrect. Any tips?
It's not necessarily slower. But most probably. This caught my attention,
so I did some tests. Your code mainly messes around with the
initialization stuff within the function. This has nothing to
do w/boost regex.

I modified your code to do the following:

- slurp (read-into-buffer) a >120MB text file (actually,
it's the Nietzsche full text, 8 times copied ;-)
- find all "free" numbers >= 10 (that have 2 digits and
word boundaries on the left & right sides)
- show the total count of these numbers
- do the same in Perl.

The results (multicore results are "single-threaded"):

[Windows XP-32, Athlon-64/3200+,@2290MHz]
- Visual Studio 2008 + Boost 1.35.0 9.3 sec
- Perl 5.10 (Active-) 10.4 sec

[Linux 2.6.23, Pentium4,@2660MHz]
- gcc 4.3, -O2, Boost 1.33.1 13.2 sec
- Perl 5.8.8 8.2 sec

[Linux 2.6.23, Core2/Q6600,@3240MHz]
- gcc 4.3, -O2, Boost 1.33.1 6.3 sec
- Perl 5.8.8 (i586, use64bitint=undef) 3.2 sec

[Linux 2.6.24, Core2/Q9300,@3338MHz]
- gcc 4.3, -O2, Boost 1.34.1 'std::runtime_error' (??)
- Perl 5.10 (i586, use64bitint=undef) 10.4 sec

The latter system is not installed completely
(it's a test w/SuSE 11 Release Candidate),
so the results may get better soon there ;-)
Code, C++:
==>
#include <boost/regex.hpp>
#include <fstream>
#include <iostream>

int number_count(const char*block, size_t len)
{
boost::match_flag_type flags = boost::match_default;
boost::regex reg("\\b\\d{2,}\\b");
boost::cmatch m;

const char *from = block, *to = block+len;
int n = 0;
while( boost::regex_search(from, to, m, reg, flags) ) {
from = m[0].second, ++n;
}
return n;
}

int main ()
{
std::ifstream in("nietzsche8.txt"); // this is a 112 MB file,
// it's 8 x the Nietzsche
if(in) { // fulltext in plain ASCII
in.seekg(0, std::ios::end); // get to EOF
unsigned int len = in.tellg(); // read file pointer
in.seekg(0, std::ios::beg); // back to pos 0

char *block = new char [len+1]; // don't be stingy
in.read(block, len); // slurp the file
int n = number_count(block, len); // process data
std::cout << "The text (" << len/1024 << "KB) has "
<< n << " numbers >= 10!" << std::endl;
delete [] block; // play fair
}
return 0;
}
<==

Code, Perl:

==>
open my $fh, '<', 'nietzsche8.txt' or die "what? $!";
my $block;
do { local $/; $block = <$fh};
close $fh;

my $n;
++$n while $block =~ /\b\d{2,}\b/g; # process data
print "The text (" . int(length($block)/1024) ."KB) has $n numbers >= 10!\n";
<==

Regards

Mirco
Jun 27 '08 #5
On 8 Jun., 18:32, brad <byte8b...@gmail.comwrote:
Still learning C++. I'm writing some regex using boost. It works great.
Only thing is... this code seems slow to me compared to equivelent Perl
and Python. I'm sure I'm doing something incorrect. Any tips?

#include <boost/regex.hpp>
#include <iostream>

// g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35
/usr/local/lib/libboost_regex-gcc41-mt-s.a
// g++ numbers.cpp -o numbers.exe
-Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib

void number_search(const std::string& portion)
* *{

* * *static const boost::regex Numbers("\\b\\d{9}\\b");
* * *static const boost::regex& rNumbers = Numbers;
* * *boost::smatch matches;

* * *std::string::const_iterator Start = portion.begin();
* * *std::string::const_iterator End = portion.end();

* * *while (boost::regex_search(Start, End, matches, rNumbers))
* * * *{
* * * *std::cout << matches.str() << std::endl;
* * * *Start = matches[0].second;
* * * *}
* *}

int main ()
* *{
* *std::string portion;
* *while (std::getline(std::cin, portion))
* * * *{
* * * *number_search(portion);
* * * *}
* *return 0;
* *}
As others have pointed out, there are probably two factors here:

- you might not be optimising your code. This can easily cause a
factor of 5-10.
- you might be measuring other parts of the library. I/O is the
obvious answer, and if you are using Microsofts newer C++ compilers
you might also be caught by the secure stl-code that is only disabled
when you add a special define to your build.

I would not expect this kind of code to be fast compared to e.g. Perl.
Perl is sort of built with regex in mind, and that part probably is
heavily optimised - maybe even written (partly) in assembly.

/Peter
Jun 27 '08 #6
On Mon, 9 Jun 2008 14:36:52 -0700 (PDT), peter koch
<pe***************@gmail.comwrote:
>Perl.
Perl is sort of built with regex in mind, and that part probably is
heavily optimised - maybe even written (partly) in assembly.
Perl regex apparently is much slower than Tcl.

Jun 27 '08 #7
Razii wrote:
On Mon, 9 Jun 2008 14:36:52 -0700 (PDT), peter koch
<pe***************@gmail.comwrote:
>Perl is sort of built with regex in mind, and that part probably is
heavily optimised - maybe even written (partly) in assembly.

Perl regex apparently is much slower than Tcl.
This is like saying: a rocket is much faster than an
airplaine. It is true sometimes but means nothing.

From my own experience, P5-REs are much more ver-
satile compared to TCL-RE (P5-REs are not 'regular'
anymore) and in the hands of an experienced pro-
grammer, this difference (which might be notable some-
times if many alternations are involved) approaches zero.

For example - there used to be an algorithm oriented language
implementation comparision (http://shootout.alioth.debian.org)
where you may find all sorts of results. In a reverse-DNA dump
test (http://shootout.alioth.debian.org/gp...vcomp&lang=all)
Perl completes in 2 seconds, TCL in 11 seconds. In another Regex-
heavy test (http://shootout.alioth.debian.org/gp...xdna&lang=all),
TCL runs in 3.3 seconds, whereas the first (allowed) Perl
impelentation comes in in 12 seconds. But, using a more
Perl-like approach (not allowed in this contest), the Perl
program (Perl #3, Perl #6 on the bottom) will complete in
1.2 seconds.

Regards

Mirco
Jun 27 '08 #8
On Tue, 10 Jun 2008 08:34:54 +0200, Mirco Wahab
<wa***@chemie.uni-halle.dewrote:
>In another Regex-
heavy test (http://shootout.alioth.debian.org/gp...xdna&lang=all),
TCL runs in 3.3 seconds, whereas the first (allowed) Perl
impelentation comes in in 12 seconds. But, using a more
Perl-like approach (not allowed in this contest), the Perl
program (Perl #3, Perl #6 on the bottom) will complete in
1.2 seconds.
How do you know that Tcl won't speed up and remain faster than Perl if
it's allowed to split the regex at |
Jun 27 '08 #9
Mirco Wahab wrote:

I modified the expression:
...
boost::regex reg("\\b\\d{2,}\\b");
...
to:
...
boost::regex reg("\\b\\d\\d+\\b");
...

with tremendeous improvements:
[Windows XP-32, Athlon-64/3200+,@2290MHz]
- Visual Studio 2008 + Boost 1.35.0 9.3 sec
- Perl 5.10 (Active-) 10.4 sec
[Windows XP(32bit), Athlon-64/3200+ @2290MHz]
Visual Studio 2008 + Boost 1.35.0 1.8 sec
Perl 5.10.003 (AP, use64bitint=undef) 9.5 sec
[Linux 2.6.23, Pentium4,@2660MHz]
- gcc 4.3, -O2, Boost 1.33.1 13.2 sec
- Perl 5.8.8 8.2 sec
[Linux 2.6.23(32bit), Pentium4/NW @2660MHz]
gcc 4.3.1 -O2, Boost 1.33.1 1.2 sec (user)
Perl 5.8.8 (32bit, use64bitint=undef) 6.2 sec (user)
[Linux 2.6.23, Core2/Q6600,@3240MHz]
- gcc 4.3, -O2, Boost 1.33.1 6.3 sec
- Perl 5.8.8 (i586, use64bitint=undef) 3.2 sec
[Linux 2.6.23(32bit), Core2/Q6600,@3240MHz]
gcc 4.3.1 -O2, Boost 1.33.1 0.55sec (user)
Perl 5.8.8 (32bit, use64bitint=undef) 2.4 sec (user)
[Linux 2.6.24, Core2/Q9300,@3338MHz]
- gcc 4.3, -O2, Boost 1.34.1 'std::runtime_error' (??)
- Perl 5.10 (i586, use64bitint=undef) 10.4 sec
[Linux 2.6.25(32bit), Core2/Q9300,@3338MHz]
gcc 4.3.1, -O3, Boost 1.34.1 0.42sec (user)[*]
Perl 5.10.0 (32bit, use64bitint=undef) 4.0 sec (user)
[*] =after kernel update & gcc update,
g++ -O3 -c boostrg.cxx -o boostrg.o
works now
modified Code, C++:
==>
#include <boost/regex.hpp>
#include <fstream>
#include <iostream>
int number_count(const char *block, unsigned int len)
{
boost::match_flag_type flags = boost::match_default;
boost::regex reg("\\b\\d\\d+\\b");
boost::cmatch what;

const char *from = block, *to = block+len;
int n = 0;
while( boost::regex_search(from, to, what, reg, flags) ) {
from = what[0].second;
++n;
}
return n;
}

int main ()
{
std::ifstream in("nietzsche8.txt"); // this is a 112 MB file,
// it's 8 x the Nietzsche
if(in) { // fulltext in plain ASCII
in.seekg(0, std::ios::end); // get to EOF
unsigned int len = in.tellg(); // read file pointer
in.seekg(0, std::ios::beg); // back to pos 0

char *block = new char [len+1]; // don't be stingy
in.read(block, len); // slurp the file
int n = number_count(block, len); // process data
std::cout << "The text (" << len/1024 << "KB) has "
<< n << " numbers >= 10!" << std::endl;
delete [] block; // play fair
}
return 0;
}
<==

modified Code, Perl:
==>

open my $fh, '<', 'nietzsche8.txt' or die "what? $!";
my $block;
do { local $/; $block = <$fh};
close $fh;

my $n;
++$n while $block =~ /\b\d\d+\b/g; # process data
print "The text (" . int(length($block)/1024) ."KB) has $n numbers >= 10!\n";

<==
At least for me, a very interesting difference.
Boost::Regex gives Perl a significant margin.

Regards

Mirco
Jun 27 '08 #10
Razii wrote:
How do you know that Tcl won't speed up and remain faster than Perl if
it's allowed to split the regex at |
It may or it may not. But the difference
will most probably approach zero, as I
tried to say.

Regards

Mirco
Jun 27 '08 #11
Mirco Wahab wrote:
Mirco Wahab wrote:

I modified the expression:
> ...
boost::regex reg("\\b\\d{2,}\\b");
...

to:
...
boost::regex reg("\\b\\d\\d+\\b");
Wow... I changed my RE to use \\d nine times instead of \\d{9} and it's
now twice as fast. Amazing. I never would have thought of something as
simple as this. Thanks for the idea.
Jun 27 '08 #12
On Tue, 10 Jun 2008 13:40:37 +0200, Mirco Wahab
<wa***@chemie.uni-halle.dewrote:
>How do you know that Tcl won't speed up and remain faster than Perl if
it's allowed to split the regex at |

It may or it may not. But the difference
will most probably approach zero, as I
tried to say.
How do you know when you have not even tried it yet? Perhaps Tcl will
be still twice faster than Perl.

In any case, Tcl regex is faster

http://swtch.com/~rsc/regexp/regexp1.html

Jun 27 '08 #13
Razii wrote:
How do you know when you have not even tried it yet? Perhaps Tcl will
be still twice faster than Perl.
In any case, Tcl regex is faster
http://swtch.com/~rsc/regexp/regexp1.html
I'm sure you have a reason to follow your opinion about that.
Despite of that, I tested this on my box where I installed
a Tcl 8.4 into Cygwin and have a Tcl 8.5 from Activestate
around (Athlon-64/3200+).

[cygwin on WinXP 32bit]
$ time /usr/bin/tclsh84 boo.tcl ==user 0m6.874s
$ time /usr/bin/perl boo.pl ==user 0m4.155s

(BONUS #1: .tcl via XP-installed Active-Tcl 8.5)
$ time /cygdrive/d/Tcl/bin/tclsh85.exe boo.tcl ==real 0m5.633s

(BONUS #2: C++, Win32-mingw-3.4.2 + Boost 1.33.1)
$ time dcboo/boostrg.exe ==real 0m1.952s

So there is, regarding my implementation (I don't have
much experience in Tcl programming), a winner here.
Here's the code (the file in question is 112415 KB
containing 823968 numbers >= 10):

[Tcl] ==>

set fl [file size "nietzsche8.txt"]
set fh [open "nietzsche8.txt" r]
set block [read $fh $fl]
close $fh

set n [regexp -all {\y\d\d+\y} $block]
set k [expr {$fl / 1024}]
puts "The text ($k KB) has $n numbers >= 10!\n";

<==

[Perl] ==>

open my $fh, '<', 'nietzsche8.txt' or die "what? $!";
my $block;
do { local $/; $block = <$fh};
close $fh;

my $n;
++$n while $block =~ /\b\d\d+\b/g; # process data
print "The text (" . int(length($block)/1024) ."KB) has $n numbers >= 10!\n";

<==

Regards

Mirco
Jun 27 '08 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Alan Pretre | last post by:
Can anyone help me figure out a regex pattern for the following input example: xxx:a=b,c=d,yyy:e=f,zzz:www:g=h,i=j,l=m I would want four matches from this: 1. xxx a=b,c=d 2. yyy e=f 3....
2
by: Daniel Billingsley | last post by:
First, if MSFT is listening I'll say IMO the MSDN material is sorely lacking in this area... it's just a whole bunch of information thrown at you and you're left to yourself as to organizing it in...
1
by: tdmailbox | last post by:
I have the following regular expression. It works fine if the regex code returns a match. However if not the .match code fails. How can I code this so that it skips the match if the regular...
2
by: Koen Hoorelbeke | last post by:
Hi there, Don't know if this is the right newsgroup, but I'll post it here anyway: Have to find a good regex-code. Now I know there are plenty of sites with regex-libraries, but can't seem to...
7
by: MrNobody | last post by:
I'm trying to do some regex in C# but for some reason linebreaks are causing my regex to not work. the test string goes like this: string ss = "<tagname...
3
by: jwgoerlich | last post by:
Hello group, I am working on a query string class. The purpose is to parse name-value pairs from incoming text. Currently, I am using the Regex code below. I have two questions. First, the...
6
by: sloan | last post by:
I have a fairly simple RegEx code below. I am given a file name, (which I don't control) , and need to change a folder name in it. The code below is choking on the filename not being...
15
by: morleyc | last post by:
Hi, i would like to remove a number of characters from my string (\t \r \n which are throughout the string), i know regex can do this but i have no idea how. Any pointers much appreciated. Chris
2
by: beatTheDevil | last post by:
Hey guys, As the title says I'm trying to make a regular expression (regex/regexp) for use in removing the comments from code. In this case, this particular regex is meant to match /* ... */...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.