Slow Regex Code

brad

Still learning C++. I'm writing some regex using boost. It works great.
Only thing is... this code seems slow to me compared to equivelent Perl
and Python. I'm sure I'm doing something incorrect. Any tips?

#include <boost/regex.hpp>
#include <iostream>

// g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35
/usr/local/lib/libboost_regex-gcc41-mt-s.a
// g++ numbers.cpp -o numbers.exe
-Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib

void number_search(c onst std::string& portion)
{

static const boost::regex Numbers("\\b\\d {9}\\b");
static const boost::regex& rNumbers = Numbers;
boost::smatch matches;

std::string::co nst_iterator Start = portion.begin() ;
std::string::co nst_iterator End = portion.end();

while (boost::regex_s earch(Start, End, matches, rNumbers))
{
std::cout << matches.str() << std::endl;
Start = matches[0].second;
}
}

int main ()
{
std::string portion;
while (std::getline(s td::cin, portion))
{
number_search(p ortion);
}
return 0;
}

Jun 27 '08 #1

Subscribe Reply

4536

James Kanze

On Jun 8, 6:32 pm, brad <byte8b...@gmai l.comwrote:

Still learning C++. I'm writing some regex using boost. It
works great. Only thing is... this code seems slow to me
compared to equivelent Perl and Python.

Seems slow, or is measurably slower. There are two
possibilities:

1. it only seems slower, because the rest of the code is
significantly faster, or

2. it really is slower, because perl and python can compile it
into some sort of efficient byte code, since they already
have an "execution" machine for such byte code loaded.

Note that pure (non-extended) regular expressions can be made to
run considerably faster, since they can be converted to a pure
DFA. My own regular expression class does this. For most
purposes, however, boost:regex will be fast enough, and worth
the added flexibility. (My own regular expression class was
designed for a very specific use. Where it doesn't need the
extensions, but it does need some additional features which
aren't in Boost. For most general use, boost::regex is
preferable.)

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #2

Juha Nieminen

brad wrote:

// g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35
/usr/local/lib/libboost_regex-gcc41-mt-s.a
// g++ numbers.cpp -o numbers.exe
-Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib

For starters, you could try adding some optimization flags, such as
-O3 and -march=<your architecture(eg . -march=pentium4) .

(No, I don't know if that will make the regexp matching faster, but it
doesn't hurt to try.)

Jun 27 '08 #3

Roland Pibinger

On Sun, 08 Jun 2008 12:32:30 -0400, brad <by*******@gmai l.comwrote:

>I'm writing some regex using boost. It works great.
Only thing is... this code seems slow to me compared to equivelent Perl
and Python. I'm sure I'm doing something incorrect. Any tips?

Try PCRE.

--
Roland Pibinger
"The best software is simple, elegant, and full of drama" - Grady Booch

Jun 27 '08 #4

Mirco Wahab

brad wrote:

Still learning C++. I'm writing some regex using boost. It works great.
Only thing is... this code seems slow to me compared to equivelent Perl
and Python. I'm sure I'm doing something incorrect. Any tips?

It's not necessarily slower. But most probably. This caught my attention,
so I did some tests. Your code mainly messes around with the
initialization stuff within the function. This has nothing to
do w/boost regex.

I modified your code to do the following:

- slurp (read-into-buffer) a >120MB text file (actually,
it's the Nietzsche full text, 8 times copied ;-)
- find all "free" numbers >= 10 (that have 2 digits and
word boundaries on the left & right sides)
- show the total count of these numbers
- do the same in Perl.

The results (multicore results are "single-threaded"):

[Windows XP-32, Athlon-64/3200+,@2290MHz]
- Visual Studio 2008 + Boost 1.35.0 9.3 sec
- Perl 5.10 (Active-) 10.4 sec

[Linux 2.6.23, Pentium4,@2660M Hz]
- gcc 4.3, -O2, Boost 1.33.1 13.2 sec
- Perl 5.8.8 8.2 sec

[Linux 2.6.23, Core2/Q6600,@3240MHz]
- gcc 4.3, -O2, Boost 1.33.1 6.3 sec
- Perl 5.8.8 (i586, use64bitint=und ef) 3.2 sec

[Linux 2.6.24, Core2/Q9300,@3338MHz]
- gcc 4.3, -O2, Boost 1.34.1 'std::runtime_e rror' (??)
- Perl 5.10 (i586, use64bitint=und ef) 10.4 sec

The latter system is not installed completely
(it's a test w/SuSE 11 Release Candidate),
so the results may get better soon there ;-)
Code, C++:
==>
#include <boost/regex.hpp>
#include <fstream>
#include <iostream>

int number_count(co nst char*block, size_t len)
{
boost::match_fl ag_type flags = boost::match_de fault;
boost::regex reg("\\b\\d{2,} \\b");
boost::cmatch m;

const char *from = block, *to = block+len;
int n = 0;
while( boost::regex_se arch(from, to, m, reg, flags) ) {
from = m[0].second, ++n;
}
return n;
}

int main ()
{
std::ifstream in("nietzsche8. txt"); // this is a 112 MB file,
// it's 8 x the Nietzsche
if(in) { // fulltext in plain ASCII
in.seekg(0, std::ios::end); // get to EOF
unsigned int len = in.tellg(); // read file pointer
in.seekg(0, std::ios::beg); // back to pos 0

char *block = new char [len+1]; // don't be stingy
in.read(block, len); // slurp the file
int n = number_count(bl ock, len); // process data
std::cout << "The text (" << len/1024 << "KB) has "
<< n << " numbers >= 10!" << std::endl;
delete [] block; // play fair
}
return 0;
}
<==

Code, Perl:

==>
open my $fh, '<', 'nietzsche8.txt ' or die "what? $!";
my $block;
do { local $/; $block = <$fh};
close $fh;

my $n;
++$n while $block =~ /\b\d{2,}\b/g; # process data
print "The text (" . int(length($blo ck)/1024) ."KB) has $n numbers >= 10!\n";
<==

Regards

Mirco

Jun 27 '08 #5

peter koch

On 8 Jun., 18:32, brad <byte8b...@gmai l.comwrote:

Still learning C++. I'm writing some regex using boost. It works great.
Only thing is... this code seems slow to me compared to equivelent Perl
and Python. I'm sure I'm doing something incorrect. Any tips?

#include <boost/regex.hpp>
#include <iostream>

// g++ numbers.cpp -o numbers -I/usr/local/include/boost-1_35
/usr/local/lib/libboost_regex-gcc41-mt-s.a
// g++ numbers.cpp -o numbers.exe
-Ic://Boost/include/boost-1_35://Boost/lib/libboost_regex-mgw34-mt-s.lib

void number_search(c onst std::string& portion)
* *{

* * *static const boost::regex Numbers("\\b\\d {9}\\b");
* * *static const boost::regex& rNumbers = Numbers;
* * *boost::smatch matches;

* * *std::string::c onst_iterator Start = portion.begin() ;
* * *std::string::c onst_iterator End = portion.end();

* * *while (boost::regex_s earch(Start, End, matches, rNumbers))
* * * *{
* * * *std::cout << matches.str() << std::endl;
* * * *Start = matches[0].second;
* * * *}
* *}

int main ()
* *{
* *std::string portion;
* *while (std::getline(s td::cin, portion))
* * * *{
* * * *number_search( portion);
* * * *}
* *return 0;
* *}

As others have pointed out, there are probably two factors here:

- you might not be optimising your code. This can easily cause a
factor of 5-10.
- you might be measuring other parts of the library. I/O is the
obvious answer, and if you are using Microsofts newer C++ compilers
you might also be caught by the secure stl-code that is only disabled
when you add a special define to your build.

I would not expect this kind of code to be fast compared to e.g. Perl.
Perl is sort of built with regex in mind, and that part probably is
heavily optimised - maybe even written (partly) in assembly.

/Peter

Jun 27 '08 #6

Razii

On Mon, 9 Jun 2008 14:36:52 -0700 (PDT), peter koch
<pe************ ***@gmail.comwr ote:

>Perl.
Perl is sort of built with regex in mind, and that part probably is
heavily optimised - maybe even written (partly) in assembly.

Perl regex apparently is much slower than Tcl.

Jun 27 '08 #7

Mirco Wahab

Razii wrote:

On Mon, 9 Jun 2008 14:36:52 -0700 (PDT), peter koch
<pe************ ***@gmail.comwr ote:
>Perl is sort of built with regex in mind, and that part probably is
heavily optimised - maybe even written (partly) in assembly.

Perl regex apparently is much slower than Tcl.

This is like saying: a rocket is much faster than an
airplaine. It is true sometimes but means nothing.

From my own experience, P5-REs are much more ver-
satile compared to TCL-RE (P5-REs are not 'regular'
anymore) and in the hands of an experienced pro-
grammer, this difference (which might be notable some-
times if many alternations are involved) approaches zero.

For example - there used to be an algorithm oriented language
implementation comparision (http://shootout.alioth.debian.org)
where you may find all sorts of results. In a reverse-DNA dump
test (http://shootout.alioth.debian.org/gp...vcomp&lang=all)
Perl completes in 2 seconds, TCL in 11 seconds. In another Regex-
heavy test (http://shootout.alioth.debian.org/gp...xdna&lang=all),
TCL runs in 3.3 seconds, whereas the first (allowed) Perl
impelentation comes in in 12 seconds. But, using a more
Perl-like approach (not allowed in this contest), the Perl
program (Perl #3, Perl #6 on the bottom) will complete in
1.2 seconds.

Regards

Mirco

Jun 27 '08 #8

Razii

On Tue, 10 Jun 2008 08:34:54 +0200, Mirco Wahab
<wa***@chemie.u ni-halle.dewrote:

>In another Regex-
heavy test (http://shootout.alioth.debian.org/gp...xdna&lang=all),
TCL runs in 3.3 seconds, whereas the first (allowed) Perl
impelentatio n comes in in 12 seconds. But, using a more
Perl-like approach (not allowed in this contest), the Perl
program (Perl #3, Perl #6 on the bottom) will complete in
1.2 seconds.

How do you know that Tcl won't speed up and remain faster than Perl if
it's allowed to split the regex at |

Jun 27 '08 #9

Mirco Wahab

Mirco Wahab wrote:

I modified the expression:

...
boost::regex reg("\\b\\d{2,} \\b");
...

to:
...
boost::regex reg("\\b\\d\\d+ \\b");
...

with tremendeous improvements:

[Windows XP-32, Athlon-64/3200+,@2290MHz]
- Visual Studio 2008 + Boost 1.35.0 9.3 sec
- Perl 5.10 (Active-) 10.4 sec

[Windows XP(32bit), Athlon-64/3200+ @2290MHz]
Visual Studio 2008 + Boost 1.35.0 1.8 sec
Perl 5.10.003 (AP, use64bitint=und ef) 9.5 sec

[Linux 2.6.23, Pentium4,@2660M Hz]
- gcc 4.3, -O2, Boost 1.33.1 13.2 sec
- Perl 5.8.8 8.2 sec

[Linux 2.6.23(32bit), Pentium4/NW @2660MHz]
gcc 4.3.1 -O2, Boost 1.33.1 1.2 sec (user)
Perl 5.8.8 (32bit, use64bitint=und ef) 6.2 sec (user)

[Linux 2.6.23, Core2/Q6600,@3240MHz]
- gcc 4.3, -O2, Boost 1.33.1 6.3 sec
- Perl 5.8.8 (i586, use64bitint=und ef) 3.2 sec

[Linux 2.6.23(32bit), Core2/Q6600,@3240MHz]
gcc 4.3.1 -O2, Boost 1.33.1 0.55sec (user)
Perl 5.8.8 (32bit, use64bitint=und ef) 2.4 sec (user)

[Linux 2.6.24, Core2/Q9300,@3338MHz]
- gcc 4.3, -O2, Boost 1.34.1 'std::runtime_e rror' (??)
- Perl 5.10 (i586, use64bitint=und ef) 10.4 sec

[Linux 2.6.25(32bit), Core2/Q9300,@3338MHz]
gcc 4.3.1, -O3, Boost 1.34.1 0.42sec (user)[*]
Perl 5.10.0 (32bit, use64bitint=und ef) 4.0 sec (user)
[*] =after kernel update & gcc update,
g++ -O3 -c boostrg.cxx -o boostrg.o
works now
modified Code, C++:
==>
#include <boost/regex.hpp>
#include <fstream>
#include <iostream>
int number_count(co nst char *block, unsigned int len)
{
boost::match_fl ag_type flags = boost::match_de fault;
boost::regex reg("\\b\\d\\d+ \\b");
boost::cmatch what;

const char *from = block, *to = block+len;
int n = 0;
while( boost::regex_se arch(from, to, what, reg, flags) ) {
from = what[0].second;
++n;
}
return n;
}

int main ()
{
std::ifstream in("nietzsche8. txt"); // this is a 112 MB file,
// it's 8 x the Nietzsche
if(in) { // fulltext in plain ASCII
in.seekg(0, std::ios::end); // get to EOF
unsigned int len = in.tellg(); // read file pointer
in.seekg(0, std::ios::beg); // back to pos 0

char *block = new char [len+1]; // don't be stingy
in.read(block, len); // slurp the file
int n = number_count(bl ock, len); // process data
std::cout << "The text (" << len/1024 << "KB) has "
<< n << " numbers >= 10!" << std::endl;
delete [] block; // play fair
}
return 0;
}
<==

modified Code, Perl:
==>

open my $fh, '<', 'nietzsche8.txt ' or die "what? $!";
my $block;
do { local $/; $block = <$fh};
close $fh;

my $n;
++$n while $block =~ /\b\d\d+\b/g; # process data
print "The text (" . int(length($blo ck)/1024) ."KB) has $n numbers >= 10!\n";

<==
At least for me, a very interesting difference.
Boost::Regex gives Perl a significant margin.

Regards

Mirco

Jun 27 '08 #10

Similar topics

2410

Regex puzzle

by: Alan Pretre | last post by:

Can anyone help me figure out a regex pattern for the following input example: xxx:a=b,c=d,yyy:e=f,zzz:www:g=h,i=j,l=m I would want four matches from this: 1. xxx a=b,c=d 2. yyy e=f 3. zzz (empty) 4. www g=h,i=j,l=m

C# / C Sharp

1869

my head is spinning with regex

by: Daniel Billingsley | last post by:

First, if MSFT is listening I'll say IMO the MSDN material is sorely lacking in this area... it's just a whole bunch of information thrown at you and you're left to yourself as to organizing it in your head. Typical learning starts with basics and progresses through increasingly complex information - I think given the inherent confusion-inducing ability of regex that kind of documentation would be very valuable. But anyway, I'm trying...

C# / C Sharp

2060

Regular Expression Regex/Match fails if regular expression returns a null

by: tdmailbox | last post by:

I have the following regular expression. It works fine if the regex code returns a match. However if not the .match code fails. How can I code this so that it skips the match if the regular expression does not find anything? Regex reg_unit_num = new Regex("L_unit_num.*?>(.*?)</td>", RegexOptions.IgnoreCase);

C# / C Sharp

1189

regex question.

by: Koen Hoorelbeke | last post by:

Hi there, Don't know if this is the right newsgroup, but I'll post it here anyway: Have to find a good regex-code. Now I know there are plenty of sites with regex-libraries, but can't seem to find the right one. The problem: I have a page with html-data (formatted table), and I need the data out of it, so I can insert it into a database. Below is an example of one row of that table. Reading the page is no problem, inserting it into...

ASP.NET

2435

how do I handle linebreaks in Regex?

by: MrNobody | last post by:

I'm trying to do some regex in C# but for some reason linebreaks are causing my regex to not work. the test string goes like this: string ss = "<tagname something=45678&somethingelse=12345>blah</tagname>\r\n<tag2>stuff</tag2>"; and my regex code is like:

C# / C Sharp

8503

Query String or Connection String with Regex

by: jwgoerlich | last post by:

Hello group, I am working on a query string class. The purpose is to parse name-value pairs from incoming text. Currently, I am using the Regex code below. I have two questions. First, the code below does not work if there is a space in the name. For example, the text "Initial Catalog=test;" parses to name=Catalog and value=test.

C# / C Sharp

6311

RegEx and Vb.net /// "Unrecognized escape sequence"

by: sloan | last post by:

I have a fairly simple RegEx code below. I am given a file name, (which I don't control) , and need to change a folder name in it. The code below is choking on the filename not being escaped. "Unrecognized escape sequence"

Visual Basic .NET

50272

Regex to remove \t \r \n from string

by: morleyc | last post by:

Hi, i would like to remove a number of characters from my string (\t \r \n which are throughout the string), i know regex can do this but i have no idea how. Any pointers much appreciated. Chris

C# / C Sharp

7342

Ruby regex for removing C/Java-style /* ... */ comments

by: beatTheDevil | last post by:

Hey guys, As the title says I'm trying to make a regular expression (regex/regexp) for use in removing the comments from code. In this case, this particular regex is meant to match /* ... */ comments. I'm using Ruby v.1.8.6 Here's my regex: multiline_comments = /\/\*(.*?)\*\// When I try myStr.gsub(multiline_comments, "")

Ruby / Ruby on Rails

9705

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9576

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10567

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10310

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9138

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7613

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5515

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5647

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4291

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp