Standard C Library regex performance issue

igor.kulkin

I have a small utility program written in Python which works pretty
slow so I've decided to implement it in C.
I did some benchmarking of Python's code performance. One of the parts
of the program is using Python's standard re (regular expressions)
module to parse the input file. As Python's routines to read from the
file and regular expressions are most likely implemented via native
libraries I would expect that the C code, which reads and parses the
file using exactly the same scheme, would show approximately the same
performance (or maybe better).
I was surprised to find out that the code in C works way slower
(actually about 300 times slower!!!) than the same code in Python.
I am running it all under Unix (I guess the version should not really
matter) and am using gcc with -O2 to compile C code.

The code does exactly the same in both languages:

1. inludes regexp library (module re in Python, regex.h in c)
2. compiles expression (the same expression is used with small
differences cause by slightly different syntax accepted by the
libraries)
3. reads input file line by line
4. parses the line using compiled regexp (in Pthon it's the call
of .match(..), in C it's the call of regexex(...)).
NOTHING MORE!

Does anyone know what's the problem?

Feb 16 '07 #1

Subscribe Post Reply

3810

Hallvard B Furuseth

ig*********@gmail.com writes:

I was surprised to find out that the code in C works way slower
(actually about 300 times slower!!!) than the same code in Python.
(...)
The code does exactly the same in both languages:

1. inludes regexp library (module re in Python, regex.h in c)

These probably use different regexp implementations with quite different
semantics. Maybe you didn't translate your Python regexps to the
equivalent regex.h regexps. Or maybe you use some very inefficient
regexps which Python re manages to optimize but regex.h does not.
One can write _very_ slow regexps with great ease.

2. compiles expression (the same expression is used with small
differences cause by slightly different syntax accepted by the
libraries)

Hopefully you do this just once, outside the loop?

3. reads input file line by line

How? One fgets() into a char buffer[], or something more clever?

4. parses the line using compiled regexp (in Pthon it's the call
of .match(..), in C it's the call of regexex(...)).
NOTHING MORE!

Does anyone know what's the problem?

Might also be related to something quite else in your code, which you
haven't mentioned. Hard to guess when you don't post your code.

--
Hallvard

Feb 16 '07 #2

Christopher Layne

ig*********@gmail.com wrote:

The code does exactly the same in both languages:

1. inludes regexp library (module re in Python, regex.h in c)
2. compiles expression (the same expression is used with small
differences cause by slightly different syntax accepted by the
libraries)
3. reads input file line by line
4. parses the line using compiled regexp (in Pthon it's the call
of .match(..), in C it's the call of regexex(...)).
NOTHING MORE!

Does anyone know what's the problem?

See if Python was compiled to use PCRE or not. IF so, then your version using
regcomp and regexec is not the same. comp.unix.programmer territory btw.

Feb 16 '07 #3

igor.kulkin

On 16 ÆÅ×, 14:57, Hallvard B Furuseth <h.b.furus...@usit.uio.no>
wrote:

igor.kul...@gmail.com writes:
I was surprised to find out that the code in C works way slower
(actually about 300 times slower!!!) than the same code in Python.
These probably use different regexp implementations with quite different
semantics. Maybe you didn't translate your Python regexps to the
equivalent regex.h regexps.

I've actually tryed running the regexps which I adjusted for regex.h
using regex.h and they match what I would want them to match
perfectly.

Or maybe you use some very inefficient
regexps which Python re manages to optimize but regex.h does not.
One can write _very_ slow regexps with great ease.

That might be true. Still regexp inspite of being very long should be
very straightforward.
Here is the regexp (I would understand if noone would read it):

^([[:alpha:]]{3} +[[:digit:]]{1,2} +[[:digit:]]{1,2}:[[:digit:]]{1,2}:
[[:digit:]]{1,2}) +([^ ]+)-([[:digit:]]+)-([[:alnum:]]+)\\[([[:digit:]]
+)\\] +([^ ]+)( +(([^ ]+)\$([[:digit:]]*)\$))?: (.*)\\n?$

It simply matches the line in the log file and has no fancy stuff
involved.

2. compiles expression (the same expression is used with small
differences cause by slightly different syntax accepted by the
libraries)

Hopefully you do this just once, outside the loop?

Sure I do.

3. reads input file line by line

How? One fgets() into a char buffer[], or something more clever?

The way I read the file should not matter as I've tryed running read-
file-line-by-line code separately (I've commented regexps stuff) and
it ran really fast.

4. parses the line using compiled regexp (in Pthon it's the call
of .match(..), in C it's the call of regexex(...)).
NOTHING MORE!

Does anyone know what's the problem?

Might also be related to something quite else in your code, which you
haven't mentioned. Hard to guess when you don't post your code.

Here is the Pthon code I've benchmarked:

import re

PATTERN = re.compile(r"^(\w{3}\s*\d{1,2}\s*\d{1,2}\:\d{1,2}\ :\d{1,2})
\s*(([^\s]+)\[(\d*)\])\s*([^\s]+)\s*(([^\s]+)$(\d*)$)\:\s(.*?)\n?$")

fp = open("some.log", "r")

for line in fp:
mo = PATTERN.match(line)

fp.close()

And here is the C code:
#include <stdio.h>
#include <regex.h>
// Just a helper function.
char * read_line(FILE * in) {
size_t line_len;
char * buf;
buf = fgetln(in, &line_len);

if (!buf) return NULL;

while (line_len 0 && (buf[line_len - 1] == (char) 10 ||
buf[line_len - 1] == (char) 13)) line_len--;

char *line = malloc(line_len + 1);

strncpy(line, buf, line_len);

line[line_len] = (char) 0;

return line;
}

int main(int argc, char **argv) {
regex_t regex;
int errc = regcomp(&regex, "^([a-zA-Z_]{3} +[0-9]{1,2} +[0-9]{1,2}:
[0-9]{1,2}:[0-9]{1,2}) +([^ ]+)-([0-9]+)-([a-zA-Z_]+)\\[([0-9]+)\\] +
([^ ]+)( +(([^ ]+)\$([0-9]*)\$))?: (.*)\\n?$", REG_EXTENDED |
REG_ICASE);
if (errc) {
// error processing logic.
// never happens as Regex compiles normally.
}

int n = regex.re_nsub + 1;
regmatch_t *pmatch = malloc(n * sizeof(regmatch_t));

FILE * in = fopen(files[i], "r");

while (1) {
char * line = read_line(in);
if (!line) break;
regexec(&regex, line, n, pmatch, 0); // I've commented this line to
test reading performance
free(line);
}

fclose(in);

return 0;
}

Feb 16 '07 #4

Christopher Layne

ig*********@gmail.com wrote:

That might be true. Still regexp inspite of being very long should be
very straightforward.
Here is the regexp (I would understand if noone would read it):

^([[:alpha:]]{3} +[[:digit:]]{1,2} +[[:digit:]]{1,2}:[[:digit:]]{1,2}:
[[:digit:]]{1,2}) +([^ ]+)-([[:digit:]]+)-([[:alnum:]]+)\\[([[:digit:]]
+)\\] +([^ ]+)( +(([^ ]+)\$([[:digit:]]*)\$))?: (.*)\\n?$

This is hideous.

It simply matches the line in the log file and has no fancy stuff
involved.

Lots of fancy stuff involved for dubious reasons.

The way I read the file should not matter as I've tryed running read-
file-line-by-line code separately (I've commented regexps stuff) and
it ran really fast.

So you're saying it runs very fast when you read a file line by line but
doesn't run fast when you don't? Well yes then of course it matters.

Here is the Pthon code I've benchmarked:

import re

PATTERN = re.compile(r"^(\w{3}\s*\d{1,2}\s*\d{1,2}\:\d{1,2}\ :\d{1,2})
\s*(([^\s]+)\[(\d*)\])\s*([^\s]+)\s*(([^\s]+)$(\d*)$)\:\s(.*?)\n?$")

----
You can set regex matching modes by specifying a special constant as a third
parameter to re.search(). re.I or re.IGNORECASE applies the pattern case
insensitively. re.S or re.DOTALL makes the dot match newlines. re.M or
re.MULTILINE makes the caret and dollar match after and before line breaks in
the subject string. There is no difference between the single-letter and
descriptive options, except for the number of characters you have to type in.
To specify more than one option, "or" them together with the | operator:
re.search("^a", "abc", re.I | re.M).

By default, Python's regex engine only considers the letters A through Z, the
digits 0 through 9, and the underscore as "word characters". Specify the flag
re.L or re.LOCALE to make \w match all characters that are considered letters
given the current locale settings. Alternatively, you can specify re.U or
re.UNICODE to treat all letters from all scripts as word characters. The
setting also affects word boundaries.
----

The above implies that Pyhton's newline mode is *ON* by default. POSIX
regcomp() is NOT newline on by default.

fp = open("some.log", "r")

for line in fp:
mo = PATTERN.match(line)

fp.close()

Which is line by line.

And here is the C code:
#include <stdio.h>
#include <regex.h>
// Just a helper function.
char * read_line(FILE * in) {
size_t line_len;
char * buf;
buf = fgetln(in, &line_len);

if (!buf) return NULL;

while (line_len 0 && (buf[line_len - 1] == (char) 10 ||
buf[line_len - 1] == (char) 13)) line_len--;

Remove that. There's no reason you should have to do any of that if fgetln()
is doing what it's told.

DESCRIPTION
The fgetln() function returns a pointer to the next line from the stream
referenced by stream. This line is not a C string as it does not end
with a terminating NUL character. The length of the line, including the
final newline, is stored in the memory location to which len points.
(Note, however, that if the line is the last in a file that does not end
in a newline, the returned text will not contain a newline.)

Looks like some BSD4.4 function. Also looks like it operates on a FILE stream
and uses a static pointer of some sort. In short, please just avoid this
function altogether and use fgets().

char *line = malloc(line_len + 1);

strncpy(line, buf, line_len);

line[line_len] = (char) 0;

return line;
}

Right. So the issue here is that you don't know your maximum line length which
is probably what led you to find a function like fgetln() in the first place.
This is one area where you've got to either establish a reasonable boundary
size and use that as the size of your temporary buffer or use fread() and do
buffer management yourself. What this means is that if you do not forsee any
line being longer than let's say 1024 characters. Use a simple temporary
buffer of 1024 char, and throw an error when you hit max line length.

int main(int argc, char **argv) {
regex_t regex;
int errc = regcomp(&regex, "^([a-zA-Z_]{3} +[0-9]{1,2} +[0-9]{1,2}:
[0-9]{1,2}:[0-9]{1,2}) +([^ ]+)-([0-9]+)-([a-zA-Z_]+)\\[([0-9]+)\\] +
([^ ]+)( +(([^ ]+)\$([0-9]*)\$))?: (.*)\\n?$", REG_EXTENDED |
REG_ICASE);

Add REG_NEWLINE.
Please remove "\\n?" from your regex.

Also, your regex:
^<0 or 1 matches of a formatted string>: <0 or more chars>$
is not the most efficient use of regex, and you should probably examine your
logfile format as well.

However the problem to me is that you did not set REG_NEWLINE.

Feb 16 '07 #5

Zack

On Feb 16, 5:36 am, igor.kul...@gmail.com wrote:

The code does exactly the same in both languages:

1. inludes regexp library (module re in Python, regex.h in c)

Is regex.h a standard?

--
Zack

Feb 16 '07 #6

Ben Pfaff

"Zack" <go********@gmail.comwrites:

Is regex.h a standard?

Yes (SUSv3, for example). It is not, however, part of the C
standard.
--
Ben Pfaff
bl*@cs.stanford.edu
http://benpfaff.org

Feb 16 '07 #7

igor.kulkin

Just in case someone faces the same problem it seems like I've been
able to find the answer for the question "why is Unix's regex from
libc so slow".

The answer is:
Because it is really so slow. It seems like it is just a very slow
implementation of the library (and by "very slow" I mean "very very
very slow". It is several hundreds times slower than a normal
implementation.).
What I did is I simply replaced this library with a POSIX version of
PCRE (Perl Compatible Regular Expression) library (which is btw
completely free and can be used in commercial projects). That alone
gave me a dramatic performance gain.

I've also heard that there is a GNU implementation of regex library
which is even faster than PCRE, but it's under GPL.

Feb 20 '07 #8

matevzb

On Feb 20, 6:34 pm, igor.kul...@gmail.com wrote:

Just in case someone faces the same problem it seems like I've been
able to find the answer for the question "why is Unix's regex from
libc so slow".

The answer is:
Because it is really so slow. It seems like it is just a very slow
implementation of the library (and by "very slow" I mean "very very
very slow". It is several hundreds times slower than a normal
implementation.).

Although this is heavily off-topic, I'm still interested in the
particular Unix flavour you're using. The reason I'm asking is your
statement "I guess the version should not really matter", which is an
underestimation. True, they differed more in the past, but it's still
a long road to SUS/POSIX.

What I did is I simply replaced this library with a POSIX version of
PCRE (Perl Compatible Regular Expression) library (which is btw
completely free and can be used in commercial projects). That alone
gave me a dramatic performance gain.

PCRE isn't a POSIX library, it only implements a "set of wrapper
functions that correspond to the POSIX regular expression API".

I've also heard that there is a GNU implementation of regex library
which is even faster than PCRE, but it's under GPL.

That shouldn't be a problem, should it?
--
WYCIWYG - what you C is what you get

Feb 20 '07 #9

William Ahern

On Fri, 16 Feb 2007 02:36:46 -0800, igor.kulkin wrote:
<snip>

I was surprised to find out that the code in C works way slower
(actually about 300 times slower!!!) than the same code in Python.
I am running it all under Unix (I guess the version should not really
matter) and am using gcc with -O2 to compile C code.

300x difference likely means there's several things going on here at
once. You should take this over to comp.unix.programmer. But, FWIW,
it's well known (by those who care to know) that many/most POSIX/SUS regex
implementations are significantly slower than both Perl's
regex implementation and also the PCRE library (which I assume Python is
using).

Feb 20 '07 #10

CBFalconer

ig*********@gmail.com wrote:

>

.... snip ...

>
I've also heard that there is a GNU implementation of regex
library which is even faster than PCRE, but it's under GPL.

So? That doesn't prevent you from using it. It doesn't even make
you publish your source code. It does make you publish linkable
binaries, so that improved libraries can be linked. And that only
if you publish/sell your code. You don't even lose your copyright.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews

Feb 21 '07 #11

Similar topics

Standard graph API?

by: Magnus Lie Hetland | last post by:

Is there any interest in a (hypothetical) standard graph API (with 'graph' meaning a network, consisting of nodes and edges)? Yes, we have the standard ways of implementing graphs through (e.g.)...

Python

Alternatives to the C++ Standard Library?

by: Steven T. Hatton | last post by:

Now that I have a better grasp of the scope and capabilities of the C++ Standard Library, I understand that products such as Qt actually provide much of the same functionality through their own...

C / C++

Regex - Memory performance

by: jeevankodali | last post by:

Hi I have an .Net application which processes thousands of Xml nodes each day and for each node I am using around 30-40 Regex matches to see if they satisfy some conditions are not. These Regex...

C# / C Sharp

Regex internal caching or what?

by: Henrik Dahl | last post by:

Hello! In my application I have a need for using a regular expression now and then. Often the same regular expression must be used multiple times. For performance reasons I use the...

C# / C Sharp

Regex to remove \t \r \n from string

by: morleyc | last post by:

Hi, i would like to remove a number of characters from my string (\t \r \n which are throughout the string), i know regex can do this but i have no idea how. Any pointers much appreciated. Chris

C# / C Sharp

Opinion wanted on "Regex"

by: Mark Chambers | last post by:

Hi there, I'm seeking opinions on the use of regular expression searching. Is there general consensus on whether it's now a best practice to rely on this rather than rolling your own (string)...

C# / C Sharp

126

Threading in new C++ standard

by: Dann Corbit | last post by:

Rather than create a new way of doing things: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2497.html why not just pick up ACE into the existing standard:...

C / C++

Performance issue with RegEx

by: Frenz | last post by:

Hi, I'm facing a performance issue with the following code: MatchCollection match = regex.Matches(strInput); int i = match.Count; //This line is consuming 98% of CPU time for(int...

.NET Framework

Regex performance issue

by: | last post by:

Hi all, Sorry for the lengthy post but as I learned I should post concise-and-complete code. So the code belows shows that the execution of ValidateAddress consumes a lot of time. In the test...

C# / C Sharp

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General