RegExp performance?

Christian Sonne

Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\ d{9}X)\D*.*"

(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)

however, on my 3200+ amd64, running the following:

reISBN10 = re.compile(".*\ D*(\d{10}|\d{9} X)\D*.*")
isbn10s = reISBN10.findal l(contents)

(where contents is the string)

this takes about 14 minutes - and there are only one or two matches...

if I change this to match ".*[ ]*(\d{10}|\d{9}X )[ ]*.*" instead, I risk
loosing results, but it runs in about 0.3 seconds

So what's the deal? - why would it take so long to run the correct one?
- especially when a slight modification makes it run as fast as I'd
expect from the beginning...
I'm sorry I cannot supply test data, in my case, it comes from
copyrighted material - however if it proves needed, I can probably
construct dummy data to illustrate the problem
Any and all guidance would be greatly appreciated,
kind regards
Christian Sonne

PS: be gentle - it's my first post here :-)

Feb 25 '07 #1

Subscribe Reply

1399

Gabriel Genellina

En Sun, 25 Feb 2007 05:21:49 -0300, Christian Sonne <Fr*******@gmai l.com>
escribió:

Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\ d{9}X)\D*.*"

Why the .* at the start and end? You dont want to match those, and makes
your regexp slow.
You didn't tell how exactly a ISBN-10 number looks like, but if you want
to match 10 digits, or 9 digits followed by an X:
reISBN10 = re.compile("\d{ 10}|\d{9}X")
That is, just the () group in your expression. But perhaps this other one
is better (I think it should be faster, but you should measure it):
reISBN10 = re.compile("\d{ 9}[\dX]")
("Nine digits followed by another digit or an X")

if I change this to match ".*[ ]*(\d{10}|\d{9}X )[ ]*.*" instead, I risk
loosing results, but it runs in about 0.3 seconds

Using my suggested expressions you might match some garbage, but not loose
anything (except two ISBN numbers joined together without any separator in
between). Assuming you have stripped all the "-", as you said.

So what's the deal? - why would it take so long to run the correct one?
- especially when a slight modification makes it run as fast as I'd
expect from the beginning...

Those .* make the expression match a LOT of things at first, just to
discard it in the next step.

--
Gabriel Genellina

Feb 25 '07 #2

Marc 'BlackJack' Rintsch

In <45************ ***********@new s.sunsite.dk>, Christian Sonne wrote:

Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\ d{9}X)\D*.*"

(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)

however, on my 3200+ amd64, running the following:

reISBN10 = re.compile(".*\ D*(\d{10}|\d{9} X)\D*.*")
isbn10s = reISBN10.findal l(contents)

(where contents is the string)

this takes about 14 minutes - and there are only one or two matches...

First of all try to get rid of the '.*' at both ends of the regexp. Don't
let the re engine search for any characters that you are not interested in
anyway.

Then leave off the '*' after '\D'. It doesn't matter if there are
multiple non-digits before or after the ISBN, there just have to be at
least one. BTW with the star it even matches *no* non-digit too!

So the re looks like this: '\D(\d{10}|\d{9 }X)\D'

Ciao,
Marc 'BlackJack' Rintsch

Feb 25 '07 #3

John Machin

On Feb 25, 7:21 pm, Christian Sonne <FreakC...@gmai l.comwrote:

Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\ d{9}X)\D*.*"

All of those *s are making it work too hard.

Starting with your r".*\D*(\d{10}| \d{9}X)\D*.*" [you do have the
r"..." not just "....", don't you?]

Step 1: Lose the .* off each end -- this is meaningless in the context
of a search() or findall() and would slow the re engine down if it
doesn't optimise it away.

r"\D*(\d{10}|\d {9}X)\D*"

Step 2: I presume that the \D* at each (remaining) end is to ensure
that you don't pick up a number with 11 or more digits. You only need
to test that your presumed ISBN is not preceded/followed by ONE
suspect character. Is ABC1234567890DE F OK? I think not; I'd use \b
instead of \D

r"\b(\d{10}|\d{ 9}X)\b"

Step 3: Now that we have only \b (which matches 0 characters) at each
end of the ISBN, we can lose the capturing ()

r"\b\d{10}|\d{9 }X\b"

Step 4: In the case of 123456789X, it fails on the X and then scans
the 123456789 again -- a tiny waste compared to all the * stuff, but
still worth fixing.

r"\b\d{9}[0-9X]\b"

Give that a whirl and let us know how correct and how fast it is.

>
(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)

however, on my 3200+ amd64, running the following:

reISBN10 = re.compile(".*\ D*(\d{10}|\d{9} X)\D*.*")

You should really get into the habit of using raw strings with re.

isbn10s = reISBN10.findal l(contents)

(where contents is the string)

this takes about 14 minutes - and there are only one or two matches...

How many actual matches and how many expected matches?

Note on "and there are only one or two matches": if your data
consisted only of valid ISBNs separated by a single space or comma, it
would run much faster. It is all the quadratic/exponential mucking
about with the in-between bits that slows it down. To demonstrate
this, try timing dummy data like "1234567890 " * 1000000 and
"123456789X " * 1000000 with your various regexes, and with the step1,
step2 etc regexes above.

>
if I change this to match ".*[ ]*(\d{10}|\d{9}X )[ ]*.*" instead, I risk
loosing results, but it runs in about 0.3 seconds

So what's the deal? - why would it take so long to run the correct one?

Because you have .*\D* in other words 0 or more occurrences of almost
anything followed by 0 or more occurrences of almost anything. Even
assuming it ignores the .*, it will find the longest possible sequence
of non-digits, then try to match the ISBN stuff. If it fails, it will
shorten that sequence of non-digits, try the ISBN stuff again, etc etc
until it matches the ISBN stuff or that sequence of non-digits is down
to zero length. It will do that for each character position in the
file contents. Why is it faster when you change \D to []? Presumably
because in your data, sequences of non-digits are longer than
sequences of spaces IOW there is less stuff to back-track over.

- especially when a slight modification makes it run as fast as I'd
expect from the beginning...

I'm sorry I cannot supply test data, in my case, it comes from
copyrighted material - however if it proves needed, I can probably
construct dummy data to illustrate the problem

What you call "dummy data", I'd call "test data". You should make a
sample of "dummy data" and test that your regex is (a) finding all
ISBNs that it should (b) not reporting incorrect matches, *BEFORE*
being concerned with speed.

>
Any and all guidance would be greatly appreciated,

For some light reading :-) borrow a copy of Friedl's book (mentioned
in the Python re docs) and read the parts on how backtracking regex
engines work.

HTH,
John

Feb 25 '07 #4

Christian Sonne

John Machin wrote:

On Feb 25, 7:21 pm, Christian Sonne <FreakC...@gmai l.comwrote:
>Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}| \d{9}X)\D*.*"

All of those *s are making it work too hard.

Starting with your r".*\D*(\d{10}| \d{9}X)\D*.*" [you do have the
r"..." not just "....", don't you?]

Step 1: Lose the .* off each end -- this is meaningless in the context
of a search() or findall() and would slow the re engine down if it
doesn't optimise it away.

r"\D*(\d{10}|\d {9}X)\D*"

Step 2: I presume that the \D* at each (remaining) end is to ensure
that you don't pick up a number with 11 or more digits. You only need
to test that your presumed ISBN is not preceded/followed by ONE
suspect character. Is ABC1234567890DE F OK? I think not; I'd use \b
instead of \D

r"\b(\d{10}|\d{ 9}X)\b"

Step 3: Now that we have only \b (which matches 0 characters) at each
end of the ISBN, we can lose the capturing ()

r"\b\d{10}|\d{9 }X\b"

Step 4: In the case of 123456789X, it fails on the X and then scans
the 123456789 again -- a tiny waste compared to all the * stuff, but
still worth fixing.

r"\b\d{9}[0-9X]\b"

Give that a whirl and let us know how correct and how fast it is.

>(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)

however, on my 3200+ amd64, running the following:

reISBN10 = re.compile(".*\ D*(\d{10}|\d{9} X)\D*.*")

You should really get into the habit of using raw strings with re.

>isbn10s = reISBN10.findal l(contents)

(where contents is the string)

this takes about 14 minutes - and there are only one or two matches...

How many actual matches and how many expected matches?

Note on "and there are only one or two matches": if your data
consisted only of valid ISBNs separated by a single space or comma, it
would run much faster. It is all the quadratic/exponential mucking
about with the in-between bits that slows it down. To demonstrate
this, try timing dummy data like "1234567890 " * 1000000 and
"123456789X " * 1000000 with your various regexes, and with the step1,
step2 etc regexes above.

>if I change this to match ".*[ ]*(\d{10}|\d{9}X )[ ]*.*" instead, I risk
loosing results, but it runs in about 0.3 seconds

So what's the deal? - why would it take so long to run the correct one?

Because you have .*\D* in other words 0 or more occurrences of almost
anything followed by 0 or more occurrences of almost anything. Even
assuming it ignores the .*, it will find the longest possible sequence
of non-digits, then try to match the ISBN stuff. If it fails, it will
shorten that sequence of non-digits, try the ISBN stuff again, etc etc
until it matches the ISBN stuff or that sequence of non-digits is down
to zero length. It will do that for each character position in the
file contents. Why is it faster when you change \D to []? Presumably
because in your data, sequences of non-digits are longer than
sequences of spaces IOW there is less stuff to back-track over.

>- especially when a slight modification makes it run as fast as I'd
expect from the beginning...

I'm sorry I cannot supply test data, in my case, it comes from
copyrighted material - however if it proves needed, I can probably
construct dummy data to illustrate the problem

What you call "dummy data", I'd call "test data". You should make a
sample of "dummy data" and test that your regex is (a) finding all
ISBNs that it should (b) not reporting incorrect matches, *BEFORE*
being concerned with speed.

>Any and all guidance would be greatly appreciated,

For some light reading :-) borrow a copy of Friedl's book (mentioned
in the Python re docs) and read the parts on how backtracking regex
engines work.

HTH,
John

Thanks to all of you for your replies - they have been most helpful, and
my program is now running at a reasonable pace...
I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
turns out to misbehave in further testing, I'll know where to turn :-P

Feb 25 '07 #5

Kirk Sluder

In article <45************ ***********@new s.sunsite.dk>,
Christian Sonne <Fr*******@gmai l.comwrote:

Thanks to all of you for your replies - they have been most helpful, and
my program is now running at a reasonable pace...
I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
turns out to misbehave in further testing, I'll know where to turn :-P

Anything with variable-length wildcard matching (*+?) is going to
drag your performance down. There was an earlier thread on this very
topic. Another stupid question is how are you planning on handling
ISBNs formatted with hyphens for readability?

In general I've found the following factors to be critical in
getting good performance from re:

1: Reducing the number of times you call re.match or re.search.
2: Reducing the number of bytes that re has to search through.
3: Minimizing the use of wildcards in the expression.

If you can pre-filter your input with string.find before running
re.match you will improve performance quite a bit compared to
running re expressions over all 10 pages. I played around a bit and
attached some example code below searching over 21K of text for the
ISBN number. testPrefilter() runs about 1/5th the execution time of
line-by-line re calls or a single re call over a 21K string.
Interestingly this ratio scales up to something as big as Mobey
Dick.

The searchLabels() functions below beats all the other functions by
searching for "ISBN", or "Internatio nal Book" and then using RE on
the surrounding 500 bytes. You might also try searching for
"Copyright" or "Library of Congress" since most modern books will
have it all on the same page.

A caveat here is that this only works if you can find a reasonably
unique string at or near what you want to find with re. If you need
to run re.search on every byte of the file anyway, this isn't going
to help.

---------------
timing test code
---------------

#!/usr/bin/env python

from timeit import Timer
import re

textString = """The text of a sample page using with an ISBN 10
number ISBN 0672328976 and some more text to compare."""

#add the full text of Mobey Dick to make the search functions
#work for their bread.
fileHandle= open("./mobey.txt")
junkText = fileHandle.read lines()
junkText.append (textString)
textString=''.j oin(junkText)
#print textString

#compile the regex
isbn10Re = re.compile(r"\b \d{9}[0-9X]\b")

def testPrefilter() :
"""Work through a pre-loaded array running re only on lines
containing ISBN"""
for line in junkText:
#search for 'ISBN"
if (line.find('ISB N') -1):
thisMatch = isbn10Re.search (line)
if thisMatch:
return thisMatch.group (0)

def testNofilter():
"""Run re.search on every line."""
for line in junkText:
#seaching using RE
thisMatch = isbn10Re.search (line)
if thisMatch:
return thisMatch.group (0)
def testFullre():
"""Run re.search on a long text string."""
thisMatch = isbn10Re.search (textString)
if thisMatch:
return thisMatch.group (0)

def searchLabels():
#identify some text that might be near an ISBN number.
isbnLabels=["ISBN",
"Internatio nal Book"]

#use the fast string.find method to locate those
#labels in the text
isbnIndexes = [textString.find (x) for x in isbnLabels]

#exclude labels not found in the text.
isbnIndexes = [y for y in isbnIndexes if y -1]

#run re.search on a 500 character string around the text label.
for x in isbnIndexes:
thisMatch=isbn1 0Re.search(text String[x-250:x+250])
return thisMatch.group (0)

#print searchLabels()
#print testPrefilter()
#print testNofilter()
t = Timer("testNofi lter()","from __main__ import testNofilter")
print t.timeit(100)
u = Timer("testPref ilter()","from __main__ import testPrefilter")
print u.timeit(100)
v = Timer("testFull re()","from __main__ import testFullre")
print v.timeit(100)
w = Timer("searchLa bels()", "from __main__ import searchLabels")
print w.timeit(100)

Feb 26 '07 #6

John Machin

On Feb 26, 2:01 pm, Kirk Sluder <k...@nospam.jo bsluder.netwrot e:

In article <45e1d367$0$902 73$14726...@new s.sunsite.dk>,
Christian Sonne <FreakC...@gmai l.comwrote:

Thanks to all of you for your replies - they have been most helpful, and
my program is now running at a reasonable pace...

I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
turns out to misbehave in further testing, I'll know where to turn :-P

Anything with variable-length wildcard matching (*+?) is going to
drag your performance down. There was an earlier thread on this very
topic. Another stupid question is how are you planning on handling
ISBNs formatted with hyphens for readability?

According to the OP's first message, 2nd paragraph:
"""
(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)
"""

Given a low density of ISBNs in the text, it may well be better to
avoid the preliminary pass to rip out the '-'s, and instead:

1. use an RE like r"\b\d[-\d]{8,11}[\dX]\b" (allows up to 3 '-'s
inside the number)

2. post-process the matches: strip out any '-'s, check for remaining
length == 10.

Another thought for the OP: Consider (irrespective of how you arrive
at a candidate ISBN) validating the ISBN check-digit.

Cheers,
John

Feb 26 '07 #7

Similar topics

39358

String search vs regexp search

by: Anand Pillai | last post by:

To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1: domything() And the regexp search assuming no case restriction would be,

Python

2186

Regexp optimization question

by: Magnus Lie Hetland | last post by:

I'm working on a project (Atox) where I need to match quite a few regular expressions (several hundred) in reasonably large text files. I've found that this can easily get rather slow. (There are many things that slow Atox down -- it hasn't been designed for speed, and any optimizations will entail quite a bit of refactoring.) I've tried to speed this up by using the same trick as SPARK, putting all the regexps into a single or-group in...

Python

2356

Saving search results in a dictionary

by: Lukas Holcik | last post by:

Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could. Or how can I replace the html &entities; in a string "blablabla&blablabal&balbalbal" with the chars they mean using re.sub? I found out they are stored in an dict . I though about this functionality:

Python

1817

EXSLT and regexp

by: Chris Croughton | last post by:

I'm trying to use the EXSLT regexp package from http://www.exslt.org/regexp/functions/match/index.html (specifically the match function) with the libxml xltproc (which supports EXSLT), but whatever I do gets errors. The examples use namespace regExp, but the supplied files use regexp, I've got it so that it at least doesn't complain about namespaces but it then complains that it can't find the match function. My stylesheet is:

.NET Framework

7478

RegExp to strip accents while ignoring case

by: Jon Maz | last post by:

Hi All, I want to strip the accents off characters in a string so that, for example, the (Spanish) word "práctico" comes out as "practico" - but ignoring case, so that "PRÁCTICO" comes out as "PRACTICO". What's the best way to do this? TIA,

C# / C Sharp

2139

JS Enabled But No RegExp Support?

by: Matt Kruse | last post by:

Are there any current browsers that have Javascript support, but not RegExp support? For example, cell phone browsers, blackberrys, or other "minimal" browsers? I know that someone using Netscape 3 would fall into this category, for example, but that's not a realistic situation anymore. And if such a condition exists, then how do you guys handle validation using regular expressions, if the browser lacks them? For example:

Javascript

2939

regexp test function behavior

by: HopfZ | last post by:

I coudn't understand some behavior of RegExp.test function. Example html code: ---------------- <html><head></head><body><script type="text/javascript"> var r = /^https?:\/\//g; document.write( ); </script></body></html> ---------------------

Javascript

1288

Does that RegExp makes sense to you, people?

by: yoni | last post by:

Hi, I am trying to write a regexp to find all the code on the header of entities in SQL Server (views, SPs, etc...) I got something like this: (.|\n)*((create view)|(create proc)|(create function)|(create trigger)) does that means sense? I want all the code that's before the code header. now, the problem is, when i give that pattern with some string buffer to the 'replace' method (I replace it with String.Empty...

C# / C Sharp

3909

RegExp.test() with global flag set

by: Matt | last post by:

Hello all, I have just discovered (the long way) that using a RegExp object with the 'global' flag set produces inconsistent results when its test() method is executed. I realize that 'global' is not an appropriate modifier for the test() function - test() searches the entire string by default. However, I would expect it to degrade gracefully. Instead, I seem to be getting something as follows - using W3Schools handy page at :

Javascript

10213

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10000

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

9040

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7538

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6780

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5436

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4113

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3722

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2920

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General