String search vs regexp search

Anand Pillai

To search a word in a group of words, say a paragraph or a web page,
would a string search or a regexp search be faster?

The string search would of course be,

if str.find(substr ) != -1:
domything()

And the regexp search assuming no case restriction would be,

strre=re.compil e(substr, re.IGNORECASE)

m=strre.search( str)
if m:
domything()

I was about to do a test, then I thought someone here might have
some data on this already.

Thanks folks!

-Anan

Jul 18 '05 #1

Subscribe Reply

39353

Duncan Booth

py*******@Hotpo p.com (Anand Pillai) wrote in
news:84******** *************** **@posting.goog le.com:

To search a word in a group of words, say a paragraph or a web page,
would a string search or a regexp search be faster?

The string search would of course be,

if str.find(substr ) != -1:
domything()

And the regexp search assuming no case restriction would be,

strre=re.compil e(substr, re.IGNORECASE)

m=strre.search( str)
if m:
domything()

I was about to do a test, then I thought someone here might have
some data on this already.

Yes. The answer is 'it all depends'.

Things it depends on include:

Your two bits of code do different things, one is case sensitive, one
ignores case. Which did you need?

How long is the string you are searching? How long is the substring?

Is the substring the same every time, or are you always searching for
different strings. Can the substring contain characters with special
meanings for regular expressions?

The regular expression code has a startup penalty since it has to compile
the regular expression at least once, however the actual searching may be
faster than the naive str.find. If the time spent doing the search is
sufficiently long compared with the time doing the compile, the regular
expression may win out.

Bottom line: write the code so it is as clean and maintainable as possible.
Only worry about optimising this if you have timed it and know that your
searches are a bottleneck.

--
Duncan Booth du****@rcp.co.u k
int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
"\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

Jul 18 '05 #2

Anand Pillai

Sorry for being too brief!

I was talking about a function which 'counts' the number
of occurences using string & regexp.

I wrote the code for the regexp search as well as the function
search and tested it on a rather large file (800 KB) for
occurences of a certain word. I find that the string search
is at least 2 times faster than the one with regexp, excluding
the time for the regexp.compile( ) method. This is particularly
noticeable when the file becomes quite large and the word is
spread out.

I also thought the regexp would beat string thumbs down and I
am suprised at the result that it is the other way around.

Here is the code. Note that I am using the 'count' methods that
count the number of occurences rather than the 'find' methods.

# Test to find out whether string search in a data
# is faster than regexp search.

# Results: String search is much faster when it comes
# to many occurences of the sub string.

import time

def strsearch1(s, substr):

t1 = time.time()
print 'Count 1 =>', s.count(substr)
t2 = time.time()
print 'Searching using string, Time taken => ', t2 - t1

def strsearch2(s, substr):

import re

r=re.compile(su bstr, re.IGNORECASE)
t1 = time.time()
print 'Count 2 =>', len(r.findall(s ))
t2 = time.time()
print 'Searching using regexp, Time taken => ', t2 - t1
data=open("test .html", "r").read()
strsearch1(data , "Miriam")
strsearch2(data , "Miriam")

# Output here...

D:\Programming\ python>python strsearch.py
Count 1 => 45
Searching using string, Time taken => 0.0599999427795
Count 2 => 45
Searching using regexp, Time taken => 0.110000014305

Test was done on a windows 98 machine using Python 2.3, running
on 248 MB RAM, Intel 1.7 GHz chipset.

I was thinking of using regexp searches in my code, but this convinces
me to stick on to the good old string search.

Thanks for the replies.

-Anand

Duncan Booth <du****@NOSPAMr cp.co.uk> wrote in message news:<Xn******* *************** *****@127.0.0.1 >...

py*******@Hotpo p.com (Anand Pillai) wrote in
news:84******** *************** **@posting.goog le.com:
To search a word in a group of words, say a paragraph or a web page,
would a string search or a regexp search be faster?

The string search would of course be,

if str.find(substr ) != -1:
domything()

And the regexp search assuming no case restriction would be,

strre=re.compil e(substr, re.IGNORECASE)

m=strre.search( str)
if m:
domything()

I was about to do a test, then I thought someone here might have
some data on this already.

Yes. The answer is 'it all depends'.

Things it depends on include:

Your two bits of code do different things, one is case sensitive, one
ignores case. Which did you need?

How long is the string you are searching? How long is the substring?

Is the substring the same every time, or are you always searching for
different strings. Can the substring contain characters with special
meanings for regular expressions?

The regular expression code has a startup penalty since it has to compile
the regular expression at least once, however the actual searching may be
faster than the naive str.find. If the time spent doing the search is
sufficiently long compared with the time doing the compile, the regular
expression may win out.

Bottom line: write the code so it is as clean and maintainable as possible.
Only worry about optimising this if you have timed it and know that your
searches are a bottleneck.

Jul 18 '05 #3

Jeremy Fincher

Duncan Booth <du****@NOSPAMr cp.co.uk> wrote in message news:<Xn******* *************** *****@127.0.0.1 >...

The regular expression code has a startup penalty since it has to compile
the regular expression at least once, however the actual searching may be
faster than the naive str.find. If the time spent doing the search is
sufficiently long compared with the time doing the compile, the regular
expression may win out.

Both regular expression searching and string.find will do searching
one character at a time; given that, it seems impossible to me that
the hand-coded-in-C "naive" string.find could be slower than the
machine-translated-coded-in-Python regular expression search.
Compilation time only serves to further increase string.find's
advantage.

Jeremy

Jul 18 '05 #4

Dennis Reinhardt

> I find that the string search

is at least 2 times faster than the one with regexp
r=re.compile(su bstr, re.IGNORECASE)

Off the top, maybe the regex is taking twice the time because it is doing
twice the work looking for both the lower case character and the upper case
character. It does not seem a fair test because the string find is case
sensitive.
--

Dennis Reinhardt

De*****@dair.co m http://www.spamai.com?ng_py

Jul 18 '05 #5

Duncan Booth

tw*********@hot mail.com (Jeremy Fincher) wrote in
news:69******** *************** ***@posting.goo gle.com:

Duncan Booth <du****@NOSPAMr cp.co.uk> wrote in message
news:<Xn******* *************** *****@127.0.0.1 >...
The regular expression code has a startup penalty since it has to
compile the regular expression at least once, however the actual
searching may be faster than the naive str.find. If the time spent
doing the search is sufficiently long compared with the time doing
the compile, the regular expression may win out.

Both regular expression searching and string.find will do searching
one character at a time; given that, it seems impossible to me that
the hand-coded-in-C "naive" string.find could be slower than the
machine-translated-coded-in-Python regular expression search.
Compilation time only serves to further increase string.find's
advantage.

I may have misremembered, but I thought there was a thread discussing this
a little while back which claimed that the regular expression library
looked for constant strings at the start of the regex, and if it found one
used Boyer-Moore to do the search. If it does, then regular expressions
searching for a constant string certainly ought to be much faster than a
plain string.find (as the length of the searched string tends towards
infinity).

If it doesn't, then it should.

--
Duncan Booth du****@rcp.co.u k
int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
"\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

Jul 18 '05 #6

Duncan Booth

Duncan Booth <du****@NOSPAMr cp.co.uk> wrote in
news:Xn******** *************** ****@127.0.0.1:

Both regular expression searching and string.find will do searching
one character at a time; given that, it seems impossible to me that
the hand-coded-in-C "naive" string.find could be slower than the
machine-translated-coded-in-Python regular expression search.
Compilation time only serves to further increase string.find's
advantage.

I may have misremembered, but I thought there was a thread discussing
this a little while back which claimed that the regular expression
library looked for constant strings at the start of the regex, and if
it found one used Boyer-Moore to do the search. If it does, then
regular expressions searching for a constant string certainly ought to
be much faster than a plain string.find (as the length of the searched
string tends towards infinity).

If it doesn't, then it should.

Ok, found the code. Regular expression searches do indeed use a form of
Boyer-Moore, but not if you are ignoring case. So by specifying
re.IGNORECASE the OP got a double hit, not only does the code have to do
case insensitive comparisons, but it also has to crawl along looking at
every character in the search string instead of skipping most of them.

--
Duncan Booth du****@rcp.co.u k
int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
"\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

Jul 18 '05 #7

Jeremy Fincher

Duncan Booth <du****@NOSPAMr cp.co.uk> wrote in message news:<Xn******* *************** *****@127.0.0.1 >...

Ok, found the code. Regular expression searches do indeed use a form of
Boyer-Moore, but not if you are ignoring case. So by specifying
re.IGNORECASE the OP got a double hit, not only does the code have to do
case insensitive comparisons, but it also has to crawl along looking at
every character in the search string instead of skipping most of them.

That's cool! Where'd you find the code?

Jeremy

Jul 18 '05 #8

Alex Martelli

Jeremy Fincher wrote:

Duncan Booth <du****@NOSPAMr cp.co.uk> wrote in message
news:<Xn******* *************** *****@127.0.0.1 >...
Ok, found the code. Regular expression searches do indeed use a form of
Boyer-Moore, but not if you are ignoring case. So by specifying
re.IGNORECASE the OP got a double hit, not only does the code have to do
case insensitive comparisons, but it also has to crawl along looking at
every character in the search string instead of skipping most of them.

That's cool! Where'd you find the code?

Hmmm, dist/src/Modules/_sre.c in the Python CVS tree? Or Modules/_sre.c in
a standard source distribution?
Alex

Jul 18 '05 #9

Duncan Booth

Alex Martelli <al***@aleax.it > wrote in
news:pC******** **************@ news2.tin.it:

Jeremy Fincher wrote:
Duncan Booth <du****@NOSPAMr cp.co.uk> wrote in message
news:<Xn******* *************** *****@127.0.0.1 >...
Ok, found the code. Regular expression searches do indeed use a form
of Boyer-Moore, but not if you are ignoring case. So by specifying
re.IGNORECASE the OP got a double hit, not only does the code have
to do case insensitive comparisons, but it also has to crawl along
looking at every character in the search string instead of skipping
most of them.

That's cool! Where'd you find the code?

Hmmm, dist/src/Modules/_sre.c in the Python CVS tree? Or
Modules/_sre.c in a standard source distribution?

Close, but it is actually a little bit complicated. _sre.c has the code
that does the search, including the skipping forward using an overlap
table, but the bit which checks for a literal prefix and builds the overlap
table is in lib/src_compile.py. So its in the Python code you have to look
to find that it ignores literal prefixes when ignoring case.

--
Duncan Booth du****@rcp.co.u k
int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
"\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

Jul 18 '05 #10

Similar topics

1911

string.seach() RegEx question

by: Lord Khaos | last post by:

If I am trying to find an expression, foo, I can do something like this: rExp = /foo/gi; if(results.search(rExp) > -1){ and all work fine. however, if I want my search term to be a variable, bar: var bar= "foo";

Javascript

1282

String question

by: Arjen | last post by:

Hello, I have inside a string a complete url. http://www.somedomain.com/index.php?action=shownewsitem&id=125 At the end you see "id=", the id can be "1", "23", "234", etc.. Now I want to have the Id in a new variable. How can catch the id?

Javascript

1605

String replacement problem

by: andrewflanders | last post by:

I have an associative array of keys and values. I want to search a string for the existance of keys and replace them with the values in the array. The problem is that some of the keys resemble regular expressions but they are not so if I build the regular expression with a constructor it causes errors saying that the regular expression syntax is bad. I got around this problem by skipping the regular expression method and passing in the key...

Javascript

5756

Pattern matching with string and list

by: olaufr | last post by:

Hi, I'd need to perform simple pattern matching within a string using a list of possible patterns. For example, I want to know if the substring starting at position n matches any of the string I have a list, as below: sentence = "the color is $red" patterns = pos = sentence.find($)

Python

3407

replace string literal

by: gary | last post by:

How would one make the ECMA-262 String.replace method work with a string literal? For example, if my string was "HELLO" how would I make it work in this instance. Please note my square brackets are not regular expression syntax. Thanks,

Javascript

2896

[STRING] extract a word and text around it

by: teo | last post by:

hallo, I need to extract a word and few text that precedes and follows it (about 30 + 30 chars) from a long textual document. Like the description that Google returns when it has found a given word. In example from:

Visual Basic .NET

3558

regexp match string with word1 and not word2

by: Flyzone | last post by:

Hello, i have again problem with regexp :-P I need to match all lines that contain one word but not contain another. Like to do "grep one | grep -v two:" The syntax of the string is: (any printable char)two:(any printable char)one(any printable char) Example: Apr 30 00:00:09 v890neg0 two: findings: blablabla

Python

1491

regexp search on infinite string?

by: Paddy | last post by:

Lets say i have a generator running that generates successive characters of a 'string' characters then I would have to 'freeze' the generator and pass the characters so far to re.search. It is expensive to create successive characters, but caching could be used for past characters. is it possible to wrap the generator in a class, possibly inheriting from string, that would allow the regexp searching of the string but without terminating...

Python

11684

Using a variable name for the string.replace() regexp

by: X l e c t r i c | last post by:

Here: http://bigbangfodder.fileave.com/res/sandr.html I'm trying to use string.replace() for a basic search and replace form using textarea values as the regexp and replacement values for string.replace(). When I tried to use the textarea variable name for regexp it didn't work as I thought it would. For example:

Javascript

9568

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9389

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9335

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8257

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6801

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6079

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4881

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3320

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2218

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General