473,503 Members | 10,046 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regular expression to match a #

Hi,

I'm trying to use a regular expression to match a string containing a #
(basically i'm looking for #include ...)

I don't seem to manage to write a regular expression that matches this.

My (probably to naive) approach is: p = re.compile(r'\b#include\b)
I also tried p = re.compile(r'\b\#include\b) in a futile attempt to use
a backslash as escape character before the #
None of the above return a match for a string like "#include <stdio>".

I know a # is used for comments, hence my attempt to escape it...

Any suggestion on how to get a regular expression to find a #?

Thanks

Aug 11 '05 #1
19 2130
Dan
> My (probably to naive) approach is: p = re.compile(r'\b#include\b)

I think your problem is the \b at the beginning. \b matches a word break
(defined as \w\W or \W\w). There would only be a word break before the #
if the preceding character were a \w (that is, [A-Za-z0-9_], and maybe
some other characters depending on your locale).

However, the \b after the "include" is exactly what you want.

--
I had picked out the theme of the baby's room and done other
things. I decided to let Jon have this.
- Jamie Cusack (of the Netherlands), whose husband Jon
finally talked her into letting him name their son Jon 2.0
Aug 11 '05 #2
Thanks,
That did the trick...

Aug 11 '05 #3
Dan wrote:
My (probably to naive) approach is: p = re.compile(r'\b#include\b)


I think your problem is the \b at the beginning. \b matches a word break
(defined as \w\W or \W\w). There would only be a word break before the #
if the preceding character were a \w (that is, [A-Za-z0-9_], and maybe
some other characters depending on your locale).

However, the \b after the "include" is exactly what you want.


So the OP probably wanted '\B' the exact opposite of '\b' for the start of
the string, i.e. only match the # if it is NOT preceded by a wordbreak.

Alternatively for C style #includes search for r'^\s*#\s*include\b'.
Aug 11 '05 #4
Tom Deco wrote:
Hi,

I'm trying to use a regular expression to match a string containing a #
(basically i'm looking for #include ...)

I don't seem to manage to write a regular expression that matches this.

My (probably to naive) approach is: p = re.compile(r'\b#include\b)
I also tried p = re.compile(r'\b\#include\b) in a futile attempt to use
a backslash as escape character before the #
None of the above return a match for a string like "#include <stdio>".

I know a # is used for comments, hence my attempt to escape it...

Any suggestion on how to get a regular expression to find a #?

Thanks


You definitely shouldn't have the first \b -- match() works only at the
beginning of the target string, so it is impossible for there to be a
word boundary just before the "#".

You probably shouldn't have the second \b.

You probably should read section A12 of K&R2.

You probably should be using a parser, but if you persist in using
regular expressions:

(a) read the manual.

(b) try something like this:
pat1 = re.compile(r'\s*#\s*include\s*<\s*([^>\s]+)\s*>\s*$')
pat1.match(" # include < fubar.h > ").group(1)

'fubar.h'

N.B. this is based the assumption that sane programmers don't have
whitespace embedded in the names of source files ;-)

HTH,
John
Aug 11 '05 #5
Duncan Booth wrote:
Dan wrote:

My (probably to naive) approach is: p = re.compile(r'\b#include\b)


I think your problem is the \b at the beginning. \b matches a word break
(defined as \w\W or \W\w). There would only be a word break before the #
if the preceding character were a \w (that is, [A-Za-z0-9_], and maybe
some other characters depending on your locale).

However, the \b after the "include" is exactly what you want.

So the OP probably wanted '\B' the exact opposite of '\b' for the start of
the string, i.e. only match the # if it is NOT preceded by a wordbreak.

Alternatively for C style #includes search for r'^\s*#\s*include\b'.


Search for r'^something' can never be better/faster than match for
r'something', and with a dopey implementation of search [which Python's
re is NOT] it could be much worse. So please don't tell newbies to
search for r'^something'.

Aug 11 '05 #6
John Machin wrote:
Search for r'^something' can never be better/faster than match for
r'something', and with a dopey implementation of search [which Python's
re is NOT] it could be much worse. So please don't tell newbies to
search for r'^something'.


How else would you match the beginning of a line in a multi-line string?
Aug 11 '05 #7
John Machin wrote:
Alternatively for C style #includes search for r'^\s*#\s*include\b'.


Search for r'^something' can never be better/faster than match for
r'something', and with a dopey implementation of search [which
Python's re is NOT] it could be much worse. So please don't tell
newbies to search for r'^something'.

Search for r'^something' is always better than searching for r'something'
when the spec requires the search to match only at the start of a line (on
the principle that code that works is better than code which doesn't).

It appears that this may be something the original poster wanted, so I
stand by my suggestion.
Aug 11 '05 #8
In article <42********@news.eftel.com>,
John Machin <sj******@lexicon.net> wrote:

Search for r'^something' can never be better/faster than match for
r'something', and with a dopey implementation of search [which Python's
re is NOT] it could be much worse. So please don't tell newbies to
search for r'^something'.


You're somehow getting mixed up in thinking that "^" is some kind of
"not" operator -- it's the start of line anchor in this context.
--
Aahz (aa**@pythoncraft.com) <*> http://www.pythoncraft.com/

The way to build large Python applications is to componentize and
loosely-couple the hell out of everything.
Aug 11 '05 #9
Jeff Schwab wrote:
John Machin wrote:
Search for r'^something' can never be better/faster than match for
r'something', and with a dopey implementation of search [which
Python's re is NOT] it could be much worse. So please don't tell
newbies to search for r'^something'.

How else would you match the beginning of a line in a multi-line string?


I beg your pardon -- I should have qualified that:

"""
So please don't tell newbies to search for r'^something' when match of
r'something' does the job.
"""

Aug 11 '05 #10
Duncan Booth wrote:
John Machin wrote:

Alternatively for C style #includes search for r'^\s*#\s*include\b'.


Search for r'^something' can never be better/faster than match for
r'something', and with a dopey implementation of search [which
Python's re is NOT] it could be much worse. So please don't tell
newbies to search for r'^something'.


Search for r'^something' is always better than searching for r'something'
when the spec requires the search to match only at the start of a line (on
the principle that code that works is better than code which doesn't).

It appears that this may be something the original poster wanted, so I
stand by my suggestion.

We could well be lost in a semantic fog where at least one of us is
using "match" to mean "the match() method" and at least one of us is
using match to mean soemthing like "the outcome of using a search()
method [or a match() method]".

So I'll stand by my suggestion, too.

Aug 11 '05 #11
Aahz wrote:
In article <42********@news.eftel.com>,
John Machin <sj******@lexicon.net> wrote:
Search for r'^something' can never be better/faster than match for
r'something', and with a dopey implementation of search [which Python's
re is NOT] it could be much worse. So please don't tell newbies to
search for r'^something'.

You're somehow getting mixed up in thinking that "^" is some kind of
"not" operator -- it's the start of line anchor in this context.


I can't imagine where you got that idea from.

If I change "[which Python's re is NOT]" to "[Python's re's search() is
not dopey]", does that help you?

The point was made in a context where the OP appeared to be reading a
line at a time and parsing it, and re.compile(r'something').match()
would do the job; re.compile(r'^something').search() will do the job too
-- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
very inefficiently in the failing case with dopey implementations of
search() (which apply match() at offsets 0, 1, 2, .....).

Aug 11 '05 #12
John Machin wrote:
Aahz wrote:
In article <42********@news.eftel.com>,
John Machin <sj******@lexicon.net> wrote:
Search for r'^something' can never be better/faster than match for
r'something', and with a dopey implementation of search [which Python's
re is NOT] it could be much worse. So please don't tell newbies to
search for r'^something'.

You're somehow getting mixed up in thinking that "^" is some kind of
"not" operator -- it's the start of line anchor in this context.


I can't imagine where you got that idea from.

If I change "[which Python's re is NOT]" to "[Python's re's search() is
not dopey]", does that help you?

The point was made in a context where the OP appeared to be reading a
line at a time and parsing it, and re.compile(r'something').match()
would do the job; re.compile(r'^something').search() will do the job too
-- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
very inefficiently in the failing case with dopey implementations of
search() (which apply match() at offsets 0, 1, 2, .....).


I don't see much difference.
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]
on win32
Type "copyright", "credits" or "license()" for more information.

************************************************** **************
Personal firewall software may warn about the connection IDLE
makes to its subprocess using this computer's internal loopback
interface. This connection is not visible on any external
interface and no data is sent to or received from the Internet.
************************************************** **************

IDLE 1.1.1
import timeit
t1 = timeit.Timer('re.search("^\w"," will not work")','import re')
t1.timeit() 34.938577109660628 t2 = timeit.Timer('re.match("\w"," will not work")','import re')
t2.timeit() 31.381461330979164 3.0/1000000 3.0000000000000001e-006 t1.timeit() 35.282282524734228 t2.timeit()

31.403153752781463

~4 second difference after a million times through seems to be trivial.
Then again, I haven't tested it for larger patterns and strings.

Aug 11 '05 #13
Devan L wrote:
John Machin wrote:
Aahz wrote:
In article <42********@news.eftel.com>,
John Machin <sj******@lexicon.net> wrote:
Search for r'^something' can never be better/faster than match for
r'something', and with a dopey implementation of search [which Python's
re is NOT] it could be much worse. So please don't tell newbies to
search for r'^something'.
You're somehow getting mixed up in thinking that "^" is some kind of
"not" operator -- it's the start of line anchor in this context.


I can't imagine where you got that idea from.

If I change "[which Python's re is NOT]" to "[Python's re's search() is
not dopey]", does that help you?

The point was made in a context where the OP appeared to be reading a
line at a time and parsing it, and re.compile(r'something').match()
would do the job; re.compile(r'^something').search() will do the job too
-- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
very inefficiently in the failing case with dopey implementations of
search() (which apply match() at offsets 0, 1, 2, .....).

I don't see much difference.


and I didn't expect that you would -- like I wrote above: "Python's re's
search() is not dopey".
Aug 11 '05 #14

John Machin wrote:
Devan L wrote:
John Machin wrote:
Aahz wrote:

In article <42********@news.eftel.com>,
John Machin <sj******@lexicon.net> wrote:
>Search for r'^something' can never be better/faster than match for
>r'something', and with a dopey implementation of search [which Python's
>re is NOT] it could be much worse. So please don't tell newbies to
>search for r'^something'.
You're somehow getting mixed up in thinking that "^" is some kind of
"not" operator -- it's the start of line anchor in this context.

I can't imagine where you got that idea from.

If I change "[which Python's re is NOT]" to "[Python's re's search() is
not dopey]", does that help you?

The point was made in a context where the OP appeared to be reading a
line at a time and parsing it, and re.compile(r'something').match()
would do the job; re.compile(r'^something').search() will do the job too
-- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
very inefficiently in the failing case with dopey implementations of
search() (which apply match() at offsets 0, 1, 2, .....).

I don't see much difference.


and I didn't expect that you would -- like I wrote above: "Python's re's
search() is not dopey".


Your wording makes it hard to distinguish what exactly is "dopey".

Aug 11 '05 #15
John Machin wrote:
Devan L wrote:
John Machin wrote:
Aahz wrote:

In article <42********@news.eftel.com>,
John Machin <sj******@lexicon.net> wrote:
> Search for r'^something' can never be better/faster than match for
> r'something', and with a dopey implementation of search [which
> Python's
> re is NOT] it could be much worse. So please don't tell newbies to
> search for r'^something'.

You're somehow getting mixed up in thinking that "^" is some kind of
"not" operator -- it's the start of line anchor in this context.
I can't imagine where you got that idea from.

If I change "[which Python's re is NOT]" to "[Python's re's search() is
not dopey]", does that help you?

The point was made in a context where the OP appeared to be reading a
line at a time and parsing it, and re.compile(r'something').match()
would do the job; re.compile(r'^something').search() will do the job too
-- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
very inefficiently in the failing case with dopey implementations of
search() (which apply match() at offsets 0, 1, 2, .....).


I don't see much difference.

and I didn't expect that you would -- like I wrote above: "Python's re's
search() is not dopey".


*ahem*

C:\junk>python
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
import timeit
t1 = timeit.Timer('re.search("^\w"," will not work")','import re')
t2 = timeit.Timer('re.match("\w"," will not work")','import re')
t3 = timeit.Timer('obj(" will not work")','import re;obj=re.compile("^\w").s
earch') t4 = timeit.Timer('obj(" will not work")','import re;obj=re.compile("\w").ma
tch') t5 = timeit.Timer('obj(" will not work qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq")'
,'import re;obj=re.compile("^\w").search') t6 = timeit.Timer('obj(" will not work qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq")'
,'import re;obj=re.compile("\w").match') ["%.3f" % t.timeit() for t in t1, t2, t3, t4] ['5.510', '4.835', '1.588', '1.178'] ["%.3f" % t.timeit() for t in t1, t2, t3, t4] ['5.512', '4.808', '1.584', '1.170']

Observation: factoring out the compile step makes the difference much
more apparent.
["%.3f" % t.timeit() for t in t3, t4, t5, t6] ['1.578', '1.175', '2.283', '1.174'] ["%.3f" % t.timeit() for t in t3, t4, t5, t6] ['1.582', '1.179', '2.284', '1.172']


Conclusion: search time depends on length of searched string.

Meta-conclusion: Either I have to retract my
based-on-hope-rather-than-on-experimentation assertion, or redefine "not
dopey" to mean "surely nobody would search for ^x when match x would do,
so it would be dopey to optimise re for that" :-)

So, back to the original point:

If re.match("something") does the job you want, don't use
re.search("^something") instead.
Aug 11 '05 #16
Devan L wrote:
John Machin wrote:
Devan L wrote:
John Machin wrote:
Aahz wrote:
>In article <42********@news.eftel.com>,
>John Machin <sj******@lexicon.net> wrote:
>
>
>
>>Search for r'^something' can never be better/faster than match for
>>r'something', and with a dopey implementation of search [which Python's
>>re is NOT] it could be much worse. So please don't tell newbies to
>>search for r'^something'.
>
>
>You're somehow getting mixed up in thinking that "^" is some kind of
>"not" operator -- it's the start of line anchor in this context.

I can't imagine where you got that idea from.

If I change "[which Python's re is NOT]" to "[Python's re's search() is
not dopey]", does that help you?

The point was made in a context where the OP appeared to be reading a
line at a time and parsing it, and re.compile(r'something').match()
would do the job; re.compile(r'^something').search() will do the job too
-- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
very inefficiently in the failing case with dopey implementations of
search() (which apply match() at offsets 0, 1, 2, .....).
I don't see much difference.


and I didn't expect that you would -- like I wrote above: "Python's re's
search() is not dopey".

Your wording makes it hard to distinguish what exactly is "dopey".


"""
dopey implementations of search() (which apply match() at offsets 0, 1,
2, .....).
"""

The "dopiness" is that the ^ operator means that the pattern cannot
possibly match starting at 1, 2, 3, etc but a non-optimised search will
not recognise that and will try all possibilities, so the failing case
takes time dependant on the length of the string.
Aug 11 '05 #17
John Machin wrote:
[...]
Observation: factoring out the compile step makes the difference much
more apparent.
>>> ["%.3f" % t.timeit() for t in t3, t4, t5, t6] ['1.578', '1.175', '2.283', '1.174'] >>> ["%.3f" % t.timeit() for t in t3, t4, t5, t6] ['1.582', '1.179', '2.284', '1.172'] >>>

To make it even more apparent, try:

import re
import profile

startsz = re.compile('^z')

for s in ('x' * 1000, 'x' * 100000, 'x'*10000000):
profile.run('startsz.search(s)')

Profile report is below.

Conclusion: search time depends on length of searched string.

Meta-conclusion: Either I have to retract my
based-on-hope-rather-than-on-experimentation assertion, or redefine "not
dopey" to mean "surely nobody would search for ^x when match x would do,
so it would be dopey to optimise re for that" :-)


No question, there's some dopiness to searching for the
beginning of the string at places other than beginning of the
string.

The tricky part would be optimizing '$'.
--
--Bryan
4 function calls in 0.003 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 :0(search)
1 0.003 0.003 0.003 0.003 :0(setprofile)
1 0.000 0.000 0.000 0.000 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.003 0.003 profile:0(startsz.search(s))
4 function calls in 0.002 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 0.002 0.002 :0(search)
1 0.000 0.000 0.000 0.000 :0(setprofile)
1 0.000 0.000 0.002 0.002 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.002 0.002 profile:0(startsz.search(s))
4 function calls in 0.228 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.228 0.228 0.228 0.228 :0(search)
1 0.000 0.000 0.000 0.000 :0(setprofile)
1 0.000 0.000 0.228 0.228 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.228 0.228 profile:0(startsz.search(s))
Aug 12 '05 #18
John Machin wrote:
The point was made in a context where the OP appeared to be reading a
line at a time and parsing it, and re.compile(r'something').match()
would do the job; re.compile(r'^something').search() will do the job too
-- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
very inefficiently in the failing case with dopey implementations of
search() (which apply match() at offsets 0, 1, 2, .....).


Answering the question you think should have been asked rather than the
question which was actually asked is a great newsnet tradition, and often
more helpful to the poster than a straight answer would have been. However,
you do have to be careful to make it clear that is what you are doing.

The OP did not use the word 'line' once in his post. He simply said he was
searching a string. You didn't use the word 'line' either. If you are going
to read more into the question than was actually asked, please try to say
what question it is you are actually answering.

If he is using individual lines and re.match then the presence or absence
of a leading ^ makes virtually no difference. If he is looking for all
occurences in a multiline string then re.search with an anchored match is a
correct way to do it (splitting the string into lines and using re.match is
an alternative which may or may not be appropriate).

Either way, putting the focus on the ^ was inappropriate: the issue is
whether to use re.search or re.match. If you assume that the search fails
on an 80 character line, then I get timings of 6.48uS (re.search), 4.68uS
(re.match with ^), 4.66uS (re.match without ^). A failing search on a
10,000 character line shows how performance will degrade (225uS for search,
no change for match), but notice that searching 1 10,000 character string
is more than twice as fast as matching 125 80 character lines.

I don't understand what you think an implementation of search() can do in
this case apart from trying for a match at offsets 0, 1, 2, ...? It could
find a match at any starting offset within the string, so it must scan the
string in some form. A clever regex implementation will use Boyer-Moore
where it can to avoid checking every index in the string, but for the
pattern I suggested it would suprise me if any implementations actually
manage much of an optimisation.
Aug 12 '05 #19
John Machin wrote:
Your wording makes it hard to distinguish what exactly is "dopey".


"""
dopey implementations of search() (which apply match() at offsets 0, 1,
2, .....).
"""

The "dopiness" is that the ^ operator means that the pattern cannot
possibly match starting at 1, 2, 3, etc but a non-optimised search will
not recognise that and will try all possibilities, so the failing case
takes time dependant on the length of the string.


The ^ operator can match at any position in the string if the preceding
character was a newline. 'Dopey' would be failing to take this into
account.
Aug 12 '05 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
4155
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
5
2486
by: Bradley Plett | last post by:
I'm hopeless at regular expressions (I just don't use them often enough to gain/maintain knowledge), but I need one now and am looking for help. I need to parse through a document to find a URL,...
11
5350
by: Dimitris Georgakopuolos | last post by:
Hello, I have a text file that I load up to a string. The text includes certain expression like {firstName} or {userName} that I want to match and then replace with a new expression. However,...
3
2276
by: Joe | last post by:
Hi, I have been using a regular expression that I don’t uite understand to filter the valid email address. My regular expression is as follows: <asp:RegularExpressionValidator...
7
3794
by: Billa | last post by:
Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...
3
2551
by: Zach | last post by:
Hello, Please forgive if this is not the most appropriate newsgroup for this question. Unfortunately I didn't find a newsgroup specific to regular expressions. I have the following regular...
25
5128
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
5
3778
by: shawnmkramer | last post by:
Anyone every heard of the Regex.IsMatch and Regex.Match methods just hanging and eventually getting a message "Requested Service not found"? I have the following pattern: ^(?<OrgCity>(+)+),...
1
3379
by: NvrBst | last post by:
I want to use the .replace() method with the regular expression /^ %VAR % =,($|&)/. The following DOESN'T replace the "^default.aspx=,($|&)" regular expression with "":...
14
4959
by: Andy B | last post by:
I need to create a regular expression that will match a 5 digit number, a space and then anything up to but not including the next closing html tag. Here is an example: <startTag>55555 any...
0
7207
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7361
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
7470
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5602
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
4693
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3183
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3173
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1523
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
403
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.