By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,181 Members | 1,169 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,181 IT Pros & Developers. It's quick & easy.

Regular expression question -- exclude substring

P: n/a
Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".

I'm thinking that the solution to my problem might be to use a regular
expression to exclude the substring "target_mark", which will replace
the part of ".*" above. However, I don't know how to exclude a
substring. Can anyone help on this? Or maybe give another solution to
my problem? Thanks very much.

Nov 7 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
dr********@gmail.com wrote:
Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".


If there is a character that can't appear in the bit between the numbers then use everything-but-that instead of . - for example if spaces can only appear as you show them, use
"(00 [^ ]* 01) target_mark" or
"(00 \S* 01) target_mark"

Kent
Nov 8 '05 #2

P: n/a
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

Nov 8 '05 #3

P: n/a
On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)


The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource
expensive to implement.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Nov 8 '05 #4

P: n/a
James Stroud wrote:
On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']


??? not in my Python:
rgx = re.compile(r"(00.*01) target_mark")
rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01'] rgx = re.compile(r"(00.*?01) target_mark")
rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')

['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy match is the same in this case.

Kent
Nov 8 '05 #5

P: n/a
On Monday 07 November 2005 17:31, Kent Johnson wrote:
James Stroud wrote:
On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)


The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the
non-greedy finds the start of the first start-of-the-match it comes
accross and then finds the first occurrence of '01' that makes the
complete match, otherwise the greedy operator would match .* as much as
it could, gobbling up all '01's before the last because these match '.*'.
For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01']


??? not in my Python:
>>> rgx = re.compile(r"(00.*01) target_mark")
>>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
>>> 01')
['00 noise1 01 noise2 00 target 01']
>>> rgx = re.compile(r"(00.*?01) target_mark")
>>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
>>> 01')


['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy
match is the same in this case.


Somehow my cutting and pasting got messed up. It should be:

py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']

Sorry about that.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Nov 8 '05 #6

P: n/a
On Mon, 7 Nov 2005 16:38:11 -0800, James Stroud <js*****@mbi.ucla.edu> wrote:
On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)


The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource
expensive to implement.

If the delimiting strings are fixed, we can use plain python string methods, e.g.,
(not tested beyond what you see ;-)
s = "00 noise1 01 noise2 00 target 01 target_mark" def findit(s, beg='00', end='01', tmk=' target_mark'): ... start = 0
... while True:
... t = s.find(tmk, start)
... if t<0: break
... start = s.rfind(beg, start, t)
... if start<0: break
... e = s.find(end, start, t)
... if e+len(end)==t: # _just_ after
... yield s[start:e+len(end)]
... start = t+len(tmk)
... list(findit(s)) ['00 target 01'] s2 = s + ' garbage noise3 00 almost 01 target_mark 00 success 01 target_mark'
list(findit(s2))

['00 target 01', '00 success 01']

(I didn't enforce exact adjacency the first time, obviously it would be more efficient
to search for end+tmk instead of tmk and back to beg and forward to end ;-)

If there can be spurious target_marks, and tricky matching spans, additional logic may be needed.
Too lazy to think about it ;-)

Regards,
Bengt Richter
Nov 8 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.