473,398 Members | 2,188 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

Regular expression question -- exclude substring

Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".

I'm thinking that the solution to my problem might be to use a regular
expression to exclude the substring "target_mark", which will replace
the part of ".*" above. However, I don't know how to exclude a
substring. Can anyone help on this? Or maybe give another solution to
my problem? Thanks very much.

Nov 7 '05 #1
6 2411
dr********@gmail.com wrote:
Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".


If there is a character that can't appear in the bit between the numbers then use everything-but-that instead of . - for example if spaces can only appear as you show them, use
"(00 [^ ]* 01) target_mark" or
"(00 \S* 01) target_mark"

Kent
Nov 8 '05 #2
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

Nov 8 '05 #3
On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)


The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource
expensive to implement.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Nov 8 '05 #4
James Stroud wrote:
On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']


??? not in my Python:
rgx = re.compile(r"(00.*01) target_mark")
rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01'] rgx = re.compile(r"(00.*?01) target_mark")
rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')

['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy match is the same in this case.

Kent
Nov 8 '05 #5
On Monday 07 November 2005 17:31, Kent Johnson wrote:
James Stroud wrote:
On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)


The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the
non-greedy finds the start of the first start-of-the-match it comes
accross and then finds the first occurrence of '01' that makes the
complete match, otherwise the greedy operator would match .* as much as
it could, gobbling up all '01's before the last because these match '.*'.
For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01']


??? not in my Python:
>>> rgx = re.compile(r"(00.*01) target_mark")
>>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
>>> 01')
['00 noise1 01 noise2 00 target 01']
>>> rgx = re.compile(r"(00.*?01) target_mark")
>>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
>>> 01')


['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy
match is the same in this case.


Somehow my cutting and pasting got messed up. It should be:

py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']

Sorry about that.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Nov 8 '05 #6
On Mon, 7 Nov 2005 16:38:11 -0800, James Stroud <js*****@mbi.ucla.edu> wrote:
On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)


The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource
expensive to implement.

If the delimiting strings are fixed, we can use plain python string methods, e.g.,
(not tested beyond what you see ;-)
s = "00 noise1 01 noise2 00 target 01 target_mark" def findit(s, beg='00', end='01', tmk=' target_mark'): ... start = 0
... while True:
... t = s.find(tmk, start)
... if t<0: break
... start = s.rfind(beg, start, t)
... if start<0: break
... e = s.find(end, start, t)
... if e+len(end)==t: # _just_ after
... yield s[start:e+len(end)]
... start = t+len(tmk)
... list(findit(s)) ['00 target 01'] s2 = s + ' garbage noise3 00 almost 01 target_mark 00 success 01 target_mark'
list(findit(s2))

['00 target 01', '00 success 01']

(I didn't enforce exact adjacency the first time, obviously it would be more efficient
to search for end+tmk instead of tmk and back to beg and forward to end ;-)

If there can be spurious target_marks, and tricky matching spans, additional logic may be needed.
Too lazy to think about it ;-)

Regards,
Bengt Richter
Nov 8 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
26
by: Shannon Jacobs | last post by:
Sorry to ask what is surely a trivial question. Also sorry that I don't have my current code version on hand, but... Anyway, must be some problem with trying to do the negative. It seems like I get...
2
by: Jonas | last post by:
I got a string from which I want to extract some info. The string has a format like this "$MyINFO $ALL %s %s$ $%s$%s$%s$|" ie "$MyINFO $ALL smurf hmm$ $LAN(T3)$yes@mail.no$85899345920$|" for doing...
8
by: Ahmad A. Rahman | last post by:
Hi all, I have a problem constructing a regular expression using .net. I have a string, separated with comma, and I want to group the string together but, I failed to group a numeric character...
10
by: Lee Kuhn | last post by:
I am trying the create a regular expression that will essentially match characters in the middle of a fixed-length string. The string may be any characters, but will always be the same length. In...
6
by: likong | last post by:
Hi, Any idea about how to write a regular expression that matches a substring xxx as long as the string does NOT contain substring yyy? Thanks. Kong
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
1
by: othellomy | last post by:
I am trying to exclude all strings that has 'a' inside (I have simplified the actual problem) select 1 where 'bb b a dfg' like '%%' However, the above does not work. By the way, I can not use...
12
by: =?Utf-8?B?SlA=?= | last post by:
I am a newbie to regular expressions and want to extract a number from the end of a string. The string would have these formats: image/4567 image/45678 image/456789 I would also want to...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.