Regular expression question -- exclude substring

dreamerbin

Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".

I'm thinking that the solution to my problem might be to use a regular
expression to exclude the substring "target_mark", which will replace
the part of ".*" above. However, I don't know how to exclude a
substring. Can anyone help on this? Or maybe give another solution to
my problem? Thanks very much.

Nov 7 '05 #1

Subscribe Post Reply

2411

Kent Johnson

dr********@gmail.com wrote:

Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".

If there is a character that can't appear in the bit between the numbers then use everything-but-that instead of . - for example if spaces can only appear as you show them, use
"(00 [^ ]* 01) target_mark" or
"(00 \S* 01) target_mark"

Kent

Nov 8 '05 #2

google

Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

Nov 8 '05 #3

James Stroud

On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:

Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource
expensive to implement.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Nov 8 '05 #4

Kent Johnson

James Stroud wrote:

On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

??? not in my Python:

rgx = re.compile(r"(00.*01) target_mark")
rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01'] rgx = re.compile(r"(00.*?01) target_mark")
rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')

['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy match is the same in this case.

Kent

Nov 8 '05 #5

James Stroud

On Monday 07 November 2005 17:31, Kent Johnson wrote:

James Stroud wrote:
On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the
non-greedy finds the start of the first start-of-the-match it comes
accross and then finds the first occurrence of '01' that makes the
complete match, otherwise the greedy operator would match .* as much as
it could, gobbling up all '01's before the last because these match '.*'.
For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

??? not in my Python:
>>> rgx = re.compile(r"(00.*01) target_mark")
>>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
>>> 01')
['00 noise1 01 noise2 00 target 01']
>>> rgx = re.compile(r"(00.*?01) target_mark")
>>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
>>> 01')

['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy
match is the same in this case.

Somehow my cutting and pasting got messed up. It should be:

py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']

Sorry about that.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Nov 8 '05 #6

Bengt Richter

On Mon, 7 Nov 2005 16:38:11 -0800, James Stroud <js*****@mbi.ucla.edu> wrote:

On Monday 07 November 2005 16:18, go****@fatherfrost.com wrote:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource
expensive to implement.

If the delimiting strings are fixed, we can use plain python string methods, e.g.,
(not tested beyond what you see ;-)

s = "00 noise1 01 noise2 00 target 01 target_mark" def findit(s, beg='00', end='01', tmk=' target_mark'): ... start = 0
... while True:
... t = s.find(tmk, start)
... if t<0: break
... start = s.rfind(beg, start, t)
... if start<0: break
... e = s.find(end, start, t)
... if e+len(end)==t: # _just_ after
... yield s[start:e+len(end)]
... start = t+len(tmk)
... list(findit(s)) ['00 target 01'] s2 = s + ' garbage noise3 00 almost 01 target_mark 00 success 01 target_mark'
list(findit(s2))

['00 target 01', '00 success 01']

(I didn't enforce exact adjacency the first time, obviously it would be more efficient
to search for end+tmk instead of tmk and back to beg and forward to end ;-)

If there can be spurious target_marks, and tricky matching spans, additional logic may be needed.
Too lazy to think about it ;-)

Regards,
Bengt Richter

Nov 8 '05 #7

by: Kenneth McDonald | last post by:

I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...

Python

Regular expression to exclude lines?

by: Shannon Jacobs | last post by:

Sorry to ask what is surely a trivial question. Also sorry that I don't have my current code version on hand, but... Anyway, must be some problem with trying to do the negative. It seems like I get...

Javascript

glibc regular expression

by: Jonas | last post by:

I got a string from which I want to extract some info. The string has a format like this "$MyINFO $ALL %s %s$ $%s$%s$%s$|" ie "$MyINFO $ALL smurf hmm$ $LAN(T3)$yes@mail.no$85899345920$|" for doing...

C / C++

Regular expression : Grouping decimal values and double quote

by: Ahmad A. Rahman | last post by:

Hi all, I have a problem constructing a regular expression using .net. I have a string, separated with comma, and I want to group the string together but, I failed to group a numeric character...

C# / C Sharp

Regular expression question

by: Lee Kuhn | last post by:

I am trying the create a regular expression that will essentially match characters in the middle of a fixed-length string. The string may be any characters, but will always be the same length. In...

C# / C Sharp

regular expression to match substring xxx and not substring yyy

by: likong | last post by:

Hi, Any idea about how to write a regular expression that matches a substring xxx as long as the string does NOT contain substring yyy? Thanks. Kong

C# / C Sharp

Get regular expression

by: Mike | last post by:

I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...

C# / C Sharp

regular expression issue

by: othellomy | last post by:

I am trying to exclude all strings that has 'a' inside (I have simplified the actual problem) select 1 where 'bb b a dfg' like '%%' However, the above does not work. By the way, I can not use...

Microsoft SQL Server

Regular Expression Question

by: =?Utf-8?B?SlA=?= | last post by:

I am a newbie to regular expressions and want to extract a number from the end of a string. The string would have these formats: image/4567 image/45678 image/456789 I would also want to...

C# / C Sharp

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Regular expression question -- exclude substring

Similar topics