473,657 Members | 2,771 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regexes: How to handle escaped characters

Hallöchen!

I need some help with finding matches in a string that has some
characters which are marked as escaped (in a separate list of
indices). Escaped means that they must not be part of any match.

My current approach is to look for matches in substrings with the
escaped characters as boundaries between the substrings. However,
then ^ and $ in the patterns are treated wrongly. (Although I use
startpos and endpos parameters for this and no slicing.)

Another idea was to have a special unicode character that never
takes part in a match. The docs are not very promising regarding
such a thing, or did I miss something?

Any other ideas?

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: br*****@jabber. org
(See http://ime.webhop.org for ICQ, MSN, etc.)
May 17 '07 #1
12 1849
Torsten Bronger wrote:
Hallöchen!

I need some help with finding matches in a string that has some
characters which are marked as escaped (in a separate list of
indices). Escaped means that they must not be part of any match.

My current approach is to look for matches in substrings with the
escaped characters as boundaries between the substrings. However,
then ^ and $ in the patterns are treated wrongly. (Although I use
startpos and endpos parameters for this and no slicing.)

Another idea was to have a special unicode character that never
takes part in a match. The docs are not very promising regarding
such a thing, or did I miss something?

Any other ideas?

Tschö,
Torsten.
You should probably provide examples of what you are trying to do or you
will likely get a lot of irrelevant answers.

James
May 17 '07 #2
Hallöchen!

James Stroud writes:
Torsten Bronger wrote:
>I need some help with finding matches in a string that has some
characters which are marked as escaped (in a separate list of
indices). Escaped means that they must not be part of any match.

[...]

You should probably provide examples of what you are trying to do
or you will likely get a lot of irrelevant answers.
Example string: u"Hollo", escaped positions: [4]. Thus, the second
"o" is escaped and must not be found be the regexp searches.

Instead of re.search, I call the function guarded_search( pattern,
text, offset) which takes care of escaped caracters. Thus, while

re.search("o$", string)

will find the second "o",

guarded_search( "o$", string, 0)

won't find anything. But how to program "guarded_search "?
Actually, it is about changing the semantics of the regexp syntax:
"." doesn't mean anymore "any character except newline" but "any
character except newline and characters marked as escaped". And so
on, for all syntax elements of regular expressions. Escaped
characters must spoil any match, however, the regexp machine should
continue to search for other matches.

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: br*****@jabber. org
(See http://ime.webhop.org for ICQ, MSN, etc.)
May 17 '07 #3
Torsten Bronger wrote:
Hallöchen!

James Stroud writes:

>>Torsten Bronger wrote:

>>>I need some help with finding matches in a string that has some
characters which are marked as escaped (in a separate list of
indices). Escaped means that they must not be part of any match.

[...]

You should probably provide examples of what you are trying to do
or you will likely get a lot of irrelevant answers.


Example string: u"Hollo", escaped positions: [4]. Thus, the second
"o" is escaped and must not be found be the regexp searches.

Instead of re.search, I call the function guarded_search( pattern,
text, offset) which takes care of escaped caracters. Thus, while

re.search("o$", string)

will find the second "o",

guarded_search( "o$", string, 0)

won't find anything. But how to program "guarded_search "?
Actually, it is about changing the semantics of the regexp syntax:
"." doesn't mean anymore "any character except newline" but "any
character except newline and characters marked as escaped". And so
on, for all syntax elements of regular expressions. Escaped
characters must spoil any match, however, the regexp machine should
continue to search for other matches.

Tschö,
Torsten.
You will probably need to implement your own findall, etc., but this
seems to do it for search:

def guarded_search( rgx, astring, escaped):
m = re.search(rgx, astring)
if m:
s = m.start()
e = m.end()
for i in escaped:
if s <= i <= e:
m = None
break
return m
Here it is in use:

pydef guarded_search( rgx, astring, escaped):
.... m = re.search(rgx, astring)
.... if m:
.... s = m.start()
.... e = m.end()
.... for i in escaped:
.... if s <= i <= e:
.... m = None
.... break
.... return m
....
pyimport re
pyescaped = [1, 5, 15]
pyprint guarded_search( 'abc', 'xyzabcxyz', escaped)
None
pyprint guarded_search( 'abc', 'xyzxyzabcxyz', escaped)
<_sre.SRE_Mat ch object at 0x40379720>

James
May 17 '07 #4
On May 18, 6:00 am, Torsten Bronger <bron...@physik .rwth-aachen.de>
wrote:
Hallöchen!

James Stroud writes:
Torsten Bronger wrote:
I need some help with finding matches in a string that has some
characters which are marked as escaped (in a separate list of
indices). Escaped means that they must not be part of any match.
[...]
You should probably provide examples of what you are trying to do
or you will likely get a lot of irrelevant answers.

Example string: u"Hollo", escaped positions: [4]. Thus, the second
"o" is escaped and must not be found be the regexp searches.

Instead of re.search, I call the function guarded_search( pattern,
text, offset) which takes care of escaped caracters. Thus, while

re.search("o$", string)

will find the second "o",

guarded_search( "o$", string, 0)
Huh? Did you mean 4 instead of zero?
>
won't find anything.
Quite apart from the confusing use of "escape", your requirements are
still as clear as mud. Try writing up docs for your "guarded_search "
function. Supply test cases showing what you expect to match and what
you don't expect to match. Is "offset" the offset in the text? If so,
don't you really want a set of "forbidden" offsets, not just one?
But how to program "guarded_search "?
Actually, it is about changing the semantics of the regexp syntax:
"." doesn't mean anymore "any character except newline" but "any
character except newline and characters marked as escaped".
Make up your mind whether you are "escaping" characters [likely to be
interpreted by many people as position-independent] or "escaping"
positions within the text.
And so
on, for all syntax elements of regular expressions. Escaped
characters must spoil any match, however, the regexp machine should
continue to search for other matches.
Whatever your exact requirement, it would seem unlikely to be so
wildly popularly demanded as to warrant inclusion in the "regexp
machine". You would have to write your own wrapper, something like the
following totally-untested example of one possible implementation of
one possible guess at what you mean:

import re
def guarded_search( pattern, text, forbidden_offse ts, overlap=False):
regex = re.compile(patt ern)
pos = 0
while True:
m = regex.search(te xt, pos)
if not m:
return
start, end = m.span()
for bad_pos in forbidden_offse ts:
if start <= bad_pos < end:
break
else:
yield m
if overlap:
pos = start + 1
else:
pos = end
8<-------

HTH,
John

May 17 '07 #5
On May 18, 6:50 am, James Stroud <jstr...@mbi.uc la.eduwrote:
def guarded_search( rgx, astring, escaped):
m = re.search(rgx, astring)
if m:
s = m.start()
e = m.end()
for i in escaped:
if s <= i <= e:
Did you mean to write

if s <= i < e:

?

m = None
break
return m
Your guarded search fails if there is a match after the rightmost bad
position i.e. it gives up at the first bad position.

My "guarded_search " (see separated post) needs the following done to
it:
1. make a copy
2. change name of copy to "guarded_search all" or something similar
3. change "yield" to "return" in the original

Cheers,
John

May 17 '07 #6
On May 17, 4:06 pm, John Machin <sjmac...@lexic on.netwrote:
On May 18, 6:00 am, Torsten Bronger <bron...@physik .rwth-aachen.de>
wrote:


Hallöchen!
James Stroud writes:
Torsten Bronger wrote:
>I need some help with finding matches in a string that has some
>characters which are marked as escaped (in a separate list of
>indices). Escaped means that they must not be part of any match.
>[...]
You should probably provide examples of what you are trying to do
or you will likely get a lot of irrelevant answers.
Example string: u"Hollo", escaped positions: [4]. Thus, the second
"o" is escaped and must not be found be the regexp searches.
Instead of re.search, I call the function guarded_search( pattern,
text, offset) which takes care of escaped caracters. Thus, while
re.search("o$", string)
will find the second "o",
guarded_search( "o$", string, 0)

Huh? Did you mean 4 instead of zero?
won't find anything.

Quite apart from the confusing use of "escape", your requirements are
still as clear as mud. Try writing up docs for your "guarded_search "
function. Supply test cases showing what you expect to match and what
you don't expect to match. Is "offset" the offset in the text? If so,
don't you really want a set of "forbidden" offsets, not just one?
But how to program "guarded_search "?
Actually, it is about changing the semantics of the regexp syntax:
"." doesn't mean anymore "any character except newline" but "any
character except newline and characters marked as escaped".

Make up your mind whether you are "escaping" characters [likely to be
interpreted by many people as position-independent] or "escaping"
positions within the text.
And so
on, for all syntax elements of regular expressions. Escaped
characters must spoil any match, however, the regexp machine should
continue to search for other matches.

Whatever your exact requirement, it would seem unlikely to be so
wildly popularly demanded as to warrant inclusion in the "regexp
machine". You would have to write your own wrapper, something like the
following totally-untested example of one possible implementation of
one possible guess at what you mean:

import re
def guarded_search( pattern, text, forbidden_offse ts, overlap=False):
regex = re.compile(patt ern)
pos = 0
while True:
m = regex.search(te xt, pos)
if not m:
return
start, end = m.span()
for bad_pos in forbidden_offse ts:
if start <= bad_pos < end:
break
else:
yield m
if overlap:
pos = start + 1
else:
pos = end
8<-------

HTH,
John- Hide quoted text -

- Show quoted text -
Here are two pyparsing-based routines, guardedSearch and
guardedSearchBy Column. The first uses a pyparsing parse action to
reject matches at a given string location, and returns a list of
tuples containing the string location and matched text. The second
uses an enhanced version of guardedSearch that uses the pyparsing
built-ins col and lineno to filter matches by column instead of by raw
string location, and returns a list of tuples of line and column of
the match location, and the matching text. (Note that string
locations are zero-based, while line and column numbers are 1-based.)

-- Paul
from pyparsing import Regex,ParseExce ption,col,linen o

def guardedSearch(p attern, text, forbidden_offse ts):

def offsetValidator (strng,locn,tok ens):
if locn in forbidden_offse ts:
raise ParseException, "can't match at offset %d" % locn

regex = Regex(pattern). setParseAction( offsetValidator )
return [ (tokStart,toks[0]) for toks,tokStart,t okEnd in
regex.scanStrin g(text) ]

print guardedSearch(u "o", u"Hollo how are you", [4,])
def guardedSearchBy Column(pattern, text, forbidden_colum ns):

def offsetValidator (strng,locn,tok ens):
if col(locn,strng) in forbidden_colum ns:
raise ParseException, "can't match at offset %d" % locn

regex = Regex(pattern). setParseAction( offsetValidator )
return [ (lineno(tokStar t,text),col(tok Start,text),tok s[0])
for toks,tokStart,t okEnd in regex.scanStrin g(text) ]

text = """\
alksjdflasjf;sa
a;sljflsjlaj
;asjflasfja;sf
aslfj;asfj;dsf
aslf;lajdf;ajsf
aslfj;afsj;sd
"""
print guardedSearchBy Column(";", text, [1,6,11,])

Prints:
[(1, 'o'), (7, 'o'), (15, 'o')]
[(1, 13, ';'), (2, 2, ';'), (3, 12, ';'), (5, 5, ';')]

May 17 '07 #7
On May 18, 8:16 am, Paul McGuire <p...@austin.rr .comwrote:
On May 17, 4:06 pm, John Machin <sjmac...@lexic on.netwrote:
On May 18, 6:00 am, Torsten Bronger <bron...@physik .rwth-aachen.de>
wrote:
Hallöchen!
James Stroud writes:
Torsten Bronger wrote:
I need some help with finding matches in a string that has some
characters which are marked as escaped (in a separate list of
indices). Escaped means that they must not be part of any match.
Note: "must not be *part of* any match" [my emphasis]

[big snip]
>
Here are two pyparsing-based routines, guardedSearch and
guardedSearchBy Column. The first uses a pyparsing parse action to
reject matches at a given string location
Seems to be somewhat less like what the OP might have in mind ...

While we're waiting for clarification from the OP, there's a chicken-
and-egg thought that's been nagging me: if the OP knows so much about
the searched string that he can specify offsets which search patterns
should not span, why does he still need to search it?

Cheers,
John

May 17 '07 #8
On May 17, 6:12 pm, John Machin <sjmac...@lexic on.netwrote:
>
Note: "must not be *part of* any match" [my emphasis]
Ooops, my bad. See this version:

from pyparsing import Regex,ParseExce ption,col,linen o,getTokensEndL oc

# fake (and inefficient) version of any if not yet upgraded to Py2.5
any = lambda lst : sum(list(lst)) 0

def guardedSearch(p attern, text, forbidden_offse ts):

def offsetValidator (strng,locn,tok ens):
start,end = locn,getTokensE ndLoc()-1
if any( start <= i <= end for i in forbidden_offse ts ):
raise ParseException, "can't match at offset %d" % locn

regex = Regex(pattern). setParseAction( offsetValidator )
return [ (tokStart,toks[0]) for toks,tokStart,t okEnd in
regex.scanStrin g(text) ]

print guardedSearch(u r"o\S", u"Hollo how are you", [8,])
def guardedSearchBy Column(pattern, text, forbidden_colum ns):

def offsetValidator (strng,locn,tok ens):
start,end = col(locn,strng) , col(getTokensEn dLoc(),strng)-1
if any( start <= i <= end for i in forbidden_colum ns ):
raise ParseException, "can't match at col %d" % start

regex = Regex(pattern). setParseAction( offsetValidator )
return [ (lineno(tokStar t,text),col(tok Start,text),tok s[0])
for toks,tokStart,t okEnd in regex.scanStrin g(text) ]

text = """\
alksjdflasjf;sa
a;sljflsjlaj
;asjflasfja;sf
aslfj;asfj;dsf
aslf;lajdf;ajsf
aslfj;afsj;sd
"""
print guardedSearchBy Column("[fa];", text, [4,12,13,])

Prints:
[(1, 'ol'), (15, 'ou')]
[(2, 1, 'a;'), (5, 10, 'f;')]
>
While we're waiting for clarification from the OP, there's a chicken-
and-egg thought that's been nagging me: if the OP knows so much about
the searched string that he can specify offsets which search patterns
should not span, why does he still need to search it?
I suspect that this is column/tabular data (a log file perhaps?), and
some columns are not interesting, but produce many false hits for the
search pattern.

-- Paul

May 17 '07 #9
On May 18, 9:46 am, Paul McGuire <p...@austin.rr .comwrote:
On May 17, 6:12 pm, John Machin <sjmac...@lexic on.netwrote:
Note: "must not be *part of* any match" [my emphasis]

While we're waiting for clarification from the OP, there's a chicken-
and-egg thought that's been nagging me: if the OP knows so much about
the searched string that he can specify offsets which search patterns
should not span, why does he still need to search it?

I suspect that this is column/tabular data (a log file perhaps?), and
some columns are not interesting, but produce many false hits for the
search pattern.
If so, why not split the record into fields and look only at the
interesting fields? Smells to me of yet another case of re abuse/
misuse ...
May 18 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
2759
by: Ara.T.Howard | last post by:
hi- i know nada about python so please forgive me if this is way off base. i'm trying to fix a bug in MoinMoin whereby WordsWithTwoCapsInARowLike ^^ ^^ ^^
8
2653
by: Jay | last post by:
Is there a definitive list of characters that must be escaped in order to insert them into a text field in mysql? TIA, Jay
4
2410
by: Johannes Busse | last post by:
Hello NG, I'm struggling with the following problem. I think can be solved quite easily (in fact it should be a FAQ), but it seems that I cannot solve it myself :-( my source looks like this: <?xml version="1.0" encoding="utf-8"?>
5
1416
by: Patty O'Dors | last post by:
Can anyone help me with understanding the behaviour of a Regular Expression which is said to be like that of a stack? It says in the docs... "(?<group1-group2>) Balancing group definition. Deletes the definition of the previously defined group name2 and stores in group name1 the interval between the previously defined name2 group and the current group. If no group name2 is defined, the match backtracks. Because deleting the last definition...
13
2999
by: Oxns | last post by:
Hi, Can anyone point me at the class to convert a string so that it displays escaped chars as \r, \n etc. please Its done in the 2005 debugger so I hope its available as a class ??. Thanks Graham
2
1666
by: a | last post by:
Hi everybody, I have a php script that gets the text from a form field: <?php include("global.inc.php"); pt_register( 'POST', 'test1'); echo $test1;
1
2644
by: bwilcoxis | last post by:
I have an externally generated xml file that correctly escaping characters such as apostrophes and quotes and rendering them as &amp;#8217; and &amp;#39; the xml file has the following definition: <?xml version="1.0" encoding="iso-8859-1"?> I have tried changing that to: <?xml version="1.0" encoding="utf-8"?> to no avail
9
11551
by: Michael Goerz | last post by:
Hi, I am writing unicode stings into a special text file that requires to have non-ascii characters as as octal-escaped UTF-8 codes. For example, the letter "Ã" (latin capital I with acute, code point 205) would come out as "\303\215". I will also have to read back from the file later on and convert the escaped characters back into a unicode string.
0
964
by: hash4sp | last post by:
Hello ! I have a problem with the escaped charactes which I am sending to an aspx page. This is javascript escaped characters %u0646%u062A%u0627%u0626%u062C%20%u0627%u0644%u062 8%u062D%u062B which i sent to the server side aspx page. But it changes back to the original characters ven it reach there.... is there any way to retain the escaped characters untill i send it to the database. I tried using HTTPUtility.urlEncode etc.. but it didnt...
0
8394
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8306
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
8503
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8605
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7327
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5632
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
2726
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
1955
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
2
1615
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.