472,328 Members | 1,090 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,328 software developers and data experts.

RegEx conditional search and replace

Sorry, yet another REGEX question. I've been struggling with trying to get
a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, it
is sleeping.
to
This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, it
is dead.
This is the full extent of the "html" that would be seen in the text, the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more than
one link present.

I'm hoping to use this to implement the automatic addition of links to other
areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the case,
I'll add some more manual Python parsing to the string, but was hoping to
use it to learn more about regex.

Any pointers would be appreciated.

Martin

Jul 5 '06 #1
6 5660
Martin Evans wrote:
Sorry, yet another REGEX question. I've been struggling with trying to get
a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, it
is sleeping.
to
This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, it
is dead.
This is the full extent of the "html" that would be seen in the text, the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more than
one link present.

I'm hoping to use this to implement the automatic addition of links to other
areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the case,
I'll add some more manual Python parsing to the string, but was hoping to
use it to learn more about regex.

Any pointers would be appreciated.

Martin
What you want is:

re.sub(regex, replacement, instring)
The replacement can be a function. So use a function.

def sleeping_to_dead(inmatch):
instr = inmatch.group(0)
if needsfixing(instr):
return instr.replace('sleeping','dead')
else:
return instr

as for the regex, something like
(<a)?[^<>]*(</a>)?
could be a start. It is probaly better to use the regex to recognize
the links as you might have something like
sleeping.sleeping/sleeping/sleeping.html in your urls. Also you
probably want to do many fixes, so you can put them all within the same
replacer function.

Jul 5 '06 #2
"Juho Schultz" <ju**********@pp.inet.fiwrote in message
news:11*********************@l70g2000cwa.googlegro ups.com...
Martin Evans wrote:
>Sorry, yet another REGEX question. I've been struggling with trying to
get
a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really,
it
is sleeping.
to
This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really,
it
is dead.
This is the full extent of the "html" that would be seen in the text, the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more than
one link present.

I'm hoping to use this to implement the automatic addition of links to
other
areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the
case,
I'll add some more manual Python parsing to the string, but was hoping to
use it to learn more about regex.

Any pointers would be appreciated.

Martin

What you want is:

re.sub(regex, replacement, instring)
The replacement can be a function. So use a function.

def sleeping_to_dead(inmatch):
instr = inmatch.group(0)
if needsfixing(instr):
return instr.replace('sleeping','dead')
else:
return instr

as for the regex, something like
(<a)?[^<>]*(</a>)?
could be a start. It is probaly better to use the regex to recognize
the links as you might have something like
sleeping.sleeping/sleeping/sleeping.html in your urls. Also you
probably want to do many fixes, so you can put them all within the same
replacer function.
Many thanks for that, the function method looks very useful. My first
working attempt had been to use the regex to locate all links. I then looped
through replacing each with a numbered dummy entry. Then safely do the
find/replaces and then replace the dummy entries with the original links. It
just seems overly inefficient.

Jul 6 '06 #3
>>import SE
>>Editor = SE.SE ('sleeping=dead sleeping.htm== sleeping<==')
Editor ('This parrot <a href="sleeping.htm" target="new">is
sleeping</a>. Really, it is sleeping.'
'This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, it
is dead.'
Or:
>>Editor ( (name of htm file), (name of output file) )
Usage: You make an explicit list of what you want and don't want after
identifying the distinctions.

I am currently trying to upload SE to the Cheese Shop which seems to be
quite a procedure. So far I have only been successful uploading the
description, but not the program. Gudidance welcome. In the interim I can
send SE out individually by request.

Regards

Frederic

----- Original Message -----
From: "Martin Evans" <ma****@browns-nospam.co.uk>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Wednesday, July 05, 2006 1:34 PM
Subject: RegEx conditional search and replace

Sorry, yet another REGEX question. I've been struggling with trying to
get
a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really,
it
is sleeping.
to
This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really,
it
is dead.
This is the full extent of the "html" that would be seen in the text, the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more than
one link present.

I'm hoping to use this to implement the automatic addition of links to
other
areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the case,
I'll add some more manual Python parsing to the string, but was hoping to
use it to learn more about regex.

Any pointers would be appreciated.

Martin

--
http://mail.python.org/mailman/listinfo/python-list
Jul 6 '06 #4
On Thu, 06 Jul 2006 08:32:46 +0100, Martin Evans wrote:
"Juho Schultz" <ju**********@pp.inet.fiwrote in message
news:11*********************@l70g2000cwa.googlegro ups.com...
>Martin Evans wrote:
>>Sorry, yet another REGEX question. I've been struggling with trying to
get
a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really,
it
is sleeping.
to
This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really,
it
is dead.
This is the full extent of the "html" that would be seen in the text, the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more than
one link present.

I'm hoping to use this to implement the automatic addition of links to
other
areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the
case,
I'll add some more manual Python parsing to the string, but was hoping to
use it to learn more about regex.

Any pointers would be appreciated.

Martin

What you want is:

re.sub(regex, replacement, instring)
The replacement can be a function. So use a function.

def sleeping_to_dead(inmatch):
instr = inmatch.group(0)
if needsfixing(instr):
return instr.replace('sleeping','dead')
else:
return instr

as for the regex, something like
(<a)?[^<>]*(</a>)?
could be a start. It is probaly better to use the regex to recognize
the links as you might have something like
sleeping.sleeping/sleeping/sleeping.html in your urls. Also you
probably want to do many fixes, so you can put them all within the same
replacer function.

... My first
working attempt had been to use the regex to locate all links. I then looped
through replacing each with a numbered dummy entry. Then safely do the
find/replaces and then replace the dummy entries with the original links. It
just seems overly inefficient.
Someone may have made use of
multiline links:

_________________________
This parrot
<a
href="sleeping.htm"
target="new" >
is sleeping
</a>.
Really, it is sleeping.
_________________________
In such a case you may need to make the page
into one string to search if you don't want to use some complex
method of tracking state with variables as you move from
string to string. You'll also have to make it possible
for non-printing characters to have been inserted in all sorts
of ways around the '>' and '<' and 'a' or 'A'
characters using ' *' here and there in he regex.

Jul 6 '06 #5
"mbstevens" <NO***********@XmbstevensX.comwrote in message
news:pa****************************@XmbstevensX.co m...
On Thu, 06 Jul 2006 08:32:46 +0100, Martin Evans wrote:
>"Juho Schultz" <ju**********@pp.inet.fiwrote in message
news:11*********************@l70g2000cwa.googlegr oups.com...
>>Martin Evans wrote:
Sorry, yet another REGEX question. I've been struggling with trying to
get
a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping.htm" target="new">is sleeping</a>.
Really,
it
is sleeping.
to
This parrot <a href="sleeping.htm" target="new">is sleeping</a>.
Really,
it
is dead.
This is the full extent of the "html" that would be seen in the text,
the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more
than
one link present.

I'm hoping to use this to implement the automatic addition of links to
other
areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the
case,
I'll add some more manual Python parsing to the string, but was hoping
to
use it to learn more about regex.

Any pointers would be appreciated.

Martin

What you want is:

re.sub(regex, replacement, instring)
The replacement can be a function. So use a function.

def sleeping_to_dead(inmatch):
instr = inmatch.group(0)
if needsfixing(instr):
return instr.replace('sleeping','dead')
else:
return instr

as for the regex, something like
(<a)?[^<>]*(</a>)?
could be a start. It is probaly better to use the regex to recognize
the links as you might have something like
sleeping.sleeping/sleeping/sleeping.html in your urls. Also you
probably want to do many fixes, so you can put them all within the same
replacer function.

... My first
working attempt had been to use the regex to locate all links. I then
looped
through replacing each with a numbered dummy entry. Then safely do the
find/replaces and then replace the dummy entries with the original links.
It
just seems overly inefficient.

Someone may have made use of
multiline links:

_________________________
This parrot
<a
href="sleeping.htm"
target="new" >
is sleeping
</a>.
Really, it is sleeping.
_________________________
In such a case you may need to make the page
into one string to search if you don't want to use some complex
method of tracking state with variables as you move from
string to string. You'll also have to make it possible
for non-printing characters to have been inserted in all sorts
of ways around the '>' and '<' and 'a' or 'A'
characters using ' *' here and there in he regex.
I agree, but luckily in this case the HREF will always be formated the same
as it happens to be inserted by another automated system, not a user. Ok it
be changed by the server but AFAIK it hasn't for the last 6 years. The text
in question apart from the links should be plain (not even <bis allowed I
think).

I'd read about back and forward regex matching and thought it might somehow
be of use here ie back search for the "<A" and forward search for the "</A>"
but of course this would easily match in between two links.

Jul 6 '06 #6

mbstevens wrote:
In such a case you may need to make the page
into one string to search if you don't want to use some complex
method of tracking state with variables as you move from
string to string.
In general it's a very hard problem to do stateful regexes.

I recall something from last year about the new Perl implementation
that tried to address this sort of problem. But I may have been
reading old docs and it could have been done years ago.

Parsing the HTML would be the only sure way to accomplish
it. Let something that already knows the hierarchy tell you
that you're entering a URL and you can skip past all of its
recursive inclusions of strings with URLs with strings that
have URLs and so on...

Of course, that means reconstructing the HTML from the
parse tree afterward...

--Blair

Jul 6 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

12
by: Peter Kleiweg | last post by:
I want to use regular expressions with less typing. Like this: A / 'b.(..)' # test for regex 'b...' in A A # get the last whole...
3
by: Jon Maz | last post by:
Hi All, Am getting frustrated trying to port the following (pretty simple) function to CSharp. The problem is that I'm lousy at Regular...
7
by: bill tie | last post by:
I'd appreciate it if you could advise. 1. How do I replace "\" (backslash) with anything? 2. Suppose I want to replace (a) every occurrence...
3
by: Craig Buchanan | last post by:
Is there a way to combine these two Replace into a single line? Regex.Replace(Subject, "\&", "&amp;") Regex.Replace(Subject, "\'", "&apos;") ...
8
by: Just Me | last post by:
I want to use regular expressions to search a string, give the user the option of replacing, and then maybe replacing the data - using reg...
6
by: Sačo Zagoranski | last post by:
Hi, could someone help me putting together a regex expression for my problem? I need my search filter to treat spaces and commas in the query...
2
by: Alex Maghen | last post by:
This is a bit of an abuse of this group. Just a nit, but I'm hoping someone really good with Regular Expressions can help me out here. I need to...
4
by: MooMaster | last post by:
I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: ...
2
by: Jeremy | last post by:
I am trying to replace a string "P" with "\P" as long as the string "P" does not already have a "\" in front. for my search string I've used...
0
by: tammygombez | last post by:
Hey fellow JavaFX developers, I'm currently working on a project that involves using a ComboBox in JavaFX, and I've run into a bit of an issue....
0
by: concettolabs | last post by:
In today's business world, businesses are increasingly turning to PowerApps to develop custom business applications. PowerApps is a powerful tool...
0
better678
by: better678 | last post by:
Question: Discuss your understanding of the Java platform. Is the statement "Java is interpreted" correct? Answer: Java is an object-oriented...
0
by: CD Tom | last post by:
This only shows up in access runtime. When a user select a report from my report menu when they close the report they get a menu I've called Add-ins...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge...
0
jalbright99669
by: jalbright99669 | last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
0
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. ...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.