RegEx conditional search and replace

Martin Evans

Sorry, yet another REGEX question. I've been struggling with trying to get
a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really, it
is sleeping.
to
This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really, it
is dead.
This is the full extent of the "html" that would be seen in the text, the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more than
one link present.

I'm hoping to use this to implement the automatic addition of links to other
areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the case,
I'll add some more manual Python parsing to the string, but was hoping to
use it to learn more about regex.

Any pointers would be appreciated.

Martin

Jul 5 '06 #1

Subscribe Reply

5896

Juho Schultz

Martin Evans wrote:

Sorry, yet another REGEX question. I've been struggling with trying to get
a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really, it
is sleeping.
to
This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really, it
is dead.
This is the full extent of the "html" that would be seen in the text, the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more than
one link present.

I'm hoping to use this to implement the automatic addition of links to other
areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the case,
I'll add some more manual Python parsing to the string, but was hoping to
use it to learn more about regex.

Any pointers would be appreciated.

Martin

What you want is:

re.sub(regex, replacement, instring)
The replacement can be a function. So use a function.

def sleeping_to_dea d(inmatch):
instr = inmatch.group(0 )
if needsfixing(ins tr):
return instr.replace(' sleeping','dead ')
else:
return instr

as for the regex, something like
(<a)?[^<>]*(</a>)?
could be a start. It is probaly better to use the regex to recognize
the links as you might have something like
sleeping.sleepi ng/sleeping/sleeping.html in your urls. Also you
probably want to do many fixes, so you can put them all within the same
replacer function.

Jul 5 '06 #2

Martin Evans

"Juho Schultz" <ju**********@p p.inet.fiwrote in message
news:11******** *************@l 70g2000cwa.goog legroups.com...

Martin Evans wrote:
>Sorry, yet another REGEX question. I've been struggling with trying to
get
a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,
it
is sleeping.
to
This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,
it
is dead.
This is the full extent of the "html" that would be seen in the text, the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more than
one link present.

I'm hoping to use this to implement the automatic addition of links to
other
areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the
case,
I'll add some more manual Python parsing to the string, but was hoping to
use it to learn more about regex.

Any pointers would be appreciated.

Martin

What you want is:

re.sub(regex, replacement, instring)
The replacement can be a function. So use a function.

def sleeping_to_dea d(inmatch):
instr = inmatch.group(0 )
if needsfixing(ins tr):
return instr.replace(' sleeping','dead ')
else:
return instr

as for the regex, something like
(<a)?[^<>]*(</a>)?
could be a start. It is probaly better to use the regex to recognize
the links as you might have something like
sleeping.sleepi ng/sleeping/sleeping.html in your urls. Also you
probably want to do many fixes, so you can put them all within the same
replacer function.

Many thanks for that, the function method looks very useful. My first
working attempt had been to use the regex to locate all links. I then looped
through replacing each with a numbered dummy entry. Then safely do the
find/replaces and then replace the dummy entries with the original links. It
just seems overly inefficient.

Jul 6 '06 #3

Anthra Norell

>>import SE

>>Editor = SE.SE ('sleeping=dead sleeping.htm== sleeping<==')
Editor ('This parrot <a href="sleeping. htm" target="new">is

sleeping</a>. Really, it is sleeping.'
'This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really, it
is dead.'
Or:

>>Editor ( (name of htm file), (name of output file) )

Usage: You make an explicit list of what you want and don't want after
identifying the distinctions.

I am currently trying to upload SE to the Cheese Shop which seems to be
quite a procedure. So far I have only been successful uploading the
description, but not the program. Gudidance welcome. In the interim I can
send SE out individually by request.

Regards

Frederic

----- Original Message -----
From: "Martin Evans" <ma****@brown s-nospam.co.uk>
Newsgroups: comp.lang.pytho n
To: <py*********@py thon.org>
Sent: Wednesday, July 05, 2006 1:34 PM
Subject: RegEx conditional search and replace

Sorry, yet another REGEX question. I've been struggling with trying to

get

a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,

is sleeping.
to
This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,

is dead.
This is the full extent of the "html" that would be seen in the text, the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more than
one link present.

I'm hoping to use this to implement the automatic addition of links to

other

areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the case,
I'll add some more manual Python parsing to the string, but was hoping to
use it to learn more about regex.

Any pointers would be appreciated.

Martin

--
http://mail.python.org/mailman/listinfo/python-list

Jul 6 '06 #4

mbstevens

On Thu, 06 Jul 2006 08:32:46 +0100, Martin Evans wrote:

"Juho Schultz" <ju**********@p p.inet.fiwrote in message
news:11******** *************@l 70g2000cwa.goog legroups.com...
>Martin Evans wrote:
>>Sorry, yet another REGEX question. I've been struggling with trying to
get
a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,
it
is sleeping.
to
This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,
it
is dead.
This is the full extent of the "html" that would be seen in the text, the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more than
one link present.

I'm hoping to use this to implement the automatic addition of links to
other
areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the
case,
I'll add some more manual Python parsing to the string, but was hoping to
use it to learn more about regex.

Any pointers would be appreciated.

Martin

What you want is:

re.sub(regex , replacement, instring)
The replacement can be a function. So use a function.

def sleeping_to_dea d(inmatch):
instr = inmatch.group(0 )
if needsfixing(ins tr):
return instr.replace(' sleeping','dead ')
else:
return instr

as for the regex, something like
(<a)?[^<>]*(</a>)?
could be a start. It is probaly better to use the regex to recognize
the links as you might have something like
sleeping.sleep ing/sleeping/sleeping.html in your urls. Also you
probably want to do many fixes, so you can put them all within the same
replacer function.

... My first
working attempt had been to use the regex to locate all links. I then looped
through replacing each with a numbered dummy entry. Then safely do the
find/replaces and then replace the dummy entries with the original links. It
just seems overly inefficient.

Someone may have made use of
multiline links:

_______________ __________
This parrot
<a
href="sleeping. htm"
target="new" >
is sleeping
</a>.
Really, it is sleeping.
_______________ __________
In such a case you may need to make the page
into one string to search if you don't want to use some complex
method of tracking state with variables as you move from
string to string. You'll also have to make it possible
for non-printing characters to have been inserted in all sorts
of ways around the '>' and '<' and 'a' or 'A'
characters using ' *' here and there in he regex.

Jul 6 '06 #5

Martin Evans

"mbstevens" <NO***********@ XmbstevensX.com wrote in message
news:pa******** *************** *****@Xmbsteven sX.com...

On Thu, 06 Jul 2006 08:32:46 +0100, Martin Evans wrote:

>"Juho Schultz" <ju**********@p p.inet.fiwrote in message
news:11******* **************@ l70g2000cwa.goo glegroups.com.. .
>>Martin Evans wrote:
Sorry, yet another REGEX question. I've been struggling with trying to
get
a regular expression to do the following example in Python:

Search and replace all instances of "sleeping" with "dead".

This parrot is sleeping. Really, it is sleeping.
to
This parrot is dead. Really, it is dead.
But not if part of a link or inside a link:

This parrot <a href="sleeping. htm" target="new">is sleeping</a>.
Really,
it
is sleeping.
to
This parrot <a href="sleeping. htm" target="new">is sleeping</a>.
Really,
it
is dead.
This is the full extent of the "html" that would be seen in the text,
the
rest of the page has already been processed. Luckily I can rely on the
formating always being consistent with the above example (the url will
normally by much longer in reality though). There may though be more
than
one link present.

I'm hoping to use this to implement the automatic addition of links to
other
areas of a website based on keywords found in the text.

I'm guessing this is a bit too much to ask for regex. If this is the
case,
I'll add some more manual Python parsing to the string, but was hoping
to
use it to learn more about regex.

Any pointers would be appreciated.

Martin

What you want is:

re.sub(rege x, replacement, instring)
The replacement can be a function. So use a function.

def sleeping_to_dea d(inmatch):
instr = inmatch.group(0 )
if needsfixing(ins tr):
return instr.replace(' sleeping','dead ')
else:
return instr

as for the regex, something like
(<a)?[^<>]*(</a>)?
could be a start. It is probaly better to use the regex to recognize
the links as you might have something like
sleeping.slee ping/sleeping/sleeping.html in your urls. Also you
probably want to do many fixes, so you can put them all within the same
replacer function.

... My first
working attempt had been to use the regex to locate all links. I then
looped
through replacing each with a numbered dummy entry. Then safely do the
find/replaces and then replace the dummy entries with the original links.
It
just seems overly inefficient.

Someone may have made use of
multiline links:

_______________ __________
This parrot
<a
href="sleeping. htm"
target="new" >
is sleeping
</a>.
Really, it is sleeping.
_______________ __________
In such a case you may need to make the page
into one string to search if you don't want to use some complex
method of tracking state with variables as you move from
string to string. You'll also have to make it possible
for non-printing characters to have been inserted in all sorts
of ways around the '>' and '<' and 'a' or 'A'
characters using ' *' here and there in he regex.

I agree, but luckily in this case the HREF will always be formated the same
as it happens to be inserted by another automated system, not a user. Ok it
be changed by the server but AFAIK it hasn't for the last 6 years. The text
in question apart from the links should be plain (not even <bis allowed I
think).

I'd read about back and forward regex matching and thought it might somehow
be of use here ie back search for the "<A" and forward search for the "</A>"
but of course this would easily match in between two links.

Jul 6 '06 #6

Blair P. Houghton

mbstevens wrote:

In such a case you may need to make the page
into one string to search if you don't want to use some complex
method of tracking state with variables as you move from
string to string.

In general it's a very hard problem to do stateful regexes.

I recall something from last year about the new Perl implementation
that tried to address this sort of problem. But I may have been
reading old docs and it could have been done years ago.

Parsing the HTML would be the only sure way to accomplish
it. Let something that already knows the hierarchy tell you
that you're entering a URL and you can skip past all of its
recursive inclusions of strings with URLs with strings that
have URLs and so on...

Of course, that means reconstructing the HTML from the
parse tree afterward...

--Blair

Jul 6 '06 #7

Similar topics

1987

regex into str

by: Peter Kleiweg | last post by:

I want to use regular expressions with less typing. Like this: A / 'b.(..)' # test for regex 'b...' in A A # get the last whole match A # get the first group in the last match A /= 'b.','X',1 # replace first occurence of regex 'b.' # in A with 'X' A /= 'b.','X' # replace all occurences of regex 'b.' # in A with 'X'

Python

2083

Translating JavaScript function with Regex to CSharp

by: Jon Maz | last post by:

Hi All, Am getting frustrated trying to port the following (pretty simple) function to CSharp. The problem is that I'm lousy at Regular Expressions.... //from http://support.microsoft.com/default.aspx?scid=kb;EN-US;246800 function fxnParseIt() { var sInputString = 'asp and database';

.NET Framework

2616

any regex gurus out there?

by: bill tie | last post by:

I'd appreciate it if you could advise. 1. How do I replace "\" (backslash) with anything? 2. Suppose I want to replace (a) every occurrence of characters "a", "b", "c", "d" with "x", (b) every occurrence of characters "p", "q", "r", "s" with "y". Right now, I do it as follows:

C# / C Sharp

8259

multiple search/replacements in a single Regex.Replace?

by: Craig Buchanan | last post by:

Is there a way to combine these two Replace into a single line? Regex.Replace(Subject, "\&", "&") Regex.Replace(Subject, "\'", "'") Perhaps Regex.Replace(Subject, "{\&|\'}", "{&|'}") Thanks, Craig

Visual Basic .NET

1983

Regex questions

by: Just Me | last post by:

I want to use regular expressions to search a string, give the user the option of replacing, and then maybe replacing the data - using reg expressions for the search and the replace strings. However all the Regex replace methods seem to combine in one call the search and replace. Is there a way of doing what I want?

Visual Basic .NET

3959

regex question

by: Sa¹o Zagoranski | last post by:

Hi, could someone help me putting together a regex expression for my problem? I need my search filter to treat spaces and commas in the query the same way no matter how many there are... Something like this: If a user enter "song, song" the filter has to find: song song

C# / C Sharp

2367

Regex Help?

by: Alex Maghen | last post by:

This is a bit of an abuse of this group. Just a nit, but I'm hoping someone really good with Regular Expressions can help me out here. I need to write a regular expression that will do the following: Search a whole blob of text (including newlines and everything). Find any text enclosed in brackets (), and replace it with a string I provide and then concatenate the text that had been enclosed in the brakets, without the brakets.

ASP.NET

1920

Regex help...pretty please?

by: MooMaster | last post by:

I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: cond(a,b,c) and I want them to look like this: cond(c,a,b)

Python

1525

regex replace problem.

by: Jeremy | last post by:

I am trying to replace a string "P" with "\P" as long as the string "P" does not already have a "\" in front. for my search string I've used @"(P)" so my regex replace statement is: System.Text.RegularExpressions.Regex pR = new System.Text.RegularExpressions.Regex(@"(P)"); string strResult = pR.Replace(@"This is a XP test X\P", @"\P"); The poblem is that the statement replaces XP with \P where I want it to be

.NET Framework

8888

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8752

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9257

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9176

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8097

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6011

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4784

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2635

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2157

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General