RegEx with multiple occurrences

Mike

Hi again.

I'm trying to strip all script blocks from HTML, and am using the
following re to do it:

p = re.compile("(\<script.*>*\</script>)",re.IGNORECASE | re.DOTALL)
m = p.search(data)

The problem is that I'm getting everything from the 1st script's start
tag to the last script's end tag in one group - so it seems like it
parses the string from both ends therefore removing far more from that
data than I want. What am I doing wrong?

May 4 '06 #1

Subscribe Post Reply

1825

Tim Chase

> p = re.compile("(\<script.*>*\</script>)",re.IGNORECASE | re.DOTALL)

m = p.search(data)
First, I presume you didn't copy & paste your expression, as
it looks like you're missing a period before the second
asterisk. Otherwise, all you'd get is any number of
greater-than signs followed by a closing "</script>" tag.

Second, you're likely getting some foobar results because
you're not using a "real" string of the form

r'(\<script...script>)'
The problem is that I'm getting everything from the 1st
script's start tag to the last script's end tag in one
group - so it seems like it parses the string from both
ends therefore removing far more from that data than I
want. What am I doing wrong?

Looks like you want the non-greedy modifier to the "*"
described at

http://docs.python.org/lib/re-syntax.html

(searching the page for "greedy" should turn up the
paragraph on the modifiers)

You likely want something more like:

r'<script[^>]*>.*?</script>'

In the first atom, you're looking for the remainder of the
script tag (as much stuff that isn't a ">" as possible).
Then you close the tag with the ">", and then take as little
as possible (".*?") of anything until you find the closing
"</script>" tag.

HTH,

-tkc

May 4 '06 #2

Mike

Tim - you're a legend. Thanks.

May 4 '06 #3

Tim Chase

> Tim - you're a legend. Thanks.

A leg-end? I always knew something was a-foot. Sorry to
make myself the butt of such joking. :)

My pleasure...glad it seems to be working for you.

-tkc (not much of a legend at all...just a regexp wonk)

May 4 '06 #4

Similar topics

Multiline regex help

by: Yatima | last post by:

Hey Folks, I've got some info in a bunch of files that kind of looks like so: Gibberish 53 MoreGarbage 12 RelevantInfo1 10/10/04

Python

regex -- substitute chars outside quoted strings

by: Gary McCullough | last post by:

What I want to do sounds simple, but it's defeating me. I want to substitute all occurences of a colon : character in a string with an @ character -- unless the : occurs within a single or...

.NET Framework

my head is spinning with regex

by: Daniel Billingsley | last post by:

First, if MSFT is listening I'll say IMO the MSDN material is sorely lacking in this area... it's just a whole bunch of information thrown at you and you're left to yourself as to organizing it in...

C# / C Sharp

VB pgmr needs help with Regex for C#

by: Mortimer Schnurd | last post by:

Hi All, I am a VB 6 programmer who is now trying to learn C#. In doing so, I am trying to convert some of my VB modules to C#. I routinely user Reg Expressions in VB and am having some trouble...

C# / C Sharp

RegEx how do I do unique?

by: D | last post by:

My first attempt at this and I'm searching formulas like so RIGHT(TEXT(A15,'yy'),1)*1000+A15-CONCATENATE(1,'-','jan','-',TEXT(A15,'yy'))+1 I want to extract the row / col coordinates (A15 in...

C# / C Sharp

String.Split(), Regex.Split() - empty String

by: Rico | last post by:

If there are consecutive occurrences of characters from the given delimiter, String.Split() and Regex.Split() produce an empty string as the token that's between such consecutive occurrences. It...

C# / C Sharp

regex for replacing plain text within html string...

by: Tim_Mac | last post by:

hi, i have a tricky problem and my regex expertise has reached its limit. i have read other posts on this newsgroup that pull out the plain text from a html string, but that won't work for me...

ASP.NET

Problem with a Regex

by: taylorjonl | last post by:

I am having a problem matching some text. It is a very simple pattern but it doesn't seem to work. Here goes. <td*>.*?</td> That is the pattern, it should match any <td></td> pair. Here is...

C# / C Sharp

parsing VB code with a regex

by: Mark | last post by:

I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...

.NET Framework

Regex question

by: Paulers | last post by:

I have been googling for about an hour now looking for an example of how to locate multiple occurrences of a regular expression and iterate through the matches to extract each one. I need to...

Visual Basic .NET

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General