473,412 Members | 5,385 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,412 software developers and data experts.

RegEx with multiple occurrences

Hi again.

I'm trying to strip all script blocks from HTML, and am using the
following re to do it:

p = re.compile("(\<script.*>*\</script>)",re.IGNORECASE | re.DOTALL)
m = p.search(data)

The problem is that I'm getting everything from the 1st script's start
tag to the last script's end tag in one group - so it seems like it
parses the string from both ends therefore removing far more from that
data than I want. What am I doing wrong?

May 4 '06 #1
3 1825
> p = re.compile("(\<script.*>*\</script>)",re.IGNORECASE | re.DOTALL)
m = p.search(data)
First, I presume you didn't copy & paste your expression, as
it looks like you're missing a period before the second
asterisk. Otherwise, all you'd get is any number of
greater-than signs followed by a closing "</script>" tag.

Second, you're likely getting some foobar results because
you're not using a "real" string of the form

r'(\<script...script>)'
The problem is that I'm getting everything from the 1st
script's start tag to the last script's end tag in one
group - so it seems like it parses the string from both
ends therefore removing far more from that data than I
want. What am I doing wrong?


Looks like you want the non-greedy modifier to the "*"
described at

http://docs.python.org/lib/re-syntax.html

(searching the page for "greedy" should turn up the
paragraph on the modifiers)

You likely want something more like:

r'<script[^>]*>.*?</script>'

In the first atom, you're looking for the remainder of the
script tag (as much stuff that isn't a ">" as possible).
Then you close the tag with the ">", and then take as little
as possible (".*?") of anything until you find the closing
"</script>" tag.

HTH,

-tkc


May 4 '06 #2
Tim - you're a legend. Thanks.

May 4 '06 #3
> Tim - you're a legend. Thanks.

A leg-end? I always knew something was a-foot. Sorry to
make myself the butt of such joking. :)

My pleasure...glad it seems to be working for you.

-tkc (not much of a legend at all...just a regexp wonk)


May 4 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
by: Yatima | last post by:
Hey Folks, I've got some info in a bunch of files that kind of looks like so: Gibberish 53 MoreGarbage 12 RelevantInfo1 10/10/04
5
by: Gary McCullough | last post by:
What I want to do sounds simple, but it's defeating me. I want to substitute all occurences of a colon : character in a string with an @ character -- unless the : occurs within a single or...
2
by: Daniel Billingsley | last post by:
First, if MSFT is listening I'll say IMO the MSDN material is sorely lacking in this area... it's just a whole bunch of information thrown at you and you're left to yourself as to organizing it in...
2
by: Mortimer Schnurd | last post by:
Hi All, I am a VB 6 programmer who is now trying to learn C#. In doing so, I am trying to convert some of my VB modules to C#. I routinely user Reg Expressions in VB and am having some trouble...
2
by: D | last post by:
My first attempt at this and I'm searching formulas like so RIGHT(TEXT(A15,'yy'),1)*1000+A15-CONCATENATE(1,'-','jan','-',TEXT(A15,'yy'))+1 I want to extract the row / col coordinates (A15 in...
3
by: Rico | last post by:
If there are consecutive occurrences of characters from the given delimiter, String.Split() and Regex.Split() produce an empty string as the token that's between such consecutive occurrences. It...
2
by: Tim_Mac | last post by:
hi, i have a tricky problem and my regex expertise has reached its limit. i have read other posts on this newsgroup that pull out the plain text from a html string, but that won't work for me...
9
by: taylorjonl | last post by:
I am having a problem matching some text. It is a very simple pattern but it doesn't seem to work. Here goes. <td*>.*?</td> That is the pattern, it should match any <td></td> pair. Here is...
17
by: Mark | last post by:
I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...
4
by: Paulers | last post by:
I have been googling for about an hour now looking for an example of how to locate multiple occurrences of a regular expression and iterate through the matches to extract each one. I need to...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.