> p = re.compile("(\<script.*>*\</script>)",re.IGNORECASE | re.DOTALL)
m = p.search(data)
First, I presume you didn't copy & paste your expression, as
it looks like you're missing a period before the second
asterisk. Otherwise, all you'd get is any number of
greater-than signs followed by a closing "</script>" tag.
Second, you're likely getting some foobar results because
you're not using a "real" string of the form
r'(\<script...script>)'
The problem is that I'm getting everything from the 1st
script's start tag to the last script's end tag in one
group - so it seems like it parses the string from both
ends therefore removing far more from that data than I
want. What am I doing wrong?
Looks like you want the non-greedy modifier to the "*"
described at
http://docs.python.org/lib/re-syntax.html
(searching the page for "greedy" should turn up the
paragraph on the modifiers)
You likely want something more like:
r'<script[^>]*>.*?</script>'
In the first atom, you're looking for the remainder of the
script tag (as much stuff that isn't a ">" as possible).
Then you close the tag with the ">", and then take as little
as possible (".*?") of anything until you find the closing
"</script>" tag.
HTH,
-tkc