By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
428,558 Members | 1,607 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 428,558 IT Pros & Developers. It's quick & easy.

Trying to find regex for any script in an html source

P: n/a
Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?
Thanks.

Dec 21 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
28tommy wrote:
Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?


Several things.
First, re.DOTALL is a flag, a _parameter_ to be passed to
the compile function, not sumething you stick inside the RE
itself:
re.compile('<script .+ src=.+>',re.DOTALL)

Second, this won't match your example above, because src
appears immediately after script. So you probably want
something like
re.compile('<script .*src=.+>',re.DOTALL)

Third, IIRC * and + are _greedy_ by default, this means they
will "eat up" as many characters as possible. Try and see
what I mean. The solution is to use the non-greedy variant
of *, that is *?
re.compile('<script .*?src=.+?>',re.DOTALL)

All this and more at
http://docs.python.org/lib/module-re.html
and, I'm sure, several online tutorials. To RTFM is never a
bad idea.
Dec 21 '05 #2

P: n/a
"28tommy" <28*****@gmail.com> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...
Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')
<snip>


28tommy -

pyparsing includes a built-in HTML tag definition method that handles tag
attributes automatically. You can also tell pyparsing to *not* accept tags
found inside HTML comments, something not so easy using re's (your target
HTML pages may not have comments, so I dont know if this is of much interest
to you). Finally, accessing the results is very easy, especially for
getting at the values of attributes defined in the opening tag. See the
following example.

Note - pyparsing is considered by some to be "way overkill" for simple HTML
scraping, and is probably 20-100X slower than regular expressions. But as
quick text processing and extraction tools go, it's pretty easy to put
together fairly complex match expressions, without the noisy typography of
regular expressions.

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul
from pyparsing import *

data = """
<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

<!--
<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/notSureAboutThisScript.js"
type="text/javascript"></script>
-->

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js"
type="text/javascript"></script>
"""

# next three lines define grammar for <script> and </script>,
# plus arbitrary HTML attributes on <script>, plus detection and
# ignoring of any matching expression that might be found inside
# an HTML comment
scriptStart,scriptEnd = makeHTMLTags("script")
expr = scriptStart + scriptEnd
expr.ignore(htmlComment)

# use the grammar to scan the data string
# for each match, return matching tokens as a ParseResults object
# - supports list-, dictionary-, and object-style token access
for toks,start,end in expr.scanString(data):
print toks.startScript
print toks.startScript[0]
print toks.startScript.keys()
print "src =", toks.startScript["src"]
print "src =", toks.startScript.src
print
====================
['script', ['language', 'JavaScript1.2'], ['src',
'http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js'], ['type',
'text/javascript'], False]
script
['src', 'type', 'language', 'empty']
src = http://i.cnn.net/cnn/.element/ssi/js...ainVideoMod.js
src = http://i.cnn.net/cnn/.element/ssi/js...ainVideoMod.js

['script', ['language', 'JavaScript1.2'], ['src',
'http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js'], ['type',
'text/javascript'], False]
script
['src', 'type', 'language', 'empty']
src = http://i.cnn.net/cnn/.element/ssi/js...otherScript.js
src = http://i.cnn.net/cnn/.element/ssi/js...otherScript.js
Dec 21 '05 #3

P: n/a
"28tommy" <28*****@gmail.com> writes:
Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?


Trying to use an RE to parse HTML. While possible, it's not nearly as
easy as it looks, and there are lots of gotchas.

Paul has already pointed out the PyParsing comes with HTML parser. If
your HTML is well-formed, you can use HTMLParser in the standard
library. If your HTML comes from the web at large (meaning much of it
was written by the people who handed in code that didn't compile for
their programming assignments), you'll want to try something like
BeautifulSoup.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Dec 24 '05 #4

P: n/a
Thank you all.

Dec 25 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.