473,405 Members | 2,141 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Trying to find regex for any script in an html source

Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?
Thanks.

Dec 21 '05 #1
4 2277
28tommy wrote:
Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?


Several things.
First, re.DOTALL is a flag, a _parameter_ to be passed to
the compile function, not sumething you stick inside the RE
itself:
re.compile('<script .+ src=.+>',re.DOTALL)

Second, this won't match your example above, because src
appears immediately after script. So you probably want
something like
re.compile('<script .*src=.+>',re.DOTALL)

Third, IIRC * and + are _greedy_ by default, this means they
will "eat up" as many characters as possible. Try and see
what I mean. The solution is to use the non-greedy variant
of *, that is *?
re.compile('<script .*?src=.+?>',re.DOTALL)

All this and more at
http://docs.python.org/lib/module-re.html
and, I'm sure, several online tutorials. To RTFM is never a
bad idea.
Dec 21 '05 #2
"28tommy" <28*****@gmail.com> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...
Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')
<snip>


28tommy -

pyparsing includes a built-in HTML tag definition method that handles tag
attributes automatically. You can also tell pyparsing to *not* accept tags
found inside HTML comments, something not so easy using re's (your target
HTML pages may not have comments, so I dont know if this is of much interest
to you). Finally, accessing the results is very easy, especially for
getting at the values of attributes defined in the opening tag. See the
following example.

Note - pyparsing is considered by some to be "way overkill" for simple HTML
scraping, and is probably 20-100X slower than regular expressions. But as
quick text processing and extraction tools go, it's pretty easy to put
together fairly complex match expressions, without the noisy typography of
regular expressions.

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul
from pyparsing import *

data = """
<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

<!--
<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/notSureAboutThisScript.js"
type="text/javascript"></script>
-->

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js"
type="text/javascript"></script>
"""

# next three lines define grammar for <script> and </script>,
# plus arbitrary HTML attributes on <script>, plus detection and
# ignoring of any matching expression that might be found inside
# an HTML comment
scriptStart,scriptEnd = makeHTMLTags("script")
expr = scriptStart + scriptEnd
expr.ignore(htmlComment)

# use the grammar to scan the data string
# for each match, return matching tokens as a ParseResults object
# - supports list-, dictionary-, and object-style token access
for toks,start,end in expr.scanString(data):
print toks.startScript
print toks.startScript[0]
print toks.startScript.keys()
print "src =", toks.startScript["src"]
print "src =", toks.startScript.src
print
====================
['script', ['language', 'JavaScript1.2'], ['src',
'http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js'], ['type',
'text/javascript'], False]
script
['src', 'type', 'language', 'empty']
src = http://i.cnn.net/cnn/.element/ssi/js...ainVideoMod.js
src = http://i.cnn.net/cnn/.element/ssi/js...ainVideoMod.js

['script', ['language', 'JavaScript1.2'], ['src',
'http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js'], ['type',
'text/javascript'], False]
script
['src', 'type', 'language', 'empty']
src = http://i.cnn.net/cnn/.element/ssi/js...otherScript.js
src = http://i.cnn.net/cnn/.element/ssi/js...otherScript.js
Dec 21 '05 #3
"28tommy" <28*****@gmail.com> writes:
Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?


Trying to use an RE to parse HTML. While possible, it's not nearly as
easy as it looks, and there are lots of gotchas.

Paul has already pointed out the PyParsing comes with HTML parser. If
your HTML is well-formed, you can use HTMLParser in the standard
library. If your HTML comes from the web at large (meaning much of it
was written by the people who handed in code that didn't compile for
their programming assignments), you'll want to try something like
BeautifulSoup.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Dec 24 '05 #4
Thank you all.

Dec 25 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Markus Ernst | last post by:
Hello I have a regex problem, spent about 7 hours on this now, but I don't find the answer in the manual and googling, though I think this must have been discussed before. I try to simply...
1
by: Xah Lee | last post by:
suppose you want to do find & replace of string of all files in a directory. here's the code: ©# -*- coding: utf-8 -*- ©# Python © ©import os,sys © ©mydir= '/Users/t/web'
18
by: Shannon Jacobs | last post by:
Trying to solve this with a regex approach rather than the programmatic approach of counting up and down the levels. I have a fairly complicated HTML page that I want to simplify. I've been able to...
8
by: G. Stewart | last post by:
The objective is to extract the first n characters of text from an HTML block. I wish to preserve all HTML (links, formatting etc.), and at the same time, extend the size of the block to ensure...
0
by: Evgeny Zoldin | last post by:
Hi All, I want to capture the argument of some javascript function call in HTML source code, namely HTML-Page contains <script....> func ( 'something1\'something2' ); </script...> or
16
by: Andrew Baker | last post by:
I am trying to write a function which provides my users with a file filter. The filter used to work just using the VB "Like" comparision, but I can't find the equivilant in C#. I looked at...
27
by: one man army | last post by:
Hi All- I am new to PHP. I found FAQTS and the php manual. I am trying this sequence, but getting 'no zip string found:'... PHP Version 4.4.0 $doc = new DomDocument; $res =...
2
by: voxiac | last post by:
Could someone tell me why: Fails with message: Traceback (most recent call last): File "<pyshell#12>", line 1, in <module> re.compile('\\dir\\(file)') File "C:\Python25\lib\re.py", line 180,...
3
XtinaS
by: XtinaS | last post by:
I'm trying to write a script for Greasemonkey that will, in LiveJournal, replace a placeholdered embedded YouTube thing with a link to the video. In LiveJournal, you can set an option to have a...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.