Trying to find regex for any script in an html source

28tommy

Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?
Thanks.

Dec 21 '05 #1

Subscribe Post Reply

2277

Mitja Trampus

28tommy wrote:

Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?

Several things.
First, re.DOTALL is a flag, a _parameter_ to be passed to
the compile function, not sumething you stick inside the RE
itself:
re.compile('<script .+ src=.+>',re.DOTALL)

Second, this won't match your example above, because src
appears immediately after script. So you probably want
something like
re.compile('<script .*src=.+>',re.DOTALL)

Third, IIRC * and + are _greedy_ by default, this means they
will "eat up" as many characters as possible. Try and see
what I mean. The solution is to use the non-greedy variant
of *, that is *?
re.compile('<script .*?src=.+?>',re.DOTALL)

All this and more at
http://docs.python.org/lib/module-re.html
and, I'm sure, several online tutorials. To RTFM is never a
bad idea.

Dec 21 '05 #2

Paul McGuire

"28tommy" <28*****@gmail.com> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...

Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')
<snip>

28tommy -

pyparsing includes a built-in HTML tag definition method that handles tag
attributes automatically. You can also tell pyparsing to *not* accept tags
found inside HTML comments, something not so easy using re's (your target
HTML pages may not have comments, so I dont know if this is of much interest
to you). Finally, accessing the results is very easy, especially for
getting at the values of attributes defined in the opening tag. See the
following example.

Note - pyparsing is considered by some to be "way overkill" for simple HTML
scraping, and is probably 20-100X slower than regular expressions. But as
quick text processing and extraction tools go, it's pretty easy to put
together fairly complex match expressions, without the noisy typography of
regular expressions.

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul
from pyparsing import *

data = """
<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>



<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js"
type="text/javascript"></script>
"""

# next three lines define grammar for <script> and </script>,
# plus arbitrary HTML attributes on <script>, plus detection and
# ignoring of any matching expression that might be found inside
# an HTML comment
scriptStart,scriptEnd = makeHTMLTags("script")
expr = scriptStart + scriptEnd
expr.ignore(htmlComment)

# use the grammar to scan the data string
# for each match, return matching tokens as a ParseResults object
# - supports list-, dictionary-, and object-style token access
for toks,start,end in expr.scanString(data):
print toks.startScript
print toks.startScript[0]
print toks.startScript.keys()
print "src =", toks.startScript["src"]
print "src =", toks.startScript.src
print
====================
['script', ['language', 'JavaScript1.2'], ['src',
'http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js'], ['type',
'text/javascript'], False]
script
['src', 'type', 'language', 'empty']
src = http://i.cnn.net/cnn/.element/ssi/js...ainVideoMod.js
src = http://i.cnn.net/cnn/.element/ssi/js...ainVideoMod.js

['script', ['language', 'JavaScript1.2'], ['src',
'http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js'], ['type',
'text/javascript'], False]
script
['src', 'type', 'language', 'empty']
src = http://i.cnn.net/cnn/.element/ssi/js...otherScript.js
src = http://i.cnn.net/cnn/.element/ssi/js...otherScript.js

Dec 21 '05 #3

Mike Meyer

"28tommy" <28*****@gmail.com> writes:

Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?

Trying to use an RE to parse HTML. While possible, it's not nearly as
easy as it looks, and there are lots of gotchas.

Paul has already pointed out the PyParsing comes with HTML parser. If
your HTML is well-formed, you can use HTMLParser in the standard
library. If your HTML comes from the web at large (meaning much of it
was written by the people who handed in code that didn't compile for
their programming assignments), you'll want to try something like
BeautifulSoup.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Dec 24 '05 #4

28tommy

Thank you all.

Dec 25 '05 #5

by: Markus Ernst | last post by:

Hello I have a regex problem, spent about 7 hours on this now, but I don't find the answer in the manual and googling, though I think this must have been discussed before. I try to simply...

PHP

[perl-python] find & replace strings for all files in a dir

by: Xah Lee | last post by:

suppose you want to do find & replace of string of all files in a directory. here's the code: ©# -*- coding: utf-8 -*- ©# Python © ©import os,sys © ©mydir= '/Users/t/web'

Python

Regular expression to find <tr> tags in 2nd level HTML tables

by: Shannon Jacobs | last post by:

Trying to solve this with a regex approach rather than the programmatic approach of counting up and down the levels. I have a fairly complicated HTML page that I want to simplify. I've been able to...

Javascript

regex puzzle!

by: G. Stewart | last post by:

The objective is to extract the first n characters of text from an HTML block. I wish to preserve all HTML (links, formatting etc.), and at the same time, extend the size of the block to ensure...

.NET Framework

Need help for Regex

by: Evgeny Zoldin | last post by:

Hi All, I want to capture the argument of some javascript function call in HTML source code, namely HTML-Page contains <script....> func ( 'something1\'something2' ); </script...> or

.NET Framework

Help with Regex and trying to mimic the VB "like" comparison

by: Andrew Baker | last post by:

I am trying to write a function which provides my users with a file filter. The filter used to work just using the VB "Like" comparision, but I can't find the equivilant in C#. I looked at...

C# / C Sharp

DOM doc - simple find element

by: one man army | last post by:

Hi All- I am new to PHP. I found FAQTS and the php manual. I am trying this sequence, but getting 'no zip string found:'... PHP Version 4.4.0 $doc = new DomDocument; $res =...

PHP

'\\' in regex affects the following parenthesis?

by: voxiac | last post by:

Could someone tell me why: Fails with message: Traceback (most recent call last): File "<pyshell#12>", line 1, in <module> re.compile('\\dir\\(file)') File "C:\Python25\lib\re.py", line 180,...

Python

Greasemonkey: Trying to replace LJ's YouTube placeholders with a link.

by: XtinaS | last post by:

I'm trying to write a script for Greasemonkey that will, in LiveJournal, replace a placeholdered embedded YouTube thing with a link to the video. In LiveJournal, you can set an option to have a...

Javascript

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Trying to find regex for any script in an html source

Similar topics