By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
429,435 Members | 2,036 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 429,435 IT Pros & Developers. It's quick & easy.

Regular Expression question

P: n/a
Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.

Jun 7 '06 #1
Share this Question
Share on Google+
5 Replies


P: n/a
ke*********@gmail.com wrote:
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?


if you want to parse HTML, use an HTML parser. if you want to parse
sloppy HTML, use a tolerant HTML parser:

http://www.crummy.com/software/BeautifulSoup/

</F>

Jun 7 '06 #2

P: n/a
<ke*********@gmail.com> wrote in message
news:11**********************@y43g2000cwc.googlegr oups.com...
Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.


As Fredrik pointed out, re's are not the only tool out there. Here's a
pyparsing solution.

-- Paul
import pyparsing
import urllib

# define HTML tag format using makeHTMLTags helper
# (we don't really care about the ending </img> tag,
# even though makeHTMLTags returns definitions for both
# starting and ending tag patterns)
imgStartTag, dummy = pyparsing.makeHTMLTags("img")

# get HTML source from some web site
htmlPage = urllib.urlopen("http://www.yahoo.com")
htmlSource = htmlPage.read()
htmlPage.close()

# scan HTML source, printing SRC attribute from each <img> tag
for tokens,start,end in imgStartTag.scanString(htmlSource):
print tokens.src
Prints:

http://us.i1.yimg.com/us.yimg.com/i/...edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/hea_0411.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/img_0607.jpg
http://us.i1.yimg.com/us.yimg.com/i/...orious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/...news/video.gif
http://us.i1.yimg.com/us.yimg.com/i/...foodssmall.jpg
http://us.i1.yimg.com/us.yimg.com/i/...6q2/img_im.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif
http://us.i1.yimg.com/us.yimg.com/i/...4q2/camera.gif
Jun 7 '06 #3

P: n/a
pyparsing is cool.
but use only re is also OK
# -*- coding: UTF-8 -*-
import urllib2
html=urllib2.urlopen(ur"http://www.yahoo.com/").read()

import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')

I got these rusults:
http://us.i1.yimg.com/us.yimg.com/i/...edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/...orious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/wthr.gif
http://us.i1.yimg.com/us.yimg.com/i/...4q2/camera.gif

On 6/8/06, Paul McGuire <pt***@austin.rr._bogus_.com> wrote:
<ke*********@gmail.com> wrote in message
news:11**********************@y43g2000cwc.googlegr oups.com...
Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.


As Fredrik pointed out, re's are not the only tool out there. Here's a
pyparsing solution.

-- Paul
import pyparsing
import urllib

# define HTML tag format using makeHTMLTags helper
# (we don't really care about the ending </img> tag,
# even though makeHTMLTags returns definitions for both
# starting and ending tag patterns)
imgStartTag, dummy = pyparsing.makeHTMLTags("img")

# get HTML source from some web site
htmlPage = urllib.urlopen("http://www.yahoo.com")
htmlSource = htmlPage.read()
htmlPage.close()

# scan HTML source, printing SRC attribute from each <img> tag
for tokens,start,end in imgStartTag.scanString(htmlSource):
print tokens.src
Prints:

http://us.i1.yimg.com/us.yimg.com/i/...edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/hea_0411.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/img_0607.jpg
http://us.i1.yimg.com/us.yimg.com/i/...orious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/...news/video.gif
http://us.i1.yimg.com/us.yimg.com/i/...foodssmall.jpg
http://us.i1.yimg.com/us.yimg.com/i/...6q2/img_im.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif
http://us.i1.yimg.com/us.yimg.com/i/...4q2/camera.gif
--
http://mail.python.org/mailman/listinfo/python-list

Jun 8 '06 #4

P: n/a
"Frank Potter" <co*******@gmail.com> wrote in message
news:ma***************************************@pyt hon.org...
pyparsing is cool.
but use only re is also OK
# -*- coding: UTF-8 -*-
import urllib2
html=urllib2.urlopen(ur"http://www.yahoo.com/").read()

import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')


Ouch - this fails to match any <img> tag that has some other attribute, such
as "height" or "width", before the "src" attribute. www.yahoo.com has
several such tags.

On the other hand, pyparsing's makeHTMLTags defines a starting tag
expression that looks for (conceptually):

< tagname ZeroOrMore(attrname '=' value) Optional('/') >

and does not assume that the first tag is "src", or anything else for that
matter.

The returned results make the tag attributes accessible as object attributes
or dictionary keys.

-- Paul
Jun 8 '06 #5

P: n/a
Paul McGuire wrote:
import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')


Ouch - this fails to match any <img> tag that has some other
attribute, such as "height" or "width", before the "src" attribute.
www.yahoo.com has several such tags.


It also fails to match any image tag where the src attribute is quoted
using single quotes, or where the src attribute is not enclosed in quotes
at all.

Handle all of that correctly in the regex and the beautiful soup or
pyparsing options look even more attractive. In fact, if anyone can write a
regex which matches the source attribute in a single named group, and
correctly handles double, single and unquoted attributes, I'll admit to
being impressed (and probably also slightly queasy when looking at it).

Here's my best attempt at a regex that gets it right, but it still gets
confused by other attributes if they contain spaces.
ATTR = '''[^\s=>]+(?:=(?:"[^">]*"|'[^'>]*'|[^"'\s>][^\s>]*))?'''
NOTSRC = '(?!src=)' + ATTR
PAT = '''<img\s(?:'''+NOTSRC + '''\s*)*src=(?:["']?)(?P<image>(?<=")[^">]*|(?<=')[^'>]*|[^ >]*)''' htmlPage = '''<html><body><img width=42 src=fred.jpg><img src=\"freda.jpg\"> <img title='the src="silly" title'
src='another'></body></html>''' for m in r.finditer(htmlPage): print m.group('image')
fred.jpg
freda.jpg

Jun 8 '06 #6

This discussion thread is closed

Replies have been disabled for this discussion.