469,602 Members | 1,747 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,602 developers. It's quick & easy.

Regular Expression question

Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.

Jun 7 '06 #1
5 2287
ke*********@gmail.com wrote:
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?


if you want to parse HTML, use an HTML parser. if you want to parse
sloppy HTML, use a tolerant HTML parser:

http://www.crummy.com/software/BeautifulSoup/

</F>

Jun 7 '06 #2
<ke*********@gmail.com> wrote in message
news:11**********************@y43g2000cwc.googlegr oups.com...
Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.


As Fredrik pointed out, re's are not the only tool out there. Here's a
pyparsing solution.

-- Paul
import pyparsing
import urllib

# define HTML tag format using makeHTMLTags helper
# (we don't really care about the ending </img> tag,
# even though makeHTMLTags returns definitions for both
# starting and ending tag patterns)
imgStartTag, dummy = pyparsing.makeHTMLTags("img")

# get HTML source from some web site
htmlPage = urllib.urlopen("http://www.yahoo.com")
htmlSource = htmlPage.read()
htmlPage.close()

# scan HTML source, printing SRC attribute from each <img> tag
for tokens,start,end in imgStartTag.scanString(htmlSource):
print tokens.src
Prints:

http://us.i1.yimg.com/us.yimg.com/i/...edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/hea_0411.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/img_0607.jpg
http://us.i1.yimg.com/us.yimg.com/i/...orious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/...news/video.gif
http://us.i1.yimg.com/us.yimg.com/i/...foodssmall.jpg
http://us.i1.yimg.com/us.yimg.com/i/...6q2/img_im.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif
http://us.i1.yimg.com/us.yimg.com/i/...4q2/camera.gif
Jun 7 '06 #3
pyparsing is cool.
but use only re is also OK
# -*- coding: UTF-8 -*-
import urllib2
html=urllib2.urlopen(ur"http://www.yahoo.com/").read()

import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')

I got these rusults:
http://us.i1.yimg.com/us.yimg.com/i/...edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/...orious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/wthr.gif
http://us.i1.yimg.com/us.yimg.com/i/...4q2/camera.gif

On 6/8/06, Paul McGuire <pt***@austin.rr._bogus_.com> wrote:
<ke*********@gmail.com> wrote in message
news:11**********************@y43g2000cwc.googlegr oups.com...
Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.


As Fredrik pointed out, re's are not the only tool out there. Here's a
pyparsing solution.

-- Paul
import pyparsing
import urllib

# define HTML tag format using makeHTMLTags helper
# (we don't really care about the ending </img> tag,
# even though makeHTMLTags returns definitions for both
# starting and ending tag patterns)
imgStartTag, dummy = pyparsing.makeHTMLTags("img")

# get HTML source from some web site
htmlPage = urllib.urlopen("http://www.yahoo.com")
htmlSource = htmlPage.read()
htmlPage.close()

# scan HTML source, printing SRC attribute from each <img> tag
for tokens,start,end in imgStartTag.scanString(htmlSource):
print tokens.src
Prints:

http://us.i1.yimg.com/us.yimg.com/i/...edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/hea_0411.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/img_0607.jpg
http://us.i1.yimg.com/us.yimg.com/i/...orious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/...news/video.gif
http://us.i1.yimg.com/us.yimg.com/i/...foodssmall.jpg
http://us.i1.yimg.com/us.yimg.com/i/...6q2/img_im.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif
http://us.i1.yimg.com/us.yimg.com/i/...4q2/camera.gif
--
http://mail.python.org/mailman/listinfo/python-list

Jun 8 '06 #4
"Frank Potter" <co*******@gmail.com> wrote in message
news:ma***************************************@pyt hon.org...
pyparsing is cool.
but use only re is also OK
# -*- coding: UTF-8 -*-
import urllib2
html=urllib2.urlopen(ur"http://www.yahoo.com/").read()

import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')


Ouch - this fails to match any <img> tag that has some other attribute, such
as "height" or "width", before the "src" attribute. www.yahoo.com has
several such tags.

On the other hand, pyparsing's makeHTMLTags defines a starting tag
expression that looks for (conceptually):

< tagname ZeroOrMore(attrname '=' value) Optional('/') >

and does not assume that the first tag is "src", or anything else for that
matter.

The returned results make the tag attributes accessible as object attributes
or dictionary keys.

-- Paul
Jun 8 '06 #5
Paul McGuire wrote:
import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')


Ouch - this fails to match any <img> tag that has some other
attribute, such as "height" or "width", before the "src" attribute.
www.yahoo.com has several such tags.


It also fails to match any image tag where the src attribute is quoted
using single quotes, or where the src attribute is not enclosed in quotes
at all.

Handle all of that correctly in the regex and the beautiful soup or
pyparsing options look even more attractive. In fact, if anyone can write a
regex which matches the source attribute in a single named group, and
correctly handles double, single and unquoted attributes, I'll admit to
being impressed (and probably also slightly queasy when looking at it).

Here's my best attempt at a regex that gets it right, but it still gets
confused by other attributes if they contain spaces.
ATTR = '''[^\s=>]+(?:=(?:"[^">]*"|'[^'>]*'|[^"'\s>][^\s>]*))?'''
NOTSRC = '(?!src=)' + ATTR
PAT = '''<img\s(?:'''+NOTSRC + '''\s*)*src=(?:["']?)(?P<image>(?<=")[^">]*|(?<=')[^'>]*|[^ >]*)''' htmlPage = '''<html><body><img width=42 src=fred.jpg><img src=\"freda.jpg\"> <img title='the src="silly" title'
src='another'></body></html>''' for m in r.finditer(htmlPage): print m.group('image')
fred.jpg
freda.jpg

Jun 8 '06 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Bradley Plett | last post: by
10 posts views Thread by Lee Kuhn | last post: by
18 posts views Thread by Q. John Chen | last post: by
7 posts views Thread by norton | last post: by
7 posts views Thread by Billa | last post: by
6 posts views Thread by Ludwig | last post: by
3 posts views Thread by Zach | last post: by
25 posts views Thread by Mike | last post: by
reply views Thread by guiromero | last post: by
reply views Thread by devrayhaan | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.