473,320 Members | 1,910 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Regular Expression question

Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.

Jun 7 '06 #1
5 2476
ke*********@gmail.com wrote:
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?


if you want to parse HTML, use an HTML parser. if you want to parse
sloppy HTML, use a tolerant HTML parser:

http://www.crummy.com/software/BeautifulSoup/

</F>

Jun 7 '06 #2
<ke*********@gmail.com> wrote in message
news:11**********************@y43g2000cwc.googlegr oups.com...
Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.


As Fredrik pointed out, re's are not the only tool out there. Here's a
pyparsing solution.

-- Paul
import pyparsing
import urllib

# define HTML tag format using makeHTMLTags helper
# (we don't really care about the ending </img> tag,
# even though makeHTMLTags returns definitions for both
# starting and ending tag patterns)
imgStartTag, dummy = pyparsing.makeHTMLTags("img")

# get HTML source from some web site
htmlPage = urllib.urlopen("http://www.yahoo.com")
htmlSource = htmlPage.read()
htmlPage.close()

# scan HTML source, printing SRC attribute from each <img> tag
for tokens,start,end in imgStartTag.scanString(htmlSource):
print tokens.src
Prints:

http://us.i1.yimg.com/us.yimg.com/i/...edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/hea_0411.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/img_0607.jpg
http://us.i1.yimg.com/us.yimg.com/i/...orious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/...news/video.gif
http://us.i1.yimg.com/us.yimg.com/i/...foodssmall.jpg
http://us.i1.yimg.com/us.yimg.com/i/...6q2/img_im.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif
http://us.i1.yimg.com/us.yimg.com/i/...4q2/camera.gif
Jun 7 '06 #3
pyparsing is cool.
but use only re is also OK
# -*- coding: UTF-8 -*-
import urllib2
html=urllib2.urlopen(ur"http://www.yahoo.com/").read()

import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')

I got these rusults:
http://us.i1.yimg.com/us.yimg.com/i/...edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/...orious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/wthr.gif
http://us.i1.yimg.com/us.yimg.com/i/...4q2/camera.gif

On 6/8/06, Paul McGuire <pt***@austin.rr._bogus_.com> wrote:
<ke*********@gmail.com> wrote in message
news:11**********************@y43g2000cwc.googlegr oups.com...
Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.


As Fredrik pointed out, re's are not the only tool out there. Here's a
pyparsing solution.

-- Paul
import pyparsing
import urllib

# define HTML tag format using makeHTMLTags helper
# (we don't really care about the ending </img> tag,
# even though makeHTMLTags returns definitions for both
# starting and ending tag patterns)
imgStartTag, dummy = pyparsing.makeHTMLTags("img")

# get HTML source from some web site
htmlPage = urllib.urlopen("http://www.yahoo.com")
htmlSource = htmlPage.read()
htmlPage.close()

# scan HTML source, printing SRC attribute from each <img> tag
for tokens,start,end in imgStartTag.scanString(htmlSource):
print tokens.src
Prints:

http://us.i1.yimg.com/us.yimg.com/i/...edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/hea_0411.gif
http://us.i1.yimg.com/us.yimg.com/i/...2/img_0607.jpg
http://us.i1.yimg.com/us.yimg.com/i/...orious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/...news/video.gif
http://us.i1.yimg.com/us.yimg.com/i/...foodssmall.jpg
http://us.i1.yimg.com/us.yimg.com/i/...6q2/img_im.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif
http://us.i1.yimg.com/us.yimg.com/i/...4q2/camera.gif
--
http://mail.python.org/mailman/listinfo/python-list

Jun 8 '06 #4
"Frank Potter" <co*******@gmail.com> wrote in message
news:ma***************************************@pyt hon.org...
pyparsing is cool.
but use only re is also OK
# -*- coding: UTF-8 -*-
import urllib2
html=urllib2.urlopen(ur"http://www.yahoo.com/").read()

import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')


Ouch - this fails to match any <img> tag that has some other attribute, such
as "height" or "width", before the "src" attribute. www.yahoo.com has
several such tags.

On the other hand, pyparsing's makeHTMLTags defines a starting tag
expression that looks for (conceptually):

< tagname ZeroOrMore(attrname '=' value) Optional('/') >

and does not assume that the first tag is "src", or anything else for that
matter.

The returned results make the tag attributes accessible as object attributes
or dictionary keys.

-- Paul
Jun 8 '06 #5
Paul McGuire wrote:
import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')


Ouch - this fails to match any <img> tag that has some other
attribute, such as "height" or "width", before the "src" attribute.
www.yahoo.com has several such tags.


It also fails to match any image tag where the src attribute is quoted
using single quotes, or where the src attribute is not enclosed in quotes
at all.

Handle all of that correctly in the regex and the beautiful soup or
pyparsing options look even more attractive. In fact, if anyone can write a
regex which matches the source attribute in a single named group, and
correctly handles double, single and unquoted attributes, I'll admit to
being impressed (and probably also slightly queasy when looking at it).

Here's my best attempt at a regex that gets it right, but it still gets
confused by other attributes if they contain spaces.
ATTR = '''[^\s=>]+(?:=(?:"[^">]*"|'[^'>]*'|[^"'\s>][^\s>]*))?'''
NOTSRC = '(?!src=)' + ATTR
PAT = '''<img\s(?:'''+NOTSRC + '''\s*)*src=(?:["']?)(?P<image>(?<=")[^">]*|(?<=')[^'>]*|[^ >]*)''' htmlPage = '''<html><body><img width=42 src=fred.jpg><img src=\"freda.jpg\"> <img title='the src="silly" title'
src='another'></body></html>''' for m in r.finditer(htmlPage): print m.group('image')
fred.jpg
freda.jpg

Jun 8 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Vibha Tripathi | last post by:
Hi Folks, I put a Regular Expression question on this list a couple days ago. I would like to rephrase my question as below: In the Python re.sub(regex, replacement, subject)...
5
by: Bradley Plett | last post by:
I'm hopeless at regular expressions (I just don't use them often enough to gain/maintain knowledge), but I need one now and am looking for help. I need to parse through a document to find a URL,...
10
by: Lee Kuhn | last post by:
I am trying the create a regular expression that will essentially match characters in the middle of a fixed-length string. The string may be any characters, but will always be the same length. In...
18
by: Q. John Chen | last post by:
I have Vidation Controls First One: Simple exluce certain special characters: say no a or b or c in the string: * Second One: I required date be entered in "MM/DD/YYYY" format: //+4 How...
5
by: Ryan | last post by:
HELLO I am using the following MICROSOFT SUGGESTED (somewhere on msdn) regular expression to validate email addresses however I understand that the RFP allows for "+" symbols in the email address...
7
by: norton | last post by:
Hello, Does any one know how to extact the following text into 4 different groups(namely Date, Artist, Album and Quality)? - Artist - Album Artist - Album - Artist - Album - Artist -...
7
by: Billa | last post by:
Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...
6
by: Ludwig | last post by:
Hi, i'm using the regular expression \b\w to find the beginning of a word, in my C# application. If the word is 'public', for example, it works. However, if the word is '<public', it does not...
3
by: Zach | last post by:
Hello, Please forgive if this is not the most appropriate newsgroup for this question. Unfortunately I didn't find a newsgroup specific to regular expressions. I have the following regular...
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.