By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,766 Members | 1,427 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,766 IT Pros & Developers. It's quick & easy.

store tag content with SGMLParser ... ?

P: n/a
Hi Python people!

I'd like to ask you a question about parsing html with SGMLParser class
from module sgmllib. Is there a way I could get contents of a tag with
certain properties? For example: I'd like to get a list of all contents
and hrefs of <a> tags, which have certain value of "class" property:

<a class="section" href="http://example.com">BLABLABLA</a>

I'd like to get stored the "http://example.com" and the "BLABLABLA",
while class="section". Probably with the use of SGMLParser method
handle_data. Or better use different approach? Thanks for answers!:)

#Luk
Jul 18 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
"Lukas Holcik" <xh******@fi.muni.cz> wrote in message
news:Hv********@news.muni.cz...
Hi Python people!

I'd like to ask you a question about parsing html with SGMLParser class
from module sgmllib. Is there a way I could get contents of a tag with
certain properties? For example: I'd like to get a list of all contents
and hrefs of <a> tags, which have certain value of "class" property:

<a class="section" href="http://example.com">BLABLABLA</a>

I'd like to get stored the "http://example.com" and the "BLABLABLA",
while class="section". Probably with the use of SGMLParser method
handle_data. Or better use different approach? Thanks for answers!:)

#Luk Lukas -

If all your section definitions are as fixed format as this, then a regexp
will probably do. If you have to deal with additional anchor attributes,
you could write a small SGML subset using pyparsing, just matching the tag
pattern you are looking for. With pyparsing, you don't have to define the
complete SGML syntax, just the desired pattern - then extract the results
using scanString. Here is an example that is tolerant of additional
attributes in the '<a class="section"...' tag.

-- Paul

===========
# get pyparsing at http://pyparsing.sourceforge.net
from pyparsing import Literal, quotedString, CharsNotIn, OneOrMore, Word,
alphas

someSGML = """
<SGML>
<tag>some stuff</tag>
<a class="section" href="http://example1.com">BLABLABLA</a>
more blah blah blah...
<a class="section" color="RED" href="http://example2.com">BLEBLEBLE</a>
<SGML_TAG>sldkjflsdkjflsdkjf</SGML_TAG>
<a class="section" href="http://example3.com" size="venti">BLIBLIBLI</a>
</SGML>
"""

tagBody = CharsNotIn('<').setResultsName("body")
href = Literal('href') + '=' + quotedString.setResultsName("href")
otherAttrDef = Word(alphas) + "=" + quotedString
sectionDef = ( Literal('<a class="section"') + OneOrMore( href |
otherAttrDef ) + '>' +
tagBody + "</a>" )

for match,start,stop in sectionDef.scanString( someSGML ):
print "href=",match.href
print "body=",match.body
print

============pythonw -u getSGMLrefs.py

href= "http://example1.com"
body= BLABLABLA

href= "http://example2.com"
body= BLEBLEBLE

href= "http://example3.com"
body= BLIBLIBLI


Jul 18 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.