468,771 Members | 1,818 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,771 developers. It's quick & easy.

Class to extract tags from html code

11
Hello everybody,because I am newbie in python two weeks only but I had programming in another languages but the python take my heart there's 3 kind of arrays Wow now I hate JAVA :) .

I am working now on html process.
So, I found a good class for getting html tag but I don't know how to use it
I wrote this code for getting the tag A hopping some help please >>>

Expand|Select|Wrap|Line Numbers
  1. <%
  2. import urllib
  3. from sgmllib import SGMLParser
  4. import htmlentitydefs
  5.  
  6.  
  7. class BaseHTMLProcessor(SGMLParser):
  8.     def reset(self):                       
  9.         # extend (called by SGMLParser.__init__)
  10.         self.pieces = []
  11.         SGMLParser.reset(self)
  12.  
  13.     def unknown_starttag(self, tag, attrs):
  14.         # called for each start tag
  15.         # attrs is a list of (attr, value) tuples
  16.         # e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
  17.         # Ideally we would like to reconstruct original tag and attributes, but
  18.         # we may end up quoting attribute values that weren't quoted in the source
  19.         # document, or we may change the type of quotes around the attribute value
  20.         # (single to double quotes).
  21.         # Note that improperly embedded non-HTML code (like client-side Javascript)
  22.         # may be parsed incorrectly by the ancestor, causing runtime script errors.
  23.         # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
  24.         # to ensure that it will pass through this parser unaltered (in handle_comment).
  25.         strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
  26.         self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
  27.  
  28.     def unknown_endtag(self, tag):         
  29.         # called for each end tag, e.g. for </pre>, tag will be "pre"
  30.         # Reconstruct the original end tag.
  31.         self.pieces.append("</%(tag)s>" % locals())
  32.  
  33.     def handle_charref(self, ref):         
  34.         # called for each character reference, e.g. for "*", ref will be "160"
  35.         # Reconstruct the original character reference.
  36.         self.pieces.append("&#%(ref)s;" % locals())
  37.  
  38.     def handle_entityref(self, ref):       
  39.         # called for each entity reference, e.g. for "&copy;", ref will be "copy"
  40.         # Reconstruct the original entity reference.
  41.         self.pieces.append("&%(ref)s" % locals())
  42.         # standard HTML entities are closed with a semicolon; other entities are not
  43.         if htmlentitydefs.entitydefs.has_key(ref):
  44.             self.pieces.append(";")
  45.  
  46.     def handle_data(self, text):           
  47.         # called for each block of plain text, i.e. outside of any tag and
  48.         # not containing any character or entity references
  49.         # Store the original text verbatim.
  50.         self.pieces.append(text)
  51.  
  52.     def handle_comment(self, text):        
  53.         # called for each HTML comment, e.g. <!-- insert Javascript code here -->
  54.         # Reconstruct the original comment.
  55.         # It is especially important that the source document enclose client-side
  56.         # code (like Javascript) within comments so it can pass through this
  57.         # processor undisturbed; see comments in unknown_starttag for details.
  58.         self.pieces.append("<!--%(text)s-->" % locals())
  59.  
  60.     def handle_pi(self, text):             
  61.         # called for each processing instruction, e.g. <?instruction>
  62.         # Reconstruct original processing instruction.
  63.         self.pieces.append("<?%(text)s>" % locals())
  64.  
  65.     def handle_decl(self, text):
  66.         # called for the DOCTYPE, if present, e.g.
  67.         # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  68.         #     "http://www.w3.org/TR/html4/loose.dtd">
  69.         # Reconstruct original DOCTYPE
  70.         self.pieces.append("<!%(text)s>" % locals())
  71.  
  72.     def output(self):              
  73.         """Return processed HTML as a single string"""
  74.         return "".join(self.pieces)
  75.  
  76.  
  77.  
  78. url='http://google.com'
  79. f = urllib.urlopen(url)
  80. s = f.read() # The html code
  81.  
  82. links = []
  83.  
  84. myparser = BaseHTMLProcessor()
  85. links=myparser.output(myparser.unknown_starttag(s,'a','href'))
  86.  
  87. req.write(links[0]+'<br>'+links[1])
  88.  
  89. %>
  90.  

The traceback >>
Expand|Select|Wrap|Line Numbers
  1. Traceback (most recent call last):
  2.  
  3.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1537, in HandlerDispatch
  4.     default=default_handler, arg=req, silent=hlist.silent)
  5.  
  6.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1229, in _process_target
  7.     result = _execute_target(config, req, object, arg)
  8.  
  9.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1128, in _execute_target
  10.     result = object(arg)
  11.  
  12.   File "/usr/lib/python2.5/site-packages/mod_python/psp.py", line 337, in handler
  13.     p.run()
  14.  
  15.   File "/usr/lib/python2.5/site-packages/mod_python/psp.py", line 243, in run
  16.     exec code in global_scope
  17.  
  18.   File "/var/www/html/smart/qui.psp", line 86, in <module>
  19.     f = urllib.urlopen(url)
  20.  
  21.   File "/usr/lib/python2.5/urllib.py", line 82, in urlopen
  22.     return opener.open(url)
  23.  
  24.   File "/usr/lib/python2.5/urllib.py", line 190, in open
  25.     return getattr(self, name)(url)
  26.  
  27.   File "/usr/lib/python2.5/urllib.py", line 325, in open_http
  28.     h.endheaders()
  29.  
  30.   File "/usr/lib/python2.5/httplib.py", line 856, in endheaders
  31.     self._send_output()
  32.  
  33.   File "/usr/lib/python2.5/httplib.py", line 728, in _send_output
  34.     self.send(msg)
  35.  
  36.   File "/usr/lib/python2.5/httplib.py", line 695, in send
  37.     self.connect()
  38.  
  39.   File "/usr/lib/python2.5/httplib.py", line 663, in connect
  40.     socket.SOCK_STREAM):
  41.  
  42. IOError: [Errno socket error] (-3, 'Temporary failure in name resolution')
  43.  
  44.  

do you think it is from socket ??
should I put some code to set time out??
Aug 10 '08 #1
0 1645

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

7 posts views Thread by chotiwallah | last post: by
10 posts views Thread by mark4 | last post: by
1 post views Thread by Ori | last post: by
6 posts views Thread by Selen | last post: by
4 posts views Thread by Patrick | last post: by
9 posts views Thread by gregmcmullinjr | last post: by
1 post views Thread by Alberto Sartori | last post: by
1 post views Thread by CARIGAR | last post: by
1 post views Thread by Marin | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.