473,324 Members | 2,531 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,324 software developers and data experts.

Class to extract tags from html code

11
Hello everybody,because I am newbie in python two weeks only but I had programming in another languages but the python take my heart there's 3 kind of arrays Wow now I hate JAVA :) .

I am working now on html process.
So, I found a good class for getting html tag but I don't know how to use it
I wrote this code for getting the tag A hopping some help please >>>

Expand|Select|Wrap|Line Numbers
  1. <%
  2. import urllib
  3. from sgmllib import SGMLParser
  4. import htmlentitydefs
  5.  
  6.  
  7. class BaseHTMLProcessor(SGMLParser):
  8.     def reset(self):                       
  9.         # extend (called by SGMLParser.__init__)
  10.         self.pieces = []
  11.         SGMLParser.reset(self)
  12.  
  13.     def unknown_starttag(self, tag, attrs):
  14.         # called for each start tag
  15.         # attrs is a list of (attr, value) tuples
  16.         # e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
  17.         # Ideally we would like to reconstruct original tag and attributes, but
  18.         # we may end up quoting attribute values that weren't quoted in the source
  19.         # document, or we may change the type of quotes around the attribute value
  20.         # (single to double quotes).
  21.         # Note that improperly embedded non-HTML code (like client-side Javascript)
  22.         # may be parsed incorrectly by the ancestor, causing runtime script errors.
  23.         # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
  24.         # to ensure that it will pass through this parser unaltered (in handle_comment).
  25.         strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
  26.         self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
  27.  
  28.     def unknown_endtag(self, tag):         
  29.         # called for each end tag, e.g. for </pre>, tag will be "pre"
  30.         # Reconstruct the original end tag.
  31.         self.pieces.append("</%(tag)s>" % locals())
  32.  
  33.     def handle_charref(self, ref):         
  34.         # called for each character reference, e.g. for "*", ref will be "160"
  35.         # Reconstruct the original character reference.
  36.         self.pieces.append("&#%(ref)s;" % locals())
  37.  
  38.     def handle_entityref(self, ref):       
  39.         # called for each entity reference, e.g. for "&copy;", ref will be "copy"
  40.         # Reconstruct the original entity reference.
  41.         self.pieces.append("&%(ref)s" % locals())
  42.         # standard HTML entities are closed with a semicolon; other entities are not
  43.         if htmlentitydefs.entitydefs.has_key(ref):
  44.             self.pieces.append(";")
  45.  
  46.     def handle_data(self, text):           
  47.         # called for each block of plain text, i.e. outside of any tag and
  48.         # not containing any character or entity references
  49.         # Store the original text verbatim.
  50.         self.pieces.append(text)
  51.  
  52.     def handle_comment(self, text):        
  53.         # called for each HTML comment, e.g. <!-- insert Javascript code here -->
  54.         # Reconstruct the original comment.
  55.         # It is especially important that the source document enclose client-side
  56.         # code (like Javascript) within comments so it can pass through this
  57.         # processor undisturbed; see comments in unknown_starttag for details.
  58.         self.pieces.append("<!--%(text)s-->" % locals())
  59.  
  60.     def handle_pi(self, text):             
  61.         # called for each processing instruction, e.g. <?instruction>
  62.         # Reconstruct original processing instruction.
  63.         self.pieces.append("<?%(text)s>" % locals())
  64.  
  65.     def handle_decl(self, text):
  66.         # called for the DOCTYPE, if present, e.g.
  67.         # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  68.         #     "http://www.w3.org/TR/html4/loose.dtd">
  69.         # Reconstruct original DOCTYPE
  70.         self.pieces.append("<!%(text)s>" % locals())
  71.  
  72.     def output(self):              
  73.         """Return processed HTML as a single string"""
  74.         return "".join(self.pieces)
  75.  
  76.  
  77.  
  78. url='http://google.com'
  79. f = urllib.urlopen(url)
  80. s = f.read() # The html code
  81.  
  82. links = []
  83.  
  84. myparser = BaseHTMLProcessor()
  85. links=myparser.output(myparser.unknown_starttag(s,'a','href'))
  86.  
  87. req.write(links[0]+'<br>'+links[1])
  88.  
  89. %>
  90.  

The traceback >>
Expand|Select|Wrap|Line Numbers
  1. Traceback (most recent call last):
  2.  
  3.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1537, in HandlerDispatch
  4.     default=default_handler, arg=req, silent=hlist.silent)
  5.  
  6.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1229, in _process_target
  7.     result = _execute_target(config, req, object, arg)
  8.  
  9.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1128, in _execute_target
  10.     result = object(arg)
  11.  
  12.   File "/usr/lib/python2.5/site-packages/mod_python/psp.py", line 337, in handler
  13.     p.run()
  14.  
  15.   File "/usr/lib/python2.5/site-packages/mod_python/psp.py", line 243, in run
  16.     exec code in global_scope
  17.  
  18.   File "/var/www/html/smart/qui.psp", line 86, in <module>
  19.     f = urllib.urlopen(url)
  20.  
  21.   File "/usr/lib/python2.5/urllib.py", line 82, in urlopen
  22.     return opener.open(url)
  23.  
  24.   File "/usr/lib/python2.5/urllib.py", line 190, in open
  25.     return getattr(self, name)(url)
  26.  
  27.   File "/usr/lib/python2.5/urllib.py", line 325, in open_http
  28.     h.endheaders()
  29.  
  30.   File "/usr/lib/python2.5/httplib.py", line 856, in endheaders
  31.     self._send_output()
  32.  
  33.   File "/usr/lib/python2.5/httplib.py", line 728, in _send_output
  34.     self.send(msg)
  35.  
  36.   File "/usr/lib/python2.5/httplib.py", line 695, in send
  37.     self.connect()
  38.  
  39.   File "/usr/lib/python2.5/httplib.py", line 663, in connect
  40.     socket.SOCK_STREAM):
  41.  
  42. IOError: [Errno socket error] (-3, 'Temporary failure in name resolution')
  43.  
  44.  

do you think it is from socket ??
should I put some code to set time out??
Aug 10 '08 #1
0 1802

Sign in to post your reply or Sign up for a free account.

Similar topics

7
by: chotiwallah | last post by:
i have a little database driven content managment system. people can load up html-docs. some of them use ms word as their html-editor, which resultes in loads of "class" and "style" attributes -...
10
by: mark4 | last post by:
Hello, Are there any utilities to help me extract Content from HTML ? I'd like to store this data in a database. The HTML consists of about 10,000 files with a total size of about 160 Mb....
1
by: Ori | last post by:
Hi, I have a HTML text which I need to parse in order to extract data from it. My html contain a table contains few rows and two columns. I want to extract the data from the 2nd column in...
6
by: Selen | last post by:
I would like to be able to extract a BLOB from the database (SqlServer) and pass it to a browser without writing it to a file. (The BLOB's are word doc's, MS project doc's, and Excel spreadsheets....
4
by: Patrick | last post by:
I've got some text with a few HTML tags, such as the following <Bold>Hello</Bold>There buddy<p>please ..... I need to be able to extract just the text, which would be Hello there buddy...
9
by: gregmcmullinjr | last post by:
Hello, I am new to the concept of XSL and am looking for some assistance. Take the following XML document: <binder> <author>Greg</author> <notes> <time>11:45</time>
1
by: Alberto Sartori | last post by:
Hello, I have a html text with custom tags which looks like html comment, such: "text text text <p>text</ptext test test text text text <p>text</ptext test test <!-- @MyTag@ -->extract...
4
by: dkasyap | last post by:
Hi, I have a huge string containing html tags, some of these tags being <img src="URL"> ones. I need to extract the urls from all the occurences of these tags in the input string. This is what I...
3
rizwan6feb
by: rizwan6feb | last post by:
I am trying to extract php code from a php file (php file also contains html, css and javascript code). I am using the following regex for this <\?*?\?> but this doesn't cater quotation marks...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.