473,222 Members | 1,769 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,222 software developers and data experts.

Class to extract tags from html code

11
Hello everybody,because I am newbie in python two weeks only but I had programming in another languages but the python take my heart there's 3 kind of arrays Wow now I hate JAVA :) .

I am working now on html process.
So, I found a good class for getting html tag but I don't know how to use it
I wrote this code for getting the tag A hopping some help please >>>

Expand|Select|Wrap|Line Numbers
  1. <%
  2. import urllib
  3. from sgmllib import SGMLParser
  4. import htmlentitydefs
  5.  
  6.  
  7. class BaseHTMLProcessor(SGMLParser):
  8.     def reset(self):                       
  9.         # extend (called by SGMLParser.__init__)
  10.         self.pieces = []
  11.         SGMLParser.reset(self)
  12.  
  13.     def unknown_starttag(self, tag, attrs):
  14.         # called for each start tag
  15.         # attrs is a list of (attr, value) tuples
  16.         # e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
  17.         # Ideally we would like to reconstruct original tag and attributes, but
  18.         # we may end up quoting attribute values that weren't quoted in the source
  19.         # document, or we may change the type of quotes around the attribute value
  20.         # (single to double quotes).
  21.         # Note that improperly embedded non-HTML code (like client-side Javascript)
  22.         # may be parsed incorrectly by the ancestor, causing runtime script errors.
  23.         # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
  24.         # to ensure that it will pass through this parser unaltered (in handle_comment).
  25.         strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
  26.         self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
  27.  
  28.     def unknown_endtag(self, tag):         
  29.         # called for each end tag, e.g. for </pre>, tag will be "pre"
  30.         # Reconstruct the original end tag.
  31.         self.pieces.append("</%(tag)s>" % locals())
  32.  
  33.     def handle_charref(self, ref):         
  34.         # called for each character reference, e.g. for "*", ref will be "160"
  35.         # Reconstruct the original character reference.
  36.         self.pieces.append("&#%(ref)s;" % locals())
  37.  
  38.     def handle_entityref(self, ref):       
  39.         # called for each entity reference, e.g. for "&copy;", ref will be "copy"
  40.         # Reconstruct the original entity reference.
  41.         self.pieces.append("&%(ref)s" % locals())
  42.         # standard HTML entities are closed with a semicolon; other entities are not
  43.         if htmlentitydefs.entitydefs.has_key(ref):
  44.             self.pieces.append(";")
  45.  
  46.     def handle_data(self, text):           
  47.         # called for each block of plain text, i.e. outside of any tag and
  48.         # not containing any character or entity references
  49.         # Store the original text verbatim.
  50.         self.pieces.append(text)
  51.  
  52.     def handle_comment(self, text):        
  53.         # called for each HTML comment, e.g. <!-- insert Javascript code here -->
  54.         # Reconstruct the original comment.
  55.         # It is especially important that the source document enclose client-side
  56.         # code (like Javascript) within comments so it can pass through this
  57.         # processor undisturbed; see comments in unknown_starttag for details.
  58.         self.pieces.append("<!--%(text)s-->" % locals())
  59.  
  60.     def handle_pi(self, text):             
  61.         # called for each processing instruction, e.g. <?instruction>
  62.         # Reconstruct original processing instruction.
  63.         self.pieces.append("<?%(text)s>" % locals())
  64.  
  65.     def handle_decl(self, text):
  66.         # called for the DOCTYPE, if present, e.g.
  67.         # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  68.         #     "http://www.w3.org/TR/html4/loose.dtd">
  69.         # Reconstruct original DOCTYPE
  70.         self.pieces.append("<!%(text)s>" % locals())
  71.  
  72.     def output(self):              
  73.         """Return processed HTML as a single string"""
  74.         return "".join(self.pieces)
  75.  
  76.  
  77.  
  78. url='http://google.com'
  79. f = urllib.urlopen(url)
  80. s = f.read() # The html code
  81.  
  82. links = []
  83.  
  84. myparser = BaseHTMLProcessor()
  85. links=myparser.output(myparser.unknown_starttag(s,'a','href'))
  86.  
  87. req.write(links[0]+'<br>'+links[1])
  88.  
  89. %>
  90.  

The traceback >>
Expand|Select|Wrap|Line Numbers
  1. Traceback (most recent call last):
  2.  
  3.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1537, in HandlerDispatch
  4.     default=default_handler, arg=req, silent=hlist.silent)
  5.  
  6.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1229, in _process_target
  7.     result = _execute_target(config, req, object, arg)
  8.  
  9.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1128, in _execute_target
  10.     result = object(arg)
  11.  
  12.   File "/usr/lib/python2.5/site-packages/mod_python/psp.py", line 337, in handler
  13.     p.run()
  14.  
  15.   File "/usr/lib/python2.5/site-packages/mod_python/psp.py", line 243, in run
  16.     exec code in global_scope
  17.  
  18.   File "/var/www/html/smart/qui.psp", line 86, in <module>
  19.     f = urllib.urlopen(url)
  20.  
  21.   File "/usr/lib/python2.5/urllib.py", line 82, in urlopen
  22.     return opener.open(url)
  23.  
  24.   File "/usr/lib/python2.5/urllib.py", line 190, in open
  25.     return getattr(self, name)(url)
  26.  
  27.   File "/usr/lib/python2.5/urllib.py", line 325, in open_http
  28.     h.endheaders()
  29.  
  30.   File "/usr/lib/python2.5/httplib.py", line 856, in endheaders
  31.     self._send_output()
  32.  
  33.   File "/usr/lib/python2.5/httplib.py", line 728, in _send_output
  34.     self.send(msg)
  35.  
  36.   File "/usr/lib/python2.5/httplib.py", line 695, in send
  37.     self.connect()
  38.  
  39.   File "/usr/lib/python2.5/httplib.py", line 663, in connect
  40.     socket.SOCK_STREAM):
  41.  
  42. IOError: [Errno socket error] (-3, 'Temporary failure in name resolution')
  43.  
  44.  

do you think it is from socket ??
should I put some code to set time out??
Aug 10 '08 #1
0 1799

Sign in to post your reply or Sign up for a free account.

Similar topics

7
by: chotiwallah | last post by:
i have a little database driven content managment system. people can load up html-docs. some of them use ms word as their html-editor, which resultes in loads of "class" and "style" attributes -...
10
by: mark4 | last post by:
Hello, Are there any utilities to help me extract Content from HTML ? I'd like to store this data in a database. The HTML consists of about 10,000 files with a total size of about 160 Mb....
1
by: Ori | last post by:
Hi, I have a HTML text which I need to parse in order to extract data from it. My html contain a table contains few rows and two columns. I want to extract the data from the 2nd column in...
6
by: Selen | last post by:
I would like to be able to extract a BLOB from the database (SqlServer) and pass it to a browser without writing it to a file. (The BLOB's are word doc's, MS project doc's, and Excel spreadsheets....
4
by: Patrick | last post by:
I've got some text with a few HTML tags, such as the following <Bold>Hello</Bold>There buddy<p>please ..... I need to be able to extract just the text, which would be Hello there buddy...
9
by: gregmcmullinjr | last post by:
Hello, I am new to the concept of XSL and am looking for some assistance. Take the following XML document: <binder> <author>Greg</author> <notes> <time>11:45</time>
1
by: Alberto Sartori | last post by:
Hello, I have a html text with custom tags which looks like html comment, such: "text text text <p>text</ptext test test text text text <p>text</ptext test test <!-- @MyTag@ -->extract...
4
by: dkasyap | last post by:
Hi, I have a huge string containing html tags, some of these tags being <img src="URL"> ones. I need to extract the urls from all the occurences of these tags in the input string. This is what I...
3
rizwan6feb
by: rizwan6feb | last post by:
I am trying to extract php code from a php file (php file also contains html, css and javascript code). I am using the following regex for this <\?*?\?> but this doesn't cater quotation marks...
1
isladogs
by: isladogs | last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, Mike...
0
by: veera ravala | last post by:
ServiceNow is a powerful cloud-based platform that offers a wide range of services to help organizations manage their workflows, operations, and IT services more efficiently. At its core, ServiceNow...
0
by: VivesProcSPL | last post by:
Obviously, one of the original purposes of SQL is to make data query processing easy. The language uses many English-like terms and syntax in an effort to make it easy to learn, particularly for...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.