473,671 Members | 2,261 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Class to extract tags from html code

11 New Member
Hello everybody,becau se I am newbie in python two weeks only but I had programming in another languages but the python take my heart there's 3 kind of arrays Wow now I hate JAVA :) .

I am working now on html process.
So, I found a good class for getting html tag but I don't know how to use it
I wrote this code for getting the tag A hopping some help please >>>

Expand|Select|Wrap|Line Numbers
  1. <%
  2. import urllib
  3. from sgmllib import SGMLParser
  4. import htmlentitydefs
  5.  
  6.  
  7. class BaseHTMLProcessor(SGMLParser):
  8.     def reset(self):                       
  9.         # extend (called by SGMLParser.__init__)
  10.         self.pieces = []
  11.         SGMLParser.reset(self)
  12.  
  13.     def unknown_starttag(self, tag, attrs):
  14.         # called for each start tag
  15.         # attrs is a list of (attr, value) tuples
  16.         # e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
  17.         # Ideally we would like to reconstruct original tag and attributes, but
  18.         # we may end up quoting attribute values that weren't quoted in the source
  19.         # document, or we may change the type of quotes around the attribute value
  20.         # (single to double quotes).
  21.         # Note that improperly embedded non-HTML code (like client-side Javascript)
  22.         # may be parsed incorrectly by the ancestor, causing runtime script errors.
  23.         # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
  24.         # to ensure that it will pass through this parser unaltered (in handle_comment).
  25.         strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
  26.         self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
  27.  
  28.     def unknown_endtag(self, tag):         
  29.         # called for each end tag, e.g. for </pre>, tag will be "pre"
  30.         # Reconstruct the original end tag.
  31.         self.pieces.append("</%(tag)s>" % locals())
  32.  
  33.     def handle_charref(self, ref):         
  34.         # called for each character reference, e.g. for "*", ref will be "160"
  35.         # Reconstruct the original character reference.
  36.         self.pieces.append("&#%(ref)s;" % locals())
  37.  
  38.     def handle_entityref(self, ref):       
  39.         # called for each entity reference, e.g. for "&copy;", ref will be "copy"
  40.         # Reconstruct the original entity reference.
  41.         self.pieces.append("&%(ref)s" % locals())
  42.         # standard HTML entities are closed with a semicolon; other entities are not
  43.         if htmlentitydefs.entitydefs.has_key(ref):
  44.             self.pieces.append(";")
  45.  
  46.     def handle_data(self, text):           
  47.         # called for each block of plain text, i.e. outside of any tag and
  48.         # not containing any character or entity references
  49.         # Store the original text verbatim.
  50.         self.pieces.append(text)
  51.  
  52.     def handle_comment(self, text):        
  53.         # called for each HTML comment, e.g. <!-- insert Javascript code here -->
  54.         # Reconstruct the original comment.
  55.         # It is especially important that the source document enclose client-side
  56.         # code (like Javascript) within comments so it can pass through this
  57.         # processor undisturbed; see comments in unknown_starttag for details.
  58.         self.pieces.append("<!--%(text)s-->" % locals())
  59.  
  60.     def handle_pi(self, text):             
  61.         # called for each processing instruction, e.g. <?instruction>
  62.         # Reconstruct original processing instruction.
  63.         self.pieces.append("<?%(text)s>" % locals())
  64.  
  65.     def handle_decl(self, text):
  66.         # called for the DOCTYPE, if present, e.g.
  67.         # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  68.         #     "http://www.w3.org/TR/html4/loose.dtd">
  69.         # Reconstruct original DOCTYPE
  70.         self.pieces.append("<!%(text)s>" % locals())
  71.  
  72.     def output(self):              
  73.         """Return processed HTML as a single string"""
  74.         return "".join(self.pieces)
  75.  
  76.  
  77.  
  78. url='http://google.com'
  79. f = urllib.urlopen(url)
  80. s = f.read() # The html code
  81.  
  82. links = []
  83.  
  84. myparser = BaseHTMLProcessor()
  85. links=myparser.output(myparser.unknown_starttag(s,'a','href'))
  86.  
  87. req.write(links[0]+'<br>'+links[1])
  88.  
  89. %>
  90.  

The traceback >>
Expand|Select|Wrap|Line Numbers
  1. Traceback (most recent call last):
  2.  
  3.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1537, in HandlerDispatch
  4.     default=default_handler, arg=req, silent=hlist.silent)
  5.  
  6.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1229, in _process_target
  7.     result = _execute_target(config, req, object, arg)
  8.  
  9.   File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1128, in _execute_target
  10.     result = object(arg)
  11.  
  12.   File "/usr/lib/python2.5/site-packages/mod_python/psp.py", line 337, in handler
  13.     p.run()
  14.  
  15.   File "/usr/lib/python2.5/site-packages/mod_python/psp.py", line 243, in run
  16.     exec code in global_scope
  17.  
  18.   File "/var/www/html/smart/qui.psp", line 86, in <module>
  19.     f = urllib.urlopen(url)
  20.  
  21.   File "/usr/lib/python2.5/urllib.py", line 82, in urlopen
  22.     return opener.open(url)
  23.  
  24.   File "/usr/lib/python2.5/urllib.py", line 190, in open
  25.     return getattr(self, name)(url)
  26.  
  27.   File "/usr/lib/python2.5/urllib.py", line 325, in open_http
  28.     h.endheaders()
  29.  
  30.   File "/usr/lib/python2.5/httplib.py", line 856, in endheaders
  31.     self._send_output()
  32.  
  33.   File "/usr/lib/python2.5/httplib.py", line 728, in _send_output
  34.     self.send(msg)
  35.  
  36.   File "/usr/lib/python2.5/httplib.py", line 695, in send
  37.     self.connect()
  38.  
  39.   File "/usr/lib/python2.5/httplib.py", line 663, in connect
  40.     socket.SOCK_STREAM):
  41.  
  42. IOError: [Errno socket error] (-3, 'Temporary failure in name resolution')
  43.  
  44.  

do you think it is from socket ??
should I put some code to set time out??
Aug 10 '08 #1
0 1828

Sign in to post your reply or Sign up for a free account.

Similar topics

7
3708
by: chotiwallah | last post by:
i have a little database driven content managment system. people can load up html-docs. some of them use ms word as their html-editor, which resultes in loads of "class" and "style" attributes - like this: <p class="MsoNormal">Some text</p> now i'd like to remove them (the attributes, not the people, that is). i know reg exp is the way, but somehow the solution avoids me. just pointing me to some more advanced tutorial (pref. in...
10
6873
by: mark4 | last post by:
Hello, Are there any utilities to help me extract Content from HTML ? I'd like to store this data in a database. The HTML consists of about 10,000 files with a total size of about 160 Mb. Each file is a thread from a message forum. Each thread has several contributions. The threads are in linear order of date posted with filenames such as 000125633.html. The
1
3446
by: Ori | last post by:
Hi, I have a HTML text which I need to parse in order to extract data from it. My html contain a table contains few rows and two columns. I want to extract the data from the 2nd column in the most efficient way (using Reg Ex.) either than using the "indexOf" function of String. Thanks,
6
2218
by: Selen | last post by:
I would like to be able to extract a BLOB from the database (SqlServer) and pass it to a browser without writing it to a file. (The BLOB's are word doc's, MS project doc's, and Excel spreadsheets. How can I do this?
4
4022
by: Patrick | last post by:
I've got some text with a few HTML tags, such as the following <Bold>Hello</Bold>There buddy<p>please ..... I need to be able to extract just the text, which would be Hello there buddy please.... Note, this is a Windows App, and not a Web App. Any ideas anyone?
9
2098
by: gregmcmullinjr | last post by:
Hello, I am new to the concept of XSL and am looking for some assistance. Take the following XML document: <binder> <author>Greg</author> <notes> <time>11:45</time>
1
4783
by: Alberto Sartori | last post by:
Hello, I have a html text with custom tags which looks like html comment, such: "text text text <p>text</ptext test test text text text <p>text</ptext test test <!-- @MyTag@ -->extract this<!-- /@MyTag@ --> text text text <p>text</ptext test test <!-- @MyTag@ -->and this<!-- /@MyTag@ --> text text text <p>text</ptext test test"
4
21879
by: dkasyap | last post by:
Hi, I have a huge string containing html tags, some of these tags being <img src="URL"> ones. I need to extract the urls from all the occurences of these tags in the input string. This is what I am doing: Pattern p=null; Matcher m= null; String word0= null; String word1= null; p= Pattern.compile(".*<img*src=\"(*)",Pattern.CASE_INSENSITIVE);
3
4100
rizwan6feb
by: rizwan6feb | last post by:
I am trying to extract php code from a php file (php file also contains html, css and javascript code). I am using the following regex for this <\?*?\?> but this doesn't cater quotation marks (single and double quotes) and comments, i mean how can i skip php tags inside a string (and comments). Please have a look at the following code <?php include("db.php"); $name=$_REQUEST;
0
8393
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8821
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8598
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
7437
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6229
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4407
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2812
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2051
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
2
1809
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.