Hello everybody,becau se I am newbie in python two weeks only but I had programming in another languages but the python take my heart there's 3 kind of arrays Wow now I hate JAVA :) .
I am working now on html process.
So, I found a good class for getting html tag but I don't know how to use it
I wrote this code for getting the tag A hopping some help please >>> -
<%
-
import urllib
-
from sgmllib import SGMLParser
-
import htmlentitydefs
-
-
-
class BaseHTMLProcessor(SGMLParser):
-
def reset(self):
-
# extend (called by SGMLParser.__init__)
-
self.pieces = []
-
SGMLParser.reset(self)
-
-
def unknown_starttag(self, tag, attrs):
-
# called for each start tag
-
# attrs is a list of (attr, value) tuples
-
# e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
-
# Ideally we would like to reconstruct original tag and attributes, but
-
# we may end up quoting attribute values that weren't quoted in the source
-
# document, or we may change the type of quotes around the attribute value
-
# (single to double quotes).
-
# Note that improperly embedded non-HTML code (like client-side Javascript)
-
# may be parsed incorrectly by the ancestor, causing runtime script errors.
-
# All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
-
# to ensure that it will pass through this parser unaltered (in handle_comment).
-
strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
-
self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
-
-
def unknown_endtag(self, tag):
-
# called for each end tag, e.g. for </pre>, tag will be "pre"
-
# Reconstruct the original end tag.
-
self.pieces.append("</%(tag)s>" % locals())
-
-
def handle_charref(self, ref):
-
# called for each character reference, e.g. for "*", ref will be "160"
-
# Reconstruct the original character reference.
-
self.pieces.append("&#%(ref)s;" % locals())
-
-
def handle_entityref(self, ref):
-
# called for each entity reference, e.g. for "©", ref will be "copy"
-
# Reconstruct the original entity reference.
-
self.pieces.append("&%(ref)s" % locals())
-
# standard HTML entities are closed with a semicolon; other entities are not
-
if htmlentitydefs.entitydefs.has_key(ref):
-
self.pieces.append(";")
-
-
def handle_data(self, text):
-
# called for each block of plain text, i.e. outside of any tag and
-
# not containing any character or entity references
-
# Store the original text verbatim.
-
self.pieces.append(text)
-
-
def handle_comment(self, text):
-
# called for each HTML comment, e.g. <!-- insert Javascript code here -->
-
# Reconstruct the original comment.
-
# It is especially important that the source document enclose client-side
-
# code (like Javascript) within comments so it can pass through this
-
# processor undisturbed; see comments in unknown_starttag for details.
-
self.pieces.append("<!--%(text)s-->" % locals())
-
-
def handle_pi(self, text):
-
# called for each processing instruction, e.g. <?instruction>
-
# Reconstruct original processing instruction.
-
self.pieces.append("<?%(text)s>" % locals())
-
-
def handle_decl(self, text):
-
# called for the DOCTYPE, if present, e.g.
-
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
-
# "http://www.w3.org/TR/html4/loose.dtd">
-
# Reconstruct original DOCTYPE
-
self.pieces.append("<!%(text)s>" % locals())
-
-
def output(self):
-
"""Return processed HTML as a single string"""
-
return "".join(self.pieces)
-
-
-
-
url='http://google.com'
-
f = urllib.urlopen(url)
-
s = f.read() # The html code
-
-
links = []
-
-
myparser = BaseHTMLProcessor()
-
links=myparser.output(myparser.unknown_starttag(s,'a','href'))
-
-
req.write(links[0]+'<br>'+links[1])
-
-
%>
-
The traceback >> -
Traceback (most recent call last):
-
-
File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1537, in HandlerDispatch
-
default=default_handler, arg=req, silent=hlist.silent)
-
-
File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1229, in _process_target
-
result = _execute_target(config, req, object, arg)
-
-
File "/usr/lib/python2.5/site-packages/mod_python/importer.py", line 1128, in _execute_target
-
result = object(arg)
-
-
File "/usr/lib/python2.5/site-packages/mod_python/psp.py", line 337, in handler
-
p.run()
-
-
File "/usr/lib/python2.5/site-packages/mod_python/psp.py", line 243, in run
-
exec code in global_scope
-
-
File "/var/www/html/smart/qui.psp", line 86, in <module>
-
f = urllib.urlopen(url)
-
-
File "/usr/lib/python2.5/urllib.py", line 82, in urlopen
-
return opener.open(url)
-
-
File "/usr/lib/python2.5/urllib.py", line 190, in open
-
return getattr(self, name)(url)
-
-
File "/usr/lib/python2.5/urllib.py", line 325, in open_http
-
h.endheaders()
-
-
File "/usr/lib/python2.5/httplib.py", line 856, in endheaders
-
self._send_output()
-
-
File "/usr/lib/python2.5/httplib.py", line 728, in _send_output
-
self.send(msg)
-
-
File "/usr/lib/python2.5/httplib.py", line 695, in send
-
self.connect()
-
-
File "/usr/lib/python2.5/httplib.py", line 663, in connect
-
socket.SOCK_STREAM):
-
-
IOError: [Errno socket error] (-3, 'Temporary failure in name resolution')
-
-
do you think it is from socket ??
should I put some code to set time out??
0 1828 Sign in to post your reply or Sign up for a free account.
Similar topics |
by: chotiwallah |
last post by:
i have a little database driven content managment system. people can
load up html-docs. some of them use ms word as their html-editor,
which resultes in loads of "class" and "style" attributes - like this:
<p class="MsoNormal">Some text</p>
now i'd like to remove them (the attributes, not the people, that is).
i know reg exp is the way, but somehow the solution avoids me. just
pointing me to some more advanced tutorial (pref. in...
|
by: mark4 |
last post by:
Hello,
Are there any utilities to help me extract Content from HTML ?
I'd like to store this data in a database.
The HTML consists of about 10,000 files with a total size of
about 160 Mb. Each file is a thread from a message forum. Each
thread has several contributions. The threads are in linear
order of date posted with filenames such as 000125633.html. The
|
by: Ori |
last post by:
Hi,
I have a HTML text which I need to parse in order to extract data from
it.
My html contain a table contains few rows and two columns. I want to
extract the data from the 2nd column in the most efficient way (using
Reg Ex.) either than using the "indexOf" function of String.
Thanks,
|
by: Selen |
last post by:
I would like to be able to extract a BLOB from the database (SqlServer)
and pass it to a browser without writing it to a file. (The BLOB's
are word doc's, MS project doc's, and Excel spreadsheets.
How can I do this?
|
by: Patrick |
last post by:
I've got some text with a few HTML tags, such as the following
<Bold>Hello</Bold>There buddy<p>please .....
I need to be able to extract just the text, which would be
Hello there buddy please....
Note, this is a Windows App, and not a Web App.
Any ideas anyone?
| |
by: gregmcmullinjr |
last post by:
Hello,
I am new to the concept of XSL and am looking for some assistance.
Take the following XML document:
<binder>
<author>Greg</author>
<notes>
<time>11:45</time>
|
by: Alberto Sartori |
last post by:
Hello,
I have a html text with custom tags which looks like html comment,
such:
"text text text <p>text</ptext test test
text text text <p>text</ptext test test
<!-- @MyTag@ -->extract this<!-- /@MyTag@ -->
text text text <p>text</ptext test test
<!-- @MyTag@ -->and this<!-- /@MyTag@ -->
text text text <p>text</ptext test test"
|
by: dkasyap |
last post by:
Hi,
I have a huge string containing html tags, some of these tags being <img src="URL"> ones. I need to extract the urls from all the occurences of these tags in the input string. This is what I am doing:
Pattern p=null;
Matcher m= null;
String word0= null;
String word1= null;
p= Pattern.compile(".*<img*src=\"(*)",Pattern.CASE_INSENSITIVE);
|
by: rizwan6feb |
last post by:
I am trying to extract php code from a php file (php file also contains html, css and javascript code). I am using the following regex for this
<\?*?\?>
but this doesn't cater quotation marks (single and double quotes) and comments, i mean how can i skip php tags inside a string (and comments). Please have a look at the following code
<?php
include("db.php");
$name=$_REQUEST;
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |