Hello,
I need help in using sgmlparser to parse a html file and keep track of
the number of times each tag is being used.
In the end of this program I need to print out the number of times each
tag was seen(presumably any type of tag can be used) and the linked
text.
I need help in getting past the first steps. I already have this basic
program to return hyperlinks. I cant seem to understand how to parse
any tag and keep track of it to print it out at a later time....
very frustrated and help is appreciated!!!!!
--------------------------------------------------------------------------
import sgmllib, urllib
class HtmParser(sgmllib.SGMLParser):
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."
sgmllib.SGMLParser.__init__(self, verbose)
self.hyperlinks = []
self.descriptions = []
self.inside_a_element = 0
def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."
for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)
def get_hyperlinks(self):
"Return the list of hyperlinks."
return self.hyperlinks
parser = HtmParser()
inptAdrs = raw_input('Please input the absolute path to the url\n')
print 'you entered: ', inptAdrs
content = urllib.urlopen(inptAdrs)
bufff = content.read()
print 'Statistics for ', inptAdrs
print 'There is', len(bufff), 'characters in the web page'
parser.feed(bufff)
print parser.get_hyperlinks()
parser.close()
---------------------------------------------------------------------------------
any help is much appreciated 3 1407
could i make a global variable and keep track of each tag count?
Also how would i make a list or dictionary of tags that is found?
how can i handle any tag that is given?
Am Dienstag 02 Mai 2006 20:38 schrieb ha*********@gmail.com: could i make a global variable and keep track of each tag count?
Also how would i make a list or dictionary of tags that is found? how can i handle any tag that is given?
The following snippet does what you want:
from sgmllib import SGMLParser
class MyParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.tagcount = {}
self.links = set()
# Tag count handling
# ------------------
def handle_starttag(self,tag,method,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
method(args)
def unknown_starttag(self,tag,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
# Argument handling
# -----------------
def start_a(self,args):
self.links.update([value for name, value in args if name == "href"])
parser = MyParser()
parser.feed(file("test.html").read()) # Insert your data source here...
parser.close()
print parser.tagcount
print parser.links
See the documentation for sgmllib for more info on handle_starttag (whose
logic might just as well have been implemented in start_a, but if you want
argument handling for more tags, it's best to keep it at this one central
place) and unknown_starttag.
--- Heiko.
Am Dienstag 02 Mai 2006 20:38 schrieb ha*********@gmail.com: could i make a global variable and keep track of each tag count?
Also how would i make a list or dictionary of tags that is found? how can i handle any tag that is given?
The following snippet does what you want:
from sgmllib import SGMLParser
class MyParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.tagcount = {}
self.links = set()
# Tag count handling
# ------------------
def handle_starttag(self,tag,method,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
method(args)
def unknown_starttag(self,tag,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
# Argument handling
# -----------------
def start_a(self,args):
self.links.update([value for name, value in args if name == "href"])
parser = MyParser()
parser.feed(file("test.html").read()) # Insert your data source here...
parser.close()
print parser.tagcount
print parser.links
See the documentation for sgmllib for more info on handle_starttag (whose
logic might just as well have been implemented in start_a, but if you want
argument handling for more tags, it's best to keep it at this one central
place) and unknown_starttag.
--- Heiko. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Fuzzyman |
last post by:
I am trying to parse an HTML page an only modify URLs within tags -
e.g. inside IMG, A, SCRIPT, FRAME tags etc...
I have built one that works fine using the HTMLParser.HTMLParser and
it works...
|
by: Martin Krallinger |
last post by:
Hi all,
I wonder whether there is some sample code or
module to parse SGML files in python given the
DTD.
I would appreciate any help,
Best regards,
|
by: silviu |
last post by:
I have the following XML string that I want to parse using the SAX
parser. If I remove the portion of the XML string between the <audit>
and </audit> tags the SAX is parsing correctly. Otherwise...
|
by: Christophe Vanfleteren |
last post by:
Hello,
I'm parsing xml that is returned by the Amazon webservices (using their REST
interface).
Their dev-heavy.xsd has the following entry:
<xs:element name="Track">
<xs:complexType>...
|
by: S. |
last post by:
if in my website i am using the sgml { notation, is it accurate
to say to my users that the site uses unicode or that it requires
unicode?
is there a mathematical formula to calculate a unicode...
| |
by: yonido |
last post by:
hello,
my goal is to get patterns out of email files - say "message
forwarding" patterns (message forwarded from: xx to: yy subject: zz)
now lets say there are tons of these patterns (by gmail,...
|
by: jimmy.williamson |
last post by:
Hi,
I'm currently working on a project where I am required to investigate
how to convert SGML to XML, and then back again.
>From what I've seen on the web so far, James Clark's SP software can...
|
by: Frantic |
last post by:
I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the...
|
by: ankitdesai |
last post by:
I would like to parse a couple of tables within an individual player's
SHTML page. For example, I would like to get the "Actual Pitching
Statistics" and the "Translated Pitching Statistics"...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |