473,503 Members | 1,747 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

SGML parsing tags and leeping track

Hello,

I need help in using sgmlparser to parse a html file and keep track of
the number of times each tag is being used.

In the end of this program I need to print out the number of times each
tag was seen(presumably any type of tag can be used) and the linked
text.

I need help in getting past the first steps. I already have this basic
program to return hyperlinks. I cant seem to understand how to parse
any tag and keep track of it to print it out at a later time....

very frustrated and help is appreciated!!!!!

--------------------------------------------------------------------------
import sgmllib, urllib

class HtmParser(sgmllib.SGMLParser):
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."

sgmllib.SGMLParser.__init__(self, verbose)
self.hyperlinks = []
self.descriptions = []
self.inside_a_element = 0

def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."

for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)

def get_hyperlinks(self):
"Return the list of hyperlinks."

return self.hyperlinks
parser = HtmParser()

inptAdrs = raw_input('Please input the absolute path to the url\n')
print 'you entered: ', inptAdrs

content = urllib.urlopen(inptAdrs)

bufff = content.read()
print 'Statistics for ', inptAdrs

print 'There is', len(bufff), 'characters in the web page'

parser.feed(bufff)
print parser.get_hyperlinks()
parser.close()
---------------------------------------------------------------------------------

any help is much appreciated

May 2 '06 #1
3 1407
could i make a global variable and keep track of each tag count?

Also how would i make a list or dictionary of tags that is found?
how can i handle any tag that is given?

May 2 '06 #2
Am Dienstag 02 Mai 2006 20:38 schrieb ha*********@gmail.com:
could i make a global variable and keep track of each tag count?

Also how would i make a list or dictionary of tags that is found?
how can i handle any tag that is given?


The following snippet does what you want:
from sgmllib import SGMLParser

class MyParser(SGMLParser):

def __init__(self):
SGMLParser.__init__(self)
self.tagcount = {}
self.links = set()

# Tag count handling
# ------------------

def handle_starttag(self,tag,method,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
method(args)

def unknown_starttag(self,tag,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1

# Argument handling
# -----------------

def start_a(self,args):
self.links.update([value for name, value in args if name == "href"])

parser = MyParser()
parser.feed(file("test.html").read()) # Insert your data source here...
parser.close()

print parser.tagcount
print parser.links


See the documentation for sgmllib for more info on handle_starttag (whose
logic might just as well have been implemented in start_a, but if you want
argument handling for more tags, it's best to keep it at this one central
place) and unknown_starttag.

--- Heiko.
May 2 '06 #3
Am Dienstag 02 Mai 2006 20:38 schrieb ha*********@gmail.com:
could i make a global variable and keep track of each tag count?

Also how would i make a list or dictionary of tags that is found?
how can i handle any tag that is given?


The following snippet does what you want:
from sgmllib import SGMLParser

class MyParser(SGMLParser):

def __init__(self):
SGMLParser.__init__(self)
self.tagcount = {}
self.links = set()

# Tag count handling
# ------------------

def handle_starttag(self,tag,method,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
method(args)

def unknown_starttag(self,tag,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1

# Argument handling
# -----------------

def start_a(self,args):
self.links.update([value for name, value in args if name == "href"])

parser = MyParser()
parser.feed(file("test.html").read()) # Insert your data source here...
parser.close()

print parser.tagcount
print parser.links


See the documentation for sgmllib for more info on handle_starttag (whose
logic might just as well have been implemented in start_a, but if you want
argument handling for more tags, it's best to keep it at this one central
place) and unknown_starttag.

--- Heiko.
May 2 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1570
by: Fuzzyman | last post by:
I am trying to parse an HTML page an only modify URLs within tags - e.g. inside IMG, A, SCRIPT, FRAME tags etc... I have built one that works fine using the HTMLParser.HTMLParser and it works...
2
1850
by: Martin Krallinger | last post by:
Hi all, I wonder whether there is some sample code or module to parse SGML files in python given the DTD. I would appreciate any help, Best regards,
4
2288
by: silviu | last post by:
I have the following XML string that I want to parse using the SAX parser. If I remove the portion of the XML string between the <audit> and </audit> tags the SAX is parsing correctly. Otherwise...
2
3837
by: Christophe Vanfleteren | last post by:
Hello, I'm parsing xml that is returned by the Amazon webservices (using their REST interface). Their dev-heavy.xsd has the following entry: <xs:element name="Track"> <xs:complexType>...
6
2762
by: S. | last post by:
if in my website i am using the sgml { notation, is it accurate to say to my users that the site uses unicode or that it requires unicode? is there a mathematical formula to calculate a unicode...
1
2412
by: yonido | last post by:
hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...
3
2366
by: jimmy.williamson | last post by:
Hi, I'm currently working on a project where I am required to investigate how to convert SGML to XML, and then back again. >From what I've seen on the web so far, James Clark's SP software can...
2
2790
by: Frantic | last post by:
I'm working on a list of japaneese entities that contain the entity, the unicode hexadecimal code and the xml/sgml entity used for that entity. A unicode document is read into the program, then the...
9
4042
by: ankitdesai | last post by:
I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...
0
7203
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7281
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7334
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
6993
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7462
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5579
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
3156
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1514
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
737
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.