473,394 Members | 1,813 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

Making sgmlib more liberal

I've written a simple class derived from sgmllib.SGMLParser to extract
text from html pages. So far it's worked pretty well except for a few
cases where I get exceptions. I've managed to work around these
problems by overriding parse_declaration.

Since parse_declaration is preceded by the comment

# Internal -- parse declaration (for use by subclasses).

I am thinking my workaround might possibly stop working with future
versions of sgmllib so I'm looking for a more correct alternative.
Any suggestions?

Here's my code:

_endTag = re.compile(r'>')

class SGML2TextParser(sgmllib.SGMLParser):
def __init__(self, f, ignoretags=['script']):
sgmllib.SGMLParser.__init__(self)
self.f = f
self.ignoretags = ignoretags
self.tag = ''

def handle_starttag(self, tag, attrs):
self.tag = tag

def handle_data(self, data):
if self.tag not in self.ignoretags:
self.f.write(data)

def handle_charref(self, name):
try:
n = int(name)
self.handle_data(unichr(n))
except ValueError:
pass

# DANGER: overriding internal function
def parse_declaration(self, i):
try:
return sgmllib.SGMLParser.parse_declaration(self, i)
except:
match = _endTag.search(self.rawdata, i)
return match and match.end(0) or -1

def extractText(html_text):
s = StringIO.StringIO()
x = SGML2TextParser(s)
x.feed(html_text)
return s.getvalue()
Jul 18 '05 #1
0 1215

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Willam Roberts | last post by:
To me, when you enumerate, you make a list of specific items, like enumerating controls in a form. Am I missing something? I mean is there something more to the meaning of the term "enumeration" or...
2
by: Stewart | last post by:
Originally posted in comp.lang.javascript: Newsgroups: comp.lang.javascript From: "Stewart" Date: 23 Aug 2005 02:50:04 -0700 Local: Tues, Aug 23 2005 10:50 am Subject: FireFox, RemoveChild,...
7
by: redneon | last post by:
Does anyone have any good links to information on how it's possible to make a library in C++? I can't seem to find anything.
90
by: Ben Finney | last post by:
Howdy all, How can a (user-defined) class ensure that its instances are immutable, like an int or a tuple, without inheriting from those types? What caveats should be observed in making...
34
by: Asfand Yar Qazi | last post by:
Hi, I'm creating a library where several classes are intertwined rather tightly. I'm thinking of making them all use pimpls, so that these circular dependancies can be avoided easily, and I'm...
351
by: CBFalconer | last post by:
We often find hidden, and totally unnecessary, assumptions being made in code. The following leans heavily on one particular example, which happens to be in C. However similar things can (and...
10
by: JurgenvonOerthel | last post by:
Consider the classes Base, Derived1 and Derived2. Both Derived1 and Derived2 derive publicly from Base. Given a 'const Base &input' I want to initialize a 'const Derived1 &output'. If the...
7
by: MarkNeumann | last post by:
I'm coming from a Corel paradox background and moving into an Access environment. So I'm struggling with something that I think is probably way simpler than I'm making it out to be. Access 2007...
204
by: Masood | last post by:
I know that this topic may inflame the "C language Taleban", but is there any prospect of some of the neat features of C++ getting incorporated in C? No I am not talking out the OO stuff. I am...
50
by: Juha Nieminen | last post by:
I asked a long time ago in this group how to make a smart pointer which works with incomplete types. I got this answer (only relevant parts included): ...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.