473,320 Members | 1,902 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Question regarding HTMLParser module.

When parsing my html files, I use handle_pi to capture some embedded python
code, but I have noticed that in the embedded python code if it contains
html, HTMLParser will parse it as well, and thus causes an error when I exec
the code, raises an EOL error. I have a work around for this as I use
different set of characters rather that <tag> use something like (tag) then
revert it back to <tag> via another function, I was wondering if there is a
way to tell HTMLParser to ignore the embedded tags or another alternative?

Any help would be greatly appreciated.
And another note, I am well aware of Zope, Webware, CherryPy, etc... for
py/html embedding options, but I want this to be a learning experience.

HTML processing instruction:
<?
import time
print time.strftime('%b-%d-%Y')
print '<tt>testing!()</tt>')


error:
Traceback (most recent call last):
File "C:\home\Adonis\python\t.py", line 40, in -toplevel-
x.feed(z)
File "C:\Python23\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python23\lib\HTMLParser.py", line 154, in goahead
k = self.parse_pi(i)
File "C:\Python23\lib\HTMLParser.py", line 232, in parse_pi
self.handle_pi(rawdata[i+2: j])
File "C:\home\Adonis\python\t.py", line 33, in handle_pi
exec(data)
File "<string>", line 4
print '<tt
^
SyntaxError: EOL while scanning single-quoted string
Jul 18 '05 #1
1 2472
Adonis wrote:
When parsing my html files, I use handle_pi to capture some embedded python
code, but I have noticed that in the embedded python code if it contains
html, HTMLParser will parse it as well, and thus causes an error when I exec
the code, raises an EOL error. I have a work around for this as I use
different set of characters rather that <tag> use something like (tag) then
revert it back to <tag> via another function, I was wondering if there is a
way to tell HTMLParser to ignore the embedded tags or another alternative?

Any help would be greatly appreciated.
And another note, I am well aware of Zope, Webware, CherryPy, etc... for
py/html embedding options, but I want this to be a learning experience.

Unfortunately, HTMLParser (and the similar sgmllib) miserably fail to
process inline text. I know this very well; I have an HTML-generating
package that uses a lot of scripting and verbatim text.

What's happening in your case is that HTMLParser, when processing a <?
tag, simply and naively inputs text up to the next ">". HTMLParser
thinks the > in <tt> closes your <? tag. (It should at least have a
flag indicating whether it should read up to "?>" or just ">".)

A workaround is to do something like this:

<? print '<tt\x29monospaced</tt\x29' >

where obviously, \x29 is the hex code for >. That's not quite as bad
as replacing characters, although it's still not perfect.

Another possibility is to use sgmllib, but that's probably way more
trouble than it's worth, and still far from perfect. Basically,
sgmllib parsers have an method called verbatim, that turns of HTML tag
processing, although entities and closing tags are still processed.
(Entities and closing tags you can kind of reconstruct into the
original text, although the whitespace is lost.) This is what I do in
my own HTML-generating package.

I'll probably contribute some badly-needed remedies to HTMLParser
sometime, as the limitations of it and sgmllib are starting to get on
my nerves.
--
CARL BANKS
Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Tuang | last post by:
The library docs show that there is an HTMLParser module and an htmllib module, both of which apparently contain classes named "HTMLParser". There is a bit of decription of differences, but it...
4
by: Kevin T. Ryan | last post by:
Hi all - I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that...
1
by: Rajarshi Guha | last post by:
Hi, I have some HTML that looks essentially consists of a series of <div>'s and each <div> having one of two classes (tnt-question or tnt-answer). I'm using HTMLParser to handle the tags as: ...
7
by: Lad | last post by:
I came across pyparsing module by Paul McGuire. It seems to be nice but I am not sure if it is the best for my need. I need to extract some text from html page. The text is in tables and a table...
9
by: florent | last post by:
I'm trying to parse html documents from the web, using the HTMLParser class of the HTMLParser module (python 2.3), but some web documents are not fully valids. When the parser finds an invalid tag,...
1
by: Kenneth McDonald | last post by:
I'm writing a program that will parse HTML and (mostly) convert it to MediaWiki format. The two Python modules I'm aware of to do this are HTMLParser and htmllib. However, I'm currently...
2
by: John | last post by:
I'm working with the HTMLParser module and have implemented HTMLParser.handle_starttag() and I see there is a separate handle_data method (which can be implemented), but I am not clear how to tie...
8
by: jonbutler88 | last post by:
Just writing a simple website spider in python, keep getting these errors, not sure what to do. The problem seems to be in the feed() function of htmlparser. Traceback (most recent call last):...
3
by: globalrev | last post by:
tried all kinds of combos to get this to work. http://docs.python.org/lib/module-HTMLParser.html from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser):
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.