By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,882 Members | 2,453 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 434,882 IT Pros & Developers. It's quick & easy.

RE: convert xhtml back to html

P: n/a
-----Original Message-----
From: py***********************************@python.org [mailto:python-
li****************************@python.org] On Behalf Of Tim Arnold
Sent: Thursday, April 24, 2008 9:34 AM
To: py*********@python.org
Subject: convert xhtml back to html

hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop
to
create CHM files. That application really hates xhtml, so I need to
convert
self-ending tags (e.g. <br />) to plain html (e.g. <br>).

Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm
not
enough of a regexp pro to figure out that lookahead stuff.

I'm not sure where to start now; I looked at BeautifulSoup and
BeautifulStoneSoup, but I can't see how to modify the actual tag.

thanks,
--Tim Arnold
--
http://mail.python.org/mailman/listinfo/python-list

One method which wouldn't require much python code, would be to run the
XHTML through a simple identity XSL tranform with the output method set to
HTML. It would have the benefit that you wouldn't have to worry about any of
the specifics of the transformation, though you would need an external
dependency.

As far as I know, both 4suite and lxml (my personal favorite:
http://codespeak.net/lxml/) support XSLT in python.

It might work out fine for you, but mixing regexps and XML always seems to
work out badly in the end for me.
---------
John Krukoff
jk******@ltgc.com

Jun 27 '08 #1
Share this question for a faster answer!
Share on Google+

This discussion thread is closed

Replies have been disabled for this discussion.