convert xhtml back to html

Tim Arnold

hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. ) to plain html (e.g. ).

Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not
enough of a regexp pro to figure out that lookahead stuff.

I'm not sure where to start now; I looked at BeautifulSoup and
BeautifulStoneSoup, but I can't see how to modify the actual tag.

thanks,
--Tim Arnold

Jun 27 '08 #1

Subscribe Post Reply

4499

Gary Herron

Tim Arnold wrote:

hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. ) to plain html (e.g. ).

Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not
enough of a regexp pro to figure out that lookahead stuff.

I'm not sure where to start now; I looked at BeautifulSoup and
BeautifulStoneSoup, but I can't see how to modify the actual tag.

thanks,
--Tim Arnold
--
http://mail.python.org/mailman/listinfo/python-list

Whether or not you can find an application that does what you want, I
don't know, but at the very least I can say this much.

You should not be reading and parsing the text yourself! XHTML is valid
XML, and there a lots of ways to read and parse XML with Python.
(ElementTree is what I use, but other choices exist.) Once you use an
existing package to read your files into an internal tree structure
representation, it should be a relatively easy job to traverse the tree
to emit the tags and text you want.
Gary Herron

Jun 27 '08 #2

Arnaud Delobelle

"Tim Arnold" <ti********@sas.comwrites:

hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. ) to plain html (e.g. ).

Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not
enough of a regexp pro to figure out that lookahead stuff.

Hi, I'm not sure if this is very helpful but the following works on
the very simple example below.

>>import re
xhtml = 'hello <img src="/img.png"/spam <br/bye '
xtag = re.compile(r'<([^>]*?)/>')
xtag.sub(r'<\1>', xhtml)

'hello <img src="/img.png"spam <brbye '
--
Arnaud

Jun 27 '08 #3

Tim Arnold

"Gary Herron" <gh*****@islandtraining.comwrote in message
news:ma**************************************@pyth on.org...

Tim Arnold wrote:
>hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop
to create CHM files. That application really hates xhtml, so I need to
convert self-ending tags (e.g. ) to plain html (e.g. ).

Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to
do that with regexps, but my simpleminded <img[^(/>)]+/doesn't work.
I'm not enough of a regexp pro to figure out that lookahead stuff.

I'm not sure where to start now; I looked at BeautifulSoup and
BeautifulStoneSoup, but I can't see how to modify the actual tag.

thanks,
--Tim Arnold
--
http://mail.python.org/mailman/listinfo/python-list

Whether or not you can find an application that does what you want, I
don't know, but at the very least I can say this much.

You should not be reading and parsing the text yourself! XHTML is valid
XML, and there a lots of ways to read and parse XML with Python.
(ElementTree is what I use, but other choices exist.) Once you use an
existing package to read your files into an internal tree structure
representation, it should be a relatively easy job to traverse the tree to
emit the tags and text you want.
Gary Herron

I agree and I'd really rather not parse it myself. However, ET will clean up
the file which in my case includes some comments required as metadata, so
that won't work. Oh, I could get ET to read it and write a new parser--I see
what you mean. I think I need to subclass so I could get ET to honor those
comments too.
That's one way to go, I was just hoping for something easier.
thanks,
--Tim

Jun 27 '08 #4

Tim Arnold

"Arnaud Delobelle" <ar*****@googlemail.comwrote in message
news:m2************@googlemail.com...

"Tim Arnold" <ti********@sas.comwrites:

>hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop
to
create CHM files. That application really hates xhtml, so I need to
convert
self-ending tags (e.g. ) to plain html (e.g. ).

Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to
do
that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm
not
enough of a regexp pro to figure out that lookahead stuff.

Hi, I'm not sure if this is very helpful but the following works on
the very simple example below.

>>>import re
xhtml = 'hello <img src="/img.png"/spam <br/bye '
xtag = re.compile(r'<([^>]*?)/>')
xtag.sub(r'<\1>', xhtml)

'hello <img src="/img.png"spam <brbye '
--
Arnaud

Thanks for that. It is helpful--I guess I had a brain malfunction. Your
example will work for me I'm pretty sure, except in some cases where the IMG
alt text contains a gt sign. I'm not sure that's even possible, so maybe
this will do the job.
thanks,
--Tim

Jun 27 '08 #5

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Arnaud Delobelle wrote:

"Tim Arnold" <ti********@sas.comwrites:

>hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. ) to plain html (e.g. ).

Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not
enough of a regexp pro to figure out that lookahead stuff.

Hi, I'm not sure if this is very helpful but the following works on
the very simple example below.

>>>import re
xhtml = 'hello <img src="/img.png"/spam <br/bye '
xtag = re.compile(r'<([^>]*?)/>')
xtag.sub(r'<\1>', xhtml)

'hello <img src="/img.png"spam <brbye '

You might try XIST (http://www.livinglogic.de/Python/xist):

Code looks like this:

from ll.xist import parsers
from ll.xist.ns import html

xhtml = 'hello <img src="/img.png"/spam <br/bye '

doc = parsers.parsestring(xhtml)
print doc.bytes(xhtml=0)

This outputs:

hello <img src="/img.png"spam <brbye 

(and a warning that the alt attribute is missing in the img ;))

Servus,
Walter

Jun 27 '08 #6

bryan rasmussen

I'll second the recommendation to use xsl-t, set the output to html.
The code for an XSL-T to do it would be basically:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html" />
<xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
</xsl:stylesheet>

you would probably want to do other stuff than just copy it out but
that's another case.

Also, from my recollection the solution in CHM to make XHTML br
elements behave correctly was , at any rate
I've done projects generating CHM and my output markup was well formed
XML at all occasions.

Cheers,
Bryan Rasmussen

On Thu, Apr 24, 2008 at 5:34 PM, Tim Arnold <ti********@sas.comwrote:

hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. ) to plain html (e.g. ).

Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not
enough of a regexp pro to figure out that lookahead stuff.

I'm not sure where to start now; I looked at BeautifulSoup and
BeautifulStoneSoup, but I can't see how to modify the actual tag.

thanks,
--Tim Arnold
--
http://mail.python.org/mailman/listinfo/python-list

Jun 27 '08 #7

Stefan Behnel

Tim Arnold wrote:

hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. ) to plain html (e.g. ).

This should do the job in lxml 2.x:

from lxml import etree

tree = etree.parse("thefile.xhtml")
tree.write("thefile.html", method="html")

http://codespeak.net/lxml

Stefan

Jun 27 '08 #8

bryan rasmussen

wow, that's pretty nice there.

Just to know: what's the performance like on XML instances of 1 GB?

Cheers,
Bryan Rasmussen
On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <st*******@behnel.dewrote:

Tim Arnold wrote:
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. ) to plain html (e.g. ).

This should do the job in lxml 2.x:

from lxml import etree

tree = etree.parse("thefile.xhtml")
tree.write("thefile.html", method="html")

http://codespeak.net/lxml

Stefan
--
http://mail.python.org/mailman/listinfo/python-list

Jun 27 '08 #9

Stefan Behnel

bryan rasmussen top-posted:

On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <st*******@behnel.dewrote:
> from lxml import etree

tree = etree.parse("thefile.xhtml")
tree.write("thefile.html", method="html")

http://codespeak.net/lxml

wow, that's pretty nice there.

Just to know: what's the performance like on XML instances of 1 GB?

That's a pretty big file, although you didn't mention what kind of XML
language you want to handle and what you want to do with it.

lxml is pretty conservative in terms of memory:

http://blog.ianbicking.org/2008/03/3...r-performance/

But the exact numbers depend on your data. lxml holds the XML tree in memory,
which is a lot bigger than the serialised data. So, for example, if you have
2GB of RAM and want to parse a serialised 1GB XML file full of little
one-element integers into an in-memory tree, get prepared for lunch. With a
lot of long text string content instead, it might still fit.

However, lxml also has a couple of step-by-step and stream parsing APIs:

http://codespeak.net/lxml/parsing.ht...rser-interface
http://codespeak.net/lxml/parsing.ht...rser-interface
http://codespeak.net/lxml/parsing.ht...e-and-iterwalk

They might do what you want.

Stefan

Jun 27 '08 #10

Jim Washington

Stefan Behnel wrote:

bryan rasmussen top-posted:

>On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <st*******@behnel.dewrote:

>> from lxml import etree

tree = etree.parse("thefile.xhtml")
tree.write("thefile.html", method="html")

http://codespeak.net/lxml

wow, that's pretty nice there.

Just to know: what's the performance like on XML instances of 1 GB?

That's a pretty big file, although you didn't mention what kind of XML
language you want to handle and what you want to do with it.

lxml is pretty conservative in terms of memory:

http://blog.ianbicking.org/2008/03/3...r-performance/

But the exact numbers depend on your data. lxml holds the XML tree in memory,
which is a lot bigger than the serialised data. So, for example, if you have
2GB of RAM and want to parse a serialised 1GB XML file full of little
one-element integers into an in-memory tree, get prepared for lunch. With a
lot of long text string content instead, it might still fit.

However, lxml also has a couple of step-by-step and stream parsing APIs:

http://codespeak.net/lxml/parsing.ht...rser-interface
http://codespeak.net/lxml/parsing.ht...rser-interface
http://codespeak.net/lxml/parsing.ht...e-and-iterwalk

If you are operating with huge XML files (say, larger than available
RAM) repeatedly, an XML database may also be a good option.

My current favorite in this realm is Sedna (free, Apache 2.0 license).
Among other features, it has facilities for indexing within documents
and collections (faster queries) and transactional sub-document updates
(safely modify parts of a document without rewriting the entire
document). I have been working on a python interface to it recently
(zif.sedna, in pypi).

Regarding RAM consumption, a Sedna database uses approximately 100 MB of
RAM by default, and that does not change much, no matter how much (or
how little) data is actually stored.

For a quick idea of Sedna's capabilities, the Sedna folks have put up an
on-line demo serving and xquerying an extract from Wikipedia (in the
range of 20 GB of data) using a Sedna server, at
http://wikidb.dyndns.org/ . Along with the on-line demo, they provide
instructions for deploying the technology locally.

- Jim Washington

Jun 27 '08 #11

Tim Arnold

"bryan rasmussen" <ra*************@gmail.comwrote in message
news:ma**************************************@pyth on.org...

I'll second the recommendation to use xsl-t, set the output to html.
The code for an XSL-T to do it would be basically:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="html" />
<xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
</xsl:stylesheet>

you would probably want to do other stuff than just copy it out but
that's another case.

Also, from my recollection the solution in CHM to make XHTML br
elements behave correctly was , at any rate
I've done projects generating CHM and my output markup was well formed
XML at all occasions.

Cheers,
Bryan Rasmussen

Thanks Bryan, Walter, John, Marc, and Stefan. I finally went with the xslt
transform which works very well and is simple. regexps would work, but they
just scare me somehow. Brian, my tags were formatted as <br /but the help
compiler would issue warnings on each one resulting in log files with
thousands of warnings. It did finish the compile though, but it made
understanding the logs too painful.

Stefan, I *really* look forward to being able to use lxml when I move to RH
linux next month. I've been using hp10.20 and never could get the requisite
libraries to compile. Once I make that move, maybe I won't have as many
markup related questions here!

thanks again to all for the great suggestions.
--Tim Arnold

Jun 27 '08 #12

by: Sebastien B. | last post by:

I'm looking for the best tool to convert 'every day' html into proper XHTML so that I can parse it as an XML document. So far I've been using Tidylib to do this, but it doesn't handle things as...

.NET Framework

XHTML or HTML 4.01?

by: John Bokma | last post by:

Hi, I converted most (not all) of my pages at http://johnbokma.com/ to XHTML. I thought this was just a small change from 4.01. However someone stated quite vaguely that my pages are *not*...

HTML / CSS

js with <object> in xhtml strict

by: Simon Strandgaard | last post by:

There are no <iframe> tag in xhtml strict, instead I should use <object>. If I change <iframe> to <object> then my javascript stops working. I am curious to how to use <object> with javascript...

Javascript

Convert HTML to XML

by: MLibby | last post by:

How do I convert an HTML page into XML? My initial thought is to convert the page to xslt but I'm not sure how to do this. Please provide any source code examples if you have them. Thanks, Mike...

.NET Framework

I shall jump on the XHTML bandwagon

by: Dan Jacobson | last post by:

I shall jump on the XHTML bandwagon. I run my perfectly good html4/strict pages thru $ tidy -asxhtml -utf8 #to get: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"...

HTML / CSS

Validing XHTML vs Html

by: Peter Williams | last post by:

Hello, If my html is valid XHTML accroding to http://validator.w3.org/, does thuis mean it is also valid (4.0.1) Html? Thanks in Advance

HTML / CSS

Convert HTML to Image

by: PenguinPig | last post by:

Dear All Experts I would like to know how to convert a HTML into Image using C#. Or allow me contains HTML code (parsed) in Image? I also tried this way but it just display the character "<" &...

C# / C Sharp

RE: convert xhtml back to html

by: John Krukoff | last post by:

-----Original Message----- One method which wouldn't require much python code, would be to run the XHTML through a simple identity XSL tranform with the output method set to HTML. It would...

Python

Re: convert xhtml back to html

by: M.-A. Lemburg | last post by:

On 2008-04-24 19:16, John Krukoff wrote: You could filter the XHTML through mxTidy and set the hide_endtags to 1: http://www.egenix.com/products/python/mxExperimental/mxTidy/ -- Marc-Andre...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

convert xhtml back to html

Similar topics