hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. <br />) to plain html (e.g. <br>).
Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not
enough of a regexp pro to figure out that lookahead stuff.
I'm not sure where to start now; I looked at BeautifulSoup and
BeautifulStoneS oup, but I can't see how to modify the actual tag.
thanks,
--Tim Arnold 11 4547
Tim Arnold wrote:
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. <br />) to plain html (e.g. <br>).
Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not
enough of a regexp pro to figure out that lookahead stuff.
I'm not sure where to start now; I looked at BeautifulSoup and
BeautifulStoneS oup, but I can't see how to modify the actual tag.
thanks,
--Tim Arnold
-- http://mail.python.org/mailman/listinfo/python-list
Whether or not you can find an application that does what you want, I
don't know, but at the very least I can say this much.
You should not be reading and parsing the text yourself! XHTML is valid
XML, and there a lots of ways to read and parse XML with Python.
(ElementTree is what I use, but other choices exist.) Once you use an
existing package to read your files into an internal tree structure
representation, it should be a relatively easy job to traverse the tree
to emit the tags and text you want.
Gary Herron
"Tim Arnold" <ti********@sas .comwrites:
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. <br />) to plain html (e.g. <br>).
Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not
enough of a regexp pro to figure out that lookahead stuff.
Hi, I'm not sure if this is very helpful but the following works on
the very simple example below.
>>import re xhtml = '<p>hello <img src="/img.png"/spam <br/bye </p>' xtag = re.compile(r'<([^>]*?)/>') xtag.sub(r'<\ 1>', xhtml)
'<p>hello <img src="/img.png"spam <brbye </p>'
--
Arnaud
"Gary Herron" <gh*****@island training.comwro te in message
news:ma******** *************** *************** @python.org...
Tim Arnold wrote:
>hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to create CHM files. That application really hates xhtml, so I need to convert self-ending tags (e.g. <br />) to plain html (e.g. <br>).
Seems simple enough, but I'm having some trouble with it. regexps trip up because I also have to take into account 'img', 'meta', 'link' tags, not just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not enough of a regexp pro to figure out that lookahead stuff.
I'm not sure where to start now; I looked at BeautifulSoup and BeautifulStone Soup, but I can't see how to modify the actual tag.
thanks, --Tim Arnold
-- http://mail.python.org/mailman/listinfo/python-list
Whether or not you can find an application that does what you want, I
don't know, but at the very least I can say this much.
You should not be reading and parsing the text yourself! XHTML is valid
XML, and there a lots of ways to read and parse XML with Python.
(ElementTree is what I use, but other choices exist.) Once you use an
existing package to read your files into an internal tree structure
representation, it should be a relatively easy job to traverse the tree to
emit the tags and text you want.
Gary Herron
I agree and I'd really rather not parse it myself. However, ET will clean up
the file which in my case includes some comments required as metadata, so
that won't work. Oh, I could get ET to read it and write a new parser--I see
what you mean. I think I need to subclass so I could get ET to honor those
comments too.
That's one way to go, I was just hoping for something easier.
thanks,
--Tim
"Arnaud Delobelle" <ar*****@google mail.comwrote in message
news:m2******** ****@googlemail .com...
"Tim Arnold" <ti********@sas .comwrites:
>hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to create CHM files. That application really hates xhtml, so I need to convert self-ending tags (e.g. <br />) to plain html (e.g. <br>).
Seems simple enough, but I'm having some trouble with it. regexps trip up because I also have to take into account 'img', 'meta', 'link' tags, not just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not enough of a regexp pro to figure out that lookahead stuff.
Hi, I'm not sure if this is very helpful but the following works on
the very simple example below.
>>>import re xhtml = '<p>hello <img src="/img.png"/spam <br/bye </p>' xtag = re.compile(r'<([^>]*?)/>') xtag.sub(r'< \1>', xhtml)
'<p>hello <img src="/img.png"spam <brbye </p>'
--
Arnaud
Thanks for that. It is helpful--I guess I had a brain malfunction. Your
example will work for me I'm pretty sure, except in some cases where the IMG
alt text contains a gt sign. I'm not sure that's even possible, so maybe
this will do the job.
thanks,
--Tim
Arnaud Delobelle wrote:
"Tim Arnold" <ti********@sas .comwrites:
>hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to create CHM files. That application really hates xhtml, so I need to convert self-ending tags (e.g. <br />) to plain html (e.g. <br>).
Seems simple enough, but I'm having some trouble with it. regexps trip up because I also have to take into account 'img', 'meta', 'link' tags, not just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not enough of a regexp pro to figure out that lookahead stuff.
Hi, I'm not sure if this is very helpful but the following works on
the very simple example below.
>>>import re xhtml = '<p>hello <img src="/img.png"/spam <br/bye </p>' xtag = re.compile(r'<([^>]*?)/>') xtag.sub(r'< \1>', xhtml)
'<p>hello <img src="/img.png"spam <brbye </p>'
You might try XIST ( http://www.livinglogic.de/Python/xist):
Code looks like this:
from ll.xist import parsers
from ll.xist.ns import html
xhtml = '<p>hello <img src="/img.png"/spam <br/bye </p>'
doc = parsers.parsest ring(xhtml)
print doc.bytes(xhtml =0)
This outputs:
<p>hello <img src="/img.png"spam <brbye </p>
(and a warning that the alt attribute is missing in the img ;))
Servus,
Walter
I'll second the recommendation to use xsl-t, set the output to html.
The code for an XSL-T to do it would be basically:
<xsl:styleshe et xmlns:xsl="http ://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html" />
<xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
</xsl:stylesheet>
you would probably want to do other stuff than just copy it out but
that's another case.
Also, from my recollection the solution in CHM to make XHTML br
elements behave correctly was <br /as opposed to <br/>, at any rate
I've done projects generating CHM and my output markup was well formed
XML at all occasions.
Cheers,
Bryan Rasmussen
On Thu, Apr 24, 2008 at 5:34 PM, Tim Arnold <ti********@sas .comwrote:
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. <br />) to plain html (e.g. <br>).
Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/doesn't work. I'm not
enough of a regexp pro to figure out that lookahead stuff.
I'm not sure where to start now; I looked at BeautifulSoup and
BeautifulStoneS oup, but I can't see how to modify the actual tag.
thanks,
--Tim Arnold
-- http://mail.python.org/mailman/listinfo/python-list
Tim Arnold wrote:
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. <br />) to plain html (e.g. <br>).
This should do the job in lxml 2.x:
from lxml import etree
tree = etree.parse("th efile.xhtml")
tree.write("the file.html", method="html") http://codespeak.net/lxml
Stefan
wow, that's pretty nice there.
Just to know: what's the performance like on XML instances of 1 GB?
Cheers,
Bryan Rasmussen
On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <st*******@behn el.dewrote:
Tim Arnold wrote:
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. <br />) to plain html (e.g. <br>).
This should do the job in lxml 2.x:
from lxml import etree
tree = etree.parse("th efile.xhtml")
tree.write("the file.html", method="html")
http://codespeak.net/lxml
Stefan
-- http://mail.python.org/mailman/listinfo/python-list
bryan rasmussen top-posted:
On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <st*******@behn el.dewrote:
> from lxml import etree
tree = etree.parse("th efile.xhtml") tree.write("the file.html", method="html")
http://codespeak.net/lxml
wow, that's pretty nice there.
Just to know: what's the performance like on XML instances of 1 GB?
That's a pretty big file, although you didn't mention what kind of XML
language you want to handle and what you want to do with it.
lxml is pretty conservative in terms of memory: http://blog.ianbicking.org/2008/03/3...r-performance/
But the exact numbers depend on your data. lxml holds the XML tree in memory,
which is a lot bigger than the serialised data. So, for example, if you have
2GB of RAM and want to parse a serialised 1GB XML file full of little
one-element integers into an in-memory tree, get prepared for lunch. With a
lot of long text string content instead, it might still fit.
However, lxml also has a couple of step-by-step and stream parsing APIs: http://codespeak.net/lxml/parsing.ht...rser-interface http://codespeak.net/lxml/parsing.ht...rser-interface http://codespeak.net/lxml/parsing.ht...e-and-iterwalk
They might do what you want.
Stefan This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Sebastien B. |
last post by:
I'm looking for the best tool to convert 'every day' html into proper XHTML
so that I can parse it as an XML document.
So far I've been using Tidylib to do this, but it doesn't handle things as
gracefully as browsers do. For example, take the page at
http://mail.yahoo.com - all browsers display it properly, but tidying it up
with Tidy...
|
by: John Bokma |
last post by:
Hi,
I converted most (not all) of my pages at http://johnbokma.com/ to
XHTML. I thought this was just a small change from 4.01.
However someone stated quite vaguely that my pages are *not* XHTML since
when a UA requests a page and states that it can handle XML the server
still responds with a Content-type: text/html
I see that...
|
by: Simon Strandgaard |
last post by:
There are no <iframe> tag in xhtml strict, instead I should use
<object>.
If I change <iframe> to <object> then my javascript stops working.
I am curious to how to use <object> with javascript ?
<?xml version="1.0" encoding="ISO-8859-1"?>
|
by: MLibby |
last post by:
How do I convert an HTML page into XML? My initial thought is to convert the page to xslt but I'm not sure how to do this. Please provide any source code examples if you have them.
Thanks,
Mike
--
mcp, mcse, mcsd, mcad.net, mcsd.net
|
by: Dan Jacobson |
last post by:
I shall jump on the XHTML bandwagon.
I run my perfectly good html4/strict pages thru
$ tidy -asxhtml -utf8 #to get:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html
xmlns="http://www.w3.org/1999/xhtml"> <head><meta
http-equiv="Content-Type" content="text/html; charset=utf-8"...
| |
by: Peter Williams |
last post by:
Hello,
If my html is valid XHTML accroding to http://validator.w3.org/, does
thuis mean it is also valid (4.0.1) Html?
Thanks in Advance
|
by: PenguinPig |
last post by:
Dear All Experts
I would like to know how to convert a HTML into Image using C#. Or allow me
contains HTML code (parsed) in Image? I also tried this way but it just
display the character "<" & ">" directly....
I have done googling, but all return shareware. I would like to know how to
programming...but not using shareware...
Thanks all.
|
by: John Krukoff |
last post by:
-----Original Message-----
One method which wouldn't require much python code, would be to run the
XHTML through a simple identity XSL tranform with the output method set to
HTML. It would have the benefit that you wouldn't have to worry about any of
the specifics of the transformation, though you would need an external
dependency.
As...
|
by: M.-A. Lemburg |
last post by:
On 2008-04-24 19:16, John Krukoff wrote:
You could filter the XHTML through mxTidy and set the hide_endtags to 1:
http://www.egenix.com/products/python/mxExperimental/mxTidy/
--
Marc-Andre Lemburg
eGenix.com
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it. ...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert...
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...
| |