By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,345 Members | 1,746 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,345 IT Pros & Developers. It's quick & easy.

CDATA and lxml

P: n/a
Heyas

So first off I know that CDATA is generally hated and just shouldn't
be done, but I'm simply required to parse it and spit it back out.
Parsing is pretty easy with lxml, but it's the spitting back out
that's giving me issues. The fact that lxml strips all the CDATA
stuff off isnt really a big issue either, so long as I can create
CDATA blocks later with <>&'s showing up instead of &lt;&gt;&amp; .
I've scoured through the lxml docs, but probably not hard enough, so
anyone know the page I'm looking for or have a quick how to?

Thanks!
Apr 11 '08 #1
Share this Question
Share on Google+
7 Replies


P: n/a
Silfheed wrote:
So first off I know that CDATA is generally hated and just shouldn't
be done, but I'm simply required to parse it and spit it back out.
Parsing is pretty easy with lxml, but it's the spitting back out
that's giving me issues. The fact that lxml strips all the CDATA
stuff off isnt really a big issue either, so long as I can create
CDATA blocks later with <>&'s showing up instead of &lt;&gt;&amp; .
I've scoured through the lxml docs, but probably not hard enough, so
anyone know the page I'm looking for or have a quick how to?
There's nothing in the docs because lxml doesn't allow you to create CDATA
sections. You're not the first one asking that, but so far, no one really had
a take on this.

It's not as trivial as it sounds. Removing the CDATA sections in the parser is
just for fun. It simplifies the internal tree traversal and text aggregation,
so this would be affected if we allowed CDATA content in addition to normal
text content. It's not that hard, it's just that it hasn't been done so far.

Stefan
Apr 11 '08 #2

P: n/a


Stefan Behnel wrote:
It's not as trivial as it sounds. Removing the CDATA sections in the parser is
just for fun.
.... *not* just for fun ...

obviously ...

Stefan
Apr 11 '08 #3

P: n/a
Hi again,

Stefan Behnel wrote:
Silfheed wrote:
>So first off I know that CDATA is generally hated and just shouldn't
be done, but I'm simply required to parse it and spit it back out.
Parsing is pretty easy with lxml, but it's the spitting back out
that's giving me issues. The fact that lxml strips all the CDATA
stuff off isnt really a big issue either, so long as I can create
CDATA blocks later with <>&'s showing up instead of &lt;&gt;&amp; .
I've scoured through the lxml docs, but probably not hard enough, so
anyone know the page I'm looking for or have a quick how to?

There's nothing in the docs because lxml doesn't allow you to create CDATA
sections. You're not the first one asking that, but so far, no one really had
a take on this.
So I gave it a try, then. In lxml 2.1, you will be able to do this:
>>root = Element("root")
root.text = CDATA('test')
tostring(root))
'<root><![CDATA[test]]></root>'

This does not work for .tail content, only for .text content (no technical
reason, I just don't see why that should be enabled).

There's also a parser option "strip_cdata" now that allows you to leave CDATA
sections in the tree. However, they will *not* behave any different than
normal text, so you can't even see at the API level that you are dealing with
CDATA. If you want to be really, really sure, you can always do this:
>>root.text = CDATA(root.text)
Hope that helps,

Stefan
Apr 11 '08 #4

P: n/a
On Apr 11, 10:33 am, Stefan Behnel <stefan...@behnel.dewrote:
Hi again,

Stefan Behnel wrote:
Silfheed wrote:
So first off I know that CDATA is generally hated and just shouldn't
be done, but I'm simply required to parse it and spit it back out.
Parsing is pretty easy with lxml, but it's the spitting back out
that's giving me issues. The fact that lxml strips all the CDATA
stuff off isnt really a big issue either, so long as I can create
CDATA blocks later with <>&'s showing up instead of &lt;&gt;&amp; .
I've scoured through the lxml docs, but probably not hard enough, so
anyone know the page I'm looking for or have a quick how to?
There's nothing in the docs because lxml doesn't allow you to create CDATA
sections. You're not the first one asking that, but so far, no one really had
a take on this.

So I gave it a try, then. In lxml 2.1, you will be able to do this:
>>root = Element("root")
>>root.text = CDATA('test')
>>tostring(root))
'<root><![CDATA[test]]></root>'

This does not work for .tail content, only for .text content (no technical
reason, I just don't see why that should be enabled).

There's also a parser option "strip_cdata" now that allows you to leave CDATA
sections in the tree. However, they will *not* behave any different than
normal text, so you can't even see at the API level that you are dealing with
CDATA. If you want to be really, really sure, you can always do this:
>>root.text = CDATA(root.text)

Hope that helps,

Stefan
That is immensely cool. Do you plan to stick it into svn soon?
Thanks!
Apr 11 '08 #5

P: n/a
On Apr 11, 3:49 pm, Silfheed <silfh...@gmail.comwrote:
On Apr 11, 10:33 am, Stefan Behnel <stefan...@behnel.dewrote:
Hi again,
Stefan Behnel wrote:
Silfheed wrote:
>So first off I know that CDATA is generally hated and just shouldn't
>be done, but I'm simply required to parse it and spit it back out.
>Parsing is pretty easy with lxml, but it's the spitting back out
>that's giving me issues. The fact that lxml strips all the CDATA
>stuff off isnt really a big issue either, so long as I can create
>CDATA blocks later with <>&'s showing up instead of &lt;&gt;&amp; .
>I've scoured through the lxml docs, but probably not hard enough, so
>anyone know the page I'm looking for or have a quick how to?
There's nothing in the docs because lxml doesn't allow you to create CDATA
sections. You're not the first one asking that, but so far, no one really had
a take on this.
So I gave it a try, then. In lxml 2.1, you will be able to do this:
>>root = Element("root")
>>root.text = CDATA('test')
>>tostring(root))
'<root><![CDATA[test]]></root>'
This does not work for .tail content, only for .text content (no technical
reason, I just don't see why that should be enabled).
There's also a parser option "strip_cdata" now that allows you to leave CDATA
sections in the tree. However, they will *not* behave any different than
normal text, so you can't even see at the API level that you are dealing with
CDATA. If you want to be really, really sure, you can always do this:
>>root.text = CDATA(root.text)
Hope that helps,
Stefan

That is immensely cool. Do you plan to stick it into svn soon?
Thanks!
Ah, looks like it's there already. Very cool, very cool. Thanks
again.
Apr 11 '08 #6

P: n/a
On Apr 11, 10:33 am, Stefan Behnel <stefan...@behnel.dewrote:
Hi again,

Stefan Behnel wrote:
Silfheed wrote:
So first off I know that CDATA is generally hated and just shouldn't
be done, but I'm simply required to parse it and spit it back out.
Parsing is pretty easy with lxml, but it's the spitting back out
that's giving me issues. The fact that lxml strips all the CDATA
stuff off isnt really a big issue either, so long as I can create
CDATA blocks later with <>&'s showing up instead of &lt;&gt;&amp; .
I've scoured through the lxml docs, but probably not hard enough, so
anyone know the page I'm looking for or have a quick how to?
There's nothing in the docs because lxml doesn't allow you to create CDATA
sections. You're not the first one asking that, but so far, no one really had
a take on this.

So I gave it a try, then. In lxml 2.1, you will be able to do this:
>>root = Element("root")
>>root.text = CDATA('test')
>>tostring(root))
'<root><![CDATA[test]]></root>'

This does not work for .tail content, only for .text content (no technical
reason, I just don't see why that should be enabled).

There's also a parser option "strip_cdata" now that allows you to leave CDATA
sections in the tree. However, they will *not* behave any different than
normal text, so you can't even see at the API level that you are dealing with
CDATA. If you want to be really, really sure, you can always do this:
>>root.text = CDATA(root.text)

Hope that helps,

Stefan
That is immensely cool. Do you plan to stick it into svn soon?
Thanks!
Jun 27 '08 #7

P: n/a
On Apr 11, 3:49 pm, Silfheed <silfh...@gmail.comwrote:
On Apr 11, 10:33 am, Stefan Behnel <stefan...@behnel.dewrote:
Hi again,
Stefan Behnel wrote:
Silfheed wrote:
>So first off I know that CDATA is generally hated and just shouldn't
>be done, but I'm simply required to parse it and spit it back out.
>Parsing is pretty easy with lxml, but it's the spitting back out
>that's giving me issues. The fact that lxml strips all the CDATA
>stuff off isnt really a big issue either, so long as I can create
>CDATA blocks later with <>&'s showing up instead of &lt;&gt;&amp; .
>I've scoured through the lxml docs, but probably not hard enough, so
>anyone know the page I'm looking for or have a quick how to?
There's nothing in the docs because lxml doesn't allow you to create CDATA
sections. You're not the first one asking that, but so far, no one really had
a take on this.
So I gave it a try, then. In lxml 2.1, you will be able to do this:
>>root = Element("root")
>>root.text = CDATA('test')
>>tostring(root))
'<root><![CDATA[test]]></root>'
This does not work for .tail content, only for .text content (no technical
reason, I just don't see why that should be enabled).
There's also a parser option "strip_cdata" now that allows you to leave CDATA
sections in the tree. However, they will *not* behave any different than
normal text, so you can't even see at the API level that you are dealing with
CDATA. If you want to be really, really sure, you can always do this:
>>root.text = CDATA(root.text)
Hope that helps,
Stefan

That is immensely cool. Do you plan to stick it into svn soon?
Thanks!
Ah, looks like it's there already. Very cool, very cool. Thanks
again.
Jun 27 '08 #8

This discussion thread is closed

Replies have been disabled for this discussion.