469,616 Members | 1,599 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,616 developers. It's quick & easy.

Tidy transforms "&" in the source-xml into a "&"

Hi,

2 issues left with my tidy-work:

1) Tidy transforms a "&" in the source-xml into a "&" in the tidied
version. My XML-Importer cannot handle it
2) in a long <title>-string a wrap is produced like:
<title>my very long title blab la blab la
Blabla bla </title>
Importer also has got problems with it
My tidy.bat
tidy.exe --output-xhtml yes --show-body-only yes --new-blocklevel-tags
component,bblocation,title2,short_intro,long_intro ,date,reference,category,image_small,image_medium, image_large,body2,external_link_text1,external_lin k_url1
--indent auto --write-back yes %1
regards
Ragnar

Nov 4 '06 #1
13 2487
Ragnar wrote:
1) Tidy transforms a "&amp;" in the source-xml into a "&" in the tidied
version.
Hold it a moment -- if your source is XML, why are you going through Tidy?

Having said that, this shouldn't happen in XHTML output mode. Contact
Tidy's authors, and/or show us a failing example so we can crosscheck
this and make sure

2) in a long <title>-string a wrap is produced like:
<title>my very long title blab la blab la
Blabla bla </title>
Importer also has got problems with it
Turn off auto-indent.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 4 '06 #2
On Sat, 04 Nov 2006 10:17:58 -0500, Joe Kesselman
<ke************@comcast.netwrote:
>Hold it a moment -- if your source is XML, why are you going through Tidy?
Is there a better way to check the well-formedness of a xml-file than
tidy -xml ?
-Timo
Nov 4 '06 #3
Hi Joe

turning off indent-auto doesnt make a difference

Here is my file: http://www.ticope.de/tmp/source.xml

thank you!

Nov 4 '06 #4
Timo Harmo wrote:
Is there a better way to check the well-formedness of a xml-file than
tidy -xml ?
Tidy is not primarily an XML tool. It's a tool for repairing
sloppily-written HTML and XHTML.

To check well-formedness of XML, feed it to any proper XML parser. If
the parser doesn't accept it, the XML is not well-formed.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 4 '06 #5
You never answered my question: If this is already XML, why are you
putting it through Tidy in the first place?

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 4 '06 #6
Ragnar wrote:
Here is my file: http://www.ticope.de/tmp/source.xml
Not well formed, so it isn't XML, despite the file name. First obvious
error is that someone failed to put quotes around the value of the lang
attribute. I'd recommend you fix this where it originates, rather than
trying to patch it later by running it through Tidy, especially since
you say Tidy's doing things you don't expect.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 4 '06 #7
Tried running the most recent copy of Tidy against your input file,
using your batchfile. It is *NOT* damaging the &. Either you're
confusing yourself badly (for example, looking at the text in an XML
tool, which of course will see &amp; as the & character since that's
what &amp; represents), or you're running a damaged copy of Tidy and
need to upgrade.

I'll bet on the former.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 4 '06 #8
Oh, forgot to say: The only thing I did differently was that I named the
input file test.html.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 4 '06 #9
I may also have accidentally dropped the "--write-back yes".

Still, this does suggest that Tidy isn't your problem.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 5 '06 #10
Joe Kesselman schrieb:
Tried running the most recent copy of Tidy against your input file,
using your batchfile. It is *NOT* damaging the &. Either you're
confusing yourself badly (for example, looking at the text in an XML
tool, which of course will see &amp; as the & character since that's
what &amp; represents), or you're running a damaged copy of Tidy and
need to upgrade.

Hi Joe

thank you so for your work and help

Yes, you might be right. I was confused by the tool which has presented
&amp; as &.
So you say I dont have wellformed xml and therefore I cannot use tidy.
The content was exported automatically from an older version of a CMS
and the rich-text-fields were not XHTML-compliant. But you are right- I
should focus more on exporting and trying to optimize the exporter
instead of the importer. Maybe it is just enough to run tidy there or
do a lot of string-manipulations (Replace) in the phase where the
content is exported using SOAP.
Ragnar

Nov 5 '06 #11
Ragnar wrote:
So you say I dont have wellformed xml and therefore I cannot use tidy.
Tidy's job is to (take an informed guess at how to) fix ill-formed HTML,
not ill-formed XML. And even there, it should be considered a stopgap,
used only because so few people (or tools!) produce officially correct HTML.

If you're working in XML, you should start by producing real XML. That
really shouldn't be hard to do.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 5 '06 #12

Joe Kesselman wrote:
To check well-formedness of XML, feed it to any proper XML parser. If
the parser doesn't accept it, the XML is not well-formed.
What would you suggest if it _isn't_ well-formed XML? (dodgy use of
HTML entities being an obvious "fixable" problem that springs to mind)

It's not an uncommon problem to have to deal with cruddy XML like this.
I'd be interested to hear what other peoples' favourite tools for
helping with it are.

Nov 8 '06 #13
Andy Dingley wrote:
What would you suggest if it _isn't_ well-formed XML? (dodgy use of
HTML entities being an obvious "fixable" problem that springs to mind)
There really is no good way to repair a damaged document without deep
knowledge of exactly what the intended document structure was -- which
is why Tidy is such a complicated application; it needs to understand
HTML well enough to make intelligent guesses about what the author's
intent was.

The *best* you can hope to do is to sweep the problem under the carpet
and guess right most of the time.

So I would, very strongly, suggest fixing the problem at the source. If
it isn't well-formed XML, fix the tool that generated it.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden
Nov 8 '06 #14

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Mike Gifford | last post: by
reply views Thread by Börni | last post: by
12 posts views Thread by Stefan Weiss | last post: by
reply views Thread by Christoph Schneegans | last post: by
40 posts views Thread by VK | last post: by
reply views Thread by BG Mahesh | last post: by
1 post views Thread by Martin Odhelius | last post: by
2 posts views Thread by Simon Brooke | last post: by
reply views Thread by kempshall | last post: by
reply views Thread by devrayhaan | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.