470,636 Members | 1,566 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,636 developers. It's quick & easy.

Preventing the UTF-8 Parser from converting an entity?

Hello all,

I'm having a little problem, The UTF-8 parser we are using converts the
newline entity ( ) within an attribute that we are using to paliate
CSS limitations.

After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

Is there a clean and easy way around this?

Any help will be greatly appreciated.

Regards
Jean-Francois Michaud

Sep 18 '06 #1
11 1816
* Jean-François Michaud wrote in comp.text.xml:
>I'm having a little problem, The UTF-8 parser we are using converts the
newline entity ( ) within an attribute that we are using to paliate
CSS limitations.
I don't understand your question. First, is not an entity but a
numeric character reference. Second, processing those is independent of
character encodings like UTF-8. Third, I don't see what CSS limitation
you might be referring to here.
>After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.
What is "\n" here? What do you mean by "converted"? What do you mean by
keeping it? Processing white-space characters and character references
to them in attribute values is explained in the XML specification. XML
processors keep them to the extent that they are significant. If you
connect the processor to a serializer, the input and output documents
will be canonically equivalent unless one of them has a bug. So there
should be no issue here.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Sep 18 '06 #2


Jean-François Michaud wrote:

I'm having a little problem, The UTF-8 parser we are using converts the
newline entity ( ) within an attribute that we are using to paliate
CSS limitations.
is not an entity nor an entity reference, rather a numeric
character reference.
What is an "UTF-8 parser"?
After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.
It is not clear what kind of tool you use and what you produce finally
but if you want to serialize a DOM or an XSLT result tree to XML markup
and want that newline character to be escaped as as a numeric
character reference then you need an XML serializer that does that. If
you want to serialize such a tree to HTML markup then you need a HTML
serializer that does that.

--

Martin Honnen
http://JavaScript.FAQTs.com/
Sep 18 '06 #3
In article <11**********************@e3g2000cwe.googlegroups. com>,
Jean-François Michaud <co*****@comcast.netwrote:
>After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.
>Is there a clean and easy way around this?
Not using XML. XML applications are effectively required to treat
character references in content the same way that they treat the
characters referred to. A conforming XML parser will convert it in
the way you describe.

If you want to have something that's like a newline but is treated
differently, then a character reference is not the right approach.
That's not what they're for. Using an element such as <nl/might be
a better solution.

-- Richard
Sep 18 '06 #4

Richard Tobin wrote:
In article <11**********************@e3g2000cwe.googlegroups. com>,
Jean-François Michaud <co*****@comcast.netwrote:
After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.
Is there a clean and easy way around this?

Not using XML. XML applications are effectively required to treat
character references in content the same way that they treat the
characters referred to. A conforming XML parser will convert it in
the way you describe.

If you want to have something that's like a newline but is treated
differently, then a character reference is not the right approach.
That's not what they're for. Using an element such as <nl/might be
a better solution.
Understandably, but we are using a stange combinary of XML + CSS under
the VEX XML editor.

We are displaying the attribute before a bit of text, but because of a
silly CSS limitation (not being able to test for a condition in a
pseudo :before element), we thought that postpending the
character at the end of the string would do the trick. It does indeed
work, but as soon as we save the document, the character gets converted
to UTF-8 encoding. We HAVE to use this character because VEX doesn't
deal with UTF-8 encoding directly to format its output. Using an <nl/>
element is simply not an option.

Regards
Jean-Francois Michaud

Sep 18 '06 #5

Bjoern Hoehrmann wrote:
* Jean-François Michaud wrote in comp.text.xml:
I'm having a little problem, The UTF-8 parser we are using converts the
newline entity ( ) within an attribute that we are using to paliate
CSS limitations.

I don't understand your question. First, is not an entity but a
numeric character reference. Second, processing those is independent of
character encodings like UTF-8. Third, I don't see what CSS limitation
you might be referring to here.
Alright let me clarify, We allow for numeric character references to be
included in our XML document so that special characters can be included
in the output. These numeric sequences get converted to UTF-8 encoding
for proper transformation into yet another XML which is then
transformed into PDF using XSLT/XSL:FO. All the way through, encoding
has to abide by UTF-8, hence the reason why the numeric sequences have
to be converted to meet this restriction. The problem is that the XML
editor that we use to display the XML content (using XML + CSS) doesn't
use UTF-8 encoded characters when dealing with formatting. It
recognizes the character, but not the UTF-8 version of it.

The problem all stems from CSS being unable to allow for me to test a
condition while displaying using a :before pseudo element (I can either
display using :before, or I can test for a condition, but I can't do
both at the same time. Yay for CSS!).

The solution was to append the character at the end of the string
attribute that we want to display so that the carriage return only
occurs when the string is non empty. This works splendidly but as soon
as we save the document, the engine converts everything to UTF-8
encoding (booo!).

[snip]

Regards
Jean-Francois Michaud

Sep 18 '06 #6
>The solution was to append the character at the end of the string
>attribute
If you mean inside the attribute value... A properly functioning XML
serializer should recognize line breaks within attribute values as a
special case and escape them as necessary to write them back out,
typically as .

However, the distinction between , CR, LF, and CRLF will not be
preserved elsewhere. The only place where XML cares about the difference
between these is in the details of attribute value normalization and
serialization.

And while looking at the parsed version of the data (as output from the
parser but not run back through a serializer, you will always see these
as the newline character,

I'm still not sure from your description which of these applies to your
particular problem. You might want to post a very explicit description
of what your source XML looks like, how you're viewing the result of the
parse, and what you're seeing.

In any case, UTF-8 has nothing to do with any of the above; it's
strictly XML behaviors.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden
Sep 18 '06 #7
Personally, I'd recommend you discard CSS and switch to XSLT. CSS was
not designed for XML processing; XSLT was (and is more powerful than CSS).
Sep 18 '06 #8
In article <11**********************@b28g2000cwb.googlegroups .com>,
Jean-François Michaud <co*****@comcast.netwrote:
>We are displaying the attribute before a bit of text
If the character is in an attribute, rather than content, it should be
output as or an equivalent reference. This is because an
ordinary linefeed would be normalised to a space character when the
file is read in again.
>It does indeed
work, but as soon as we save the document, the character gets converted
to UTF-8 encoding.
Just to be clear about this: linefeed is an ASCII character, and is the
same in UTF-8 as in ASCII.
>We HAVE to use this character because VEX doesn't
deal with UTF-8 encoding directly to format its output.
I really don't understand this at all. The encoding is not relevant
here. In your input file, you will have . A program that reads
(parses) this will have a linefeed character in its data, using
whatever internal encoding it happens to use. UTF-8 only becomes
relevant when you output the file, and as I said a linefeed in an
attribute should be output as rather than a linefeed character.
-- Richard
Sep 18 '06 #9

Joseph Kesselman wrote:
Personally, I'd recommend you discard CSS and switch to XSLT. CSS was
not designed for XML processing; XSLT was (and is more powerful than CSS).
I know, that would have been my take also. The technology that we are
using is the VEX XML editor. It allows users to update XML content as
if they were in word which is not entirely uninterresting, but CSS is
not advanced enough for this XML + CSS combo to work perfectly when
more demanding formatting is necessary. VEX unfortunately uses CSS to
render the output on display. No way around this short of throwing
everything in the garbage altogether and thats just not gonna happen.

Regards
Jeff

Sep 18 '06 #10
(parses) this will have a linefeed character in its data [...]
attribute should be output as rather than a linefeed character.
Absolutely. If you're looking at the parsed form of the attribute's
value, you should see the newline character. If you're looking at the
text form, you should see . If either is not true, your tools are
broken.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden
Sep 18 '06 #11
Jean-François Michaud wrote:
Hello all,

I'm having a little problem, The UTF-8 parser we are using converts the
newline entity ( ) within an attribute that we are using to paliate
CSS limitations.

After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

Is there a clean and easy way around this?

Any help will be greatly appreciated.

Regards
Jean-Francois Michaud
hi,

[CR], [LF], [CR/LF] are normalized by XML parsers, but characters
references are left as-is (the value you see is the character that is
referred)

that is to say, if you parse the following document :

<?xml version="1.0"?>
<foo bar="abc&#xA;def
ghi"/>

(with [CR/LF] between "def" and "ghi")
you will get that value :

abc
def ghi

(with [CR/LF] between "abc" and "def")

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
Sep 19 '06 #12

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by David Hane | last post: by
38 posts views Thread by Haines Brown | last post: by
18 posts views Thread by Tron Thomas | last post: by
3 posts views Thread by tshad | last post: by
3 posts views Thread by Lars Netzel | last post: by
3 posts views Thread by Avinash | last post: by
2 posts views Thread by =?Utf-8?B?YW5vb3A=?= | last post: by
1 post views Thread by Korara | last post: by
???
reply views Thread by Stoney L | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.