Preventing the UTF-8 Parser from converting an entity?

Jean-François Michaud

Hello all,

I'm having a little problem, The UTF-8 parser we are using converts the
newline entity ( ) within an attribute that we are using to paliate
CSS limitations.

After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

Is there a clean and easy way around this?

Any help will be greatly appreciated.

Regards
Jean-Francois Michaud

Sep 18 '06 #1

Subscribe Post Reply

1972

Bjoern Hoehrmann

* Jean-François Michaud wrote in comp.text.xml:

>I'm having a little problem, The UTF-8 parser we are using converts the
newline entity ( ) within an attribute that we are using to paliate
CSS limitations.

I don't understand your question. First, is not an entity but a
numeric character reference. Second, processing those is independent of
character encodings like UTF-8. Third, I don't see what CSS limitation
you might be referring to here.

>After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

What is "\n" here? What do you mean by "converted"? What do you mean by
keeping it? Processing white-space characters and character references
to them in attribute values is explained in the XML specification. XML
processors keep them to the extent that they are significant. If you
connect the processor to a serializer, the input and output documents
will be canonically equivalent unless one of them has a bug. So there
should be no issue here.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Sep 18 '06 #2

Martin Honnen

Jean-François Michaud wrote:

I'm having a little problem, The UTF-8 parser we are using converts the
newline entity ( ) within an attribute that we are using to paliate
CSS limitations.

is not an entity nor an entity reference, rather a numeric
character reference.
What is an "UTF-8 parser"?

After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

It is not clear what kind of tool you use and what you produce finally
but if you want to serialize a DOM or an XSLT result tree to XML markup
and want that newline character to be escaped as as a numeric
character reference then you need an XML serializer that does that. If
you want to serialize such a tree to HTML markup then you need a HTML
serializer that does that.

--

Martin Honnen
http://JavaScript.FAQTs.com/

Sep 18 '06 #3

Richard Tobin

In article <11**********************@e3g2000cwe.googlegroups. com>,
Jean-François Michaud <co*****@comcast.netwrote:

>After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

>Is there a clean and easy way around this?

Not using XML. XML applications are effectively required to treat
character references in content the same way that they treat the
characters referred to. A conforming XML parser will convert it in
the way you describe.

If you want to have something that's like a newline but is treated
differently, then a character reference is not the right approach.
That's not what they're for. Using an element such as <nl/might be
a better solution.

-- Richard

Sep 18 '06 #4

Jean-François Michaud

Richard Tobin wrote:

In article <11**********************@e3g2000cwe.googlegroups. com>,
Jean-François Michaud <co*****@comcast.netwrote:

After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

Is there a clean and easy way around this?

Not using XML. XML applications are effectively required to treat
character references in content the same way that they treat the
characters referred to. A conforming XML parser will convert it in
the way you describe.

If you want to have something that's like a newline but is treated
differently, then a character reference is not the right approach.
That's not what they're for. Using an element such as <nl/might be
a better solution.

Understandably, but we are using a stange combinary of XML + CSS under
the VEX XML editor.

We are displaying the attribute before a bit of text, but because of a
silly CSS limitation (not being able to test for a condition in a
pseudo :before element), we thought that postpending the
character at the end of the string would do the trick. It does indeed
work, but as soon as we save the document, the character gets converted
to UTF-8 encoding. We HAVE to use this character because VEX doesn't
deal with UTF-8 encoding directly to format its output. Using an <nl/>
element is simply not an option.

Regards
Jean-Francois Michaud

Sep 18 '06 #5

Jean-François Michaud

Bjoern Hoehrmann wrote:

* Jean-François Michaud wrote in comp.text.xml:
I'm having a little problem, The UTF-8 parser we are using converts the
newline entity ( ) within an attribute that we are using to paliate
CSS limitations.

I don't understand your question. First, is not an entity but a
numeric character reference. Second, processing those is independent of
character encodings like UTF-8. Third, I don't see what CSS limitation
you might be referring to here.

Alright let me clarify, We allow for numeric character references to be
included in our XML document so that special characters can be included
in the output. These numeric sequences get converted to UTF-8 encoding
for proper transformation into yet another XML which is then
transformed into PDF using XSLT/XSL:FO. All the way through, encoding
has to abide by UTF-8, hence the reason why the numeric sequences have
to be converted to meet this restriction. The problem is that the XML
editor that we use to display the XML content (using XML + CSS) doesn't
use UTF-8 encoded characters when dealing with formatting. It
recognizes the character, but not the UTF-8 version of it.

The problem all stems from CSS being unable to allow for me to test a
condition while displaying using a :before pseudo element (I can either
display using :before, or I can test for a condition, but I can't do
both at the same time. Yay for CSS!).

The solution was to append the character at the end of the string
attribute that we want to display so that the carriage return only
occurs when the string is non empty. This works splendidly but as soon
as we save the document, the engine converts everything to UTF-8
encoding (booo!).

[snip]

Regards
Jean-Francois Michaud

Sep 18 '06 #6

Joseph Kesselman

>The solution was to append the character at the end of the string

>attribute

If you mean inside the attribute value... A properly functioning XML
serializer should recognize line breaks within attribute values as a
special case and escape them as necessary to write them back out,
typically as .

However, the distinction between , CR, LF, and CRLF will not be
preserved elsewhere. The only place where XML cares about the difference
between these is in the details of attribute value normalization and
serialization.

And while looking at the parsed version of the data (as output from the
parser but not run back through a serializer, you will always see these
as the newline character,

I'm still not sure from your description which of these applies to your
particular problem. You might want to post a very explicit description
of what your source XML looks like, how you're viewing the result of the
parse, and what you're seeing.

In any case, UTF-8 has nothing to do with any of the above; it's
strictly XML behaviors.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden

Sep 18 '06 #7

Joseph Kesselman

Personally, I'd recommend you discard CSS and switch to XSLT. CSS was
not designed for XML processing; XSLT was (and is more powerful than CSS).

Sep 18 '06 #8

Richard Tobin

In article <11**********************@b28g2000cwb.googlegroups .com>,
Jean-François Michaud <co*****@comcast.netwrote:

>We are displaying the attribute before a bit of text

If the character is in an attribute, rather than content, it should be
output as or an equivalent reference. This is because an
ordinary linefeed would be normalised to a space character when the
file is read in again.

>It does indeed
work, but as soon as we save the document, the character gets converted
to UTF-8 encoding.

Just to be clear about this: linefeed is an ASCII character, and is the
same in UTF-8 as in ASCII.

>We HAVE to use this character because VEX doesn't
deal with UTF-8 encoding directly to format its output.

I really don't understand this at all. The encoding is not relevant
here. In your input file, you will have . A program that reads
(parses) this will have a linefeed character in its data, using
whatever internal encoding it happens to use. UTF-8 only becomes
relevant when you output the file, and as I said a linefeed in an
attribute should be output as rather than a linefeed character.
-- Richard

Sep 18 '06 #9

Jean-François Michaud

Joseph Kesselman wrote:

Personally, I'd recommend you discard CSS and switch to XSLT. CSS was
not designed for XML processing; XSLT was (and is more powerful than CSS).

I know, that would have been my take also. The technology that we are
using is the VEX XML editor. It allows users to update XML content as
if they were in word which is not entirely uninterresting, but CSS is
not advanced enough for this XML + CSS combo to work perfectly when
more demanding formatting is necessary. VEX unfortunately uses CSS to
render the output on display. No way around this short of throwing
everything in the garbage altogether and thats just not gonna happen.

Regards
Jeff

Sep 18 '06 #10

Joseph Kesselman

(parses) this will have a linefeed character in its data [...]

attribute should be output as rather than a linefeed character.

Absolutely. If you're looking at the parsed form of the attribute's
value, you should see the newline character. If you're looking at the
text form, you should see . If either is not true, your tools are
broken.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden

Sep 18 '06 #11

Philippe Poulard

Jean-François Michaud wrote:

Hello all,

I'm having a little problem, The UTF-8 parser we are using converts the
newline entity ( ) within an attribute that we are using to paliate
CSS limitations.

After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

Is there a clean and easy way around this?

Any help will be greatly appreciated.

Regards
Jean-Francois Michaud

hi,

[CR], [LF], [CR/LF] are normalized by XML parsers, but characters
references are left as-is (the value you see is the character that is
referred)

that is to say, if you parse the following document :

<?xml version="1.0"?>
<foo bar="abc
def
ghi"/>

(with [CR/LF] between "def" and "ghi")
you will get that value :

abc
def ghi

(with [CR/LF] between "abc" and "def")

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !

Sep 19 '06 #12

by: Mike Dee | last post by:

A very very basic UTF-8 question that's driving me nuts: If I have this in the beginning of my Python script in Linux: #!/usr/bin/env python # -*- coding: UTF-8 -*- should I - or should I...

Python

Preventing runaway mysqld process

by: David Hane | last post by:

Hi all, I would like give users the ability to experiment with complex queries but I'm worried about them creating queries that will bog down the server. Does anyone have any ideas for...

MySQL Database

French "No" character entity

by: Haines Brown | last post by:

I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...

HTML / CSS

Preventing memory fragmentation

by: Tron Thomas | last post by:

Given the following information about memory management in C++: ----- The c-runtime dynamic memory manager (and most other commercial memory managers) has issues with fragmentation similar to a...

C / C++

Preventing Multiple submit (Disabling Submit Button Post Click) Solution

by: Mark | last post by:

This is a solution... Often users want to keep clicking "submit" when they are waiting for server processing. Most apps these days like to disable the submit button to prevent this. You can't just...

ASP.NET

Preventing Validation

by: tshad | last post by:

I have a logon page that asks for a Logon and Password as well as a linkbutton that goes to register a new user. The problem is that if I push the submit button or the linkbutton to register a...

ASP.NET

Preventing some characters as input in textbox

by: Lars Netzel | last post by:

Hi How do I on the client, when a user enters a letter into a field that is supposed to only accept numbers to not enter the letter... that means nothing really happens unless it's a number. ...

ASP.NET

Preventing login as 'NT AUTHORITY\ANONYMOUS LOGON'

by: et | last post by:

I have an asp.net program that uses a connection string, using integrated security to connect to a sql database. It runs fine on one server, but the other server gives me the error that "Login...

ASP.NET

Preventing binary input

by: Avinash | last post by:

I am writing an application that takes a file as an input. I want to avoid binary files that have been specified by the user. Is there any way to detect that a file contains binary data? Thanks,...

C / C++

preventing Session ID replay attack

by: =?Utf-8?B?YW5vb3A=?= | last post by:

Hello, I am developing a Simple ASP Website with a login page. I want to know how can I change Session ID after login and also Close the current Session after User closes the Window or gets logged...

ASP / Active Server Pages

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Preventing the UTF-8 Parser from converting an entity?

Similar topics