471,052 Members | 1,205 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,052 software developers and data experts.

xsl and unicode surrogate characters

Hi

In one of the data files that I have , I am seeing these characters
\xed\xa0\xa0 . They seem to break the xsl.

---------------------------------------------------------------
Extra content at the end of the document
XML/XSL Error: </data><data ><![CDATA[ Pls advice
----------------------------------------------------------------
this seems to break the libxml2/libxslt

is this a unicode utf-16 surrogate pair ?
for displaying it on xml/xsl, should I extract only \xa0?
since this is hingher than 00-7f range can i just strip it?
under what condition the encoding software put this string in?
thanks for help,

Jan 5 '06 #1
3 1822
Sakcee wrote:
Hi

In one of the data files that I have , I am seeing these characters
\xed\xa0\xa0 . They seem to break the xsl. [...] is this a unicode utf-16 surrogate pair ?
Yes and no. This is the UTF-8 encoding of U+D820, which is a high
surrogate code point. So yes. It's not yet a pair; there would have to
be a second such code point. So no.

Furthermore, in UTF-8, you should never ever have encoded surrogate
codes; instead, whoever generated the UTF-8 should have combined the
two surrogate code point into a single coded character, and should
have encoded *that* character. So no - this byte sequence isn't
even valid UTF-8.
for displaying it on xml/xsl, should I extract only \xa0?
You should tell your parser to reject the file as ill-formed.
since this is hingher than 00-7f range can i just strip it?
Depending an what you want to achieve: sure! It will modify
the meaning of the bytes, of course.
under what condition the encoding software put this string in?


If it has a bug.

Regards,
Martin
Jan 5 '06 #2
thanks very much for the info, it really helped

we are using the text from file to display on webpage and we have a
method for conversion the parsed data to utf-8 and then displaying, all
the data looks fine after parsing except the
surrogate pair,
since i can not guess what it was supposed to be , is it ok to strip it
using regex re.complie(' [\xed|\xa0] ')?


Martin v. Lwis wrote:
Sakcee wrote:
Hi

In one of the data files that I have , I am seeing these characters
\xed\xa0\xa0 . They seem to break the xsl.

[...]
is this a unicode utf-16 surrogate pair ?


Yes and no. This is the UTF-8 encoding of U+D820, which is a high
surrogate code point. So yes. It's not yet a pair; there would have to
be a second such code point. So no.

Furthermore, in UTF-8, you should never ever have encoded surrogate
codes; instead, whoever generated the UTF-8 should have combined the
two surrogate code point into a single coded character, and should
have encoded *that* character. So no - this byte sequence isn't
even valid UTF-8.
for displaying it on xml/xsl, should I extract only \xa0?


You should tell your parser to reject the file as ill-formed.
since this is hingher than 00-7f range can i just strip it?


Depending an what you want to achieve: sure! It will modify
the meaning of the bytes, of course.
under what condition the encoding software put this string in?


If it has a bug.

Regards,
Martin


Jan 5 '06 #3
Sakcee wrote:
thanks very much for the info, it really helped

we are using the text from file to display on webpage and we have a
method for conversion the parsed data to utf-8 and then displaying, all
the data looks fine after parsing except the
surrogate pair,
since i can not guess what it was supposed to be , is it ok to strip it
using regex re.complie(' [\xed|\xa0] ')?


As martin said: that alters the meaning of the bytes. If that has to bother
you or not, that's yours to decide. If for example you stripped all vocals
from a text, it still might be comprehensible for most people, so if vocals
bother you for whatever reason, remove them.

Bt myb y bttr try nd fx th prblm n th frst plc.

Regards,

Diez
Jan 5 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Mike Brown | last post: by
3 posts views Thread by Chris Mullins | last post: by
5 posts views Thread by Johannes | last post: by
6 posts views Thread by archana | last post: by
18 posts views Thread by Chameleon | last post: by
17 posts views Thread by Adam Olsen | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.