473,472 Members | 2,181 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Transforming XML containing Asian characters?

I have an XML file containing localized strings in 9 languages, encoded
in Unicode (UTF-8). Im trying to parse this XML document via XSLT
(Apache Xalan) to selectively render localized strings depending on a
users selected language.

The problem Im running into is that when the XML document is sent
through the XSLT stylesheet, all European special characters (such as
umlauts, accents, etc) are converted to html character entities (as
expected behavior), however, the Asian character sets are shown as
question marks in the page source. It seems as if the XSLT engine does
not know how to convert the Asian Unicode strings to useable character
entities (or it is trying incorrectly to convert them to something that
then the browser cannot understand).

As an example, below is the output I get using the simplest of XSLT to
take the localized XML and output it to a UTF-8 encoded html file. Note
the 2 lines in question marks, which in the XML source, appear
correctly as Japanese and Chinese.

French
Français
Französisch
Français
Frans
Francese
???????????????
??????
Francés

Been banging my head against my desk for awhile now. Any ideas
thankfully accepted!

Jul 20 '05 #1
4 3295
For some reason when I set the xsl:output encoding attribute to
"UTF-8", Xalan does not convert Asian symbols into numeric entity
codes. It keeps the symbols in Unicode. But it does convert European
double byte characters to numeric entities. Is there any way around
this without changing the xsl:output encoding? I need it to convert all
double byte characters (including Chinese, Japanese and European) into
entity codes, while still maintaining a UTF-8 output encoding.

Is this possible?

Jul 20 '05 #2
mi**********@yahoo.com writes:
For some reason when I set the xsl:output encoding attribute to
"UTF-8", Xalan does not convert Asian symbols into numeric entity
codes. It keeps the symbols in Unicode. But it does convert European
double byte characters to numeric entities. Is there any way around
this without changing the xsl:output encoding? I need it to convert all
double byte characters (including Chinese, Japanese and European) into
entity codes, while still maintaining a UTF-8 output encoding.

Is this possible?


specify US-ASCII as the output. The end result is then as you wish: the
file itself will be ascii encoded (which is teh same as utf8 encoding
for ascii characters) and all non ascii characters will be accessed by
reference.

David
Jul 20 '05 #3
That works, but then if I have any Unicode characters on the page, it
shows up as garble (since these are not in the US-ASCII range). You
probably are wondering why would there be any unicode characters on the
page if they are all converted into entities during the XSLT
conversion.

This is because not all the content on the page is being output by the
XSLT template (we have a JSP page containing a java bean which in turn
calls the Xalan engine and outputs the results to the page) but it is
only a portion of the larger page, which contains Chinese/Japanse
Unicode characters.

The output type set on the XSL template (US-ASCII) then overrides the
encoding of the jsp page (UTF-8) and although we see the correct
conversions within the XSLT output "zone", all the Unicode characters
outside this zone are no longer readable.

What I really need to know is, is there a way to keep the UTF-8 output
encoding and have Xalan still convert Chinese/Japanese characters into
entities? It seems like it should do this by default, as these symbols
are just another character in the Unicode range.

Thanks,

Mike

Jul 20 '05 #4
mi**********@yahoo.com writes:
That works, but then if I have any Unicode characters on the page, it
shows up as garble (since these are not in the US-ASCII range). You
probably are wondering why would there be any unicode characters on the
page if they are all converted into entities during the XSLT
conversion. character references (which are not entity references, but yes)

This is because not all the content on the page is being output by the
XSLT template (we have a JSP page containing a java bean which in turn
calls the Xalan engine and outputs the results to the page) but it is
only a portion of the larger page, which contains Chinese/Japanse
Unicode characters.
So in that case, what's the problem with the XSLT derived portions being
in utf8?


The output type set on the XSL template (US-ASCII) then overrides the
encoding of the jsp page (UTF-8) and although we see the correct
conversions within the XSLT output "zone", all the Unicode characters
outside this zone are no longer readable.

What I really need to know is, is there a way to keep the UTF-8 output
encoding and have Xalan still convert Chinese/Japanese characters into
entities?
I'm not sure abut xalan, saxon has extension attributes on xsl:output
that can control this.

In XSLT1 I have in the past specified us-ascii (to get non ascii
characters as references) and then just removed the encoding declaration
in the result in a post process (with sed/per/whatever) that way the
files will be parsed as utf8 and utf8 characters will work.

In xslt2 (eg saxon8) you will be able to specify us-ascii and also
specify that no xml declaration is output, this is explictly for use
cases such as yours where the output from xslt needs to be merged with
other things.
It seems like it should do this by default, as these symbols
are just another character in the Unicode range.
All XML characters fit this description. If you specify an encoding that
includes the characters (and utf8 includes them all) then the normal
behaviour is that the characters are output as character data (You
indicate that accented letters are being output as numeric references,
which would be surprising, but conformant behaviour)

Thanks,

Mike

Jul 20 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Cathie | last post by:
Hi All, I am trying to get my style sheet to work. It works fine in IE but I can't get it to work in .net. Below is the function I use for transforming, where advancedOptionsFile is the path...
4
by: Sam | last post by:
Is there a way to display Asian fonts in Access 97? If not, how about subsequent versions? One thought is I was able to display Asian fonts in Visual Basic using a browser control, which in turn...
40
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...
0
by: skapoor | last post by:
I am trying to save Asian characters from my asp.net application to my postgresql database whose encoding type is utf8. It doesn't give any error but saves ????? instead of proper characters. ...
1
by: dz | last post by:
Yes. I deliberately removed the Asian Language support files. Firefox still shows the page correctly, but not IE.
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.