473,587 Members | 2,494 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

xml, character encoding, asp question

Hi...

I've been doing a lot of work both creating and consuming web services, and
I notice there seems to be a discontinuity between a number of the different
cogs in the wheel centering around windows-1252 and that it is not equivalent
to iso-8859-1.

Looking in the registry under HKEY_CLASSES_RO OT\MIME\Databas e\Charset and
\Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are
mapped to code page 1252, which I'm assuming is windows-1252 in execution
terms. So if I set the codepage=1252 and Response.Charse t=iso-8859-1 in ASP,
it seems that I'm *really* going to get out windows-1252, not iso-8859-1.
This becomes somewhat noticable in html since a lot of commonly used elements
(like the free-floating bullet •), which *aren't* really 8859-1, get
interpreted as such in browsers.

I occasionally run into problems, however, because MSXML doesn't appear to
be using the mime database to determine how to process the encoding
declaration (or at least it's got some different mapping hidden somewhere).
MSXML appears to treat the range 128-159 the way the ansi standard defines
them - undefined control sequences. As such, when you're processing xml
(either xml to xml or xml to html via xsl), if you get what is *intended* to
be a bullet (149) or curly quotes or any of those other extensions that are
really windows-1252 in your xml, msxml won't make the association and
translate the characters properly going between character sets. And
unfortunately a lot of web services don't accept or generate "windows-1252"
as an encoding declaration.

So...
1) Am I correct in assuming that MSXML is using different encoding routines
than IIS/ASP?

2) Is there a @Codepage I can specify that will produce real latin 1 in asp?

3) Will ASP.Net be more standards compliant? and/or does ASP.Net use the
mime database under the covers too?

4) just as an aside anybody have a clue why when output via xsl for
encoding utf-8 doesn't display properly in IE?

Thanks
-Mark

Jul 22 '05 #1
7 4951
Hello Mark,

MSXML has two methos to load XML:LoadXML method and the Load method.

The LoadXML method always takes a Unicode BSTR that is encoded in UCS-2 or
UTF-16 only. If you pass in anything other than a valid Unicode BSTR to
LoadXML, it will fail to load.

The Load method implements the following algorithm for determining the
character encoding or character set of the XML document:

1.If the Content-Type HTTP header defines a character set, this character
set overrides anything in the XML document itself. This obviously doesn't
apply to SAFEARRAY and IStream mechanisms because there is no HTTP header.
2.If there is a 2-byte Unicode byte-order mark, it assumes the encoding is
UTF-16. It can handle both big endian and little endian.
3.If there is a 4-byte Unicode byte order mark (0xFF 0xFE 0xFF 0xFE), it
assumes the encoding is UTF-32. It can handle both big endian and lttle
endian.
4.Otherwise, it assumes the encoding is UTF-8 unless it finds an XML
declaration with an encoding attribute that specifies some other character
set (such as ISO-8859-1, Windows-1252, Shift-JIS, and so on).

"Windows-1252" should be right thing to produce latin 1. ASP.NET also has
codepage property and simliar with ASP, however, the charator will be
UNICODE in its code behind.

Luke

Jul 22 '05 #2
Hi Luke...

Thanks for responding, but the response is a little too narrow to address
any of the questions I asked. We're using the Load() method to load the
response from web services, so the detection of the encoding is not the
issue. The issue is that the mappings between character sets that MSXML uses
doesn't appear to be the same as other apis available to ASP (like
Server.HTMLEnco de() and Server.UrlEncod e()) and other C++ apis (like
WideCharToMulti Byte() and MultiByteToWide Char()).

Near as I can tell, everything other than MSXML doing encoding conversion
seems to be working from the HKEY_CLASSES_RO OT\MIME\Databas e\Charset &
CodePage system. Also near as I can tell, that system doesn't differentiate
between windows-1252 and iso-8859-1, even though they are *not* equivalent
(1252 is a superset of 8859-1). I probably wouldn't be running into as many
annoying inconsistencies if MSXML was standards-noncompliant in the same way,
but MSXML *does* recognize the difference between windows-1252 and iso-8859-1
and does process/output things differently. And since many of the web
services we consume come from other vendors, we don't have the option of just
telling them to use "windows-1252" instead of "iso-8859-1" in their xml
encoding headers.

First, I'm looking for ways to get MSXML and ASP to work together
consistently, if possible. If not, at least try to define what to avoid.
It's also of parenthetical interest whether ASP.Net has fixed any of these
inconsistencies ; I haven't done trial cases myself to test it yet.

Take the small bullet as a good example. Putting • in your html gets you a
small bullet in IE, though this is only a legitimate interpretation if your
encoding is windows-1252 - not iso-8859-1 or any other non-windows-12*
encoding. 149 is a legal character in unicode just not the bullet character.
In unicode the bullet character is 8226. If I have a literal 149 character
in an xml document with a declared encoding of windows-1252, MSXML will
interpret that up to 8826 as part of the character set mapping when the xml
is parsed; how it gets represented when I spiel it out via xsl or
Response.Write depends on the output encoding I use.

If that same xml document, however, has a declared encoding of iso-8859-1,
MSXML doesn't map the 149 to anything at all - it doesn't recognize that it
has any particular meaning. So if my xsl stylesheet applied to that dom
outputs utf8, what comes out is a two byte representation of 149 - c2 95. IE
doesn't recognize those characters as meaning anything in particular and what
it displays is garbage. Hence the reason for my posting.

Ironically, there are some web services out there which have the same
misunderstandin g of the difference between windows-1252 and iso-8859-1 that
you do. They generate xml with an encoding of "iso-8859-1" when they are
including 1252 characters between 128-159. It's frustrating that while MSXML
is more standards compliant in recognizing the difference, that standards
compliance causes garbage to come out the back end of the meat grinder.

Thanks
Mark

Jul 22 '05 #3
Hi Mark,

I think we can specify the encoding in xsl, for example:

<xsl:styleshe et version="1.0"
xmlns:xsl="http ://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" encoding="iso-8859-1" />
<xsl:template match="Books">

I test above code in IE and it can display char 149 correctly.

Luke

Jul 22 '05 #4
I can't help you much here Mark, but I can sympathise. We're going to be
hitting this problem ourselves soon so I'm especially interested in this
thread.

I know all to well that 'Windows Latin-1' (code page 1252) is *not* the same
as the ISO latin-1 set (iso 8859/1). There are some subtle differences where
MS have tried to make better use of some the lesser-used parts of the ISO
set.

Tony Proctor

"Mark" <mm******@nospa m.nospam> wrote in message
news:80******** *************** ***********@mic rosoft.com...
Hi...

I've been doing a lot of work both creating and consuming web services, and I notice there seems to be a discontinuity between a number of the different cogs in the wheel centering around windows-1252 and that it is not equivalent to iso-8859-1.

Looking in the registry under HKEY_CLASSES_RO OT\MIME\Databas e\Charset and
\Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are
mapped to code page 1252, which I'm assuming is windows-1252 in execution
terms. So if I set the codepage=1252 and Response.Charse t=iso-8859-1 in ASP, it seems that I'm *really* going to get out windows-1252, not iso-8859-1.
This becomes somewhat noticable in html since a lot of commonly used elements (like the free-floating bullet •), which *aren't* really 8859-1, get
interpreted as such in browsers.

I occasionally run into problems, however, because MSXML doesn't appear to
be using the mime database to determine how to process the encoding
declaration (or at least it's got some different mapping hidden somewhere). MSXML appears to treat the range 128-159 the way the ansi standard defines
them - undefined control sequences. As such, when you're processing xml
(either xml to xml or xml to html via xsl), if you get what is *intended* to be a bullet (149) or curly quotes or any of those other extensions that are really windows-1252 in your xml, msxml won't make the association and
translate the characters properly going between character sets. And
unfortunately a lot of web services don't accept or generate "windows-1252" as an encoding declaration.

So...
1) Am I correct in assuming that MSXML is using different encoding routines than IIS/ASP?

2) Is there a @Codepage I can specify that will produce real latin 1 in asp?
3) Will ASP.Net be more standards compliant? and/or does ASP.Net use the
mime database under the covers too?

4) just as an aside anybody have a clue why when output via xsl for
encoding utf-8 doesn't display properly in IE?

Thanks
-Mark

Jul 22 '05 #5
Hi Luke...

Again, thanks for responding. We're getting closer to an understanding of
the problem, but not yet any resolution.

Yes, you can change the output encoding designation in xsl, and yes you can
use "iso-8859-1" and it will output a literal 149 and yes IE will display it
- usually. But this only delivers us to the doorstep of understanding the
inconsistencies that make this difficult to work with in ASP.

If you want to have any good support for internationaliz ation on your
website, you really can't use windows-1252 OR iso-8859-1 (same thing as far
as ASP goes) as your ASP page's code page because the output encoding from
IIS (or the encoding IE receives depending on how you look at it) because
that will influence how IE tries to process form elements that it tries to
encode for resubmission.

The big problem is that an IE page with 1252 encoding lets you copy/paste,
say, chinese into the form element and it looks good in the form element, but
IE does a terrible job encoding those inputs on a url. It uses a
non-standard encoding format to construct the url and the tools in ASP for
interpreting are marginal.

To get really *good* support for url encoding from IE (or other browsers),
you have to set your page encoding to utf-8. If you do that, IE will use
utf-8 to stream international user input in the url encoding, and it does it
in a standard way.

But if you use utf-8 encoding and you're working with xml in your asp page,
then the *real* difference between windows-1252 and iso-8859-1 *does* become
a problem. Because, as i've been saying, MSXML is standards-compliant and
does recognize the difference while the rest of ASP is *not* standards
compliant in how it handles the two.

So these inconsistencies really put a web developer in a bind. Which
feature do you want to drop - internationaliz ation? Use of web services?
Use of xml? Or do you just have to bend over backward as a developer trying
to develop all of your own tools to work around the fact that the MS tools
for this are inconsistent? Seems like the last one to me, but I thought I
would ask to see if these sorts of things were on the MS radar screen.

Thanks
-mark

Jul 22 '05 #6
Hi Mark,

I understand your complaining on this issue. It is really a tough issue to
take care all these staff. The best thing I can suggest is to migrate to
ASP.NET. It has better support for internationaliz ation and web service.
You can handle the web service with XML classes in .NET, convert it to utf8
and send result to client side.

Luke

Jul 22 '05 #7
Re: question (2) Mark, I've found a reference to a code page that I didn't
know existed: 28591. This is suppose to be exactly equivalent to ISO 8859/1.

If this works (I haven't tried it) then it won't solve all problems though.
The Euro symbol, for instance, is a very important character in Windows
Latin-1, but it isn't present in the ISO Latin-1. I believe ISO cope with it
using a newer ISO 8859/15 (Latin-9). The code page equivalent for this,
apparently, is 20865.

Tony Proctor

"Mark" wrote:
Hi...

I've been doing a lot of work both creating and consuming web services, and
I notice there seems to be a discontinuity between a number of the different
cogs in the wheel centering around windows-1252 and that it is not equivalent
to iso-8859-1.

Looking in the registry under HKEY_CLASSES_RO OT\MIME\Databas e\Charset and
\Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are
mapped to code page 1252, which I'm assuming is windows-1252 in execution
terms. So if I set the codepage=1252 and Response.Charse t=iso-8859-1 in ASP,
it seems that I'm *really* going to get out windows-1252, not iso-8859-1.
This becomes somewhat noticable in html since a lot of commonly used elements
(like the free-floating bullet •), which *aren't* really 8859-1, get
interpreted as such in browsers.

I occasionally run into problems, however, because MSXML doesn't appear to
be using the mime database to determine how to process the encoding
declaration (or at least it's got some different mapping hidden somewhere).
MSXML appears to treat the range 128-159 the way the ansi standard defines
them - undefined control sequences. As such, when you're processing xml
(either xml to xml or xml to html via xsl), if you get what is *intended* to
be a bullet (149) or curly quotes or any of those other extensions that are
really windows-1252 in your xml, msxml won't make the association and
translate the characters properly going between character sets. And
unfortunately a lot of web services don't accept or generate "windows-1252"
as an encoding declaration.

So...
1) Am I correct in assuming that MSXML is using different encoding routines
than IIS/ASP?

2) Is there a @Codepage I can specify that will produce real latin 1 in asp?

3) Will ASP.Net be more standards compliant? and/or does ASP.Net use the
mime database under the covers too?

4) just as an aside anybody have a clue why when output via xsl for
encoding utf-8 doesn't display properly in IE?

Thanks
-Mark

Jul 22 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
2010
by: Bill | last post by:
Dear All, I have a question related to character set. The default encoding of MySQL is latin1. So, textual data should be interpreted as 8-bit character. Is that right? If the default character set to , say big5, in server startup, will MySQL interpretes all textual data as 16-bit character?
5
2408
by: ssk | last post by:
Hello! This might be a dumb question. An XML file starts with a line like the following line. <?xml version="1.0" encoding="ISO-8859-1"?> So an application knows what encoding the file is. However, how does an application read the first line without knowing what encoding it is?
38
5718
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I find an answer to this question (don't find it in the W3C_char_entities document). -- Haines Brown brownh@hartford-hwp.com
7
3784
by: Art M | last post by:
I saved an html page the other day that encoded some punctuation with codes like â?T --> apostrophe (in case those characters don't show up in your news reader that's a_circumflex + euro + trademark) --Art
15
9807
by: Craig Wagner | last post by:
I have a situation where I have a series of character codes stored in the database. In some cases they are the ASCII value of the character, in other cases they are the > 127 character code value that you use in combination with the ALT key to enter special characters into a document (e.g. ALT+0147 and ALT+0148 get you 'smart quotes'). I'm trying to figure out how (or if it's possible) to obtain the Unicode equivalent of the character...
18
4607
by: james | last post by:
Hi, I am loading a CSV file ( Comma Seperated Value) into a Richtext box. I have a routine that splits the data up when it hits the "," and then copies the results into a listbox. The data also has some different characters in it that I am trying to remove. The small a with two dots over it and the small y with two dots over it. Here is my code so far to remove the small y: Private Sub Button2_Click(ByVal sender As System.Object, ByVal...
44
9453
by: Kulgan | last post by:
Hi I am struggling to find definitive information on how IE 5.5, 6 and 7 handle character input (I am happy with the display of text). I have two main questions: 1. Does IE automaticall convert text input in HTML forms from the
3
3868
by: LiMBi | last post by:
Hi, Is there a way to encode "??????????? ??????????" to "¶ÒÁ¹Ô´¹Ö§¤Ð µÃ§·Õèà»ç¹" and vice versa. Thanks
17
10636
by: =?Utf-8?B?R2Vvcmdl?= | last post by:
Hello everyone, Wide character and multi-byte character are two popular encoding schemes on Windows. And wide character is using unicode encoding scheme. But each time I feel confused when talking with another team -- codepage -- at the same time. I am more confused when I saw sometimes we need codepage parameter for wide character conversion, and sometimes we do not need for conversion. Here are two examples,
0
7924
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
1
7978
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8221
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
5722
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5395
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
3845
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
3882
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1455
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1192
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.