473,509 Members | 2,951 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Encode() behaves differently with different charsets?

I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer.

Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set.

For example:
<script>
document.write(escape('Ôèëìè'));
(note: that should be five accented characters)
</script>

Produces: %D4%E8%EB%EC%E8

But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write(escape('Ôèëìè'));
</script>

Produces: %u0424%u0438%u043B%u043C%u0438

Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.

Might you know of any resources that can help me better understand
what's happening there?

Many thanks!
Scott
Jul 20 '05 #1
5 4076
sc***@turnstyle.com (Scott Matthews) writes:
I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer. Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set. For example:
<script>
document.write(escape('Ôèëìè'));
(note: that should be five accented characters)
It is five accented characters, because your message is encoded as
ISO-8859-1, and, e.g., the first character (byte value 212) is
O-circumflex in ISO-8859-1. It also has Unicode codepoint 212,
since Unicode agress with ISO-8859-1 on values below 256.
</script>

Produces: %D4%E8%EB%EC%E8
Where D4 is 212 in hex, so as expected.
But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write(escape('Ôèëìè'));
Now, this *script* is interpreted as Windows-1251 characters, including
the literal string. The first character of that string is the byte 212,
which in Windows 1251 is the Cyrillic capital letter EF. Since Javascript
uses Unicode for strings, the first character of the string value becomes
Cyrillic EF, which has Unicode code-point 1060.
</script>

Produces: %u0424%u0438%u043B%u043C%u0438
Here 0424 is hex for 1060, as expected.
(can be checked using 'parseInt("0424",16)')
Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.
It doesn't. What changes is the interpretation of the string literal.
Try changing the write to
document.write('Ô'.charCodeAt(0));
or even better
document.write('Ôèëìè');
Might you know of any resources that can help me better understand
what's happening there?


No ressources, sorry. But remember that when you assign an encoding
that is different from the one used by your editor, you can't trust
the characters you see. WYSI-not-WYG!

You should learn what a codepage really does. A codepage represents a
set of (up to) 256 different characters (or code points), like capital
Roman letter A, Arabic numeral 4, Roman letter O circumflex accent,
cyrillice capital EF, or Chinese glyph whatnot. Those are the only
characters that can be written using that codepage. It also defines a
map from 8-bit bytes to those characters. Different code pages can
assign different code points to the same byte, as ISO-8859-1 and
Windows-1251 does to the byte 212.

Javascript converts all strings
to 16-bit Unicode internally, so it doesn't need to know about
code pages after the page has loaded.
Unicode:
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-lat1.htm>
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-cyr1.htm>

Codepage 1251 is "Cyrillic (Windows)"
<URL:http://longhorn.msdn.microsoft.com/lhsdk/ref/ns/system.text/c/encoding/p/codepage.aspx>

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'
Jul 20 '05 #2
Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.

In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?

Here's my situation: I have a FORM that asks for a URL as input. The
page that the FORM sits on is available in a few languages, and so it
can include a few differnt codepages.

The action sets a window.location to the value of that form field --
when I'm in Windows-1251, I get a 404 but in ISO-8859-1 everything
works.

I appreciate your thoughts on how best to remedy this!

Thanks again,
Scott

Lasse Reichstein Nielsen <lr*@hotpop.com> wrote in message news:<vf**********@hotpop.com>...
sc***@turnstyle.com (Scott Matthews) writes:
I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer.

Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set.

For example:
<script>
document.write(escape('Ôèëìè'));
(note: that should be five accented characters)


It is five accented characters, because your message is encoded as
ISO-8859-1, and, e.g., the first character (byte value 212) is
O-circumflex in ISO-8859-1. It also has Unicode codepoint 212,
since Unicode agress with ISO-8859-1 on values below 256.
</script>

Produces: %D4%E8%EB%EC%E8


Where D4 is 212 in hex, so as expected.
But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write(escape('Ôèëìè'));


Now, this *script* is interpreted as Windows-1251 characters, including
the literal string. The first character of that string is the byte 212,
which in Windows 1251 is the Cyrillic capital letter EF. Since Javascript
uses Unicode for strings, the first character of the string value becomes
Cyrillic EF, which has Unicode code-point 1060.
</script>

Produces: %u0424%u0438%u043B%u043C%u0438


Here 0424 is hex for 1060, as expected.
(can be checked using 'parseInt("0424",16)')
Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.


It doesn't. What changes is the interpretation of the string literal.
Try changing the write to
document.write('Ô'.charCodeAt(0));
or even better
document.write('Ôèëìè');
Might you know of any resources that can help me better understand
what's happening there?


No ressources, sorry. But remember that when you assign an encoding
that is different from the one used by your editor, you can't trust
the characters you see. WYSI-not-WYG!

You should learn what a codepage really does. A codepage represents a
set of (up to) 256 different characters (or code points), like capital
Roman letter A, Arabic numeral 4, Roman letter O circumflex accent,
cyrillice capital EF, or Chinese glyph whatnot. Those are the only
characters that can be written using that codepage. It also defines a
map from 8-bit bytes to those characters. Different code pages can
assign different code points to the same byte, as ISO-8859-1 and
Windows-1251 does to the byte 212.

Javascript converts all strings
to 16-bit Unicode internally, so it doesn't need to know about
code pages after the page has loaded.
Unicode:
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-lat1.htm>
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-cyr1.htm>

Codepage 1251 is "Cyrillic (Windows)"
<URL:http://longhorn.msdn.microsoft.com/lhsdk/ref/ns/system.text/c/encoding/p/codepage.aspx>

/L

Jul 20 '05 #3
sc***@turnstyle.com (Scott Matthews) writes:
Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.
That is because it is encoding different values. In the latin-1 code
page, your string contains the unicode character with code point 212.
It is escaped as %D4, because that is how 212 is written in hex.

In the Windows-1251(Cyrillic) codepage, the string contains the unicode
character with code point 1060. Since that can't be represented as a
two-digit hex number, escape uses the longer four-digit encoding:
%u0424
In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?
It could, but it doesn't have to, since two hex digits are sufficient.
It optimizes and uses the shorter representation. It could have
generated %u00D4 instead, but that would be three bytes wasted.
Here's my situation: I have a FORM that asks for a URL as input. The
page that the FORM sits on is available in a few languages, and so it
can include a few differnt codepages.


Whee! Inputs and codepages. I believe there is something tricky about
that, but I don't know it. If the way the input is interpreted by the
browser is not the way it is intended by the operating system (I press
the Cyrillic FE key, browser writes an O-circumflex), then something
is bound to go wrong (or you might say that it already is).

I am afraid it is probably browser *and* operating system dependent.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'
Jul 20 '05 #4
Scott Matthews wrote:

Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.

In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?
We had - 2 years ago - the same situation but with Japanese and
Chinese :) (my company does not support Cyrillic yet, but supports
Western European languages and Far East ones) -
and had exactly the same question!

Thanks, Lasse, your guess finally makes some sense (we were lost): It could, but it doesn't have to, since two hex digits are sufficient.
It optimizes and uses the shorter representation. It could have
generated %u00D4 instead, but that would be three bytes wasted.


So Scott, when our server-side software receives a data from a form
we have IF-ELSE there!

That is, if it's Western (windows-1252 or iso-8859-1) we use
URLDecoding1() that assumes %XX format
Otherwise, we use URLDecoding2() that assumes %uXXXX format.

We _always_ know - at the server side - what the encoding is -
when we send a page to a browser in the first place, creating
HTTP Header with "...charset=..." in it, we store that value on server
side. Or, in some cases, we create a page in such a way that
a form has a hidden field that contains encoding name, so when a
data is sent from the form to the server, one of the fields will
tell server-side software what the encoding is.

As for languages/encodings and Form Input - it's not really an
issue of this topic (in this topic we assume - as most Apps do - that
the data coming from a form are in the same encoding that page itself
is), you can read here:

http://ppewww.ph.gla.ac.uk/%7eflavel...form-i18n.html


--
Regards,
Paul Gorodyansky
"Cyrillic (Russian): instructions for Windows and Internet":
http://ourworld.compuserve.com/homepages/PaulGor/
Jul 20 '05 #5
Lasse Reichstein Nielsen wrote:
...
I am afraid it is probably browser *and* operating system dependent.


Right. When we first ran into this issue (2+ years ago)
we found out that only Internet
Explorer creates either %XX or %uXXXX based on the encoding, while
Netscape 4.0 does not - JavaScript in it always converts to %XX form

Don't know how JavaScript in Netscape 7/Mozilla works in such case -
we do use them now, but I did not ask the guys...
--
Regards,
Paul Gorodyansky
"Cyrillic (Russian): instructions for Windows and Internet":
http://ourworld.compuserve.com/homepages/PaulGor/
Jul 20 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
1951
by: Faith | last post by:
Hi all, I have a serious problem that I am not sure whether its a bug in the Microsoft Visual C++ or something that I am doing wrong. The problem is (as noted in the Subject) is that my program...
1
1216
by: Do | last post by:
Hi, Has anyone every had two NET Framework 1.1 Servers that both run the same web application, but the behavior of the forms is different? My form validation behaves differently on two...
4
7549
by: Darrel | last post by:
How does HTML.encode work? I'm trying to save text in a hidden form field into a SQL DB. The tedt is HTML (from a WYSIWYG editor...X-standard). One problem I have is that stray apostrophe's in...
15
2427
by: Joe Weinstein | last post by:
Hi. Below is a simple JDBC program to insert and extract a numerical value. When ResultSet.getDouble() is called, the same program produces different output on solaris than it does on Linux. I...
8
2935
by: Vinayakc | last post by:
Hi all, I am new to python. I have written one small application which reads data from xml file and tries to encode data using apprpriate charset. I am facing problem while encoding one...
1
1553
by: rchen8080 | last post by:
I found that the php script behaves differently when it run in command line rather than runing over web server. Program includes the code like <?php function send_email($from_address,...
13
3659
by: mario | last post by:
Hello! i stumbled on this situation, that is if I decode some string, below just the empty string, using the mcbs encoding, it succeeds, but if I try to encode it back with the same encoding it...
17
2648
by: yawnmoth | last post by:
http://www.frostjedi.com/terra/scripts/demo/this-alert.html http://www.frostjedi.com/terra/scripts/demo/this-alert2.html Why, when you click in the black box, do the alert boxes say different...
0
1017
by: souvickm | last post by:
I have developed an exe to cut image files from a dumped folder and paste them in some other folder(categorized into subfolders) on a different machine. I have also coded to generate a text file at...
0
7137
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7347
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7416
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
7506
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5656
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
5062
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
3218
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
1
779
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
443
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.