Encode() behaves differently with different charsets?

Scott Matthews

I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer.

Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set.

For example:
<script>
document.write(escape('Ôèëìè'));
(note: that should be five accented characters)
</script>

Produces: %D4%E8%EB%EC%E8

But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write(escape('Ôèëìè'));
</script>

Produces: %u0424%u0438%u043B%u043C%u0438

Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.

Might you know of any resources that can help me better understand
what's happening there?

Many thanks!
Scott

Jul 20 '05 #1

Subscribe Reply

4076

Lasse Reichstein Nielsen

sc***@turnstyle.com (Scott Matthews) writes:

I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer. Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set. For example:
<script>
document.write(escape('Ôèëìè'));
(note: that should be five accented characters)
It is five accented characters, because your message is encoded as
ISO-8859-1, and, e.g., the first character (byte value 212) is
O-circumflex in ISO-8859-1. It also has Unicode codepoint 212,
since Unicode agress with ISO-8859-1 on values below 256.
</script>

Produces: %D4%E8%EB%EC%E8
Where D4 is 212 in hex, so as expected.
But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write(escape('Ôèëìè'));
Now, this *script* is interpreted as Windows-1251 characters, including
the literal string. The first character of that string is the byte 212,
which in Windows 1251 is the Cyrillic capital letter EF. Since Javascript
uses Unicode for strings, the first character of the string value becomes
Cyrillic EF, which has Unicode code-point 1060.
</script>

Produces: %u0424%u0438%u043B%u043C%u0438
Here 0424 is hex for 1060, as expected.
(can be checked using 'parseInt("0424",16)')
Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.
It doesn't. What changes is the interpretation of the string literal.
Try changing the write to
document.write('Ô'.charCodeAt(0));
or even better
document.write('Ôèëìè');
Might you know of any resources that can help me better understand
what's happening there?

No ressources, sorry. But remember that when you assign an encoding
that is different from the one used by your editor, you can't trust
the characters you see. WYSI-not-WYG!

You should learn what a codepage really does. A codepage represents a
set of (up to) 256 different characters (or code points), like capital
Roman letter A, Arabic numeral 4, Roman letter O circumflex accent,
cyrillice capital EF, or Chinese glyph whatnot. Those are the only
characters that can be written using that codepage. It also defines a
map from 8-bit bytes to those characters. Different code pages can
assign different code points to the same byte, as ISO-8859-1 and
Windows-1251 does to the byte 212.

Javascript converts all strings
to 16-bit Unicode internally, so it doesn't need to know about
code pages after the page has loaded.
Unicode:
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-lat1.htm>
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-cyr1.htm>

Codepage 1251 is "Cyrillic (Windows)"
<URL:http://longhorn.msdn.microsoft.com/lhsdk/ref/ns/system.text/c/encoding/p/codepage.aspx>

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 20 '05 #2

Scott Matthews

Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.

In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?

Here's my situation: I have a FORM that asks for a URL as input. The
page that the FORM sits on is available in a few languages, and so it
can include a few differnt codepages.

The action sets a window.location to the value of that form field --
when I'm in Windows-1251, I get a 404 but in ISO-8859-1 everything
works.

I appreciate your thoughts on how best to remedy this!

Thanks again,
Scott

Lasse Reichstein Nielsen <lr*@hotpop.com> wrote in message news:<vf**********@hotpop.com>...

sc***@turnstyle.com (Scott Matthews) writes:
I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer.

Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set.

For example:
<script>
document.write(escape('Ôèëìè'));
(note: that should be five accented characters)

It is five accented characters, because your message is encoded as
ISO-8859-1, and, e.g., the first character (byte value 212) is
O-circumflex in ISO-8859-1. It also has Unicode codepoint 212,
since Unicode agress with ISO-8859-1 on values below 256.
</script>

Produces: %D4%E8%EB%EC%E8

Where D4 is 212 in hex, so as expected.
But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write(escape('Ôèëìè'));

Now, this *script* is interpreted as Windows-1251 characters, including
the literal string. The first character of that string is the byte 212,
which in Windows 1251 is the Cyrillic capital letter EF. Since Javascript
uses Unicode for strings, the first character of the string value becomes
Cyrillic EF, which has Unicode code-point 1060.
</script>

Produces: %u0424%u0438%u043B%u043C%u0438

Here 0424 is hex for 1060, as expected.
(can be checked using 'parseInt("0424",16)')
Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.

It doesn't. What changes is the interpretation of the string literal.
Try changing the write to
document.write('Ô'.charCodeAt(0));
or even better
document.write('Ôèëìè');
Might you know of any resources that can help me better understand
what's happening there?

No ressources, sorry. But remember that when you assign an encoding
that is different from the one used by your editor, you can't trust
the characters you see. WYSI-not-WYG!

You should learn what a codepage really does. A codepage represents a
set of (up to) 256 different characters (or code points), like capital
Roman letter A, Arabic numeral 4, Roman letter O circumflex accent,
cyrillice capital EF, or Chinese glyph whatnot. Those are the only
characters that can be written using that codepage. It also defines a
map from 8-bit bytes to those characters. Different code pages can
assign different code points to the same byte, as ISO-8859-1 and
Windows-1251 does to the byte 212.

Javascript converts all strings
to 16-bit Unicode internally, so it doesn't need to know about
code pages after the page has loaded.
Unicode:
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-lat1.htm>
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-cyr1.htm>

Codepage 1251 is "Cyrillic (Windows)"
<URL:http://longhorn.msdn.microsoft.com/lhsdk/ref/ns/system.text/c/encoding/p/codepage.aspx>

/L

Jul 20 '05 #3

Lasse Reichstein Nielsen

sc***@turnstyle.com (Scott Matthews) writes:

Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.
That is because it is encoding different values. In the latin-1 code
page, your string contains the unicode character with code point 212.
It is escaped as %D4, because that is how 212 is written in hex.

In the Windows-1251(Cyrillic) codepage, the string contains the unicode
character with code point 1060. Since that can't be represented as a
two-digit hex number, escape uses the longer four-digit encoding:
%u0424
In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?
It could, but it doesn't have to, since two hex digits are sufficient.
It optimizes and uses the shorter representation. It could have
generated %u00D4 instead, but that would be three bytes wasted.
Here's my situation: I have a FORM that asks for a URL as input. The
page that the FORM sits on is available in a few languages, and so it
can include a few differnt codepages.

Whee! Inputs and codepages. I believe there is something tricky about
that, but I don't know it. If the way the input is interpreted by the
browser is not the way it is intended by the operating system (I press
the Cyrillic FE key, browser writes an O-circumflex), then something
is bound to go wrong (or you might say that it already is).

I am afraid it is probably browser *and* operating system dependent.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 20 '05 #4

Paul Gorodyansky

Scott Matthews wrote:

Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.

In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?
We had - 2 years ago - the same situation but with Japanese and
Chinese :) (my company does not support Cyrillic yet, but supports
Western European languages and Far East ones) -
and had exactly the same question!

Thanks, Lasse, your guess finally makes some sense (we were lost): It could, but it doesn't have to, since two hex digits are sufficient.
It optimizes and uses the shorter representation. It could have
generated %u00D4 instead, but that would be three bytes wasted.

So Scott, when our server-side software receives a data from a form
we have IF-ELSE there!

That is, if it's Western (windows-1252 or iso-8859-1) we use
URLDecoding1() that assumes %XX format
Otherwise, we use URLDecoding2() that assumes %uXXXX format.

We _always_ know - at the server side - what the encoding is -
when we send a page to a browser in the first place, creating
HTTP Header with "...charset=..." in it, we store that value on server
side. Or, in some cases, we create a page in such a way that
a form has a hidden field that contains encoding name, so when a
data is sent from the form to the server, one of the fields will
tell server-side software what the encoding is.

As for languages/encodings and Form Input - it's not really an
issue of this topic (in this topic we assume - as most Apps do - that
the data coming from a form are in the same encoding that page itself
is), you can read here:

http://ppewww.ph.gla.ac.uk/%7eflavel...form-i18n.html

--
Regards,
Paul Gorodyansky
"Cyrillic (Russian): instructions for Windows and Internet":
http://ourworld.compuserve.com/homepages/PaulGor/

Jul 20 '05 #5

Paul Gorodyansky

Lasse Reichstein Nielsen wrote:

...
I am afraid it is probably browser *and* operating system dependent.

Right. When we first ran into this issue (2+ years ago)
we found out that only Internet
Explorer creates either %XX or %uXXXX based on the encoding, while
Netscape 4.0 does not - JavaScript in it always converts to %XX form

Don't know how JavaScript in Netscape 7/Mozilla works in such case -
we do use them now, but I did not ask the guys...
--
Regards,
Paul Gorodyansky
"Cyrillic (Russian): instructions for Windows and Internet":
http://ourworld.compuserve.com/homepages/PaulGor/

Jul 20 '05 #6

Similar topics

1951

VC++ behaves differently when using BreakPoint

by: Faith | last post by:

Hi all, I have a serious problem that I am not sure whether its a bug in the Microsoft Visual C++ or something that I am doing wrong. The problem is (as noted in the Subject) is that my program...

C / C++

1216

same web page behaves differently on servers

by: Do | last post by:

Hi, Has anyone every had two NET Framework 1.1 Servers that both run the same web application, but the behavior of the forms is different? My form validation behaves differently on two...

ASP.NET

7549

When to use HTML encode and when not to?

by: Darrel | last post by:

How does HTML.encode work? I'm trying to save text in a hidden form field into a SQL DB. The tedt is HTML (from a WYSIWYG editor...X-standard). One problem I have is that stray apostrophe's in...

ASP.NET

2427

IBM JDBC driver behaves differently on Linux than on Solaris

by: Joe Weinstein | last post by:

Hi. Below is a simple JDBC program to insert and extract a numerical value. When ResultSet.getDouble() is called, the same program produces different output on solaris than it does on Linux. I...

DB2 Database

2935

Encode exception for chinese text

by: Vinayakc | last post by:

Hi all, I am new to python. I have written one small application which reads data from xml file and tries to encode data using apprpriate charset. I am facing problem while encoding one...

Python

1553

php script behaves differently when it run in command line

by: rchen8080 | last post by:

I found that the php script behaves differently when it run in command line rather than runing over web server. Program includes the code like <?php function send_email($from_address,...

PHP

3659

different encodings for unicode() and u''.encode(), bug?

by: mario | last post by:

Hello! i stumbled on this situation, that is if I decode some string, below just the empty string, using the mcbs encoding, it succeeds, but if I try to encode it back with the same encoding it...

Python

2648

onclick behaves differently when defined via javascript

by: yawnmoth | last post by:

http://www.frostjedi.com/terra/scripts/demo/this-alert.html http://www.frostjedi.com/terra/scripts/demo/this-alert2.html Why, when you click in the black box, do the alert boxes say different...

Javascript

1017

Exe behaves differently when run as a job

by: souvickm | last post by:

I have developed an exe to cut image files from a dumped folder and paste them in some other folder(categorized into subfolders) on a different machine. I have also coded to generate a text file at...

Visual Basic 4 / 5 / 6

7137

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7347

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7416

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7506

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5656

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

5062

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

3218

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

779

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

443

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General