Encode() behaves differently with different charsets?

Scott Matthews

I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer.

Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set.

For example:
<script>
document.write(escape('Ôèëìè'));
(note: that should be five accented characters)
</script>

Produces: %D4%E8%EB%EC%E8

But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write(escape('Ôèëìè'));
</script>

Produces: %u0424%u0438%u043B%u043C%u0438

Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.

Might you know of any resources that can help me better understand
what's happening there?

Many thanks!
Scott

Jul 20 '05 #1

Subscribe Post Reply

4066

Lasse Reichstein Nielsen

sc***@turnstyle.com (Scott Matthews) writes:

I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer. Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set. For example:
<script>
document.write(escape('Ôèëìè'));
(note: that should be five accented characters)
It is five accented characters, because your message is encoded as
ISO-8859-1, and, e.g., the first character (byte value 212) is
O-circumflex in ISO-8859-1. It also has Unicode codepoint 212,
since Unicode agress with ISO-8859-1 on values below 256.
</script>

Produces: %D4%E8%EB%EC%E8
Where D4 is 212 in hex, so as expected.
But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write(escape('Ôèëìè'));
Now, this *script* is interpreted as Windows-1251 characters, including
the literal string. The first character of that string is the byte 212,
which in Windows 1251 is the Cyrillic capital letter EF. Since Javascript
uses Unicode for strings, the first character of the string value becomes
Cyrillic EF, which has Unicode code-point 1060.
</script>

Produces: %u0424%u0438%u043B%u043C%u0438
Here 0424 is hex for 1060, as expected.
(can be checked using 'parseInt("0424",16)')
Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.
It doesn't. What changes is the interpretation of the string literal.
Try changing the write to
document.write('Ô'.charCodeAt(0));
or even better
document.write('Ôèëìè');
Might you know of any resources that can help me better understand
what's happening there?

No ressources, sorry. But remember that when you assign an encoding
that is different from the one used by your editor, you can't trust
the characters you see. WYSI-not-WYG!

You should learn what a codepage really does. A codepage represents a
set of (up to) 256 different characters (or code points), like capital
Roman letter A, Arabic numeral 4, Roman letter O circumflex accent,
cyrillice capital EF, or Chinese glyph whatnot. Those are the only
characters that can be written using that codepage. It also defines a
map from 8-bit bytes to those characters. Different code pages can
assign different code points to the same byte, as ISO-8859-1 and
Windows-1251 does to the byte 212.

Javascript converts all strings
to 16-bit Unicode internally, so it doesn't need to know about
code pages after the page has loaded.
Unicode:
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-lat1.htm>
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-cyr1.htm>

Codepage 1251 is "Cyrillic (Windows)"
<URL:http://longhorn.msdn.microsoft.com/lhsdk/ref/ns/system.text/c/encoding/p/codepage.aspx>

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 20 '05 #2

Scott Matthews

Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.

In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?

Here's my situation: I have a FORM that asks for a URL as input. The
page that the FORM sits on is available in a few languages, and so it
can include a few differnt codepages.

The action sets a window.location to the value of that form field --
when I'm in Windows-1251, I get a 404 but in ISO-8859-1 everything
works.

I appreciate your thoughts on how best to remedy this!

Thanks again,
Scott

Lasse Reichstein Nielsen <lr*@hotpop.com> wrote in message news:<vf**********@hotpop.com>...

sc***@turnstyle.com (Scott Matthews) writes:
I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer.

Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set.

For example:
<script>
document.write(escape('Ôèëìè'));
(note: that should be five accented characters)

It is five accented characters, because your message is encoded as
ISO-8859-1, and, e.g., the first character (byte value 212) is
O-circumflex in ISO-8859-1. It also has Unicode codepoint 212,
since Unicode agress with ISO-8859-1 on values below 256.
</script>

Produces: %D4%E8%EB%EC%E8

Where D4 is 212 in hex, so as expected.
But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write(escape('Ôèëìè'));

Now, this *script* is interpreted as Windows-1251 characters, including
the literal string. The first character of that string is the byte 212,
which in Windows 1251 is the Cyrillic capital letter EF. Since Javascript
uses Unicode for strings, the first character of the string value becomes
Cyrillic EF, which has Unicode code-point 1060.
</script>

Produces: %u0424%u0438%u043B%u043C%u0438

Here 0424 is hex for 1060, as expected.
(can be checked using 'parseInt("0424",16)')
Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.

It doesn't. What changes is the interpretation of the string literal.
Try changing the write to
document.write('Ô'.charCodeAt(0));
or even better
document.write('Ôèëìè');
Might you know of any resources that can help me better understand
what's happening there?

No ressources, sorry. But remember that when you assign an encoding
that is different from the one used by your editor, you can't trust
the characters you see. WYSI-not-WYG!

You should learn what a codepage really does. A codepage represents a
set of (up to) 256 different characters (or code points), like capital
Roman letter A, Arabic numeral 4, Roman letter O circumflex accent,
cyrillice capital EF, or Chinese glyph whatnot. Those are the only
characters that can be written using that codepage. It also defines a
map from 8-bit bytes to those characters. Different code pages can
assign different code points to the same byte, as ISO-8859-1 and
Windows-1251 does to the byte 212.

Javascript converts all strings
to 16-bit Unicode internally, so it doesn't need to know about
code pages after the page has loaded.
Unicode:
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-lat1.htm>
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-cyr1.htm>

Codepage 1251 is "Cyrillic (Windows)"
<URL:http://longhorn.msdn.microsoft.com/lhsdk/ref/ns/system.text/c/encoding/p/codepage.aspx>

/L

Jul 20 '05 #3

Lasse Reichstein Nielsen

sc***@turnstyle.com (Scott Matthews) writes:

Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.
That is because it is encoding different values. In the latin-1 code
page, your string contains the unicode character with code point 212.
It is escaped as %D4, because that is how 212 is written in hex.

In the Windows-1251(Cyrillic) codepage, the string contains the unicode
character with code point 1060. Since that can't be represented as a
two-digit hex number, escape uses the longer four-digit encoding:
%u0424
In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?
It could, but it doesn't have to, since two hex digits are sufficient.
It optimizes and uses the shorter representation. It could have
generated %u00D4 instead, but that would be three bytes wasted.
Here's my situation: I have a FORM that asks for a URL as input. The
page that the FORM sits on is available in a few languages, and so it
can include a few differnt codepages.

Whee! Inputs and codepages. I believe there is something tricky about
that, but I don't know it. If the way the input is interpreted by the
browser is not the way it is intended by the operating system (I press
the Cyrillic FE key, browser writes an O-circumflex), then something
is bound to go wrong (or you might say that it already is).

I am afraid it is probably browser *and* operating system dependent.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 20 '05 #4

Paul Gorodyansky

Scott Matthews wrote:

Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.

In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?
We had - 2 years ago - the same situation but with Japanese and
Chinese :) (my company does not support Cyrillic yet, but supports
Western European languages and Far East ones) -
and had exactly the same question!

Thanks, Lasse, your guess finally makes some sense (we were lost): It could, but it doesn't have to, since two hex digits are sufficient.
It optimizes and uses the shorter representation. It could have
generated %u00D4 instead, but that would be three bytes wasted.

So Scott, when our server-side software receives a data from a form
we have IF-ELSE there!

That is, if it's Western (windows-1252 or iso-8859-1) we use
URLDecoding1() that assumes %XX format
Otherwise, we use URLDecoding2() that assumes %uXXXX format.

We _always_ know - at the server side - what the encoding is -
when we send a page to a browser in the first place, creating
HTTP Header with "...charset=..." in it, we store that value on server
side. Or, in some cases, we create a page in such a way that
a form has a hidden field that contains encoding name, so when a
data is sent from the form to the server, one of the fields will
tell server-side software what the encoding is.

As for languages/encodings and Form Input - it's not really an
issue of this topic (in this topic we assume - as most Apps do - that
the data coming from a form are in the same encoding that page itself
is), you can read here:

http://ppewww.ph.gla.ac.uk/%7eflavel...form-i18n.html

--
Regards,
Paul Gorodyansky
"Cyrillic (Russian): instructions for Windows and Internet":
http://ourworld.compuserve.com/homepages/PaulGor/

Jul 20 '05 #5

Paul Gorodyansky

Lasse Reichstein Nielsen wrote:

...
I am afraid it is probably browser *and* operating system dependent.

Right. When we first ran into this issue (2+ years ago)
we found out that only Internet
Explorer creates either %XX or %uXXXX based on the encoding, while
Netscape 4.0 does not - JavaScript in it always converts to %XX form

Don't know how JavaScript in Netscape 7/Mozilla works in such case -
we do use them now, but I did not ask the guys...
--
Regards,
Paul Gorodyansky
"Cyrillic (Russian): instructions for Windows and Internet":
http://ourworld.compuserve.com/homepages/PaulGor/

Jul 20 '05 #6

by: Faith | last post by:

Hi all, I have a serious problem that I am not sure whether its a bug in the Microsoft Visual C++ or something that I am doing wrong. The problem is (as noted in the Subject) is that my program...

C / C++

same web page behaves differently on servers

by: Do | last post by:

Hi, Has anyone every had two NET Framework 1.1 Servers that both run the same web application, but the behavior of the forms is different? My form validation behaves differently on two...

ASP.NET

When to use HTML encode and when not to?

by: Darrel | last post by:

How does HTML.encode work? I'm trying to save text in a hidden form field into a SQL DB. The tedt is HTML (from a WYSIWYG editor...X-standard). One problem I have is that stray apostrophe's in...

ASP.NET

IBM JDBC driver behaves differently on Linux than on Solaris

by: Joe Weinstein | last post by:

Hi. Below is a simple JDBC program to insert and extract a numerical value. When ResultSet.getDouble() is called, the same program produces different output on solaris than it does on Linux. I...

DB2 Database

Encode exception for chinese text

by: Vinayakc | last post by:

Hi all, I am new to python. I have written one small application which reads data from xml file and tries to encode data using apprpriate charset. I am facing problem while encoding one...

Python

php script behaves differently when it run in command line

by: rchen8080 | last post by:

I found that the php script behaves differently when it run in command line rather than runing over web server. Program includes the code like <?php function send_email($from_address,...

PHP

different encodings for unicode() and u''.encode(), bug?

by: mario | last post by:

Hello! i stumbled on this situation, that is if I decode some string, below just the empty string, using the mcbs encoding, it succeeds, but if I try to encode it back with the same encoding it...

Python

onclick behaves differently when defined via javascript

by: yawnmoth | last post by:

http://www.frostjedi.com/terra/scripts/demo/this-alert.html http://www.frostjedi.com/terra/scripts/demo/this-alert2.html Why, when you click in the black box, do the alert boxes say different...

Javascript

Exe behaves differently when run as a job

by: souvickm | last post by:

I have developed an exe to cut image files from a dumped folder and paste them in some other folder(categorized into subfolders) on a different machine. I have also coded to generate a text file at...

Visual Basic 4 / 5 / 6

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Encode() behaves differently with different charsets?

Similar topics