473,770 Members | 4,999 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Encode() behaves differently with different charsets?

I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer.

Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set.

For example:
<script>
document.write( escape('Ôèëìè') );
(note: that should be five accented characters)
</script>

Produces: %D4%E8%EB%EC%E8

But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write( escape('Ôèëìè') );
</script>

Produces: %u0424%u0438%u0 43B%u043C%u0438

Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.

Might you know of any resources that can help me better understand
what's happening there?

Many thanks!
Scott
Jul 20 '05 #1
5 4116
sc***@turnstyle .com (Scott Matthews) writes:
I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer. Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set. For example:
<script>
document.write( escape('Ôèëìè') );
(note: that should be five accented characters)
It is five accented characters, because your message is encoded as
ISO-8859-1, and, e.g., the first character (byte value 212) is
O-circumflex in ISO-8859-1. It also has Unicode codepoint 212,
since Unicode agress with ISO-8859-1 on values below 256.
</script>

Produces: %D4%E8%EB%EC%E8
Where D4 is 212 in hex, so as expected.
But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write( escape('Ôèëìè') );
Now, this *script* is interpreted as Windows-1251 characters, including
the literal string. The first character of that string is the byte 212,
which in Windows 1251 is the Cyrillic capital letter EF. Since Javascript
uses Unicode for strings, the first character of the string value becomes
Cyrillic EF, which has Unicode code-point 1060.
</script>

Produces: %u0424%u0438%u0 43B%u043C%u0438
Here 0424 is hex for 1060, as expected.
(can be checked using 'parseInt("0424 ",16)')
Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.
It doesn't. What changes is the interpretation of the string literal.
Try changing the write to
document.write( 'Ô'.charCodeAt( 0));
or even better
document.write( 'Ôèëìè');
Might you know of any resources that can help me better understand
what's happening there?


No ressources, sorry. But remember that when you assign an encoding
that is different from the one used by your editor, you can't trust
the characters you see. WYSI-not-WYG!

You should learn what a codepage really does. A codepage represents a
set of (up to) 256 different characters (or code points), like capital
Roman letter A, Arabic numeral 4, Roman letter O circumflex accent,
cyrillice capital EF, or Chinese glyph whatnot. Those are the only
characters that can be written using that codepage. It also defines a
map from 8-bit bytes to those characters. Different code pages can
assign different code points to the same byte, as ISO-8859-1 and
Windows-1251 does to the byte 212.

Javascript converts all strings
to 16-bit Unicode internally, so it doesn't need to know about
code pages after the page has loaded.
Unicode:
<URL:http://www.voltaire.ox .ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-lat1.htm>
<URL:http://www.voltaire.ox .ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-cyr1.htm>

Codepage 1251 is "Cyrillic (Windows)"
<URL:http://longhorn.msdn.m icrosoft.com/lhsdk/ref/ns/system.text/c/encoding/p/codepage.aspx>

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'
Jul 20 '05 #2
Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.

In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?

Here's my situation: I have a FORM that asks for a URL as input. The
page that the FORM sits on is available in a few languages, and so it
can include a few differnt codepages.

The action sets a window.location to the value of that form field --
when I'm in Windows-1251, I get a 404 but in ISO-8859-1 everything
works.

I appreciate your thoughts on how best to remedy this!

Thanks again,
Scott

Lasse Reichstein Nielsen <lr*@hotpop.com > wrote in message news:<vf******* ***@hotpop.com> ...
sc***@turnstyle .com (Scott Matthews) writes:
I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer.

Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set.

For example:
<script>
document.write( escape('Ôèëìè') );
(note: that should be five accented characters)


It is five accented characters, because your message is encoded as
ISO-8859-1, and, e.g., the first character (byte value 212) is
O-circumflex in ISO-8859-1. It also has Unicode codepoint 212,
since Unicode agress with ISO-8859-1 on values below 256.
</script>

Produces: %D4%E8%EB%EC%E8


Where D4 is 212 in hex, so as expected.
But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write( escape('Ôèëìè') );


Now, this *script* is interpreted as Windows-1251 characters, including
the literal string. The first character of that string is the byte 212,
which in Windows 1251 is the Cyrillic capital letter EF. Since Javascript
uses Unicode for strings, the first character of the string value becomes
Cyrillic EF, which has Unicode code-point 1060.
</script>

Produces: %u0424%u0438%u0 43B%u043C%u0438


Here 0424 is hex for 1060, as expected.
(can be checked using 'parseInt("0424 ",16)')
Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.


It doesn't. What changes is the interpretation of the string literal.
Try changing the write to
document.write( 'Ô'.charCodeAt( 0));
or even better
document.write( 'Ôèëìè');
Might you know of any resources that can help me better understand
what's happening there?


No ressources, sorry. But remember that when you assign an encoding
that is different from the one used by your editor, you can't trust
the characters you see. WYSI-not-WYG!

You should learn what a codepage really does. A codepage represents a
set of (up to) 256 different characters (or code points), like capital
Roman letter A, Arabic numeral 4, Roman letter O circumflex accent,
cyrillice capital EF, or Chinese glyph whatnot. Those are the only
characters that can be written using that codepage. It also defines a
map from 8-bit bytes to those characters. Different code pages can
assign different code points to the same byte, as ISO-8859-1 and
Windows-1251 does to the byte 212.

Javascript converts all strings
to 16-bit Unicode internally, so it doesn't need to know about
code pages after the page has loaded.
Unicode:
<URL:http://www.voltaire.ox .ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-lat1.htm>
<URL:http://www.voltaire.ox .ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-cyr1.htm>

Codepage 1251 is "Cyrillic (Windows)"
<URL:http://longhorn.msdn.m icrosoft.com/lhsdk/ref/ns/system.text/c/encoding/p/codepage.aspx>

/L

Jul 20 '05 #3
sc***@turnstyle .com (Scott Matthews) writes:
Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.
That is because it is encoding different values. In the latin-1 code
page, your string contains the unicode character with code point 212.
It is escaped as %D4, because that is how 212 is written in hex.

In the Windows-1251(Cyrillic) codepage, the string contains the unicode
character with code point 1060. Since that can't be represented as a
two-digit hex number, escape uses the longer four-digit encoding:
%u0424
In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?
It could, but it doesn't have to, since two hex digits are sufficient.
It optimizes and uses the shorter representation. It could have
generated %u00D4 instead, but that would be three bytes wasted.
Here's my situation: I have a FORM that asks for a URL as input. The
page that the FORM sits on is available in a few languages, and so it
can include a few differnt codepages.


Whee! Inputs and codepages. I believe there is something tricky about
that, but I don't know it. If the way the input is interpreted by the
browser is not the way it is intended by the operating system (I press
the Cyrillic FE key, browser writes an O-circumflex), then something
is bound to go wrong (or you might say that it already is).

I am afraid it is probably browser *and* operating system dependent.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'
Jul 20 '05 #4
Scott Matthews wrote:

Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.

In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?
We had - 2 years ago - the same situation but with Japanese and
Chinese :) (my company does not support Cyrillic yet, but supports
Western European languages and Far East ones) -
and had exactly the same question!

Thanks, Lasse, your guess finally makes some sense (we were lost): It could, but it doesn't have to, since two hex digits are sufficient.
It optimizes and uses the shorter representation. It could have
generated %u00D4 instead, but that would be three bytes wasted.


So Scott, when our server-side software receives a data from a form
we have IF-ELSE there!

That is, if it's Western (windows-1252 or iso-8859-1) we use
URLDecoding1() that assumes %XX format
Otherwise, we use URLDecoding2() that assumes %uXXXX format.

We _always_ know - at the server side - what the encoding is -
when we send a page to a browser in the first place, creating
HTTP Header with "...charset=... " in it, we store that value on server
side. Or, in some cases, we create a page in such a way that
a form has a hidden field that contains encoding name, so when a
data is sent from the form to the server, one of the fields will
tell server-side software what the encoding is.

As for languages/encodings and Form Input - it's not really an
issue of this topic (in this topic we assume - as most Apps do - that
the data coming from a form are in the same encoding that page itself
is), you can read here:

http://ppewww.ph.gla.ac.uk/%7eflavel...form-i18n.html


--
Regards,
Paul Gorodyansky
"Cyrillic (Russian): instructions for Windows and Internet":
http://ourworld.compuserve.com/homepages/PaulGor/
Jul 20 '05 #5
Lasse Reichstein Nielsen wrote:
...
I am afraid it is probably browser *and* operating system dependent.


Right. When we first ran into this issue (2+ years ago)
we found out that only Internet
Explorer creates either %XX or %uXXXX based on the encoding, while
Netscape 4.0 does not - JavaScript in it always converts to %XX form

Don't know how JavaScript in Netscape 7/Mozilla works in such case -
we do use them now, but I did not ask the guys...
--
Regards,
Paul Gorodyansky
"Cyrillic (Russian): instructions for Windows and Internet":
http://ourworld.compuserve.com/homepages/PaulGor/
Jul 20 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
1985
by: Faith | last post by:
Hi all, I have a serious problem that I am not sure whether its a bug in the Microsoft Visual C++ or something that I am doing wrong. The problem is (as noted in the Subject) is that my program behaves differently when I insert a break point in a certain function (or I step into it) I get different results. Its the same input and the same code!! I do not think that adding a source code here will help since its fairly a long program, but I...
1
1224
by: Do | last post by:
Hi, Has anyone every had two NET Framework 1.1 Servers that both run the same web application, but the behavior of the forms is different? My form validation behaves differently on two servers, and they have the exact same version of my ASP.net web form. Anyone?
4
7579
by: Darrel | last post by:
How does HTML.encode work? I'm trying to save text in a hidden form field into a SQL DB. The tedt is HTML (from a WYSIWYG editor...X-standard). One problem I have is that stray apostrophe's in the HTML text are throwing a SQL error. Html.encode doesn't seem to do anything with these, eh? Secondly, does HTMLencode also encode already encoded items?
15
2469
by: Joe Weinstein | last post by:
Hi. Below is a simple JDBC program to insert and extract a numerical value. When ResultSet.getDouble() is called, the same program produces different output on solaris than it does on Linux. I would be grateful for any discussion of this! thanks, Joe Weinstein at BEA Systems Results on Linux Box -----------------------------------------------------------------------
8
2948
by: Vinayakc | last post by:
Hi all, I am new to python. I have written one small application which reads data from xml file and tries to encode data using apprpriate charset. I am facing problem while encoding one chinese paragraph with charset "gb2312". code is:
1
1566
by: rchen8080 | last post by:
I found that the php script behaves differently when it run in command line rather than runing over web server. Program includes the code like <?php function send_email($from_address, $to_address, $bcc_address, $subject, $msg, $attachments,$reminder_id) { $headers = ...;
13
3691
by: mario | last post by:
Hello! i stumbled on this situation, that is if I decode some string, below just the empty string, using the mcbs encoding, it succeeds, but if I try to encode it back with the same encoding it surprisingly fails with a LookupError. This seems like something to be corrected? $ python Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) on darwin
17
2681
by: yawnmoth | last post by:
http://www.frostjedi.com/terra/scripts/demo/this-alert.html http://www.frostjedi.com/terra/scripts/demo/this-alert2.html Why, when you click in the black box, do the alert boxes say different things? Shouldn't they say the same thing?
0
1026
by: souvickm | last post by:
I have developed an exe to cut image files from a dumped folder and paste them in some other folder(categorized into subfolders) on a different machine. I have also coded to generate a text file at the end of the operation to provide a kind of report. The exe also reports the error no. & procedure name through the same text file (when it encounters any error/exception) Now, the exe performs all the reqd tasks properly when run but when I try to...
0
9592
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9425
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10059
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10005
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
7416
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5452
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3972
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3576
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2817
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.