Thanks for your reply, please permit me to follow-up...
I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.
In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?
Here's my situation: I have a FORM that asks for a URL as input. The
page that the FORM sits on is available in a few languages, and so it
can include a few differnt codepages.
The action sets a window.location to the value of that form field --
when I'm in Windows-1251, I get a 404 but in ISO-8859-1 everything
works.
I appreciate your thoughts on how best to remedy this!
Thanks again,
Scott
Lasse Reichstein Nielsen <lrn@hotpop.com> wrote in message news:<vfmg3tna.fsf@hotpop.com>...[color=blue]
>
scott@turnstyle.com (Scott Matthews) writes:
>[color=green]
> > I've recently come upon an odd Javascript (and/or browser) behavior,
> > and after hunting around the Web I still can't seem to find an answer.[/color]
>[color=green]
> > Specifically, I have noticed that the Javascript encode() function
> > behaves differently if a codepage has been set.[/color]
>[color=green]
> > For example:
> > <script>
> > document.write(escape('Ôèëìè'));
> > (note: that should be five accented characters)[/color]
>
> It is five accented characters, because your message is encoded as
> ISO-8859-1, and, e.g., the first character (byte value 212) is
> O-circumflex in ISO-8859-1. It also has Unicode codepoint 212,
> since Unicode agress with ISO-8859-1 on values below 256.
>[color=green]
> > </script>
> >
> > Produces: %D4%E8%EB%EC%E8[/color]
>
> Where D4 is 212 in hex, so as expected.
>[color=green]
> > But setting the codepage to Windows-1251:
> >
> > <META HTTP-EQUIV="Content-Type" CONTENT="text/html;
> > charset=Windows-1251">
> > <script>
> > document.write(escape('Ôèëìè'));[/color]
>
> Now, this *script* is interpreted as Windows-1251 characters, including
> the literal string. The first character of that string is the byte 212,
> which in Windows 1251 is the Cyrillic capital letter EF. Since Javascript
> uses Unicode for strings, the first character of the string value becomes
> Cyrillic EF, which has Unicode code-point 1060.
>[color=green]
> > </script>
> >
> > Produces: %u0424%u0438%u043B%u043C%u0438[/color]
>
> Here 0424 is hex for 1060, as expected.
> (can be checked using 'parseInt("0424",16)')
>[color=green]
> > Personally, I wouldn't expect the Javascript encode() function to
> > change its behavior if the codepage has been changed.[/color]
>
> It doesn't. What changes is the interpretation of the string literal.
> Try changing the write to
> document.write('Ô'.charCodeAt(0));
> or even better
> document.write('Ôèëìè');
>[color=green]
> > Might you know of any resources that can help me better understand
> > what's happening there?[/color]
>
> No ressources, sorry. But remember that when you assign an encoding
> that is different from the one used by your editor, you can't trust
> the characters you see. WYSI-not-WYG!
>
> You should learn what a codepage really does. A codepage represents a
> set of (up to) 256 different characters (or code points), like capital
> Roman letter A, Arabic numeral 4, Roman letter O circumflex accent,
> cyrillice capital EF, or Chinese glyph whatnot. Those are the only
> characters that can be written using that codepage. It also defines a
> map from 8-bit bytes to those characters. Different code pages can
> assign different code points to the same byte, as ISO-8859-1 and
> Windows-1251 does to the byte 212.
>
> Javascript converts all strings
> to 16-bit Unicode internally, so it doesn't need to know about
> code pages after the page has loaded.
>
>
> Unicode:
> <URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-lat1.htm>
> <URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-cyr1.htm>
>
> Codepage 1251 is "Cyrillic (Windows)"
> <URL:http://longhorn.msdn.microsoft.com/lhsdk/ref/ns/system.text/c/encoding/p/codepage.aspx>
>
> /L[/color]