Input Character Set Handling

Kulgan

Hi

I am struggling to find definitive information on how IE 5.5, 6 and 7
handle character input (I am happy with the display of text).
I have two main questions:
1. Does IE automaticall convert text input in HTML forms from the
native character set (e.g. SJIS, 8859-1 etc) to UTF-8 prior to sending
the input back to the server?

2. Does IE Javascript do the same? So if I write a Javascript function
that compares a UTF-8 string to a string that a user has inputted into
a text box, will IE convert the user's string into UTF-8 before doing
the comparison?
I think that the answer to question 1 is probably "YES", but I cannot
find any information on question 2!
Many thanks for your help
Kulgan.

Nov 10 '06 #1

Subscribe Reply

9486

1
2
3
>
Last »

Bart Van der Donck

Kulgan wrote:

1. Does IE automaticall convert text input in HTML forms from the
native character set (e.g. SJIS, 8859-1 etc) to UTF-8 prior to sending
the input back to the server?

With <form method="get" , the browser tries to pass the characters
to the server in the character set of the page, but it will only
succeed if the characters in question can be represented in that
character set. If not, browsers calculate "their best bet" based on
what's available (old style) or use an Unicode set (new style).

Example: western browsers send 'é' as '%E9' by default (URL encoding).
But when the page is in UTF-8, the browser will first lookup the
Unicode multibyte encoding of 'é'. In this case, it are 2 bytes
because 'é' lies in UTF code point range 128-256. Those two bytes
correspond to Ã and ©, and will result in '%C3%A9' (URL encoding) in
the eventual query string.

<form method="post" enctype="applic ation/x-www-form-urlencoded" is
the same as <form method="post" and uses the same general principle
as GET.

In <form method="post" enctype="multip art/form-data" there is no
default encoding at all, because this encoding type needs to be able to
transfer non-base64-ed binaries. 'é' will be passed as 'é' and that's
it.

2. Does IE Javascript do the same? So if I write a Javascript function
that compares a UTF-8 string to a string that a user has inputted into
a text box, will IE convert the user's string into UTF-8 before doing
the comparison?

Browsers only encode form values between the moment that the user
submits the form and the moment that the new POST/GET request is made.
You should have no problem to use any of the Unicode characters in
javascript as long as you haven't sent the form.

Hope this helps,

--
Bart

Nov 10 '06 #2

Kulgan

Browsers only encode form values between the moment that the user
submits the form and the moment that the new POST/GET request is made.
You should have no problem to use any of the Unicode characters in
javascript as long as you haven't sent the form.

Thanks for the helpful info.

On the Javascript subject, if the user's input character set is not
UTF-8 (e.g. it is the Japanese SJIS set), but the page character set is
UTF-8, how does Javascript see the characters? Does the browser do an
SJIS to UTF-8 conversion on the characters before they are used (e.g.
to find the length of the string?)

Thanks,

Kulgan.

Nov 10 '06 #3

Kulgan wrote:

2. Does IE Javascript do the same? So if I write a Javascript function
that compares a UTF-8 string to a string that a user has inputted into
a text box, will IE convert the user's string into UTF-8 before doing
the comparison?

That is confusion inspired by Unicode, Inc. and W3C (I'm wondering
rather often if they have any clue at all about Unicode).

Unicode is a *charset* : a set of characters where each character unit
is represented by two bytes (taking the original Unicode 16-bit
encoding). At the same time TCP/IP protocol is an 8-bit media: its
atomic unit is one byte. This way one cannot directly send Unicode
entities over the Internet: same way as you cannot place a 3D box on a
sheet of paper, you can only emulate it (making its 2D projection). So
it is necessary to use some 8-bit *encoding* algorithm to split Unicode
characters onto sequences of bytes, send them over the Internet and
glue them back together on the other end. Here UTF-8 *encoding* (not
*charset*) comes into play. By some special algorithm it encodes
Unicode characters into base ACSII sequences and send them to the
recipient. The recipient - informed in advance by Content-Type header
what i's coming - uses UTF-8 decoder to get back the original Unicode
characters.
The Fact Number One unknown to the majority of specialists, including
the absolute majority of W3C volunteers - so feel yourselve a choosen
one :-) -
Pragma <?xml version="1.0" encoding="utf-8"?which one sees left and
right in XML and pseudo-XHTML documents *does not* mean that this
document is in UTF-8 encoding. It means that the document is in Unicode
charset and it must be transmitted (if needed) over an 8-bit media
using UTF-8 encoding algorithm. Respectively if the document is not
using Unicode charset then you are making a false statement with
numerous nasty outcomes pending if ever used on the Internet.
Here is even more secret knowledge, shared between myself and Sir
Berners-Lee only :-) -
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
*does not* mean that the characters you see on your screen are in
"UTF-8 charset" (there is not such). It means: "The input stream was
declared as Unicode charset characters encoded using UTF-8 transport
encoding. The result you are seeing (if seeing anything) is the result
of decoding the input stream using UTF-8 decoder".
"charset" term here is totally misleading one - it remained from the
old times with charsets of 256 entities maximum thus encoding matching
charset and vice versa. The proper header W3C should insist on is
....content="te xt/html; charset=Unicode ; encoding=UTF-8"
As I said before very few people on the Earth knows the truth and the
Web did not collapse so far for two main reason:
1) Content-Type header sent by server takes precedence over META tag on
the page. This HTTP standard is one of most valuable ones left to us by
fathers. They saw in advance the ignorance ruling so left the chance to
server admins to save the world :-)
2) All modern UA's have special neuristic built in to sort out real
UTF-8 input streams and authors mistakes. A note for the "Content-Type
in my heart" adepts: it means that over the last years a great amount
of viewer-dependant XML/XHTML documents was produced.

Sorry for such extremely long preface, but I considered dangerous to
just keep giving "short fix" advises: it is fighting with symptoms
instead of the sickness. And the sickness is growing worldwide: out
helpdesk is flooded with requests like "my document is in UTF-8
encoding, why..." etc.

Coming back to your original question: the page will be either Unicode
or ISO-8859-1 or something else: but it *never* will be UTF-8: UTF-8
exists only during the transmission and parsing stages. The maximum one
can do is to have UTF-8 encoded characters right in the document like
%D0%82... But in such case it is just row UTF-8 source represented
using ASCII charset.

>From the other side JavaScript operates with Unicode only and it sees

the page content "through the window of Unicode" no matter what the
actual charset is. So to reliably compare user input / node values with
JavaScript strings you have to:
1) The most reliable one for an average-small amount of non-ASCII
characters:
Use \u Unicode escape sequences

2) Lesser reliable as can be easily smashed once open in a non-Unicode
editor:
Have the entire .js file in Unicode with non-ASCII characters typed as
they are and your server sending the file in UTF-8 encoding.

P.S. There is whole another issue which could be named "How do I handle
Unicode 32-bit characters or How did Unicode, Inc. screw the whole
world". But your primary question is answered, and it's beer time
anyway. :-)

Nov 10 '06 #4

Bart Van der Donck

Kulgan wrote:

[...]
On the Javascript subject, if the user's input character set is not
UTF-8 (e.g. it is the Japanese SJIS set), but the page character set is
UTF-8, how does Javascript see the characters?

Always the same, as their Unicode code points.

Does the browser do an SJIS to UTF-8 conversion on the characters
before they are used (e.g. to find the length of the string?)

No conversion/encoding is possible on that level. I think you're not
fully aware of the distinction between
(1) the user's (available) charsets
(2) the charset of the web page
(3) how javascript handles characters internally

Only (3) is of importance in your case:

Paste into input field:<br>
ﾔﾂｶ
<hr>
<form>
<input name="i">
<input type="button" value="check" onClick="
if (document.forms[0].i.value == '\uFF94\uFF82\u FF76') {
alert('equal') }
else {
alert('not equal')
}
">
</form>

Note that it doesn't matter whether the user has SJIS installed. It
also doesn't matter what the charset of the page is.

--
Bart

Nov 10 '06 #5

Bart Van der Donck

VK wrote:

[...]
Unicode is a *charset* : a set of characters where each character unit
is represented by two bytes (taking the original Unicode 16-bit
encoding).
[...]

I wouldn't put it that way. Some Unicode characters consist of 2 bytes,
yes, but Unicode's primary idea is the multi-byte concept; characters
can also consist of 1 byte, or more than 2.

--
Bart

Nov 10 '06 #6

Bart Van der Donck wrote:

[...]
Unicode is a *charset* : a set of characters where each character unit
is represented by two bytes (taking the original Unicode 16-bit
encoding).
[...]

I wouldn't put it that way. Some Unicode characters consist of 2 bytes,
yes, but Unicode's primary idea is the multi-byte concept; characters
can also consist of 1 byte, or more than 2.

I humbly disagree: the very original Unicode idea is that 8 bits cannot
accommodate all charcodes for all characters currently used in the
world. This way it was an obvious idea to use a two bytes encoding with
respectively 65,535 possible character units: to represent all
*currently used* systems of writing. While some Far East systems
(Hangul, Traditional Chinese) would be a space challenge - the majority
of other systems are based on the Phoenician phonetic alphabet (>Greek

Latin Others) so relatively very compact. This way 65,535 storage units were more than generous for the task.
From the other end at the moment the project started the US English

(base ASCII) texts were absolutely prevailing in the transmission so
the task was do not double the HTTP traffic with useless 0x00 bytes. To
avoid that it was decided that the bytes 0-127 will be treated
literally as base ASCII characters and anything 128-255 will be treated
as the beginning of a double-byte Unicode sequence. Alas it meant that
0x8000 - 0xFFFF ( a good half of the table) would be unusable. Lucky
Pike and Thompson found a way of an economic unambiguous transmission
of any characters in 0-65535 range meeting the core requirement do not
double the traffic with Unicode-encoded base-ASCII characters. This
algorithm - later called UTF-8 - went into wide production. It
doesn't mean that English "A" is represented with a single byte
in Unicode: it means that Unicode double byte character 0x0041 (Basic
Latin LATIN CAPITAL LETTER A) has an universally recognized single-byte
shortcut 0x41
That would be a happy ending but misfortunately Unicode, Inc. treated
65,535 storage places as a teenager would treat his first credit card
- thus rolling it on the first occasion without thinking of the
consequences. Any shaister coming with any kind of crap tables was
immediately welcome and accounted. This way Unicode, Inc. started to
work on the "first came - first got" basis and the original idea
"all currently used charsets" was seamlessly transformed into
"all symbolic systems ever used for any purposes by the human
civilization". Well predictably for language specialists - but
surprisingly for Unicode, Inc. amateurs - it appeared that the
humanity produced a countless amount f systems to denote sounds,
syllables, words, ideas, musical sounds, chemical elements and an
endless amount of other material and spiritual entities. This way they
spent all available storage space for rarely used crap before even
fixing the place for such "minor" issues as Chinese or Japanese. As
the result they had to go from 2-byte system to 3-byte system and now
they seem exploring the storage space of a 4-byte system. And this is
even without touching yet Egyptian hieratic/demotic and all variants of
Cuneiform. And there is no one so far to come, send the fn amateurs to
hell and to bring the Unicode system in order.

You come to say to any Java team guy "Unicode" (unlike
"Candyman" one time will suffice :-) and then run away quickly
before he started beating you.

Yes I am biased on the matter: I hate "volunteers " ensured that
whatever they are doing is right just because they are doing it for
free (and seemly for free).

Nov 10 '06 #7

Michael Winter

VK wrote:

Kulgan wrote:
>2. Does IE Javascript do the same? So if I write a Javascript
function that compares a UTF-8 string to a string that a user has
inputted into a text box, will IE convert the user's string into
UTF-8 before doing the comparison?

That is confusion inspired by Unicode, Inc. and W3C (I'm wondering
rather often if they have any clue at all about Unicode).

Oh, here we go.

Unicode is a *charset* ...

It's a character encoding: characters are encoded as an integer within a
certain "codespace" , namely the range 0..10FFFF. There are then
"encoding forms" that transform values in this range to "code units",
specifically the three Unicode Transformation Formats, UTF-8, -16, and
-32. These code units can be used to store or transport sequences of
"encoded characters". The "encoding scheme" (which includes big- and
little-endian forms for UTF-16 and -32) defines precisely how each form
is serialised into octets.

[snip]

Here UTF-8 *encoding* (not *charset*) comes into play. By some
special algorithm it encodes Unicode characters into base ACSII
sequences and send them to the recipient.

Whilst some encoded characters will map directly to ASCII (specifically
the Unicode code points, 0..7F), most won't. For a start, ASCII is a
7-bit encoding (128 characters in the range 0..7F), whereas UTF-8 is an
8-bit, variable-width format.

The word you are looking for is "octet".

[snip]

Pragma <?xml version="1.0" encoding="utf-8"?>

It is the XML declaration and takes the form of a processing instruction.

... *does not* mean that this document is in UTF-8 encoding.

That depends on what you mean by "in UTF-8 encoding". If you meant
"serialised using the UTF-8 encoding scheme", then that's precisely what
it means. However, it is unnecessary to include an XML declaration for
documents that use either the UTF-8 or -16 encoding form (see 4.3.3
Character Encoding in Entities).

It means that the document is in Unicode charset ...

All XML documents (and HTML, for that matter) use the Unicode
repertoire. The issue is the form in which the document is transported.
Should a higher protocol not signal the encoding form in use (UTF-8,
ISO-8859-1, etc.) then the XML declaration serves that purpose.

[snip]

Coming back to your original question: the page will be either Unicode
or ISO-8859-1 or something else: but it *never* will be UTF-8: UTF-8
exists only during the transmission and parsing stages.

UTF-8 can be used any time the document needs to be serialised into a
sequence of octets. Therefore, a document might stored on disk using
UTF-8, and then transmitted verbatim across a network.

[snip]

Mike

Nov 10 '06 #8

Jim Land

"Bart Van der Donck" <ba**@nijlen.co mwrote in
news:11******** **************@ h48g2000cwc.goo glegroups.com:

Paste into input field:<br>
ﾔﾂｶ
<hr>
<form>
<input name="i">
<input type="button" value="check" onClick="
if (document.forms[0].i.value == '\uFF94\uFF82\u FF76') {
alert('equal') }
else {
alert('not equal')
}
">
</form>

Not equal.

2 Paste ﾔ
if (document.forms[0].i.value == '\uFF94;')
Not equal

3 Paste ﾔ
if (document.forms[0].i.value == 'ﾔ')
Not equal

4 Paste &
if (document.forms[0].i.value == '&')
Not equal

5 Paste abc
if (document.forms[0].i.value == 'abc')
Equal

6 Paste &
if (document.forms[0].i.value == '&')
Equal

7 Paste &
if (document.forms[0].i.value == '&') //ascii decimal
Equal

8 Paste &
if (document.forms[0].i.value == '\x26') //ascii hex
Equal

9 Paste &
if (document.forms[0].i.value == '\46') //ascii octal
Equal

10 Paste &
if (document.forms[0].i.value == '\u0026') //unicode
Equal

11 Paste &
if (document.forms[0].i.value == '&') //html character entity
Equal

Are the following conclusions correct?

1. When a single character is typed in an input box, Javascript can
correctly recognize it as itself, as its ascii code (decimal, hex, or
octal), as its unicode, or as its html character entity.

2. However, Javascript does *not* correctly recognize a character entered
by typing its ascii code, unicode, or html character entity into a text
box.

Nov 11 '06 #9

Kulgan

On the Javascript subject, if the user's input character set is not

UTF-8 (e.g. it is the Japanese SJIS set), but the page character set is
UTF-8, how does Javascript see the characters?

Always the same, as their Unicode code points.

Many thanks for the advice. I am starting to get an understanding of
what is going on now!! Are you saying that if the user's Windows
character set is not Unicode that Javascript sees characters inputted
into text boxes as Unicode? Or are modern Windows (XP) installations
always Unicode for data input anyway??

Can of worms...!

Kulgan.

Nov 11 '06 #10

Similar topics

1998

clarification on character handling

by: aegis | last post by:

7.4#1 states The header <ctype.h> declares several functions useful for classifying and mapping characters.166) In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined. Why should something such as: tolower(-10); invoke undefined behavior?

C / C++

2138

cannot understand the character handling Program

by: Gladiator | last post by:

When I am trying to execute a program from "The C Programming Language" by Dennis Ritchie, I tried to run the following program.I am using Dev++ as a compiler software. The Program is presented below. #include <stdio.h> main() { long nc;

C / C++

1384

Curses and Character Handling

by: stormandstress | last post by:

Hi. I'm writing a program that is dependent on the curses library and functions for python, and I'm a little puzzled by the way characters are handled. The basics of the program are that a character is taken from input and put into a certain position within a list (There's more to it than that, but I think it's irrelevant). The problem is, when a character is taken via the <window>.getch() function, what comes back is an int...

Python

2126

Foreign Character Handling

by: MitchellEr | last post by:

I can't seem to get consistency in my application with foreign character handling. I'm creating a series of forms that update database tables. So, when trying to edit a form, the field values that show up are queried from the database. Occasionally, some fields will contain foreign characters - like ü, ã, é. The Session.Codepage is set to 65001. The charset also is set in the HTML code: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML...

ASP / Active Server Pages

8889

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8752

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9257

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9179

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9116

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6702

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

4519

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4784

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2157

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General