473,756 Members | 2,558 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Input Character Set Handling

Hi

I am struggling to find definitive information on how IE 5.5, 6 and 7
handle character input (I am happy with the display of text).
I have two main questions:
1. Does IE automaticall convert text input in HTML forms from the
native character set (e.g. SJIS, 8859-1 etc) to UTF-8 prior to sending
the input back to the server?

2. Does IE Javascript do the same? So if I write a Javascript function
that compares a UTF-8 string to a string that a user has inputted into
a text box, will IE convert the user's string into UTF-8 before doing
the comparison?
I think that the answer to question 1 is probably "YES", but I cannot
find any information on question 2!
Many thanks for your help
Kulgan.

Nov 10 '06
44 9492
Jim Land (NO SPAM) wrote:
"Bart Van der Donck" <ba**@nijlen.co mwrote in
news:11******** **************@ h48g2000cwc.goo glegroups.com:
Posts like yours are dangerous; Gougle Groups displays html char/num
entities where you haven't typed them and vice versa. I can imagine
that most News Readers will have trouble with it too; that's why I've
put some work to restrict my previous post to ISO-8859-1 so everybody
sees it correctly.
Paste into input field:<br>
ヤツカ
<hr>
<form>
<input name="i">
<input type="button" value="check" onClick="
if (document.forms[0].i.value == '\uFF94\uFF82\u FF76') {
alert('equal') }
else {
alert('not equal')
}
">
</form>
Not equal.

2 Paste ヤ
if (document.forms[0].i.value == '\uFF94;')
Not equal

3 Paste ヤ
if (document.forms[0].i.value == 'ヤ')
Not equal

4 Paste &amp;
if (document.forms[0].i.value == '&amp;')
Not equal

5 Paste abc
if (document.forms[0].i.value == 'abc')
Equal

6 Paste &
if (document.forms[0].i.value == '&')
Equal

7 Paste &
if (document.forms[0].i.value == '&') //ascii decimal
Equal

8 Paste &
if (document.forms[0].i.value == '\x26') //ascii hex
Equal

9 Paste &
if (document.forms[0].i.value == '\46') //ascii octal
Equal

10 Paste &
if (document.forms[0].i.value == '\u0026') //unicode
Equal

11 Paste &
if (document.forms[0].i.value == '&amp;') //html character entity
Equal
I suppose your testing results should be fine, two thoughts:
- beware of leading/trailing spaces when you copy/paste
- (document.forms[0].i.value == '\uFF94;') doesn't equal because the
semicolon shouldn't be there
Are the following conclusions correct?

1. When a single character is typed in an input box, Javascript can
correctly recognize it as itself,
Yes.
as its ascii code (decimal, hex, or octal),
Yes, but only when it's an ASCII character (which is nowadays too
narrow to work with).
as its unicode,
Yes.
or as its html character entity.
I'ld say this is a bridge too far; there might be browser dependencies
when it comes too num/char entity handling in forms. I would tend to
not rely too much on this kind of stuff.
2. However, Javascript does *not* correctly recognize a character entered
by typing its ascii code, unicode, or html character entity into a text
box.
Correct by definition; eg when you type "\x41", it will be treated as
"\x4" and not as "A", because you typed "\x4" and not "A" :-) But it's
possible to write a script too modify such behaviour.

--
Bart

Nov 11 '06 #11
Kulgan wrote:
Many thanks for the advice. I am starting to get an understanding of
what is going on now!! Are you saying that if the user's Windows
character set is not Unicode that Javascript sees characters inputted
into text boxes as Unicode?
Yes, always.
Or are modern Windows (XP) installations always Unicode for data
input anyway??
I'm not sure of that, but it doesn't matter here. You can input
whatever you want from any charset on any OS using any decent browser.
Javascript will always handle it internally as Unicode code-points;
each javascript implementation is built that way.
Can of worms...!
True, but with some basic rules and a lot of common sense, most
situations can be dealt with.

--
Bart

Nov 11 '06 #12
VK
Oh, here we go.

Oh, here we go :-): someone gonna teach me about the Unicode. For some
reasons - which I'll skip to disclose - it is funny to me, but go ahead
anyway.
It's a character encoding: characters are encoded as an integer within a
certain "codespace" , namely the range 0..10FFFF.
Unicode is a charset (set of characters) with each character unit
represented by words (in the programming sense) with the smallest word
consisting of 2 bytes (16 bits) . This way the range doesn't go from 0:
there is not such character in Unicode. Unicode starts from the
character 0x0000. Again you are thinking and talking about character
entities, bytes, Unicode and UTF-8 at once: which is not helpful if one
tries to understand the matter.
There are then
"encoding forms" that transform values in this range to "code units",
specifically the three Unicode Transformation Formats, UTF-8, -16, and
-32. These code units can be used to store or transport sequences of
"encoded characters". The "encoding scheme" (which includes big- and
little-endian forms for UTF-16 and -32) defines precisely how each form
is serialised into octets.
That is correct.

<snip>
For a start, ASCII is a
7-bit encoding (128 characters in the range 0..7F)
I prefer to use the old term lower-ASCII to refer to 0-127 part where
the 128-255 variable part used for extra entities and variable from one
charset to another. This way more academically correct term could be
"IBM tables" and respectively "lower part of IBM tables" but who
remembers this term now? "lower-ASCII" in the sense "0-127 characters"
or "US ASCII" is good enough for the matter.
whereas UTF-8 is an
8-bit, variable-width format.
Again you are mixing charsets and bytes. UTF-8 is a transport encoding
representing Unicode characters using "US ASCII" only character
sequences.
a document might stored on disk using
UTF-8, and then transmitted verbatim across a network.
Technically well possible but for what reason? (besides making a copy
in another storage place). Such document is not viewable without
specially written parser and not directly usable for Internet. So what
purpose would be of such document?

Nov 11 '06 #13
"Bart Van der Donck" <ba**@nijlen.co mwrote in
news:11******** *************@f 16g2000cwb.goog legroups.com:
Jim Land (NO SPAM) wrote:
>"Bart Van der Donck" <ba**@nijlen.co mwrote in
news:11******* *************** @h48g2000cwc.go oglegroups.com:

Posts like yours are dangerous; Gougle Groups displays html char/num
entities where you haven't typed them and vice versa. I can imagine
that most News Readers will have trouble with it too; that's why I've
put some work to restrict my previous post to ISO-8859-1 so everybody
sees it correctly.
Thank you for pointing this out. For those reading posts in a reader
that mangles, I have clarified below by inserting spaces so the string
cannot be rendered as a special character.
Paste into input field:<br>
ヤツカ \\ & # 65428; & # 65410; & # 65398;
<hr>
<form>
<input name="i">
<input type="button" value="check" onClick="
if (document.forms[0].i.value == '\uFF94\uFF82\u FF76') {
\\ \ u FF94 \ u FF82 \ u FF76
alert('equal') }
else {
alert('not equal')
}
">
</form>
Not equal.

2 Paste ヤ \\ & # 65428 ;
if (document.forms[0].i.value == '\uFF94;') \\ \ u FF94 ;
Not equal

3 Paste ヤ \\ & # 65428 ;
>if (document.forms[0].i.value == 'ヤ') \\ & # 65428 ;
>Not equal

4 Paste &amp; \\ & amp ;
if (document.forms[0].i.value == '&amp;') \\ & amp ;
Not equal

5 Paste abc \\ abc
if (document.forms[0].i.value == 'abc') \\ abc
Equal

6 Paste & \\ single character
if (document.forms[0].i.value == '&') \\ single character
Equal

7 Paste & \\ single character
if (document.forms[0].i.value == '&') // & # 38; ascii decimal
Equal

8 Paste & \\ single character
if (document.forms[0].i.value == '\x26') // \ x 26 ascii hex
Equal

9 Paste & \\ single character
if (document.forms[0].i.value == '\46') // \ 46 ascii octal
Equal

10 Paste & \\ single character
if (document.forms[0].i.value == '\u0026') // \ u 0026 unicode
Equal

11 Paste & \\ single character
if (document.forms[0].i.value == '&amp;')
// & amp ; html character entity
>Equal

I suppose your testing results should be fine, two thoughts:
- beware of leading/trailing spaces when you copy/paste
- (document.forms[0].i.value == '\uFF94;') doesn't equal because the
semicolon shouldn't be there
Thanks, my typo. But still not equal when semicolon is removed.
>
>Are the following conclusions correct?

1. When a single character is typed in an input box, Javascript can
correctly recognize it as itself,

Yes.
>as its ascii code (decimal, hex, or octal),

Yes, but only when it's an ASCII character (which is nowadays too
narrow to work with).
>as its unicode,

Yes.
>or as its html character entity.

I'ld say this is a bridge too far; there might be browser dependencies
when it comes too num/char entity handling in forms. I would tend to
not rely too much on this kind of stuff.
>2. However, Javascript does *not* correctly recognize a character
entered by typing its ascii code, unicode, or html character entity
into a text box.

Correct by definition; eg when you type "\x41", it will be treated as
"\x4" and not as "A", because you typed "\x4" and not "A" :-) But it's
possible to write a script too modify such behaviour.
I believe you meant, 'when you type "\x41" (\ x 41), it will be treated
as
"\x41" (\ x 41) and not as "A", because you typed "\x41" (\ x 41) and
not "A"'
Nov 11 '06 #14
VK wrote:

[snip]
>It's a character encoding: characters are encoded as an integer
within a certain "codespace" , namely the range 0..10FFFF.

Unicode is a charset (set of characters)
Character set and character encoding are synonymous, however Unicode is
not defined using the former.
with each character unit represented by words (in the programming
sense) with the smallest word consisting of 2 bytes (16 bits).
If by "character unit" you mean code point, that's nonsense. A code
point is an integer, simple as that. How it is represented varies.
This way the range doesn't go from 0: there is not such character in
Unicode.
In the Unicode Standard, the codespace consists of the integers
from 0 to 10FFFF [base 16], comprising 1,114,112 code points
available for assigning the repertoire of abstract characters.
-- 2.4 Code Points and Characters,
The Unicode Standard, Version 4.1.0
Unicode starts from the character 0x0000.
The Unicode codespace starts from the integer 0. The first assigned
character exists at code point 0.
Again you are thinking and talking about character entities, bytes,
Unicode and UTF-8 at once:
No, I'm not. I used terms that are distinctly abstract.

It seems to me that you are confusing a notational convention -
referring to characters with the form U+xxxx - for some sort of definition.
which is not helpful if one tries to understand the matter.
Quite. Why then do you try so hard to misrepresent technical issues?

[snip]
"lower-ASCII" in the sense "0-127 characters" or "US ASCII" is good
enough for the matter.
I'm not really going to debate the issue, so long as you understand what
I mean when I refer to ASCII.
>whereas UTF-8 is an 8-bit, variable-width format.

Again you are mixing charsets and bytes.
No, I'm not.
UTF-8 is a transport encoding representing Unicode characters using
"US ASCII" only character sequences.
My point was that, given your own definition of (US-)ASCII above, this
sort of statement is absurd. The most significant bit is important in
the octets generated when using the UTF-8 encoding scheme - all scalar
values greater than 7F are serialised to two or more octets, each of
which have the MSB set - yet you are describing it in terms of something
where only the lowest 7-bits are use to represent characters.

For example, U+0430 is represented by the octets D0 and B0. In binary,
these octets are 11010000 and 10110000, respectively. If UTF-8 uses "US
ASCII only character sequences", and you agree that US-ASCII is strictly
7-bit, do you care to explain that evident contradiction?
>a document might stored on disk using UTF-8, and then transmitted
verbatim across a network.

Technically well possible but for what reason? ...
Efficiency. Most Western texts will be smaller when the UTF-8 encoding
scheme is employed as the 0..7F code points are the most common,
encompassing both common letters, digits, and punctuation.
Such document is not viewable without specially written parser and
not directly usable for Internet.
Oh dear. Of all of the documents that use one of the Unicode encoding
schemes on the Web, I should think that the /vast/ majority of them use
UTF-8. As for "specially written parser", XML processors are required to
accept UTF-8 input and browsers at least as far back as NN4 also do so.

[snip]

Mike
Nov 11 '06 #15
VK
a document might stored on disk using UTF-8, and then transmitted
verbatim across a network.
Technically well possible but for what reason? ...
Such document is not viewable without specially written parser and
not directly usable for Internet.
Oh dear. Of all of the documents that use one of the Unicode encoding
schemes on the Web, I should think that the /vast/ majority of them use
UTF-8. As for "specially written parser", XML processors are required to
accept UTF-8 input and browsers at least as far back as NN4 also do so.
Oh dear. So by "transmitte d verbatim across a network" you meant like
"served from a server to user agent"?! OK, then we have a really "low
start"... You homework for Monday (I'll check :-)

Given this UTF-8 encoded XML file:

<?xml version="1.0" encoding="UTF-8"?>
<repository>
<!-- item contains UTF-8 encoded
Unicode character (r) (trade mark)
<item>%C2%AE</item>
</repository>

Investigate and explain why this (r) sign doesn't appear back no matter
what when viewed in UA.
A hint: think of a difference of 1) byte input stream from network and
2) document source text made from the received byte stream. On what
stage UA's UTF-8 decoder works?

Then create a version properly displaying (r) sign. To avoid DTD
charset hassle, it is allowed to make a (X)HTML document instead of
XML. Make sure that you see (r) sign when open in UA. What charset your
source is? A hint: do not look at UTF-8

Nov 11 '06 #16
VK

VK wrote:
To avoid DTD
charset hassle
A "repeating word" typo, of course:

"To avoid DTD subset hassle..."

Nov 11 '06 #17
VK wrote:

[MLW:]
>>>a document might stored on disk using UTF-8, and then transmitted
verbatim across a network.
[snip]
Oh dear. So by "transmitte d verbatim across a network" you meant like
"served from a server to user agent"?!
Of course.
OK, then we have a really "low start"... You homework for Monday
(I'll check :-)
We do, but I'm not the one that doesn't understand what's going on. Once
again, you prove yourself to be totally clueless.
Given this UTF-8 encoded XML file:

<?xml version="1.0" encoding="UTF-8"?>
<repository>
<!-- item contains UTF-8 encoded
Unicode character (r) (trade mark)
<item>%C2%AE</item>
</repository>
Moron! That doesn't use the UTF-8 encoding form.

The element, item, contains six characters, represented using six
octets. In hexadecimal (and binary) these are: 25 (00100101), 43
(01000011), 32 (00110010), 25, 41 (01000001), and 45 (01000101). Using
UTF-8, it should contain one character, represented using two octets: C2
(11000010) and AE (10101110).

[snip]

Mike
Nov 11 '06 #18
VK

Michael Winter wrote:
<?xml version="1.0" encoding="UTF-8"?>
<repository>
<!-- item contains UTF-8 encoded
Unicode character (r) (trade mark)
<item>%C2%AE</item>
</repository>

Moron!
Halfwit.
That doesn't use the UTF-8 encoding form.
But I'm in a rather good mood so still accepting the homework by
Monday. You are even allowed (though not suggested) to extend the
assignment: make a document in "truly deeply UTF-8" encoding - whatever
it is in your mind - which one could "transmit verbatim over network".
With so many technical details in this thread a bit of fun can be
useful.

Nov 11 '06 #19
VK

Bart Van der Donck wrote:
With <form method="get" , the browser tries to pass the characters
to the server in the character set of the page
Sorry to correct but it's an important one:
IE 6,7 will always pass the form data with GET as UTF-8 encoded
sequences (in the default configuration). It is regulated by Tools >
Internet Options Advanced Always send URL's as UTF-8

Nov 11 '06 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

21
2010
by: aegis | last post by:
7.4#1 states The header <ctype.h> declares several functions useful for classifying and mapping characters.166) In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined. Why should something such as: tolower(-10); invoke undefined behavior?
17
2142
by: Gladiator | last post by:
When I am trying to execute a program from "The C Programming Language" by Dennis Ritchie, I tried to run the following program.I am using Dev++ as a compiler software. The Program is presented below. #include <stdio.h> main() { long nc;
3
1385
by: stormandstress | last post by:
Hi. I'm writing a program that is dependent on the curses library and functions for python, and I'm a little puzzled by the way characters are handled. The basics of the program are that a character is taken from input and put into a certain position within a list (There's more to it than that, but I think it's irrelevant). The problem is, when a character is taken via the <window>.getch() function, what comes back is an int...
3
2128
by: MitchellEr | last post by:
I can't seem to get consistency in my application with foreign character handling. I'm creating a series of forms that update database tables. So, when trying to edit a form, the field values that show up are queried from the database. Occasionally, some fields will contain foreign characters - like ü, ã, é. The Session.Codepage is set to 65001. The charset also is set in the HTML code: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML...
0
10046
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9886
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9722
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8723
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7259
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6542
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5155
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5318
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3817
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.