473,769 Members | 6,597 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Input Character Set Handling

Hi

I am struggling to find definitive information on how IE 5.5, 6 and 7
handle character input (I am happy with the display of text).
I have two main questions:
1. Does IE automaticall convert text input in HTML forms from the
native character set (e.g. SJIS, 8859-1 etc) to UTF-8 prior to sending
the input back to the server?

2. Does IE Javascript do the same? So if I write a Javascript function
that compares a UTF-8 string to a string that a user has inputted into
a text box, will IE convert the user's string into UTF-8 before doing
the comparison?
I think that the answer to question 1 is probably "YES", but I cannot
find any information on question 2!
Many thanks for your help
Kulgan.

Nov 10 '06
44 9496
VK wrote:

[snip]

[R. Cornford:]
>(the registered trade mark character is the C2 AE sequence just
before the 3C at the end of the penultimate line).

Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack.
A hack? No, simply how UTF-8 works. I have no idea what it was you
posted earlier, but it was not a UTF-8 encoded document (at least not in
the spirit it was meant to be).

[snip]

Mike
Nov 12 '06 #31
VK

Michael Winter wrote:
Where do you think Internet protocols are specified?
Mostly and mainly in the same place where the [window] object is: :-)
it goes per the traditions and per the "templatic" implementation.

Any way, I did some research (damn time zone change, cannot get a
sleep). Sorry I cannot post URL's as I used Perl scripts on one of our
clients' server - they will not like it. Feel free to re-evaluate
yourselve, watch the shebang path as usual.

[ Test 1 ]
#!/usr/bin/perl
print "Content-Type: text/html; charset=iso-8859-1\n\n";
print <<EndOfBlock;
<html>
<head>
<title>Test 1</title>
</head>
<body>
<form method="GET" action="">
<fieldset>
<input type="text" name="test">
<input type="submit">
</fieldset>
</form>
</body>
</html>
EndOfBlock
exit(0);

[ Test 2 ]
#!/usr/bin/perl
print "Content-Type: text/html; charset=iso-8859-1\n\n";
print <<EndOfBlock;
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Test 1</title>
</head>
<body>
<form method="GET" action="">
<fieldset>
<input type="text" name="test">
<input type="submit">
</fieldset>
</form>
</body>
</html>
EndOfBlock
exit(0);

[ Test 3 ]
#!/usr/bin/perl
print "Content-Type: text/html\n\n";
print <<EndOfBlock;
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Test 1</title>
</head>
<body>
<form method="GET" action="">
<fieldset>
<input type="text" name="test">
<input type="submit">
</fieldset>
</form>
</body>
</html>
EndOfBlock
exit(0);

[Test 1] sets iso-8859-1 charset in the server header

[Test 2] sets iso-8859-1 charset in the server header but UTF-8 in META
tag. Server header is obligated to take priority over meta if UA is not
broken (thus iso-8859-1 remains)

[Test 3] sets UTF-8 in meta.

The variant of charset not set at all is not taken into consideration.
Feel free to break your browser yourselve :-)

In each generated form I typed in the same Russian word which sounds as
"probah" and wich means as I understand "a probe". See the first match
in search results
<http://www.google.com/search?hl=en&q= %D0%BF%D1%80%D0 %BE%D0%B1%D0%B0 &btnG=Google+Se arch>

//////////////
[Test 1] (iso-8859-1 set be server header)
Reported charset by all UA': iso-8859-1

Submission results:

IE 6.0
test=%EF%F0%EE% E1%E0

Firefox 1.5
test=%26%231087 %3B%26%231088%3 B%26%231086%3B% 26%231073%3B%26 %231072%3B

Opera 9.02
test=%26%231087 %3B%26%231088%3 B%26%231086%3B% 26%231073%3B%26 %231072%3B
//////////////
Test 2 (iso-8859-1 set by server header, overrides meta tag)
Reported charset by all UA': iso-8859-1

Submission results (watch the change for IE):

IE 6.0
test=%26%231087 %3B%26%231088%3 B%26%231086%3B% 26%231073%3B%26 %231072%3B

Firefox 1.5
test=%26%231087 %3B%26%231088%3 B%26%231086%3B% 26%231073%3B%26 %231072%3B

Opera 9.02
test=%26%231087 %3B%26%231088%3 B%26%231086%3B% 26%231073%3B%26 %231072%3B
//////////////
Test 3 (UTF-8 set by meta tag)

Reported charset by all UA': UTF-8

Submission results:

IE 6.0
test=%D0%BF%D1% 80%D0%BE%D0%B1% D0%B0

Firefox 1.5
test=%D0%BF%D1% 80%D0%BE%D0%B1% D0%B0

Opera 9.02
test=%D0%BF%D1% 80%D0%BE%D0%B1% D0%B0

Nov 12 '06 #32
VK wrote:
And for sure you have checked *what* charset is indicated in
browser for your "UTF-8" ?
>Are you sure you are not, once again, looking for the wrong thing
in the wrong place? (for example, at the Encoding item in the
menu for IE's post-XSLT transformation representation of the XML).
>Firefox's 'View Page Info' has no trouble reporting the resource
as UTF-8, and a hex dump of the bytes actually sent shows:-
>3C 21 44 4F 43 54 59 50 45 20 72 6F 6F 74 20 5B
0A 20 20 20 20 3C 21 45 4C 45 4D 45 4E 54 20 72
6F 6F 74 20 28 23 50 43 44 41 54 41 29 3E 0A 20
20 20 20 5D 3E 0A 0A 3C 72 6F 6F 74 3E C2 AE 3C
2F 72 6F 6F 74 3E 0A
>- which certainly is UTF-8 encoded (the registered trade mark
character is the C2 AE sequence just before the 3C at the end
of the penultimate line).

Wow! Now I see. Sorry for being so slow, but it just takes a
bit for such sophisticated hack. So instead of say "CYRILLIC
CAPITAL LETTER A" (Unicode 0x0410) we are taking its UTF-8
encoding 208 144 and placing two 8-bit encoded characters
matching 208 and 144. Say in Cyrillic (Windows-1251) these
will be CYRILLIC CAPITAL LETTER R and CYRILLIC SMALL LETTER
DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode
character CYRILLIC CAPITAL LETTER A. Just tried it: it works
for modern browsers. Wow... I will definitely add it to our
knowledge base, as a sample of what people may come up with
with enough of free time available :-)
ROTLMLOL. It all just goes straight over you head, doesn't it?
Sorry again to everyone for being so slow: but it's really...
sophisticated.
Sophisticated? I suppose that depends on how rudimentary your intellect
is to start with.

Richard.
Nov 12 '06 #33
VK wrote:
Bart Van der Donck wrote:
My test seems to indicate the opposite on MSIE6 + "Always send URL's as
UTF-8" checked:

http://www.dotinternet.be/temp/example.htm -%E9
http://www.dotinternet.be/temp/exampleUTF-8.htm -%C3%A9

Am I overlooking something ?

Partially. The first URL leads to illegal HTTP transmission (no charset
provided neither by page nor by server). This way it activates error
correction mechanics in browser. And UA's error correction is all
separate issue of conversation.
Okay, let's disable such correction mechanisms then; say the following
example in ISO-8859-1. It shows the same result:
http://www.dotinternet.be/temp/exampleISO.htm

I think it's like Michael Winter said (RFC 2616): "Media subtypes of
the 'text' type are defined to have a default charset value of
'ISO-8859-1' when received via HTTP". This specification seems to be
well obeyed by the browsers that I tested.
Say IE 6 SP1 / Win 98SE studies the input stream and by some formal
signs decides that it's Cyrillic.
If that would happen, it would still get encoded to %E9 in a query
string. It's only the browser that decides how to display the
character, albeit HTML entity И (Cyrillic) or é (Latin-1).
When you change the character table, %E9 might point to a Latin,
Cyrillic or Swahili sign, or depending on whatever table is used. That
has no effect on query string encoding, those are two separate things.
These "formal signs" are very fragil and the source is wide open for
the "Korean issie" and "Characters jam" effects. They don't happen here
just because of the simplicity of the page content.
Yes, true.

--
Bart

Nov 13 '06 #34
VK wrote:
You come to say to any Java team guy "Unicode" (unlike
"Candyman" one time will suffice :-) and then run away quickly
before he started beating you.
What a luxury. In the Perl world everybody starts fighting with
everybody.

--
Bart

Nov 13 '06 #35
VK wrote:
Michael Winter wrote:
>Where do you think Internet protocols are specified?

Mostly and mainly in the same place where the [window] object is: :-)
it goes per the traditions and per the "templatic" implementation.
You want to compare the object model of competing products to
interworking network protocols?
Any way, I did some research ...
Why? The document I cited from Alan Flavell had already drawn the
necessary conclusions. Did you read it?

[snip]

Mike
Nov 14 '06 #36
VK
Any way, I did some research ...
>
Why? The document I cited from Alan Flavell had already drawn the
necessary conclusions. Did you read it?
Alan Flavell has no idea (AFAICT) neither about the Korean Issue, nor
about the Character Jam nor about the Phenomenon of the first non-ASCII
character as such. This way it is not an authority to me until the
knowledge of these issues is demostrated somewhere else in his books.

Nov 15 '06 #37
VK wrote:
>>Any way, I did some research ...
Why? The document I cited from Alan Flavell had already drawn the
necessary conclusions. Did you read it?

Alan Flavell has no idea (AFAICT) neither about the Korean Issue,
The only time you've referred to a "Korean Issue" in the past was caused
by a failure in MSIE to detect an encoding scheme correctly, producing
rather odd results when it guessed UTF-7. The solution to that is
obvious, and Alan addresses it indirectly by recommending that the user
agent should never need guess. That said, he does touch on it:

In that analysis, I've disregarded utf-7 format (which would be
wrongly identified as us-ascii), as being inappropriate for use
in an HTTP context. One might mention, however, that when MSIE
is set to auto-detect character encodings, it has been known to
mis-identify some us-ascii pages, claiming them to be in utf-7.
-- Heuristic recognition of utf-8?,
FORM submission and i18n, Alan J. Flavell
<http://ppewww.ph.gla.a c.uk/~flavell/charset/form-i18n.html>
nor about the Character Jam nor about the Phenomenon of the first
non-ASCII character as such.
If you want a sensible discussion of the issues, actually describe them
properly.

[snip]

Mike
Nov 15 '06 #38
VK
Michael Winter wrote:
If you want a sensible discussion of the issues, actually describe them
properly.
The issue is that UA's acting unstable w/o charset indicated somehow.
That is especially true for IE6 which also happens to be the most
widely used UA at this time. IE6 is a very old, I would say ancient,
browser (by the Web time scale) with Unicode and UTF-__ encodings
support implemented atop and addon somehow anyhow.

This is only far related to JavaScript programming though. Maybe I'll
make a demo page showing what an innocent page can do with IE6 if
charset is not provided.

Nov 15 '06 #39
Hello!

"VK" <sc**********@y ahoo.comwrote in message news:11******** **************@ e3g2000cwe.goog legroups.com...
...
- which certainly is UTF-8 encoded (the registered trade mark character
is the C2 AE sequence just before the 3C at the end of the penultimate
line).

Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack. So instead of say "CYRILLIC CAPITAL LETTER A"
(Unicode 0x0410) we are taking its UTF-8 encoding 208 144 and placing
two 8-bit encoded characters matching 208 and 144. Say in Cyrillic
(Windows-1251) these will be CYRILLIC CAPITAL LETTER R and CYRILLIC
SMALL LETTER DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode character
CYRILLIC CAPITAL LETTER A. Just tried it: it works for modern browsers.
Wow... I will definitely add it to our knowledge base, as a sample of
what people may come up with with enough of free time available :-)

Sorry again to everyone for being so slow: but it's really...
sophisticated.

Sophisticated? Hack (from another message)?
But you wrote that you deal with say Japanese and Korean 'legacy' encodings
so say you do know what Shift_Jis is, right? Then why you write such noncense:
take these two characters together and display as one
?

"Two characters"??? UTF-8 is same multi-byte encoding as Shift_JIS -
do you write about ONE Japanese letter which is encoded by 2 bytes
in Shift_JIS in the same manner, that is,
"... one byte matches...chara cter, 2nd byte matches... character then
these 2 characters together ... one Japanese letter"?

There are no "characters " there, just 2 bytes that represent one Cyrillic
letter in mulit-byte encoding "UTF-8" -
same way as another 2 bytes represent one Japanese letter in multi-byte
encoding "Shift_Jis" .
As Michale wrote, you somehow did not thing about the serialization,
about files on the disk.
I don't know why you did not know before about say .HTML files
containing pure UTF-8 text (i.e. real UTF-8 characters as mulit-byte items)
to produce a multilingual page - such I18n examples and well known pages
exist on the Web since I became and I18n engineer back in 1997 :-)

For example, for my Cyrillic(Russia n) instructional site I prepared
a section "Multilingu al HTML" many, many years ago -
it included preparation of the .htm _files_ containing UTF-8 text -
no one in the right mind would NOT have _large_ text represented in
your examples of UTF-8 - <item>%C2%AE</item- how do you think
a wen site owner would _edit/correct_ such page it - instead of a
_readable_ text (say Russian+German letters in UTF-8 encoding)
would contain just things like >%C2%AE?

Strange (based on your statements of I18n knowledge) that we here have to explain
you UTF-8 facts written say for _beginners_ at least 6 years ago on my site in
"Multilingu al HTML" section (M.Flavell's site is listed there as a source
for non-beginners): http://RusWin.net/mix.htm

It has UTF-8 examples, too: http://RusWin.net/utf8euro.htm
and http://RusWin.net/utf8-jap.htm

Same can be said aboit XML. In both XML and HTML serialization
(files on disk) is a VERY _common_ practice to have real UTF-8
text in .xml and .html

--
Regards,
Paul
Javascript Virtual Keyboard working in Opera, Mozilla, IE:
http://Kbd.RusWin.net


Nov 19 '06 #40

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

21
2013
by: aegis | last post by:
7.4#1 states The header <ctype.h> declares several functions useful for classifying and mapping characters.166) In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined. Why should something such as: tolower(-10); invoke undefined behavior?
17
2143
by: Gladiator | last post by:
When I am trying to execute a program from "The C Programming Language" by Dennis Ritchie, I tried to run the following program.I am using Dev++ as a compiler software. The Program is presented below. #include <stdio.h> main() { long nc;
3
1388
by: stormandstress | last post by:
Hi. I'm writing a program that is dependent on the curses library and functions for python, and I'm a little puzzled by the way characters are handled. The basics of the program are that a character is taken from input and put into a certain position within a list (There's more to it than that, but I think it's irrelevant). The problem is, when a character is taken via the <window>.getch() function, what comes back is an int...
3
2128
by: MitchellEr | last post by:
I can't seem to get consistency in my application with foreign character handling. I'm creating a series of forms that update database tables. So, when trying to edit a form, the field values that show up are queried from the database. Occasionally, some fields will contain foreign characters - like ü, ã, é. The Session.Codepage is set to 65001. The charset also is set in the HTML code: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML...
0
9589
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10214
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10048
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9996
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9865
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6674
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5304
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5447
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3963
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.