By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,850 Members | 972 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,850 IT Pros & Developers. It's quick & easy.

Input Character Set Handling

P: n/a
Hi

I am struggling to find definitive information on how IE 5.5, 6 and 7
handle character input (I am happy with the display of text).
I have two main questions:
1. Does IE automaticall convert text input in HTML forms from the
native character set (e.g. SJIS, 8859-1 etc) to UTF-8 prior to sending
the input back to the server?

2. Does IE Javascript do the same? So if I write a Javascript function
that compares a UTF-8 string to a string that a user has inputted into
a text box, will IE convert the user's string into UTF-8 before doing
the comparison?
I think that the answer to question 1 is probably "YES", but I cannot
find any information on question 2!
Many thanks for your help
Kulgan.

Nov 10 '06 #1
Share this Question
Share on Google+
44 Replies


P: n/a
Kulgan wrote:
1. Does IE automaticall convert text input in HTML forms from the
native character set (e.g. SJIS, 8859-1 etc) to UTF-8 prior to sending
the input back to the server?
With <form method="get" , the browser tries to pass the characters
to the server in the character set of the page, but it will only
succeed if the characters in question can be represented in that
character set. If not, browsers calculate "their best bet" based on
what's available (old style) or use an Unicode set (new style).

Example: western browsers send '' as '%E9' by default (URL encoding).
But when the page is in UTF-8, the browser will first lookup the
Unicode multibyte encoding of ''. In this case, it are 2 bytes
because '' lies in UTF code point range 128-256. Those two bytes
correspond to and , and will result in '%C3%A9' (URL encoding) in
the eventual query string.

<form method="post" enctype="application/x-www-form-urlencoded" is
the same as <form method="post" and uses the same general principle
as GET.

In <form method="post" enctype="multipart/form-data" there is no
default encoding at all, because this encoding type needs to be able to
transfer non-base64-ed binaries. '' will be passed as '' and that's
it.
2. Does IE Javascript do the same? So if I write a Javascript function
that compares a UTF-8 string to a string that a user has inputted into
a text box, will IE convert the user's string into UTF-8 before doing
the comparison?
Browsers only encode form values between the moment that the user
submits the form and the moment that the new POST/GET request is made.
You should have no problem to use any of the Unicode characters in
javascript as long as you haven't sent the form.

Hope this helps,

--
Bart

Nov 10 '06 #2

P: n/a
Browsers only encode form values between the moment that the user
submits the form and the moment that the new POST/GET request is made.
You should have no problem to use any of the Unicode characters in
javascript as long as you haven't sent the form.
Thanks for the helpful info.

On the Javascript subject, if the user's input character set is not
UTF-8 (e.g. it is the Japanese SJIS set), but the page character set is
UTF-8, how does Javascript see the characters? Does the browser do an
SJIS to UTF-8 conversion on the characters before they are used (e.g.
to find the length of the string?)

Thanks,

Kulgan.

Nov 10 '06 #3

P: n/a
VK
Kulgan wrote:
2. Does IE Javascript do the same? So if I write a Javascript function
that compares a UTF-8 string to a string that a user has inputted into
a text box, will IE convert the user's string into UTF-8 before doing
the comparison?
That is confusion inspired by Unicode, Inc. and W3C (I'm wondering
rather often if they have any clue at all about Unicode).

Unicode is a *charset* : a set of characters where each character unit
is represented by two bytes (taking the original Unicode 16-bit
encoding). At the same time TCP/IP protocol is an 8-bit media: its
atomic unit is one byte. This way one cannot directly send Unicode
entities over the Internet: same way as you cannot place a 3D box on a
sheet of paper, you can only emulate it (making its 2D projection). So
it is necessary to use some 8-bit *encoding* algorithm to split Unicode
characters onto sequences of bytes, send them over the Internet and
glue them back together on the other end. Here UTF-8 *encoding* (not
*charset*) comes into play. By some special algorithm it encodes
Unicode characters into base ACSII sequences and send them to the
recipient. The recipient - informed in advance by Content-Type header
what i's coming - uses UTF-8 decoder to get back the original Unicode
characters.
The Fact Number One unknown to the majority of specialists, including
the absolute majority of W3C volunteers - so feel yourselve a choosen
one :-) -
Pragma <?xml version="1.0" encoding="utf-8"?which one sees left and
right in XML and pseudo-XHTML documents *does not* mean that this
document is in UTF-8 encoding. It means that the document is in Unicode
charset and it must be transmitted (if needed) over an 8-bit media
using UTF-8 encoding algorithm. Respectively if the document is not
using Unicode charset then you are making a false statement with
numerous nasty outcomes pending if ever used on the Internet.
Here is even more secret knowledge, shared between myself and Sir
Berners-Lee only :-) -
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
*does not* mean that the characters you see on your screen are in
"UTF-8 charset" (there is not such). It means: "The input stream was
declared as Unicode charset characters encoded using UTF-8 transport
encoding. The result you are seeing (if seeing anything) is the result
of decoding the input stream using UTF-8 decoder".
"charset" term here is totally misleading one - it remained from the
old times with charsets of 256 entities maximum thus encoding matching
charset and vice versa. The proper header W3C should insist on is
....content="text/html; charset=Unicode; encoding=UTF-8"
As I said before very few people on the Earth knows the truth and the
Web did not collapse so far for two main reason:
1) Content-Type header sent by server takes precedence over META tag on
the page. This HTTP standard is one of most valuable ones left to us by
fathers. They saw in advance the ignorance ruling so left the chance to
server admins to save the world :-)
2) All modern UA's have special neuristic built in to sort out real
UTF-8 input streams and authors mistakes. A note for the "Content-Type
in my heart" adepts: it means that over the last years a great amount
of viewer-dependant XML/XHTML documents was produced.

Sorry for such extremely long preface, but I considered dangerous to
just keep giving "short fix" advises: it is fighting with symptoms
instead of the sickness. And the sickness is growing worldwide: out
helpdesk is flooded with requests like "my document is in UTF-8
encoding, why..." etc.

Coming back to your original question: the page will be either Unicode
or ISO-8859-1 or something else: but it *never* will be UTF-8: UTF-8
exists only during the transmission and parsing stages. The maximum one
can do is to have UTF-8 encoded characters right in the document like
%D0%82... But in such case it is just row UTF-8 source represented
using ASCII charset.
>From the other side JavaScript operates with Unicode only and it sees
the page content "through the window of Unicode" no matter what the
actual charset is. So to reliably compare user input / node values with
JavaScript strings you have to:
1) The most reliable one for an average-small amount of non-ASCII
characters:
Use \u Unicode escape sequences

2) Lesser reliable as can be easily smashed once open in a non-Unicode
editor:
Have the entire .js file in Unicode with non-ASCII characters typed as
they are and your server sending the file in UTF-8 encoding.

P.S. There is whole another issue which could be named "How do I handle
Unicode 32-bit characters or How did Unicode, Inc. screw the whole
world". But your primary question is answered, and it's beer time
anyway. :-)

Nov 10 '06 #4

P: n/a
Kulgan wrote:
[...]
On the Javascript subject, if the user's input character set is not
UTF-8 (e.g. it is the Japanese SJIS set), but the page character set is
UTF-8, how does Javascript see the characters?
Always the same, as their Unicode code points.
Does the browser do an SJIS to UTF-8 conversion on the characters
before they are used (e.g. to find the length of the string?)
No conversion/encoding is possible on that level. I think you're not
fully aware of the distinction between
(1) the user's (available) charsets
(2) the charset of the web page
(3) how javascript handles characters internally

Only (3) is of importance in your case:

Paste into input field:<br>
ヤツカ
<hr>
<form>
<input name="i">
<input type="button" value="check" onClick="
if (document.forms[0].i.value == '\uFF94\uFF82\uFF76') {
alert('equal') }
else {
alert('not equal')
}
">
</form>

Note that it doesn't matter whether the user has SJIS installed. It
also doesn't matter what the charset of the page is.

--
Bart

Nov 10 '06 #5

P: n/a
VK wrote:
[...]
Unicode is a *charset* : a set of characters where each character unit
is represented by two bytes (taking the original Unicode 16-bit
encoding).
[...]
I wouldn't put it that way. Some Unicode characters consist of 2 bytes,
yes, but Unicode's primary idea is the multi-byte concept; characters
can also consist of 1 byte, or more than 2.

--
Bart

Nov 10 '06 #6

P: n/a
VK

Bart Van der Donck wrote:
[...]
Unicode is a *charset* : a set of characters where each character unit
is represented by two bytes (taking the original Unicode 16-bit
encoding).
[...]
I wouldn't put it that way. Some Unicode characters consist of 2 bytes,
yes, but Unicode's primary idea is the multi-byte concept; characters
can also consist of 1 byte, or more than 2.
I humbly disagree: the very original Unicode idea is that 8 bits cannot
accommodate all charcodes for all characters currently used in the
world. This way it was an obvious idea to use a two bytes encoding with
respectively 65,535 possible character units: to represent all
*currently used* systems of writing. While some Far East systems
(Hangul, Traditional Chinese) would be a space challenge - the majority
of other systems are based on the Phoenician phonetic alphabet (>Greek
Latin Others) so relatively very compact. This way 65,535 storage units were more than generous for the task.
From the other end at the moment the project started the US English
(base ASCII) texts were absolutely prevailing in the transmission so
the task was do not double the HTTP traffic with useless 0x00 bytes. To
avoid that it was decided that the bytes 0-127 will be treated
literally as base ASCII characters and anything 128-255 will be treated
as the beginning of a double-byte Unicode sequence. Alas it meant that
0x8000 - 0xFFFF ( a good half of the table) would be unusable. Lucky
Pike and Thompson found a way of an economic unambiguous transmission
of any characters in 0-65535 range meeting the core requirement do not
double the traffic with Unicode-encoded base-ASCII characters. This
algorithm - later called UTF-8 - went into wide production. It
doesn't mean that English "A" is represented with a single byte
in Unicode: it means that Unicode double byte character 0x0041 (Basic
Latin LATIN CAPITAL LETTER A) has an universally recognized single-byte
shortcut 0x41
That would be a happy ending but misfortunately Unicode, Inc. treated
65,535 storage places as a teenager would treat his first credit card
- thus rolling it on the first occasion without thinking of the
consequences. Any shaister coming with any kind of crap tables was
immediately welcome and accounted. This way Unicode, Inc. started to
work on the "first came - first got" basis and the original idea
"all currently used charsets" was seamlessly transformed into
"all symbolic systems ever used for any purposes by the human
civilization". Well predictably for language specialists - but
surprisingly for Unicode, Inc. amateurs - it appeared that the
humanity produced a countless amount f systems to denote sounds,
syllables, words, ideas, musical sounds, chemical elements and an
endless amount of other material and spiritual entities. This way they
spent all available storage space for rarely used crap before even
fixing the place for such "minor" issues as Chinese or Japanese. As
the result they had to go from 2-byte system to 3-byte system and now
they seem exploring the storage space of a 4-byte system. And this is
even without touching yet Egyptian hieratic/demotic and all variants of
Cuneiform. And there is no one so far to come, send the fn amateurs to
hell and to bring the Unicode system in order.

You come to say to any Java team guy "Unicode" (unlike
"Candyman" one time will suffice :-) and then run away quickly
before he started beating you.

Yes I am biased on the matter: I hate "volunteers" ensured that
whatever they are doing is right just because they are doing it for
free (and seemly for free).

Nov 10 '06 #7

P: n/a
VK wrote:
Kulgan wrote:
>2. Does IE Javascript do the same? So if I write a Javascript
function that compares a UTF-8 string to a string that a user has
inputted into a text box, will IE convert the user's string into
UTF-8 before doing the comparison?

That is confusion inspired by Unicode, Inc. and W3C (I'm wondering
rather often if they have any clue at all about Unicode).
Oh, here we go.
Unicode is a *charset* ...
It's a character encoding: characters are encoded as an integer within a
certain "codespace", namely the range 0..10FFFF. There are then
"encoding forms" that transform values in this range to "code units",
specifically the three Unicode Transformation Formats, UTF-8, -16, and
-32. These code units can be used to store or transport sequences of
"encoded characters". The "encoding scheme" (which includes big- and
little-endian forms for UTF-16 and -32) defines precisely how each form
is serialised into octets.

[snip]
Here UTF-8 *encoding* (not *charset*) comes into play. By some
special algorithm it encodes Unicode characters into base ACSII
sequences and send them to the recipient.
Whilst some encoded characters will map directly to ASCII (specifically
the Unicode code points, 0..7F), most won't. For a start, ASCII is a
7-bit encoding (128 characters in the range 0..7F), whereas UTF-8 is an
8-bit, variable-width format.

The word you are looking for is "octet".

[snip]
Pragma <?xml version="1.0" encoding="utf-8"?>
It is the XML declaration and takes the form of a processing instruction.
... *does not* mean that this document is in UTF-8 encoding.
That depends on what you mean by "in UTF-8 encoding". If you meant
"serialised using the UTF-8 encoding scheme", then that's precisely what
it means. However, it is unnecessary to include an XML declaration for
documents that use either the UTF-8 or -16 encoding form (see 4.3.3
Character Encoding in Entities).
It means that the document is in Unicode charset ...
All XML documents (and HTML, for that matter) use the Unicode
repertoire. The issue is the form in which the document is transported.
Should a higher protocol not signal the encoding form in use (UTF-8,
ISO-8859-1, etc.) then the XML declaration serves that purpose.

[snip]
Coming back to your original question: the page will be either Unicode
or ISO-8859-1 or something else: but it *never* will be UTF-8: UTF-8
exists only during the transmission and parsing stages.
UTF-8 can be used any time the document needs to be serialised into a
sequence of octets. Therefore, a document might stored on disk using
UTF-8, and then transmitted verbatim across a network.

[snip]

Mike
Nov 10 '06 #8

P: n/a
"Bart Van der Donck" <ba**@nijlen.comwrote in
news:11**********************@h48g2000cwc.googlegr oups.com:
Paste into input field:<br>
ヤツカ
<hr>
<form>
<input name="i">
<input type="button" value="check" onClick="
if (document.forms[0].i.value == '\uFF94\uFF82\uFF76') {
alert('equal') }
else {
alert('not equal')
}
">
</form>
Not equal.

2 Paste ヤ
if (document.forms[0].i.value == '\uFF94;')
Not equal

3 Paste ヤ
if (document.forms[0].i.value == 'ヤ')
Not equal

4 Paste &amp;
if (document.forms[0].i.value == '&amp;')
Not equal

5 Paste abc
if (document.forms[0].i.value == 'abc')
Equal

6 Paste &
if (document.forms[0].i.value == '&')
Equal

7 Paste &
if (document.forms[0].i.value == '&') //ascii decimal
Equal

8 Paste &
if (document.forms[0].i.value == '\x26') //ascii hex
Equal

9 Paste &
if (document.forms[0].i.value == '\46') //ascii octal
Equal

10 Paste &
if (document.forms[0].i.value == '\u0026') //unicode
Equal

11 Paste &
if (document.forms[0].i.value == '&amp;') //html character entity
Equal

Are the following conclusions correct?

1. When a single character is typed in an input box, Javascript can
correctly recognize it as itself, as its ascii code (decimal, hex, or
octal), as its unicode, or as its html character entity.

2. However, Javascript does *not* correctly recognize a character entered
by typing its ascii code, unicode, or html character entity into a text
box.

Nov 11 '06 #9

P: n/a
On the Javascript subject, if the user's input character set is not
UTF-8 (e.g. it is the Japanese SJIS set), but the page character set is
UTF-8, how does Javascript see the characters?

Always the same, as their Unicode code points.
Many thanks for the advice. I am starting to get an understanding of
what is going on now!! Are you saying that if the user's Windows
character set is not Unicode that Javascript sees characters inputted
into text boxes as Unicode? Or are modern Windows (XP) installations
always Unicode for data input anyway??

Can of worms...!

Kulgan.

Nov 11 '06 #10

P: n/a
Jim Land (NO SPAM) wrote:
"Bart Van der Donck" <ba**@nijlen.comwrote in
news:11**********************@h48g2000cwc.googlegr oups.com:
Posts like yours are dangerous; Gougle Groups displays html char/num
entities where you haven't typed them and vice versa. I can imagine
that most News Readers will have trouble with it too; that's why I've
put some work to restrict my previous post to ISO-8859-1 so everybody
sees it correctly.
Paste into input field:<br>
ヤツカ
<hr>
<form>
<input name="i">
<input type="button" value="check" onClick="
if (document.forms[0].i.value == '\uFF94\uFF82\uFF76') {
alert('equal') }
else {
alert('not equal')
}
">
</form>
Not equal.

2 Paste ヤ
if (document.forms[0].i.value == '\uFF94;')
Not equal

3 Paste ヤ
if (document.forms[0].i.value == 'ヤ')
Not equal

4 Paste &amp;
if (document.forms[0].i.value == '&amp;')
Not equal

5 Paste abc
if (document.forms[0].i.value == 'abc')
Equal

6 Paste &
if (document.forms[0].i.value == '&')
Equal

7 Paste &
if (document.forms[0].i.value == '&') //ascii decimal
Equal

8 Paste &
if (document.forms[0].i.value == '\x26') //ascii hex
Equal

9 Paste &
if (document.forms[0].i.value == '\46') //ascii octal
Equal

10 Paste &
if (document.forms[0].i.value == '\u0026') //unicode
Equal

11 Paste &
if (document.forms[0].i.value == '&amp;') //html character entity
Equal
I suppose your testing results should be fine, two thoughts:
- beware of leading/trailing spaces when you copy/paste
- (document.forms[0].i.value == '\uFF94;') doesn't equal because the
semicolon shouldn't be there
Are the following conclusions correct?

1. When a single character is typed in an input box, Javascript can
correctly recognize it as itself,
Yes.
as its ascii code (decimal, hex, or octal),
Yes, but only when it's an ASCII character (which is nowadays too
narrow to work with).
as its unicode,
Yes.
or as its html character entity.
I'ld say this is a bridge too far; there might be browser dependencies
when it comes too num/char entity handling in forms. I would tend to
not rely too much on this kind of stuff.
2. However, Javascript does *not* correctly recognize a character entered
by typing its ascii code, unicode, or html character entity into a text
box.
Correct by definition; eg when you type "\x41", it will be treated as
"\x4" and not as "A", because you typed "\x4" and not "A" :-) But it's
possible to write a script too modify such behaviour.

--
Bart

Nov 11 '06 #11

P: n/a
Kulgan wrote:
Many thanks for the advice. I am starting to get an understanding of
what is going on now!! Are you saying that if the user's Windows
character set is not Unicode that Javascript sees characters inputted
into text boxes as Unicode?
Yes, always.
Or are modern Windows (XP) installations always Unicode for data
input anyway??
I'm not sure of that, but it doesn't matter here. You can input
whatever you want from any charset on any OS using any decent browser.
Javascript will always handle it internally as Unicode code-points;
each javascript implementation is built that way.
Can of worms...!
True, but with some basic rules and a lot of common sense, most
situations can be dealt with.

--
Bart

Nov 11 '06 #12

P: n/a
VK
Oh, here we go.

Oh, here we go :-): someone gonna teach me about the Unicode. For some
reasons - which I'll skip to disclose - it is funny to me, but go ahead
anyway.
It's a character encoding: characters are encoded as an integer within a
certain "codespace", namely the range 0..10FFFF.
Unicode is a charset (set of characters) with each character unit
represented by words (in the programming sense) with the smallest word
consisting of 2 bytes (16 bits) . This way the range doesn't go from 0:
there is not such character in Unicode. Unicode starts from the
character 0x0000. Again you are thinking and talking about character
entities, bytes, Unicode and UTF-8 at once: which is not helpful if one
tries to understand the matter.
There are then
"encoding forms" that transform values in this range to "code units",
specifically the three Unicode Transformation Formats, UTF-8, -16, and
-32. These code units can be used to store or transport sequences of
"encoded characters". The "encoding scheme" (which includes big- and
little-endian forms for UTF-16 and -32) defines precisely how each form
is serialised into octets.
That is correct.

<snip>
For a start, ASCII is a
7-bit encoding (128 characters in the range 0..7F)
I prefer to use the old term lower-ASCII to refer to 0-127 part where
the 128-255 variable part used for extra entities and variable from one
charset to another. This way more academically correct term could be
"IBM tables" and respectively "lower part of IBM tables" but who
remembers this term now? "lower-ASCII" in the sense "0-127 characters"
or "US ASCII" is good enough for the matter.
whereas UTF-8 is an
8-bit, variable-width format.
Again you are mixing charsets and bytes. UTF-8 is a transport encoding
representing Unicode characters using "US ASCII" only character
sequences.
a document might stored on disk using
UTF-8, and then transmitted verbatim across a network.
Technically well possible but for what reason? (besides making a copy
in another storage place). Such document is not viewable without
specially written parser and not directly usable for Internet. So what
purpose would be of such document?

Nov 11 '06 #13

P: n/a
"Bart Van der Donck" <ba**@nijlen.comwrote in
news:11*********************@f16g2000cwb.googlegro ups.com:
Jim Land (NO SPAM) wrote:
>"Bart Van der Donck" <ba**@nijlen.comwrote in
news:11**********************@h48g2000cwc.googleg roups.com:

Posts like yours are dangerous; Gougle Groups displays html char/num
entities where you haven't typed them and vice versa. I can imagine
that most News Readers will have trouble with it too; that's why I've
put some work to restrict my previous post to ISO-8859-1 so everybody
sees it correctly.
Thank you for pointing this out. For those reading posts in a reader
that mangles, I have clarified below by inserting spaces so the string
cannot be rendered as a special character.
Paste into input field:<br>
ヤツカ \\ & # 65428; & # 65410; & # 65398;
<hr>
<form>
<input name="i">
<input type="button" value="check" onClick="
if (document.forms[0].i.value == '\uFF94\uFF82\uFF76') {
\\ \ u FF94 \ u FF82 \ u FF76
alert('equal') }
else {
alert('not equal')
}
">
</form>
Not equal.

2 Paste ヤ \\ & # 65428 ;
if (document.forms[0].i.value == '\uFF94;') \\ \ u FF94 ;
Not equal

3 Paste ヤ \\ & # 65428 ;
>if (document.forms[0].i.value == 'ヤ') \\ & # 65428 ;
>Not equal

4 Paste &amp; \\ & amp ;
if (document.forms[0].i.value == '&amp;') \\ & amp ;
Not equal

5 Paste abc \\ abc
if (document.forms[0].i.value == 'abc') \\ abc
Equal

6 Paste & \\ single character
if (document.forms[0].i.value == '&') \\ single character
Equal

7 Paste & \\ single character
if (document.forms[0].i.value == '&') // & # 38; ascii decimal
Equal

8 Paste & \\ single character
if (document.forms[0].i.value == '\x26') // \ x 26 ascii hex
Equal

9 Paste & \\ single character
if (document.forms[0].i.value == '\46') // \ 46 ascii octal
Equal

10 Paste & \\ single character
if (document.forms[0].i.value == '\u0026') // \ u 0026 unicode
Equal

11 Paste & \\ single character
if (document.forms[0].i.value == '&amp;')
// & amp ; html character entity
>Equal

I suppose your testing results should be fine, two thoughts:
- beware of leading/trailing spaces when you copy/paste
- (document.forms[0].i.value == '\uFF94;') doesn't equal because the
semicolon shouldn't be there
Thanks, my typo. But still not equal when semicolon is removed.
>
>Are the following conclusions correct?

1. When a single character is typed in an input box, Javascript can
correctly recognize it as itself,

Yes.
>as its ascii code (decimal, hex, or octal),

Yes, but only when it's an ASCII character (which is nowadays too
narrow to work with).
>as its unicode,

Yes.
>or as its html character entity.

I'ld say this is a bridge too far; there might be browser dependencies
when it comes too num/char entity handling in forms. I would tend to
not rely too much on this kind of stuff.
>2. However, Javascript does *not* correctly recognize a character
entered by typing its ascii code, unicode, or html character entity
into a text box.

Correct by definition; eg when you type "\x41", it will be treated as
"\x4" and not as "A", because you typed "\x4" and not "A" :-) But it's
possible to write a script too modify such behaviour.
I believe you meant, 'when you type "\x41" (\ x 41), it will be treated
as
"\x41" (\ x 41) and not as "A", because you typed "\x41" (\ x 41) and
not "A"'
Nov 11 '06 #14

P: n/a
VK wrote:

[snip]
>It's a character encoding: characters are encoded as an integer
within a certain "codespace", namely the range 0..10FFFF.

Unicode is a charset (set of characters)
Character set and character encoding are synonymous, however Unicode is
not defined using the former.
with each character unit represented by words (in the programming
sense) with the smallest word consisting of 2 bytes (16 bits).
If by "character unit" you mean code point, that's nonsense. A code
point is an integer, simple as that. How it is represented varies.
This way the range doesn't go from 0: there is not such character in
Unicode.
In the Unicode Standard, the codespace consists of the integers
from 0 to 10FFFF [base 16], comprising 1,114,112 code points
available for assigning the repertoire of abstract characters.
-- 2.4 Code Points and Characters,
The Unicode Standard, Version 4.1.0
Unicode starts from the character 0x0000.
The Unicode codespace starts from the integer 0. The first assigned
character exists at code point 0.
Again you are thinking and talking about character entities, bytes,
Unicode and UTF-8 at once:
No, I'm not. I used terms that are distinctly abstract.

It seems to me that you are confusing a notational convention -
referring to characters with the form U+xxxx - for some sort of definition.
which is not helpful if one tries to understand the matter.
Quite. Why then do you try so hard to misrepresent technical issues?

[snip]
"lower-ASCII" in the sense "0-127 characters" or "US ASCII" is good
enough for the matter.
I'm not really going to debate the issue, so long as you understand what
I mean when I refer to ASCII.
>whereas UTF-8 is an 8-bit, variable-width format.

Again you are mixing charsets and bytes.
No, I'm not.
UTF-8 is a transport encoding representing Unicode characters using
"US ASCII" only character sequences.
My point was that, given your own definition of (US-)ASCII above, this
sort of statement is absurd. The most significant bit is important in
the octets generated when using the UTF-8 encoding scheme - all scalar
values greater than 7F are serialised to two or more octets, each of
which have the MSB set - yet you are describing it in terms of something
where only the lowest 7-bits are use to represent characters.

For example, U+0430 is represented by the octets D0 and B0. In binary,
these octets are 11010000 and 10110000, respectively. If UTF-8 uses "US
ASCII only character sequences", and you agree that US-ASCII is strictly
7-bit, do you care to explain that evident contradiction?
>a document might stored on disk using UTF-8, and then transmitted
verbatim across a network.

Technically well possible but for what reason? ...
Efficiency. Most Western texts will be smaller when the UTF-8 encoding
scheme is employed as the 0..7F code points are the most common,
encompassing both common letters, digits, and punctuation.
Such document is not viewable without specially written parser and
not directly usable for Internet.
Oh dear. Of all of the documents that use one of the Unicode encoding
schemes on the Web, I should think that the /vast/ majority of them use
UTF-8. As for "specially written parser", XML processors are required to
accept UTF-8 input and browsers at least as far back as NN4 also do so.

[snip]

Mike
Nov 11 '06 #15

P: n/a
VK
a document might stored on disk using UTF-8, and then transmitted
verbatim across a network.
Technically well possible but for what reason? ...
Such document is not viewable without specially written parser and
not directly usable for Internet.
Oh dear. Of all of the documents that use one of the Unicode encoding
schemes on the Web, I should think that the /vast/ majority of them use
UTF-8. As for "specially written parser", XML processors are required to
accept UTF-8 input and browsers at least as far back as NN4 also do so.
Oh dear. So by "transmitted verbatim across a network" you meant like
"served from a server to user agent"?! OK, then we have a really "low
start"... You homework for Monday (I'll check :-)

Given this UTF-8 encoded XML file:

<?xml version="1.0" encoding="UTF-8"?>
<repository>
<!-- item contains UTF-8 encoded
Unicode character (r) (trade mark)
<item>%C2%AE</item>
</repository>

Investigate and explain why this (r) sign doesn't appear back no matter
what when viewed in UA.
A hint: think of a difference of 1) byte input stream from network and
2) document source text made from the received byte stream. On what
stage UA's UTF-8 decoder works?

Then create a version properly displaying (r) sign. To avoid DTD
charset hassle, it is allowed to make a (X)HTML document instead of
XML. Make sure that you see (r) sign when open in UA. What charset your
source is? A hint: do not look at UTF-8

Nov 11 '06 #16

P: n/a
VK

VK wrote:
To avoid DTD
charset hassle
A "repeating word" typo, of course:

"To avoid DTD subset hassle..."

Nov 11 '06 #17

P: n/a
VK wrote:

[MLW:]
>>>a document might stored on disk using UTF-8, and then transmitted
verbatim across a network.
[snip]
Oh dear. So by "transmitted verbatim across a network" you meant like
"served from a server to user agent"?!
Of course.
OK, then we have a really "low start"... You homework for Monday
(I'll check :-)
We do, but I'm not the one that doesn't understand what's going on. Once
again, you prove yourself to be totally clueless.
Given this UTF-8 encoded XML file:

<?xml version="1.0" encoding="UTF-8"?>
<repository>
<!-- item contains UTF-8 encoded
Unicode character (r) (trade mark)
<item>%C2%AE</item>
</repository>
Moron! That doesn't use the UTF-8 encoding form.

The element, item, contains six characters, represented using six
octets. In hexadecimal (and binary) these are: 25 (00100101), 43
(01000011), 32 (00110010), 25, 41 (01000001), and 45 (01000101). Using
UTF-8, it should contain one character, represented using two octets: C2
(11000010) and AE (10101110).

[snip]

Mike
Nov 11 '06 #18

P: n/a
VK

Michael Winter wrote:
<?xml version="1.0" encoding="UTF-8"?>
<repository>
<!-- item contains UTF-8 encoded
Unicode character (r) (trade mark)
<item>%C2%AE</item>
</repository>

Moron!
Halfwit.
That doesn't use the UTF-8 encoding form.
But I'm in a rather good mood so still accepting the homework by
Monday. You are even allowed (though not suggested) to extend the
assignment: make a document in "truly deeply UTF-8" encoding - whatever
it is in your mind - which one could "transmit verbatim over network".
With so many technical details in this thread a bit of fun can be
useful.

Nov 11 '06 #19

P: n/a
VK

Bart Van der Donck wrote:
With <form method="get" , the browser tries to pass the characters
to the server in the character set of the page
Sorry to correct but it's an important one:
IE 6,7 will always pass the form data with GET as UTF-8 encoded
sequences (in the default configuration). It is regulated by Tools >
Internet Options Advanced Always send URL's as UTF-8

Nov 11 '06 #20

P: n/a
VK wrote:

[snip]
Halfwit.
You'll eat your words, I promise you.
>That doesn't use the UTF-8 encoding form.

But I'm in a rather good mood so still accepting the homework by
Monday.
I already explained precisely how the character should be serialised,
but clearly that went over your head.
You are even allowed (though not suggested) to extend the assignment:
make a document in "truly deeply UTF-8" encoding - whatever it is in
your mind - which one could "transmit verbatim over network".
<http://www.mlwinter.pwp.blueyonder.co.uk/clj/utf-8.xml>

Feel free to use a protocol analyser like Ethereal to view the each raw
octet returned in the response. You will notice that though the content
of the element, root, is a single character (U+00AE Registered Trade
Mark Sign), it is represented using two octets: C2 and AE.

[snip]

Mike
Nov 11 '06 #21

P: n/a
VK wrote:
Bart Van der Donck wrote:
With <form method="get" , the browser tries to pass the characters
to the server in the character set of the page

Sorry to correct but it's an important one:
IE 6,7 will always pass the form data with GET as UTF-8 encoded
sequences (in the default configuration). It is regulated by Tools >
Internet Options Advanced Always send URL's as UTF-8
My test seems to indicate the opposite on MSIE6 + "Always send URL's as
UTF-8" checked:

http://www.dotinternet.be/temp/example.htm -%E9
http://www.dotinternet.be/temp/exampleUTF-8.htm -%C3%A9

Am I overlooking something ?

--
Bart

Nov 12 '06 #22

P: n/a
VK

Bart Van der Donck wrote:
My test seems to indicate the opposite on MSIE6 + "Always send URL's as
UTF-8" checked:

http://www.dotinternet.be/temp/example.htm -%E9
http://www.dotinternet.be/temp/exampleUTF-8.htm -%C3%A9

Am I overlooking something ?
Partially. The first URL leads to illegal HTTP transmission (no charset
provided neither by page nor by server). This way it activates error
correction mechanics in browser. And UA's error correction is all
separate issue of conversation.
Say IE 6 SP1 / Win 98SE studies the input stream and by some formal
signs decides that it's Cyrillic. These "formal signs" are very fragil
and the source is wide open for the "Korean issie" and "Characters jam"
effects. They don't happen here just because of the simplicity of the
page content.
At the same time IE6 SP1 / Win XP SP1 by the same formal signs decides
that it's UTF-8 encoded Unicode document but falls on the "Characters
jam" effect so the page comes out blank though View Source shows the
source a bit twisted but not empty.
Illegal transmissions of this kind are called "easy money" on our
helpdesk :-) They bring us couple hundred bucks for sure each month and
it takes one sentence to solve the problem: "add matching charset
either to META tag or to Content-Type header sent by server"

Nov 12 '06 #23

P: n/a
VK
You are even allowed (though not suggested) to extend the assignment:
make a document in "truly deeply UTF-8" encoding - whatever it is in
your mind - which one could "transmit verbatim over network".
<http://www.mlwinter.pwp.blueyonder.co.uk/clj/utf-8.xml>
And for sure you have checked *what* charset is indicated in browser
for your "UTF-8" ?

Nov 12 '06 #24

P: n/a
VK wrote:
[MLW:]
>>You are even allowed (though not suggested) to extend
the assignment: make a document in "truly deeply UTF-8"
encoding - whatever it is in your mind - which one could
"transmit verbatim over network".
> <http://www.mlwinter.pwp.blueyonder.co.uk/clj/utf-8.xml>

And for sure you have checked *what* charset is indicated in
browser for your "UTF-8" ?
Are you sure you are not, once again, looking for the wrong thing in the
wrong place? (for example, at the Encoding item in the menu for IE's
post-XSLT transformation representation of the XML).

Firefox's 'View Page Info' has no trouble reporting the resource as
UTF-8, and a hex dump of the bytes actually sent shows:-

3C 21 44 4F 43 54 59 50 45 20 72 6F 6F 74 20 5B
0A 20 20 20 20 3C 21 45 4C 45 4D 45 4E 54 20 72
6F 6F 74 20 28 23 50 43 44 41 54 41 29 3E 0A 20
20 20 20 5D 3E 0A 0A 3C 72 6F 6F 74 3E C2 AE 3C
2F 72 6F 6F 74 3E 0A

- which certainly is UTF-8 encoded (the registered trade mark character
is the C2 AE sequence just before the 3C at the end of the penultimate
line).

Richard.
Nov 12 '06 #25

P: n/a
VK
And for sure you have checked *what* charset is indicated in
browser for your "UTF-8" ?
Are you sure you are not, once again, looking for the wrong thing in the
wrong place? (for example, at the Encoding item in the menu for IE's
post-XSLT transformation representation of the XML).
Firefox's 'View Page Info' has no trouble reporting the resource as
UTF-8, and a hex dump of the bytes actually sent shows:-
3C 21 44 4F 43 54 59 50 45 20 72 6F 6F 74 20 5B
0A 20 20 20 20 3C 21 45 4C 45 4D 45 4E 54 20 72
6F 6F 74 20 28 23 50 43 44 41 54 41 29 3E 0A 20
20 20 20 5D 3E 0A 0A 3C 72 6F 6F 74 3E C2 AE 3C
2F 72 6F 6F 74 3E 0A
- which certainly is UTF-8 encoded (the registered trade mark character
is the C2 AE sequence just before the 3C at the end of the penultimate
line).
Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack. So instead of say "CYRILLIC CAPITAL LETTER A"
(Unicode 0x0410) we are taking its UTF-8 encoding 208 144 and placing
two 8-bit encoded characters matching 208 and 144. Say in Cyrillic
(Windows-1251) these will be CYRILLIC CAPITAL LETTER R and CYRILLIC
SMALL LETTER DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode character
CYRILLIC CAPITAL LETTER A. Just tried it: it works for modern browsers.
Wow... I will definitely add it to our knowledge base, as a sample of
what people may come up with with enough of free time available :-)

Sorry again to everyone for being so slow: but it's really...
sophisticated.

Nov 12 '06 #26

P: n/a
VK

VK wrote:
Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack. So instead of say "CYRILLIC CAPITAL LETTER A"
(Unicode 0x0410) we are taking its UTF-8 encoding 208 144 and placing
two 8-bit encoded characters matching 208 and 144. Say in Cyrillic
(Windows-1251) these will be CYRILLIC CAPITAL LETTER R and CYRILLIC
SMALL LETTER DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode character
CYRILLIC CAPITAL LETTER A. Just tried it: it works for modern browsers.
More over: despite the physical file contains Windows-1251 CYRILLIC
CAPITAL LETTER R + CYRILLIC SMALL LETTER DJE combo, in View Source
and by DOM methods it's reported as Unicode CYRILLIC CAPITAL LETTER A.
It's all very sketchy but I see some client-side protection algorithms
much more elegant and effective than traditional boring obfuscators
(see for instance another recent post here by Senderos). If I come up
with something useful: this thread and your names will be mentioned.

Nov 12 '06 #27

P: n/a
VK wrote:
Bart Van der Donck wrote:
>My test seems to indicate the opposite on MSIE6 + "Always send URL's as
UTF-8" checked:

http://www.dotinternet.be/temp/example.htm -%E9
http://www.dotinternet.be/temp/exampleUTF-8.htm -%C3%A9

Am I overlooking something ?
No, I don't believe so, though I can't say I'm certain of just what that
option is supposed to do - I've never looked into it.

Alan Flavell's discussion of form submission and internationalisation[1]
notes that the encoding scheme of the document affects how form data is
transferred. This document mentions that that IE option appears only to
affect anchors, and the path component at that, not form submission or
the query component.
Partially. The first URL leads to illegal HTTP transmission (no charset
provided neither by page nor by server).
Is being wrong a hobby for you or something?

If no encoding scheme is specified, the HTTP/1.1 specification (RFC
2616) states that "media subtypes of the 'text' type are defined to have
a default charset value of 'ISO-8859-1' when received via HTTP" (3.7.1
Canonicalization and Text Defaults).

It isn't unusual for this to be ignored in practice, what with
auto-detection and user preferences, but that doesn't make omitting the
charset parameter "illegal", only ill-advised.

[snip]

Mike

[1] FORM submission and i18n, Alan J. Flavell
<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>
Nov 12 '06 #28

P: n/a
VK

Michael Winter wrote:
If no encoding scheme is specified, the HTTP/1.1 specification (RFC
2616) states that "media subtypes of the 'text' type are defined to have
a default charset value of 'ISO-8859-1' when received via HTTP" (3.7.1
Canonicalization and Text Defaults).
3.7.1
....
Data in character sets other than "ISO-8859-1" or
its subsets MUST be labeled with an appropriate charset value.
....

But after all RFC is RFC: "Request For Comments" - nothing less but
nothing more; thus take it serious but with caution.
Anyone wants troubles with documents not shown and broken: no problems,
send your documents with no charset indications of any kind. When your
customers will come to complain, just quote them RFC's - maybe it will
help to save your business (I doubt very much, but feel free to try
:-).
>From my side it is even good that W3C considered HTTP rfc's frozen as
nothing to correct in there. It means I'll continue to get my money for
helping freshly graduated admin's to fix their boohs and for explaining
them that the Internet as it is is in the wires, not in the books.

That is *not* to offend anyone participating in this thread or simply
reading this thread. I just refuse to take the role of some stubbering
bastard forcing charset usage while it's allowed to skip by some RFC.
As I said: anyone is free to do whatever she wants. Just more money for
me anyway.

Nov 12 '06 #29

P: n/a
VK wrote:

[snip]
3.7.1
...
Data in character sets other than "ISO-8859-1" or
its subsets MUST be labeled with an appropriate charset value.
...
So? The data did use the ISO-8859-1 encoding form, so labelling it as
such is not technically required.
But after all RFC is RFC: "Request For Comments" - nothing less but
nothing more; thus take it serious but with caution.
Each distinct version of an Internet standards-related
specification is published as part of the "Request for
Comments" (RFC) document series. This archival series is the
official publication channel for Internet standards documents
and other publications of the IESG, IAB, and Internet
community.
-- 2.1 Requests for Comments (RFCs),
The Internet Standards Process (Revision 3), RFC 2026

Where do you think Internet protocols are specified?
Anyone wants troubles with documents not shown and broken: no
problems, send your documents with no charset indications of any
kind. When your customers will come to complain, just quote them
RFC's - maybe it will help to save your business (I doubt very much,
but feel free to try :-).
You seem to have problems reading, so let me paraphrase my previous
post: it is not wrong to omit a charset parameter if the encoding form
is ISO-8859-1, but it is not recommended.

[snip]

Mike
Nov 12 '06 #30

P: n/a
VK wrote:

[snip]

[R. Cornford:]
>(the registered trade mark character is the C2 AE sequence just
before the 3C at the end of the penultimate line).

Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack.
A hack? No, simply how UTF-8 works. I have no idea what it was you
posted earlier, but it was not a UTF-8 encoded document (at least not in
the spirit it was meant to be).

[snip]

Mike
Nov 12 '06 #31

P: n/a
VK

Michael Winter wrote:
Where do you think Internet protocols are specified?
Mostly and mainly in the same place where the [window] object is: :-)
it goes per the traditions and per the "templatic" implementation.

Any way, I did some research (damn time zone change, cannot get a
sleep). Sorry I cannot post URL's as I used Perl scripts on one of our
clients' server - they will not like it. Feel free to re-evaluate
yourselve, watch the shebang path as usual.

[ Test 1 ]
#!/usr/bin/perl
print "Content-Type: text/html; charset=iso-8859-1\n\n";
print <<EndOfBlock;
<html>
<head>
<title>Test 1</title>
</head>
<body>
<form method="GET" action="">
<fieldset>
<input type="text" name="test">
<input type="submit">
</fieldset>
</form>
</body>
</html>
EndOfBlock
exit(0);

[ Test 2 ]
#!/usr/bin/perl
print "Content-Type: text/html; charset=iso-8859-1\n\n";
print <<EndOfBlock;
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Test 1</title>
</head>
<body>
<form method="GET" action="">
<fieldset>
<input type="text" name="test">
<input type="submit">
</fieldset>
</form>
</body>
</html>
EndOfBlock
exit(0);

[ Test 3 ]
#!/usr/bin/perl
print "Content-Type: text/html\n\n";
print <<EndOfBlock;
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Test 1</title>
</head>
<body>
<form method="GET" action="">
<fieldset>
<input type="text" name="test">
<input type="submit">
</fieldset>
</form>
</body>
</html>
EndOfBlock
exit(0);

[Test 1] sets iso-8859-1 charset in the server header

[Test 2] sets iso-8859-1 charset in the server header but UTF-8 in META
tag. Server header is obligated to take priority over meta if UA is not
broken (thus iso-8859-1 remains)

[Test 3] sets UTF-8 in meta.

The variant of charset not set at all is not taken into consideration.
Feel free to break your browser yourselve :-)

In each generated form I typed in the same Russian word which sounds as
"probah" and wich means as I understand "a probe". See the first match
in search results
<http://www.google.com/search?hl=en&q=%D0%BF%D1%80%D0%BE%D0%B1%D0%B0&btnG =Google+Search>

//////////////
[Test 1] (iso-8859-1 set be server header)
Reported charset by all UA': iso-8859-1

Submission results:

IE 6.0
test=%EF%F0%EE%E1%E0

Firefox 1.5
test=%26%231087%3B%26%231088%3B%26%231086%3B%26%23 1073%3B%26%231072%3B

Opera 9.02
test=%26%231087%3B%26%231088%3B%26%231086%3B%26%23 1073%3B%26%231072%3B
//////////////
Test 2 (iso-8859-1 set by server header, overrides meta tag)
Reported charset by all UA': iso-8859-1

Submission results (watch the change for IE):

IE 6.0
test=%26%231087%3B%26%231088%3B%26%231086%3B%26%23 1073%3B%26%231072%3B

Firefox 1.5
test=%26%231087%3B%26%231088%3B%26%231086%3B%26%23 1073%3B%26%231072%3B

Opera 9.02
test=%26%231087%3B%26%231088%3B%26%231086%3B%26%23 1073%3B%26%231072%3B
//////////////
Test 3 (UTF-8 set by meta tag)

Reported charset by all UA': UTF-8

Submission results:

IE 6.0
test=%D0%BF%D1%80%D0%BE%D0%B1%D0%B0

Firefox 1.5
test=%D0%BF%D1%80%D0%BE%D0%B1%D0%B0

Opera 9.02
test=%D0%BF%D1%80%D0%BE%D0%B1%D0%B0

Nov 12 '06 #32

P: n/a
VK wrote:
And for sure you have checked *what* charset is indicated in
browser for your "UTF-8" ?
>Are you sure you are not, once again, looking for the wrong thing
in the wrong place? (for example, at the Encoding item in the
menu for IE's post-XSLT transformation representation of the XML).
>Firefox's 'View Page Info' has no trouble reporting the resource
as UTF-8, and a hex dump of the bytes actually sent shows:-
>3C 21 44 4F 43 54 59 50 45 20 72 6F 6F 74 20 5B
0A 20 20 20 20 3C 21 45 4C 45 4D 45 4E 54 20 72
6F 6F 74 20 28 23 50 43 44 41 54 41 29 3E 0A 20
20 20 20 5D 3E 0A 0A 3C 72 6F 6F 74 3E C2 AE 3C
2F 72 6F 6F 74 3E 0A
>- which certainly is UTF-8 encoded (the registered trade mark
character is the C2 AE sequence just before the 3C at the end
of the penultimate line).

Wow! Now I see. Sorry for being so slow, but it just takes a
bit for such sophisticated hack. So instead of say "CYRILLIC
CAPITAL LETTER A" (Unicode 0x0410) we are taking its UTF-8
encoding 208 144 and placing two 8-bit encoded characters
matching 208 and 144. Say in Cyrillic (Windows-1251) these
will be CYRILLIC CAPITAL LETTER R and CYRILLIC SMALL LETTER
DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode
character CYRILLIC CAPITAL LETTER A. Just tried it: it works
for modern browsers. Wow... I will definitely add it to our
knowledge base, as a sample of what people may come up with
with enough of free time available :-)
ROTLMLOL. It all just goes straight over you head, doesn't it?
Sorry again to everyone for being so slow: but it's really...
sophisticated.
Sophisticated? I suppose that depends on how rudimentary your intellect
is to start with.

Richard.
Nov 12 '06 #33

P: n/a
VK wrote:
Bart Van der Donck wrote:
My test seems to indicate the opposite on MSIE6 + "Always send URL's as
UTF-8" checked:

http://www.dotinternet.be/temp/example.htm -%E9
http://www.dotinternet.be/temp/exampleUTF-8.htm -%C3%A9

Am I overlooking something ?

Partially. The first URL leads to illegal HTTP transmission (no charset
provided neither by page nor by server). This way it activates error
correction mechanics in browser. And UA's error correction is all
separate issue of conversation.
Okay, let's disable such correction mechanisms then; say the following
example in ISO-8859-1. It shows the same result:
http://www.dotinternet.be/temp/exampleISO.htm

I think it's like Michael Winter said (RFC 2616): "Media subtypes of
the 'text' type are defined to have a default charset value of
'ISO-8859-1' when received via HTTP". This specification seems to be
well obeyed by the browsers that I tested.
Say IE 6 SP1 / Win 98SE studies the input stream and by some formal
signs decides that it's Cyrillic.
If that would happen, it would still get encoded to %E9 in a query
string. It's only the browser that decides how to display the
character, albeit HTML entity И (Cyrillic) or é (Latin-1).
When you change the character table, %E9 might point to a Latin,
Cyrillic or Swahili sign, or depending on whatever table is used. That
has no effect on query string encoding, those are two separate things.
These "formal signs" are very fragil and the source is wide open for
the "Korean issie" and "Characters jam" effects. They don't happen here
just because of the simplicity of the page content.
Yes, true.

--
Bart

Nov 13 '06 #34

P: n/a
VK wrote:
You come to say to any Java team guy "Unicode" (unlike
"Candyman" one time will suffice :-) and then run away quickly
before he started beating you.
What a luxury. In the Perl world everybody starts fighting with
everybody.

--
Bart

Nov 13 '06 #35

P: n/a
VK wrote:
Michael Winter wrote:
>Where do you think Internet protocols are specified?

Mostly and mainly in the same place where the [window] object is: :-)
it goes per the traditions and per the "templatic" implementation.
You want to compare the object model of competing products to
interworking network protocols?
Any way, I did some research ...
Why? The document I cited from Alan Flavell had already drawn the
necessary conclusions. Did you read it?

[snip]

Mike
Nov 14 '06 #36

P: n/a
VK
Any way, I did some research ...
>
Why? The document I cited from Alan Flavell had already drawn the
necessary conclusions. Did you read it?
Alan Flavell has no idea (AFAICT) neither about the Korean Issue, nor
about the Character Jam nor about the Phenomenon of the first non-ASCII
character as such. This way it is not an authority to me until the
knowledge of these issues is demostrated somewhere else in his books.

Nov 15 '06 #37

P: n/a
VK wrote:
>>Any way, I did some research ...
Why? The document I cited from Alan Flavell had already drawn the
necessary conclusions. Did you read it?

Alan Flavell has no idea (AFAICT) neither about the Korean Issue,
The only time you've referred to a "Korean Issue" in the past was caused
by a failure in MSIE to detect an encoding scheme correctly, producing
rather odd results when it guessed UTF-7. The solution to that is
obvious, and Alan addresses it indirectly by recommending that the user
agent should never need guess. That said, he does touch on it:

In that analysis, I've disregarded utf-7 format (which would be
wrongly identified as us-ascii), as being inappropriate for use
in an HTTP context. One might mention, however, that when MSIE
is set to auto-detect character encodings, it has been known to
mis-identify some us-ascii pages, claiming them to be in utf-7.
-- Heuristic recognition of utf-8?,
FORM submission and i18n, Alan J. Flavell
<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>
nor about the Character Jam nor about the Phenomenon of the first
non-ASCII character as such.
If you want a sensible discussion of the issues, actually describe them
properly.

[snip]

Mike
Nov 15 '06 #38

P: n/a
VK
Michael Winter wrote:
If you want a sensible discussion of the issues, actually describe them
properly.
The issue is that UA's acting unstable w/o charset indicated somehow.
That is especially true for IE6 which also happens to be the most
widely used UA at this time. IE6 is a very old, I would say ancient,
browser (by the Web time scale) with Unicode and UTF-__ encodings
support implemented atop and addon somehow anyhow.

This is only far related to JavaScript programming though. Maybe I'll
make a demo page showing what an innocent page can do with IE6 if
charset is not provided.

Nov 15 '06 #39

P: n/a
Hello!

"VK" <sc**********@yahoo.comwrote in message news:11**********************@e3g2000cwe.googlegro ups.com...
...
- which certainly is UTF-8 encoded (the registered trade mark character
is the C2 AE sequence just before the 3C at the end of the penultimate
line).

Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack. So instead of say "CYRILLIC CAPITAL LETTER A"
(Unicode 0x0410) we are taking its UTF-8 encoding 208 144 and placing
two 8-bit encoded characters matching 208 and 144. Say in Cyrillic
(Windows-1251) these will be CYRILLIC CAPITAL LETTER R and CYRILLIC
SMALL LETTER DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode character
CYRILLIC CAPITAL LETTER A. Just tried it: it works for modern browsers.
Wow... I will definitely add it to our knowledge base, as a sample of
what people may come up with with enough of free time available :-)

Sorry again to everyone for being so slow: but it's really...
sophisticated.

Sophisticated? Hack (from another message)?
But you wrote that you deal with say Japanese and Korean 'legacy' encodings
so say you do know what Shift_Jis is, right? Then why you write such noncense:
take these two characters together and display as one
?

"Two characters"??? UTF-8 is same multi-byte encoding as Shift_JIS -
do you write about ONE Japanese letter which is encoded by 2 bytes
in Shift_JIS in the same manner, that is,
"... one byte matches...character, 2nd byte matches... character then
these 2 characters together ... one Japanese letter"?

There are no "characters" there, just 2 bytes that represent one Cyrillic
letter in mulit-byte encoding "UTF-8" -
same way as another 2 bytes represent one Japanese letter in multi-byte
encoding "Shift_Jis".
As Michale wrote, you somehow did not thing about the serialization,
about files on the disk.
I don't know why you did not know before about say .HTML files
containing pure UTF-8 text (i.e. real UTF-8 characters as mulit-byte items)
to produce a multilingual page - such I18n examples and well known pages
exist on the Web since I became and I18n engineer back in 1997 :-)

For example, for my Cyrillic(Russian) instructional site I prepared
a section "Multilingual HTML" many, many years ago -
it included preparation of the .htm _files_ containing UTF-8 text -
no one in the right mind would NOT have _large_ text represented in
your examples of UTF-8 - <item>%C2%AE</item- how do you think
a wen site owner would _edit/correct_ such page it - instead of a
_readable_ text (say Russian+German letters in UTF-8 encoding)
would contain just things like >%C2%AE?

Strange (based on your statements of I18n knowledge) that we here have to explain
you UTF-8 facts written say for _beginners_ at least 6 years ago on my site in
"Multilingual HTML" section (M.Flavell's site is listed there as a source
for non-beginners): http://RusWin.net/mix.htm

It has UTF-8 examples, too: http://RusWin.net/utf8euro.htm
and http://RusWin.net/utf8-jap.htm

Same can be said aboit XML. In both XML and HTML serialization
(files on disk) is a VERY _common_ practice to have real UTF-8
text in .xml and .html

--
Regards,
Paul
Javascript Virtual Keyboard working in Opera, Mozilla, IE:
http://Kbd.RusWin.net


Nov 19 '06 #40

P: n/a
Hello!

"VK" <sc**********@yahoo.comwrote in message news:11**********************@e3g2000cwe.googlegro ups.com...
...
- which certainly is UTF-8 encoded (the registered trade mark character
is the C2 AE sequence just before the 3C at the end of the penultimate
line).

Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack.
So instead of say "CYRILLIC CAPITAL LETTER A"
(Unicode 0x0410) we are taking its UTF-8 encoding 208 144 and placing
two 8-bit encoded characters matching 208 and 144. Say in Cyrillic
(Windows-1251) these will be CYRILLIC CAPITAL LETTER R and CYRILLIC
SMALL LETTER DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode character
CYRILLIC CAPITAL LETTER A. Just tried it: it works for modern browsers.
Wow... I will definitely add it to our knowledge base, as a sample of
what people may come up with with enough of free time available :-)

Sorry again to everyone for being so slow: but it's really...
sophisticated.

Sophisticated? Hack? Free time? It's a common _practice_, not
"free time strange example" - please read below.

You wrote that you deal with say Japanese and Korean 'legacy' encodings
so say you do know what Shift_Jis is, right? Then why you write such noncense:
take these two characters together and display as one?
"Two characters"??? UTF-8 is same multi-byte encoding as Shift_JIS -
do you write about ONE Japanese letter which is encoded by 2 bytes
in Shift_JIS in the same manner, that is,
"... one byte matches...character, 2nd byte matches... character then
these 2 characters together ... one Japanese letter"?
I don't think so :)

There are no "characters" there, neither for Russian in UTF-8 nor for Japanese
in Shift_JIS - just _2 bytes_ that represent one Cyrillic letter in multi-byte
encoding "UTF-8" -
same way as another 2 bytes represent one Japanese letter in multi-byte
encoding "Shift_Jis".

As Michael wrote above, you somehow did not thing about the serialization,
about files on the disk.

I don't know why you did not know before about say .HTML files
containing pure UTF-8 text (i.e. real UTF-8 characters as mulit-byte items)
to produce a multilingual page - such I18n examples and well known pages
exist on the Web since I became and I18n engineer back in 1997 :-)

For example, for my Cyrillic(Russian) instructional site I prepared
a section "Multilingual HTML" many, many years ago -
it included preparation of the .htm _files_ containing UTF-8 text -

and it is NOT "free time hack, example 'just for amusement' example" -

no one in the right mind would have _large_ text represented in
_your examples_ of UTF-8 - <item>%C2%AE</item- how do you think
a Web site owner would _maintain/edit/correct_ such page if - instead of a
_readable_ text (say Russian+German letters in UTF-8 encoding) it contains
just things like >%C2%AE?

In reality most multi-lingual Web pages serialized as .htm files, contain
real UTF-8 text, so it's not a "hack" but practical thing used everywhere -
and in accordance with UTF-8 definition as "mulit-byte encoding".
Strange (based on your statements of I18n knowledge) that we here have to explain
you UTF-8 facts written say for _beginners_ at least 6 years ago on my site in
"Multilingual HTML" section (M.Flavell's site is listed there as a source
for non-beginners): http://RusWin.net/mix.htm

It has UTF-8 examples, too: http://RusWin.net/utf8euro.htm
and http://RusWin.net/utf8-jap.htm

Same can be said aboit XML. In both XML and HTML serialization
(files on disk) is a VERY _common_ practice to have real UTF-8
text in .xml and .html

--
Regards,
Paul
Javascript Virtual Keyboard working in Opera, Mozilla, IE:
http://Kbd.RusWin.net



Nov 19 '06 #41

P: n/a
VK
Paul Gorodyansky wrote:
You wrote that you deal with say Japanese and Korean 'legacy' encodings
so say you do know what Shift_Jis is, right? Then why you write such noncense:
I would hardly call Shift JIS a "legacy" one as it remains the only one
used in Japan itself :-) (grace to Unicode, Inc. screwed the entire
nation)

But no, you did not get my post right: I was talking about a standard
Western (Latin 1) page interpreted as written in Hangul (Korean
ideograph alphabet) or Unicode 16-bit: because of missing charset
indication. I really think now to make a demo set and to post it at
ciwah, as it seems a terra ingornita for too many people.
(We've called the relevant problem "Korean issue" - it is a slang term
because 1) ISO Latin page being interpreted as UTF-7 with Hangul
(Korean) characters in it and 2) because we've got a number of requests
on the matter at the moment of the first big USA - North Korea crisis.
No national offence I hope.)
"Two characters"??? UTF-8 is same multi-byte encoding as Shift_JIS -
do you write about ONE Japanese letter which is encoded by 2 bytes
in Shift_JIS in the same manner, that is,
"... one byte matches...character, 2nd byte matches... character then
these 2 characters together ... one Japanese letter"?
I don't think so :)
What I missed in the discussion of "storing UTF-8 directly and
delivering it verbatim" is the usage of term "byte" and "byte sequence"
in application to a *text document*. Naturally everything consists of
bytes, including any .html or .txt file. At the same time there is a
core distinction between a text file and a binary file. And the text
file by definition consists of *characters*, not of bytes. And from a
point of view of any "8-bit observer" such document is nothing but a
set of 8-bit characters. It is required to provide an extra instruction
to interpret it in some other way.
After I have transformed the explanations from byte terms to character
terms I understood that Mr.Winter tried to tell me. It required an
extra abstraction effort from my side as it's a bit like describing a
painting in terms of wavelengths. From the other side it helped greatly
to understand a big category of help requests (about broken pages)
which we've getting. Before I thought that the vistimes are just...
strange people. Now I understand that they are simply looking at from
"another dimension" and from that dimension what they are doing is
totally correct and it has perfect sense. Unfortunately for them the
Internet oftenly operates in the dimension different from their's.

Nov 19 '06 #42

P: n/a
Hello!

VK wrote:
Paul Gorodyansky wrote:
"Two characters"??? UTF-8 is same multi-byte encoding as Shift_JIS -
do you write about ONE Japanese letter which is encoded by 2 bytes
in Shift_JIS in the same manner, that is,
"... one byte matches...character, 2nd byte matches... character then
these 2 characters together ... one Japanese letter"?
I don't think so :)

And the text file by definition consists of *characters*, not of bytes.
It's true - and it's true for _both_ Japanese Shift_JIS and UTF-8.-
And from a point of view of any "8-bit observer" such document is nothing but a
set of 8-bit characters.
No, not at all. Japanese text - using multi-byte encoding Shift_JIS -
is - as you wrote yourself - a collection of *character* - Japanese
ones,
that is when characters cold be represented by one (hankaka katakanu)
of two bytes - there are NO "8-bit characters" there -
EXACTLY the same situation is with UTF-8 text - because it's also a
multi-byte encoding - one should NOT look at a Japanese *character*
in Shift_JIS text as "two 8-bit characters - Capital a-umlaut and ...."
-
no, it is ONE multi-byte character.
Exactly the same is true for UTF-8 text.

It's what I wrote yesterday - see a quote at the top of _this_ message.
It is required to provide an extra instruction to interpret it in some other way.
Interpretation of a text presented in a Mulit-byte encoding is known
for many years already - on JCK texts.

After I have transformed the explanations from byte terms to character
terms I understood that Mr.Winter tried to tell me. It required an
extra abstraction effort from my side as it's a bit like describing a
painting in terms of wavelengths.
Why? Multi-byte UTF-8 text with UTF-8 characters the same concept
as multi-byte Japanese Shift_JIS text so it's strange that the concept
looks new for you -
I'd understand if you were aperson who never dealt with Japanese,
Chinese or Korean...

--
Regards,
Paul
http://RusWin.net

Nov 19 '06 #43

P: n/a
VK

pa*****@compuserve.com wrote:
No, not at all. Japanese text - using multi-byte encoding Shift_JIS -
is - as you wrote yourself - a collection of *character* - Japanese
ones
Only as long as declared/auto-recognized as Shift_JIS. Otherwise it's
8-bit charset. I invite you once again to read the origin of this
branch of the thread, not just my latest posts.
Why? Multi-byte UTF-8 text with UTF-8 characters the same concept
as multi-byte Japanese Shift_JIS text so it's strange that the concept
looks new for you -
I'd understand if you were aperson who never dealt with Japanese,
Chinese or Korean...
I dealt a lot with them. Before further discuss the issue two things
should be done:

1) the discussion moved to ciwah or even ciwam as it is too far of
JavaScript IMHO (though somehow connected so maybe can be left here ?).

2) I want to show the cases I was talking in this thread: I hate
*abstract* discussion of a kind:
- I can do it because it's written here that I can do it.
- You never cannot do it because the sh** will happen.

Nov 19 '06 #44

P: n/a
Hello!

1) the discussion moved to ciwah or even ciwam as it is too far of
JavaScript IMHO (though somehow connected so maybe can be left here ?).
Right. I just posted because I was surprized why the concept of real UTF-8
characters (vs URL-encoding or entities) was so new for you when it's exactly
the same as say Japanese Shift_JIS - both are multi-byte encodings and for
_both_ does NOT make any sense to described a multi-byte *character* as
you did, i.e. I replied to this (which is wrong for multi-byte encoding being it
Shift_JIS or UTF-8, because there are *no* 'two 8-bit characters, it's one
multi-byte character):
... we are taking its UTF-8 encoding 208 144 and placing
two 8-bit encoded characters matching 208 and 144. Say in Cyrillic
(Windows-1251) these will be CYRILLIC CAPITAL LETTER R and CYRILLIC
SMALL LETTER DJE (Serbian).
It (describing parts of a multi-byte character as separate characters of _another_
encoding) would be wrong to described in this way a two-byte Japanese character -
and it's wrong to to so for a UTF-8 character.
--
Paul
http://RusWin.net
Nov 20 '06 #45

This discussion thread is closed

Replies have been disabled for this discussion.