On Mon, 14 Mar 2005, Patrick Van Esch wrote:
[color=blue]
> [...] but (on an Win XP machine) generates
> unicode (which is indeed encoded under UTF-16, as I understand it
> now). If you open that code with anything that expects ASCII (such as
> a basic program or so reading it as a text file) you get a "funny"
> file which has as first byte a 255 code, and as second byte a 254
> code, and then all true ascii is indeed encoded by a 0 byte preceded
> by a byte containing the ascii code, and the greek characters are
> simply encoded by "first byte value" + 256 x "second byte value".[/color]
If you want to understand what this is, it could be useful to read the
appropriate section in the Unicode standard.
I know that many readers react to this by saying "this is more
complicated than I want to know", but in fact by taking a little extra
time to understand this additional complication, it can save a lot of
confusion later; whereas people who insist on inventing
over-simplified versions of the story inevitably waste time later when
it leads to confusion.
My recommendation would be to read chapter 2
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
at least from section 2.4 down to 2.6 inclusive, and in particular
to study table 2-3.
The "character encoding form" utf-16 (see section 2.5) defines the
way to represent any Unicode character by means of 16-bit-wide data
unit(s).
However, in practice (since computer architectures are typically
byte-oriented) you need a way to be more precise about how these
16-bit units will be stored in a computer or transmitted on a
communications channel.
For this reason the "Character Encoding Form" breaks up into a number
of "Character Encoding Schemes". For utf-16 we first break the
categories down by whether the byte-ordering is defined internally or
externally to the data stream.
With utf-16BE and utf-16LE, the byte-ordering (big-endian or
little-endian) is specified externally to the data stream, by the name
of the encoding scheme.
With the utf-16 Encoding Scheme, the byte-ordering is specified by
means of the "byte order mark" at the start of the data (and this is
what you saw on Windows). There are of course two flavours of this
encoding /scheme/, i.e big and little endian, but they both have the
same /name/, since the BOM itself is sufficient to distinguish between
them.
And thus the Unicode specification has just three Encoding Scheme
names for what are, in a sense, four different encoding schemes for
the one utf-16 "encoding form".
Notice that the name utf-16 appears both as a Character Encoding Form
(which comprises the Character Encoding Schemes utf-16BE, utf-16LE and
both kinds of utf-16), as well as appearing as a Character Encoding
Scheme. This can be a bit confusing.
The Unicode FAQ also has an informative article on the BOM.
[color=blue]
> So I wrote a small Reality Basic program that detects this 255 - 254
> initial two-byte sequence, and then replaces each "XX and 00" sequence
> simply by XX, and if it is "XX and YY" replaces it by "&#(value of XX
> + 256 * YY)", to make an ascii file out of it.[/color]
This is good fun for exploring the issues, of course, but you don't
want to do that in practice. There are libraries that are guaranteed
to give the right answers (including surrogates and all that stuff)
Admittedly at the moment you aren't trying to use Byzantine musical
symbols, Linear B, or any of the other stuff that would need
surrogates, but that's no reason to dig oneself into a hole when
there's no good cause to do so.
None of this discussion about utf-16 should distract, of course, from
the general recommendations you've been getting in favour of using
utf-8.
best regards