Binary or text file

list

Hi folks,

I am new to Googlegroups. I asked my questions at other forums, since
now.

I have an important question: I have to check files if they are
binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
check a file an find out if the file is binary or text?

Thanks for your help.

May 10 '07 #1

Subscribe Post Reply

4823

osmium

<li**@ubootsuxx.dewrote:

I am new to Googlegroups. I asked my questions at other forums, since
now.

I have an important question: I have to check files if they are
binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
check a file an find out if the file is binary or text?

You can't. You can only determine with high probability what the file is.
Assuming ASCII code only a very few of the control characters ever appear in
a text file. It's like finding a white crow, if you had looked at just one
more crow, it might have been a white one. But if you have the file
extender for the file (as above), you can look at Wotsit and get an answer.

http://www.wotsit.org/

May 10 '07 #2

Keith Halligan

You can't. You can only determine with high probability what the file is.

Assuming ASCII code only a very few of the control characters ever appear in
a text file.

Thats pretty much the way to do it. If you take the unix command
`file' it does it pretty much like this. It'll generally take the
first 512 bytes of the file and from that it can determine the type of
file. Binary files tend to have a lot of padding with bytes zeroed
out, while ascii files will have every byte having a value 30.

May 10 '07 #3

Diego Martins

On May 10, 3:16 pm, Keith Halligan <keith.halli...@gmail.comwrote:

You can't. You can only determine with high probability what the file is.
Assuming ASCII code only a very few of the control characters ever appear in
a text file.

Thats pretty much the way to do it. If you take the unix command
`file' it does it pretty much like this. It'll generally take the
first 512 bytes of the file and from that it can determine the type of
file. Binary files tend to have a lot of padding with bytes zeroed
out, while ascii files will have every byte having a value 30.

there is a 'file' command utility in unix that does the job
borrow source code from it :)

May 11 '07 #4

James Kanze

On May 10, 8:16 pm, Keith Halligan <keith.halli...@gmail.comwrote:

You can't. You can only determine with high probability what the file is.
Assuming ASCII code only a very few of the control characters ever appear in
a text file.

Thats pretty much the way to do it. If you take the unix command
`file' it does it pretty much like this. It'll generally take the
first 512 bytes of the file and from that it can determine the type of
file. Binary files tend to have a lot of padding with bytes zeroed
out, while ascii files will have every byte having a value 30.

Note, however, that the file utility has a very high error rate.
And it knows a fair amount about the formats of different types
of binary files, and can recognize those because of various
embedded magic numbers---if the file matches a known format,
then it isn't plain text.

In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding. A file in UTF-32LE, for example,
with English text, will have close to 3/4 of the bytes 0. You
can still try some heuristics: if you have a file with 1 byte
non-0, then three 0's, and that pattern repeats, with few
exceptions, there's a very good chance that it is UTF-32LE. But
it's more complicated (and globally, less reliable) that back in
the days when everything was ASCII.

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 11 '07 #5

Gianni Mariani

On May 12, 6:14 am, James Kanze <james.ka...@gmail.comwrote:
....

>
In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding.

Really ? Most text files I see don't have any characters beyond the
ASCII set which would make them ASCII.

.... A file in UTF-32LE, for example,
with English text, will have close to 3/4 of the bytes 0. You
can still try some heuristics: if you have a file with 1 byte
non-0, then three 0's, and that pattern repeats, with few
exceptions, there's a very good chance that it is UTF-32LE. But
it's more complicated (and globally, less reliable) that back in
the days when everything was ASCII.

I have yet to see a UTF-32LE file in the wild. Even the UTF-16 files
I've seen are far and few between. I'd like to believe that utf-8
will become the default text format and there are a few tests to
determine the likliness of a file being utf-8 (and no, it's probably
not a BOM at the beginning of the file).

May 11 '07 #6

James Kanze

On May 12, 1:32 am, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 12, 6:14 am, James Kanze <james.ka...@gmail.comwrote:
...

In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding.

Really ? Most text files I see don't have any characters beyond the
ASCII set which would make them ASCII.

Really. You must live a very parochial life. I find accented
characters pretty regularly in my files (including in C++ source
files). And ASCII doesn't have any accented characters.

You're reading this thread; there are non-ASCII characters in
the messages in it. (Check out my signature, for example.)
Practically, if you're connected to the network, you can forget
about ASCII; you have to be able to handle a large number of
different character encodings.

.... A file in UTF-32LE, for example,
with English text, will have close to 3/4 of the bytes 0. You
can still try some heuristics: if you have a file with 1 byte
non-0, then three 0's, and that pattern repeats, with few
exceptions, there's a very good chance that it is UTF-32LE. But
it's more complicated (and globally, less reliable) that back in
the days when everything was ASCII.

I have yet to see a UTF-32LE file in the wild.

I haven't either, but I know that they exist. I've also created
a few for test purposes.

Even the UTF-16 files I've seen are far and few between.

Curious. From what I understand, UTF-16 is the standard
encoding under Windows. And machines running Windows aren't
exactly "few and far between".

I'd like to believe that utf-8
will become the default text format

I would too, but given the passive that has to be taken into
account, I don't realistically expect it to happen any time
soon.

and there are a few tests to
determine the likliness of a file being utf-8 (and no, it's probably
not a BOM at the beginning of the file).

Actually, UTF-8 isn't that difficult. If the first 500 some
bytes don't contain an illegal UTF-8 sequence, there's only a
very small probability that the file isn't UTF-8.

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 12 '07 #7

Gianni Mariani

On May 12, 7:18 pm, James Kanze <james.ka...@gmail.comwrote:

On May 12, 1:32 am, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 12, 6:14 am, James Kanze <james.ka...@gmail.comwrote:
...
In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding.
Really ? Most text files I see don't have any characters beyond the
ASCII set which would make them ASCII.

Really. You must live a very parochial life.

What is with you French ? Nuking the pacific is not enough ?

... I find accented
characters pretty regularly in my files (including in C++ source
files). And ASCII doesn't have any accented characters.

I think my claim is valid, most, i.e. 50% or more of text files I use
are ASCII. If it wasn't for your .sig having a few 8859-1 characters
in it, your posts would be ASCII as well.

....

Even the UTF-16 files I've seen are far and few between.

Curious. From what I understand, UTF-16 is the standard
encoding under Windows. And machines running Windows aren't
exactly "few and far between".

Still, even on Windows, most text files are created as 8 bit. The
only tool I use regularly that produces utf-16 files in regedit
although it will read utf-8 files correctly.

I suspect very few applications will read utf-16 in a conforming way.
I don't if ISO-10646 has been updated, but a while back, utf-16 was a
stateful encoding (it still is for all intents and purposes). Any
time you read a reversed BOM you need to swap endianness. I have met
very few programmers that know what a surrogate pair is.

>
I'd like to believe that utf-8
will become the default text format

I would too, but given the passive that has to be taken into
account, I don't realistically expect it to happen any time
soon.

Well. there are alot of websites that claim to push utf-8 and most
browsers support utf-8 well - even bidi selection works like it should
which is quite cool

>
and there are a few tests to
determine the likliness of a file being utf-8 (and no, it's probably
not a BOM at the beginning of the file).

Actually, UTF-8 isn't that difficult. If the first 500 some
bytes don't contain an illegal UTF-8 sequence, there's only a
very small probability that the file isn't UTF-8.

Yes. That's right. You need to have a lib that is robust enough to
tell you.

May 12 '07 #8

ajk

On 10 May 2007 09:58:41 -0700, li**@ubootsuxx.de wrote:

>Hi folks,

I am new to Googlegroups. I asked my questions at other forums, since
now.

I have an important question: I have to check files if they are
binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
check a file an find out if the file is binary or text?

Thanks for your help.

Depends a bit what you mean with "binary"

If you are under Windows you can determine if a file is an .exe-file
by reading the first few bytes in the file. Strictly speaking all
files are stored in binary format and it is a matter of interpreting
the contents.
/ajk

May 12 '07 #9

osmium

"ajk" writes:

>>I am new to Googlegroups. I asked my questions at other forums, since
now.

I have an important question: I have to check files if they are
binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
check a file an find out if the file is binary or text?

Thanks for your help.

Depends a bit what you mean with "binary"

If you are under Windows you can determine if a file is an .exe-file
by reading the first few bytes in the file. Strictly speaking all
files are stored in binary format and it is a matter of interpreting
the contents.

Since he posted the question to a.l.c++ we assume he wants an answer that is
appropriate within the context of that
language. I think you should think more deeply about the difference between
"highly likely" and *is*.

May 12 '07 #10

James Kanze

On May 12, 2:51 pm, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 12, 7:18 pm, James Kanze <james.ka...@gmail.comwrote:

On May 12, 1:32 am, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 12, 6:14 am, James Kanze <james.ka...@gmail.comwrote:
...
In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding.
Really ? Most text files I see don't have any characters beyond the
ASCII set which would make them ASCII.

Really. You must live a very parochial life.

What is with you French ? Nuking the pacific is not enough ?

Racist, on top of it. I've worked in both France and Germany,
and it is a fact of life that both languages have characters
which aren't present in ASCII, but which are more or less
necessary if the text is to be understood, or at least appear
normal. From what I've seen of other languages, this seems to
be the usual case. Long before Unicode, different regions
developed different encodings to handle non-US ASCII characters,
because a definite need for it was felt.

... I find accented
characters pretty regularly in my files (including in C++ source
files). And ASCII doesn't have any accented characters.

I think my claim is valid, most, i.e. 50% or more of text files I use
are ASCII. If it wasn't for your .sig having a few 8859-1 characters
in it, your posts would be ASCII as well.

Not all my posts. I frequently post to fr.comp.lang.c++ and
de.comp.lang.iso-c++ as well, and my posts there contain
characters which are not ASCII.

Formally, of course, the issue is far from simple. If you're
dealing with text data over the network, you have to be ready to
handle different code sets. In practice, most protocols will
insist on either one of the Unicode encodings or an encoding
which shares the first 129 characters with ASCII for the start
of the headers, until you've transmitted the information as to
which encoding you are actually using. And if you know that it
is text, and that it starts with a header, picking between
UTF-32BE, UTF-32LE, UTF-16BE, UTF-16LE and a byte encoding is
trivial, and that allows you to get through until you've read
the real encoding.

And of course, most of the newer protocols just say: it has to
be UTF-8.

...

Even the UTF-16 files I've seen are far and few between.

Curious. From what I understand, UTF-16 is the standard
encoding under Windows. And machines running Windows aren't
exactly "few and far between".

Still, even on Windows, most text files are created as 8 bit. The
only tool I use regularly that produces utf-16 files in regedit
although it will read utf-8 files correctly.

I suspect very few applications will read utf-16 in a conforming way.
I don't if ISO-10646 has been updated, but a while back, utf-16 was a
stateful encoding (it still is for all intents and purposes). Any
time you read a reversed BOM you need to swap endianness. I have met
very few programmers that know what a surrogate pair is.

I have met very few programmers who even know that there exist
character sets which aren't encoded using single, 8 bit
characters. I'm not saying that ignorance isn't wide spread,
but I will try to fight it, whenever I can.

I'd like to believe that utf-8
will become the default text format

I would too, but given the passive that has to be taken into
account, I don't realistically expect it to happen any time
soon.

Well. there are alot of websites that claim to push utf-8 and most
browsers support utf-8 well - even bidi selection works like it should
which is quite cool

It's making headway. But a lot of code and text is old code and
text. And it's not going to go away anytime soon.

and there are a few tests to
determine the likliness of a file being utf-8 (and no, it's probably
not a BOM at the beginning of the file).

Actually, UTF-8 isn't that difficult. If the first 500 some
bytes don't contain an illegal UTF-8 sequence, there's only a
very small probability that the file isn't UTF-8.

Yes. That's right. You need to have a lib that is robust enough to
tell you.

Or write one yourself:-).

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 12 '07 #11

Gianni Mariani

On May 13, 7:39 am, James Kanze <james.ka...@gmail.comwrote:

On May 12, 2:51 pm, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 12, 7:18 pm, James Kanze <james.ka...@gmail.comwrote:
On May 12, 1:32 am, Gianni Mariani <gi3nos...@mariani.wswrote:
On May 12, 6:14 am, James Kanze <james.ka...@gmail.comwrote:
...
In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding.
Really ? Most text files I see don't have any characters beyond the
ASCII set which would make them ASCII.
Really. You must live a very parochial life.
What is with you French ? Nuking the pacific is not enough ?

Racist, on top of it. I've worked in both France and Germany,
and it is a fact of life that both languages have characters
which aren't present in ASCII, but which are more or less
necessary if the text is to be understood, or at least appear
normal. ...

OK, the French didn't nuke the Pacific now ... and by claiming they
did one is now racist ?

Because someone does not use accented characters one is now
"parochial".

And because someone does not agree with you one is "inexperienced".

Yup. Sounds French to me. If you can't use facts, use personal
attacks.

... From what I've seen of other languages, this seems to
be the usual case. Long before Unicode, different regions
developed different encodings to handle non-US ASCII characters,
because a definite need for it was felt.

ISO-8859-1, -2 ... -15, JIS, ShiftJIS, EUC-*, ISO-2022, Big5, KOI-8
are ones I have personally worked with. It was a mess. That's why I
pushed for Unicode (utf-8) adoption as much as I could. Many file
formats became utf-8 because I suggested and explained to developers
what they needed to do otherwise and believe me, it was not easy to
convince people to use utf-8.

One of the nicest but underused features of uicode text is language
tagging. A unicode text string is able to tell you what language it
is (meaning that all unicode text is stateful) but very few people
implement it.

>

... I find accented
characters pretty regularly in my files (including in C++ source
files). And ASCII doesn't have any accented characters.
I think my claim is valid, most, i.e. 50% or more of text files I use
are ASCII. If it wasn't for your .sig having a few 8859-1 characters
in it, your posts would be ASCII as well.

Not all my posts. I frequently post to fr.comp.lang.c++ and
de.comp.lang.iso-c++ as well, and my posts there contain
characters which are not ASCII.

That's nice. You have such a colorful world with that accented É and
that eszet character 'ß' it the pivot of the spice of life.
....

>
And of course, most of the newer protocols just say: it has to
be UTF-8.

That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.

>
...
Even the UTF-16 files I've seen are far and few between.
Curious. From what I understand, UTF-16 is the standard
encoding under Windows. And machines running Windows aren't
exactly "few and far between".
... I have met
very few programmers that know what a surrogate pair is.

I have met very few programmers who even know that there exist
character sets which aren't encoded using single, 8 bit
characters. I'm not saying that ignorance isn't wide spread,
but I will try to fight it, whenever I can.

Life with Unicode is much easier. Surprising little code really needs
to care that it is parsing utf-8. Some code will break because it
splits characters or it compares un-normalized strings, but these
problems are far easier to deal with than the mish-mash of encodings
in the past.

>

I'd like to believe that utf-8
will become the default text format
I would too, but given the passive that has to be taken into
account, I don't realistically expect it to happen any time
soon.
Well. there are alot of websites that claim to push utf-8 and most
browsers support utf-8 well - even bidi selection works like it should
which is quite cool

It's making headway. But a lot of code and text is old code and
text. And it's not going to go away anytime soon.

Do you normalize your unicode strings ? Do you apply state from
unicode language tags across all strings you extract from a stream of
unicode characters ?

>

and there are a few tests to
determine the likliness of a file being utf-8 (and no, it's probably
not a BOM at the beginning of the file).
Actually, UTF-8 isn't that difficult. If the first 500 some
bytes don't contain an illegal UTF-8 sequence, there's only a
very small probability that the file isn't UTF-8.
Yes. That's right. You need to have a lib that is robust enough to
tell you.

Or write one yourself:-).

You have probably used one I wrote. Do you know where the "-l" in
iconv came from ?

May 12 '07 #12

Ian Collins

Gianni Mariani wrote:

On May 13, 7:39 am, James Kanze <james.ka...@gmail.comwrote:
>On May 12, 2:51 pm, Gianni Mariani <gi3nos...@mariani.wswrote:

>>On May 12, 7:18 pm, James Kanze <james.ka...@gmail.comwrote:

Really. You must live a very parochial life.
What is with you French ? Nuking the pacific is not enough ?
Racist, on top of it. I've worked in both France and Germany,
and it is a fact of life that both languages have characters
which aren't present in ASCII, but which are more or less
necessary if the text is to be understood, or at least appear
normal. ...

OK, the French didn't nuke the Pacific now ... and by claiming they
did one is now racist ?

Because someone does not use accented characters one is now
"parochial".

Why all the crap? Just because you and I don't see many text files with
extended character sets, doesn't mean that aren't in widespread use.

If you want to pick a fight, find a rough bar.

--
Ian Collins.

May 12 '07 #13

Gianni Mariani

On May 13, 8:56 am, Ian Collins <ian-n...@hotmail.comwrote:

Gianni Mariani wrote:
On May 13, 7:39 am, James Kanze <james.ka...@gmail.comwrote:
On May 12, 2:51 pm, Gianni Mariani <gi3nos...@mariani.wswrote:

>On May 12, 7:18 pm, James Kanze <james.ka...@gmail.comwrote:

>>Really. You must live a very parochial life.
What is with you French ? Nuking the pacific is not enough ?
Racist, on top of it. I've worked in both France and Germany,
and it is a fact of life that both languages have characters
which aren't present in ASCII, but which are more or less
necessary if the text is to be understood, or at least appear
normal. ...

OK, the French didn't nuke the Pacific now ... and by claiming they
did one is now racist ?

Because someone does not use accented characters one is now
"parochial".

Why all the crap?

Is that a technical term ?

... Just because you and I don't see many text files with
extended character sets, doesn't mean that aren't in widespread use.

The claim by James was that today "ASCII is pretty much
inexistant(sic)". Which is blatantly wrong. Having pointed that out
to him, he shoots back using "parochial" or "inexperienced" to justify
himself.

James, being of German and French background, I could hope for a more
Swiss-neutral attitude but it appears that we have a classic Parisian
arrogance with a German bureaucratic mind-set. I haven't met too many
of these guys around.

>
If you want to pick a fight, find a rough bar.

You're right, I should have known better.

So, we should all proclaim that all ASCII files are now officially
utf-8 and all other text formats are deprecated and should be deleted.

May 12 '07 #14

Gianni Mariani

On May 12, 11:42 pm, ajk <f...@nomail.comwrote:

On 10 May 2007 09:58:41 -0700, l...@ubootsuxx.de wrote:

...Strictly speaking all
files are stored in binary format and it is a matter of interpreting
the contents.

Strictly speaking, that is not true depending on who you're talking
about doing the interpretation. Some systems (VMS) didn't allow you
to read the binary stream of all files and would have a "record
management services" (RMS) get in the way. Those days are more or less
gone (thank Unix).

May 13 '07 #15

Gianni Mariani

On May 13, 8:46 am, Gianni Mariani <gi3nos...@mariani.wswrote:

That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.

Perhaps bigoted is the wrong word....

http://groups.google.com/group/comp....02fe2ee90c94d7

At least I can say I called this prophetically.

May 13 '07 #16

Markus Schoder

On Sat, 12 May 2007 15:46:12 -0700, Gianni Mariani wrote:

On May 13, 7:39 am, James Kanze <james.ka...@gmail.comwrote:
>On May 12, 2:51 pm, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 12, 7:18 pm, James Kanze <james.ka...@gmail.comwrote:
On May 12, 1:32 am, Gianni Mariani <gi3nos...@mariani.wswrote:
On May 12, 6:14 am, James Kanze <james.ka...@gmail.comwrote:
...
In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding.
Really ? Most text files I see don't have any characters beyond
the ASCII set which would make them ASCII.
Really. You must live a very parochial life.
What is with you French ? Nuking the pacific is not enough ?

Racist, on top of it. I've worked in both France and Germany, and it
is a fact of life that both languages have characters which aren't
present in ASCII, but which are more or less necessary if the text is
to be understood, or at least appear normal. ...

OK, the French didn't nuke the Pacific now ... and by claiming they did
one is now racist ?

Because someone does not use accented characters one is now "parochial".

And because someone does not agree with you one is "inexperienced".

Yup. Sounds French to me. If you can't use facts, use personal
attacks.

Funny how nationalism rears its ugly head in the most unlikely places.

Welcome to my kill file.

--
Markus Schoder

May 13 '07 #17

Gianni Mariani

On May 13, 1:45 pm, Markus Schoder <a3vr6dsg-use...@yahoo.dewrote:
....

Funny how nationalism rears its ugly head in the most unlikely places.

Welcome to my kill file.

That's usually written *PLONK*.

I welcome our new kill file overloads.

May 13 '07 #18

James Kanze

On May 13, 12:46 am, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 13, 7:39 am, James Kanze <james.ka...@gmail.comwrote:
On May 12, 2:51 pm, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 12, 7:18 pm, James Kanze <james.ka...@gmail.comwrote:
On May 12, 1:32 am, Gianni Mariani <gi3nos...@mariani.wswrote:
On May 12, 6:14 am, James Kanze <james.ka...@gmail.comwrote:
...
In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding.
Really ? Most text files I see don't have any characters beyond the
ASCII set which would make them ASCII.
Really. You must live a very parochial life.
What is with you French ? Nuking the pacific is not enough ?

Racist, on top of it. I've worked in both France and Germany,
and it is a fact of life that both languages have characters
which aren't present in ASCII, but which are more or less
necessary if the text is to be understood, or at least appear
normal. ...

OK, the French didn't nuke the Pacific now ... and by claiming they
did one is now racist ?

What does nuking the Pacific have to do with anything. It's
racist to condemn all French because some idiotic government
officials do something stupid. If you're going to judge
everyone by their government, what would one say about the
Americans today?

Because someone does not use accented characters one is now
"parochial".

Because one doesn't take into account that they exist, one is
very parochial.

[...]

And of course, most of the newer protocols just say: it has to
be UTF-8.

That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.

By who? I think that there is a consensus that UTF-8 is the way
to go. The problem is that reality isn't following that
consensus very quickly, and that as soon as a computer is
connected to the network, it has to deal with all sorts of wierd
encodings. It's a lot of extra work, for everyone involved, but
that's life.

[...]

Life with Unicode is much easier. Surprising little code really needs
to care that it is parsing utf-8.

Are you kidding? What about code which uses e.g. "isalpha()".

Some code will break because it splits characters or it
compares un-normalized strings, but these problems are far
easier to deal with than the mish-mash of encodings in the
past.

Easier, yes, but not all of the tools are necessarily in place.
Things like "isalpha()" are an obvious problem.

Actually, UTF-8 isn't that difficult. If the first 500 some
bytes don't contain an illegal UTF-8 sequence, there's only a
very small probability that the file isn't UTF-8.
Yes. That's right. You need to have a lib that is robust enough to
tell you.

Or write one yourself:-).

You have probably used one I wrote. Do you know where the "-l" in
iconv came from ?

What's iconv?

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 13 '07 #19

James Kanze

On May 13, 1:51 am, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 13, 8:56 am, Ian Collins <ian-n...@hotmail.comwrote:
Gianni Mariani wrote:

... Just because you and I don't see many text files with
extended character sets, doesn't mean that aren't in widespread use.

More significantly, the software which generated what you are
processing as "pure ASCII" probably was actually using some
exended code set. There is no support for "pure ASCII" under
Linux, as far as I can see, for example. The reality is that if
your software doesn't correctly handle characters with a bit 7
set, it is broken, because even in America, most of the tools
can easily generate such files.

I know that I have a couple of files which contain a 'ÿ' (y with
a diaerisis) in ISO 8859-1, for test purposes. It's amazing how
many programs treat it as an end of file. Would you (or Gianni,
for that matter) consider this "correct", even if the program
didn't have to deal with accented characters per se? Would you
(or Gianni) consider it OK to not test this (limit) case,
knowing that it is a frequent error?

The claim by James was that today "ASCII is pretty much
inexistant(sic)". Which is blatantly wrong.

Statistics? ASCII isn't used by Windows. It's not available in
the standard Linux distributions I use. All of the Internet
protocols I know *now* require more. (The now is important.
When I first implemented code around SMTP and NNTP, ASCII was
the standard encoding, and in fact, the only one supported.)

Having pointed that out to him, he shoots back using
"parochial" or "inexperienced" to justify himself.

James, being of German and French background,

James, being born and raised in the United States, and still
holding an American passport...

I could hope for a more
Swiss-neutral attitude but it appears that we have a classic Parisian
arrogance with a German bureaucratic mind-set.

More racism. I've not encountered any arrogance in Paris, and
I've not found Germany to be any more bureaucratic that anywhere
else.

People with that sort of attitude are parochial. They've not
gone out and actually considered other people for what they are.

[...]

So, we should all proclaim that all ASCII files are now officially
utf-8 and all other text formats are deprecated and should be deleted.

Of course, if you'd have actually read what you're responding
to, I said that we have to deal with a lot of different code
sets. And that that is a real problem.

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 13 '07 #20

Gianni Mariani

On May 13, 9:09 pm, James Kanze <james.ka...@gmail.comwrote:
....

OK, the French didn't nuke the Pacific now ... and by claiming they
did one is now racist ?

What does nuking the Pacific have to do with anything.

It's arrogant, just like you are being.

... It's
racist to condemn all French because some idiotic government
officials do something stupid. If you're going to judge
everyone by their government, what would one say about the
Americans today?

The majority of US citizens voted in the scariest U.S. government in
my living history.
- the "Patriot Act" - we started seeing the worst of McArthyism coming
back.
- The VP getting the biggest defence contracts to the company he used
to run
- Letting Microsoft off the hook for criminal activity

I can go on and on. The USA is run by the corps.

>
Because someone does not use accented characters one is now
"parochial".

Because one doesn't take into account that they exist, one is
very parochial.

Re-read what you wrote and tell me you honestly believe that ASCII
files are non-existant. I have applications that still generate
them. I still generate them. I see them every day. Accusing me of
being parochial is very arrogant and disingenuous.

>
[...]

And of course, most of the newer protocols just say: it has to
be UTF-8.
That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.

By who?

The discussion I refer to is archived by google.

... I think that there is a consensus that UTF-8 is the way
to go.

It was not a consensus in 1996.

....

>
Life with Unicode is much easier. Surprising little code really needs
to care that it is parsing utf-8.

Are you kidding? What about code which uses e.g. "isalpha()".

Ok, you need to think a little harder at what you're trying to do.

>
Some code will break because it splits characters or it
compares un-normalized strings, but these problems are far
easier to deal with than the mish-mash of encodings in the
past.

Easier, yes, but not all of the tools are necessarily in place.
Things like "isalpha()" are an obvious problem.

There is a need to standardize on something that handles all these
things - ICU is the only thing I have seen that gets close.

>

Actually, UTF-8 isn't that difficult. If the first 500 some
bytes don't contain an illegal UTF-8 sequence, there's only a
very small probability that the file isn't UTF-8.
Yes. That's right. You need to have a lib that is robust enough to
tell you.
Or write one yourself:-).
You have probably used one I wrote. Do you know where the "-l" in
iconv came from ?

What's iconv?

man iconv

May 13 '07 #21

James Kanze

On May 13, 4:26 am, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 13, 8:46 am, Gianni Mariani <gi3nos...@mariani.wswrote:

That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.

Perhaps bigoted is the wrong word....

http://groups.google.com/group/comp...._frm/thread/4d...

At least I can say I called this prophetically.

So what are we arguing about? You obviously know that other
code sets exist, and that we have to deal with them. And we
seem to agree with regards to the ideal solution.

The problem is that most vendors don't see any value in
supporting *both* ASCII and a superset of ASCII (e.g. ISO
8859-1, UTF-8, etc.), and because the superset is in practice
necessary, only provide it. You can pretend that you only have
to deal with ASCII, but in practice, you can't prevent
characters in the superset from appearing in your text files,
and a correct program has to deal with them correctly. The
default code set for Windows 8 bits is, I believe, ISO 8859-1.
(I'm not sure that this is by design. It may just be a case of
using USC-2, and stripping off the top byte.) Which means that
you may have characters in the text which the user thought were
legitimate characters, because they displayed as such on his
screen. And that, even though they are native English speakers:
the name on the binding of the encyclopedia I used when at
school (in northern, rural Illinois---you can't get much more
parocial) started with an Æ, for example, and at least one of my
English teachers spelled 'naïve' with the diaeresis. If the
characters are available (and they are), they will be used. And
a correct program will handle them correctly. In practice, text
encoded in "ASCII" simply doesn't exist today, and programs
which assume that their text input is pure ASCII are simply
broken.

And from the link above, it's obvious that you know this as well
as I do, if not better. So what's the argument about?

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 13 '07 #22

Gianni Mariani

On May 13, 9:24 pm, James Kanze <james.ka...@gmail.comwrote:

On May 13, 1:51 am, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 13, 8:56 am, Ian Collins <ian-n...@hotmail.comwrote:
Gianni Mariani wrote:
... Just because you and I don't see many text files with
extended character sets, doesn't mean that aren't in widespread use.

More significantly, the software which generated what you are
processing as "pure ASCII" probably was actually using some
exended code set. There is no support for "pure ASCII" under
Linux, as far as I can see, for example. The reality is that if
your software doesn't correctly handle characters with a bit 7
set, it is broken, because even in America, most of the tools
can easily generate such files.

I know that I have a couple of files which contain a 'ÿ' (y with
a diaerisis) in ISO 8859-1, for test purposes. It's amazing how
many programs treat it as an end of file. Would you (or Gianni,
for that matter) consider this "correct", even if the program
didn't have to deal with accented characters per se? Would you
(or Gianni) consider it OK to not test this (limit) case,
knowing that it is a frequent error?

Chill for a sec. You said ASCII files are "inexistant(sic)" which I
suspect means nonexistent. I call "bollocks" on that one and the
result is I'm being accused of being "parochial".

>
The claim by James was that today "ASCII is pretty much
inexistant(sic)". Which is blatantly wrong.

Statistics? ASCII isn't used by Windows. It's not available in
the standard Linux distributions I use. All of the Internet
protocols I know *now* require more. (The now is important.
When I first implemented code around SMTP and NNTP, ASCII was
the standard encoding, and in fact, the only one supported.)

And this has a bearing on (non)existence of ASCII files how ?

>
Having pointed that out to him, he shoots back using
"parochial" or "inexperienced" to justify himself.
James, being of German and French background,

James, being born and raised in the United States, and still
holding an American passport...

Ah, even better. A bumbling American with and arrogant Parisian
attitude driven by a German bent for precision. Life sucks sometimes.

>
I could hope for a more
Swiss-neutral attitude but it appears that we have a classic Parisian
arrogance with a German bureaucratic mind-set.

More racism. I've not encountered any arrogance in Paris, and

Really ? Where have you been in Paris ? Even the French I know
consider Parisians to be generally more arrogant than the rest of
France. Ah, there you go. You look at yourself in the mirror while
in Paris.

I've not found Germany to be any more bureaucratic that anywhere
else.

That's not what Germans say about themselves.

>
People with that sort of attitude are parochial. They've not
gone out and actually considered other people for what they are.

OK. Again, a very arrogant thing to say.

>
[...]

So, we should all proclaim that all ASCII files are now officially
utf-8 and all other text formats are deprecated and should be deleted.

Of course, if you'd have actually read what you're responding
to, I said that we have to deal with a lot of different code
sets. And that that is a real problem.

You have a choice to refuse to deal with anything but utf8. Until you
do, you will whine.

May 13 '07 #23

Gianni Mariani

On May 13, 9:48 pm, James Kanze <james.ka...@gmail.comwrote:

On May 13, 4:26 am, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 13, 8:46 am, Gianni Mariani <gi3nos...@mariani.wswrote:
That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.
Perhaps bigoted is the wrong word....
http://groups.google.com/group/comp...._frm/thread/4d...
At least I can say I called this prophetically.

So what are we arguing about?

The (non)existence of ASCII files. You say they don't exist and I say
they do. We're talking about files, code that consumes files.

Pretty stupid thing to be arguing about.

But I suppose you're too busy accusing me of being parochial and I'd
too busy trying to explain why.

May 13 '07 #24

James Kanze

On May 13, 2:03 pm, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 13, 9:24 pm, James Kanze <james.ka...@gmail.comwrote:

On May 13, 1:51 am, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 13, 8:56 am, Ian Collins <ian-n...@hotmail.comwrote:
Gianni Mariani wrote:
... Just because you and I don't see many text files with
extended character sets, doesn't mean that aren't in widespread use.

More significantly, the software which generated what you are
processing as "pure ASCII" probably was actually using some
exended code set. There is no support for "pure ASCII" under
Linux, as far as I can see, for example. The reality is that if
your software doesn't correctly handle characters with a bit 7
set, it is broken, because even in America, most of the tools
can easily generate such files.

I know that I have a couple of files which contain a 'ÿ' (y with
a diaerisis) in ISO 8859-1, for test purposes. It's amazing how
many programs treat it as an end of file. Would you (or Gianni,
for that matter) consider this "correct", even if the program
didn't have to deal with accented characters per se? Would you
(or Gianni) consider it OK to not test this (limit) case,
knowing that it is a frequent error?

Chill for a sec. You said ASCII files are "inexistant(sic)" which I
suspect means nonexistent. I call "bollocks" on that one and the
result is I'm being accused of being "parochial".

If you refuse to recognize that most of the files you actually
have to deal with were not written in ASCII, but in some
superset of ASCII, you're living in some isolated, very
backwards community. Neither Linux nor Windows even support
"ASCII" now adays.

If you consider that ASCII is all we'll ever need, you're being
very parochial, not looking beyond a very, very limited
community of users.

Those are the facts. Whether you like them or not.

The claim by James was that today "ASCII is pretty much
inexistant(sic)". Which is blatantly wrong.

Statistics? ASCII isn't used by Windows. It's not available in
the standard Linux distributions I use. All of the Internet
protocols I know *now* require more. (The now is important.
When I first implemented code around SMTP and NNTP, ASCII was
the standard encoding, and in fact, the only one supported.)

And this has a bearing on (non)existence of ASCII files how ?

Well, if ASCII isn't supported by Windows, and it isn't
supported by Linux, it obviously cannot be the general case for
most files.

Having pointed that out to him, he shoots back using
"parochial" or "inexperienced" to justify himself.
James, being of German and French background,

James, being born and raised in the United States, and still
holding an American passport...

Ah, even better. A bumbling American with and arrogant
Parisian attitude driven by a German bent for precision. Life
sucks sometimes.

Ah, even more blatant racism.

[Lot's more racism cut...]

People with that sort of attitude are parochial. They've not
gone out and actually considered other people for what they are.

OK. Again, a very arrogant thing to say.

What's arrogant about calling a spade a spade?

[...]

So, we should all proclaim that all ASCII files are now officially
utf-8 and all other text formats are deprecated and should be deleted.

Of course, if you'd have actually read what you're responding
to, I said that we have to deal with a lot of different code
sets. And that that is a real problem.

You have a choice to refuse to deal with anything but utf8. Until you
do, you will whine.

You may have a choice, but I live in the real world. The files
are there, and I have to deal with them. Whether I like it or
not.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 14 '07 #25

James Kanze

On May 13, 1:40 pm, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 13, 9:09 pm, James Kanze <james.ka...@gmail.comwrote:
...

[...]

... It's
racist to condemn all French because some idiotic government
officials do something stupid. If you're going to judge
everyone by their government, what would one say about the
Americans today?

The majority of US citizens voted in the scariest U.S. government in
my living history.
- the "Patriot Act" - we started seeing the worst of McArthyism coming
back.
- The VP getting the biggest defence contracts to the company he used
to run
- Letting Microsoft off the hook for criminal activity

I can go on and on. The USA is run by the corps.

OK, you aren't racist. You just hate everbody:-).

Seriously, do you really believe that you can judge people by
their government? Even in so-called democracies, like France
and the USA. I've lived in three different countries, and I've
very close contacts with a fourth (my wife is Italian). I've
found people to be pretty much the same everywhere, and in the
vast majority, I've found them to be pretty decent.

Because someone does not use accented characters one is now
"parochial".

Because one doesn't take into account that they exist, one is
very parochial.

Re-read what you wrote and tell me you honestly believe that ASCII
files are non-existant.

I haven't seen one for ages. Neither Linux nor Windows even
supports them.

I have applications that still generate them. I still
generate them. I see them every day. Accusing me of being
parochial is very arrogant and disingenuous.

So what machine are you using? Posix requires 8 bit characters,
and it doesn't have a function "isascii" anymore---it requires
full support for an eight bit character set. And of course,
correct code will not fail because some file happens to contain
an accented character. You can pretend that your files are
ASCII, but that's just pretending.

[...]

And of course, most of the newer protocols just say: it has to
be UTF-8.
That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.

By who?

The discussion I refer to is archived by google.

... I think that there is a consensus that UTF-8 is the way
to go.

It was not a consensus in 1996.

That's a long time ago. (In our profession, at least.)

...

Life with Unicode is much easier. Surprising little code really needs
to care that it is parsing utf-8.

Are you kidding? What about code which uses e.g. "isalpha()".

Ok, you need to think a little harder at what you're trying to do.

In general. Once you can no longer count on just ASCII, you do
have problems. Regardless of the encoding. On the whole, I
think UTF-8 is the only viable solution for communications, and
it is also the prefered solution for internal coding for a lot
of applications. Other applications will prefer UTF-32. And a
number of applications will still make do with some pure 8 bit
encoding, ISO 8859-1, or such.

Some code will break because it splits characters or it
compares un-normalized strings, but these problems are far
easier to deal with than the mish-mash of encodings in the
past.

Easier, yes, but not all of the tools are necessarily in place.
Things like "isalpha()" are an obvious problem.

There is a need to standardize on something that handles all these
things - ICU is the only thing I have seen that gets close.

They seem to have done the most work in this direction to date.
On the other hand, they use UTF-16, which doesn't seem a
judicious choice today: UTF-32 or UTF-8 would seem preferable,
depending on what the program is doing.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 14 '07 #26

Gianni Mariani

On May 14, 5:24 pm, James Kanze <james.ka...@gmail.comwrote:

On May 13, 1:40 pm, Gianni Mariani <gi3nos...@mariani.wswrote:

....
A tad off topic. I suppose we digressed off topic a few posts ago.

I can go on and on. The USA is run by the corps.

OK, you aren't racist. You just hate everbody:-).

I have an opinion. The older I get, the less critical I am of the
opinions I hold but the stronger the opinions are.

There is another theory I have about governments. You get the
government you deserve.

>
Seriously, do you really believe that you can judge people by
their government? Even in so-called democracies, like France
and the USA. I've lived in three different countries, and I've
very close contacts with a fourth (my wife is Italian). I've
found people to be pretty much the same everywhere, and in the
vast majority, I've found them to be pretty decent.

Same here. Almost exactly but they are all European.Roman in
heritage. Try spending some serious time in India, Thailand, PRC or
Hong Kong though, and then some more time in the Saudi, or UAE or
Nigeria even. The cultural skew takes some time to come to grips
with.

Just because a Parisian is arrogant to me doesn't mean I think the
worst of them. I talk about one particular incident in a Paris in a
restaurant all the time. It's very funny and if the listeners take
the time to think about it, it shows a very positive pride the French
have about themselves. I sometimes wish I had that a little more as
well. Nonetheless, it's arrogant and the arrogance comes out in other
ways. I have a similar incident in Switzerland (Neuchatel) and I can
tell you that I was annoyed, mostly with myself because I could not
fault the Swiss shop assistant for bending over backwards to help.

If I tried to pull off what the Parisians (allegedly) do where I live
now, I would probably come off as a total dick.

>

Because someone does not use accented characters one is now
"parochial".
Because one doesn't take into account that they exist, one is
very parochial.
Re-read what you wrote and tell me you honestly believe that ASCII
files are non-existant.

I haven't seen one for ages. Neither Linux nor Windows even
supports them.

Never mind. It matters not. The point I make is not so deep and
meaningful.

>
I have applications that still generate them. I still
generate them. I see them every day. Accusing me of being
parochial is very arrogant and disingenuous.

So what machine are you using? Posix requires 8 bit characters,
and it doesn't have a function "isascii" anymore---it requires
full support for an eight bit character set. And of course,
correct code will not fail because some file happens to contain
an accented character. You can pretend that your files are
ASCII, but that's just pretending.

You talk about processing technology, I talk about actual files. See,
not so deep and meaningful. It all works back to the context of the
original statement. Way too much energy spent here.

>

[...]
And of course, most of the newer protocols just say: it has to
be UTF-8.
That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.
By who?
The discussion I refer to is archived by google.

... I think that there is a consensus that UTF-8 is the way
to go.
It was not a consensus in 1996.

That's a long time ago. (In our profession, at least.)

...
Life with Unicode is much easier. Surprising little code really needs
to care that it is parsing utf-8.
Are you kidding? What about code which uses e.g. "isalpha()".
Ok, you need to think a little harder at what you're trying to do.

In general. Once you can no longer count on just ASCII, you do
have problems. Regardless of the encoding. On the whole, I
think UTF-8 is the only viable solution for communications, and
it is also the prefered solution for internal coding for a lot
of applications. Other applications will prefer UTF-32. And a
number of applications will still make do with some pure 8 bit
encoding, ISO 8859-1, or such.

At one point I was asked to give a recommendation on
internationalizing and application. It was a web browser. My default
answer was "wide chars", etc etc I examined the code and realized I'd
given the project a death sentence because there was no way the
project would recover so I went back to the team and said - JUST
KIDDING. What you need is utf8 with one of these special string
classes that converts a string transparently between utf-8 and utf-16
whenever it needs to and slowly move more of the application over to
wide char code. The code was migrated when it needed to and much of
the application didn't need touching.

The main point of this was that the codebase never broke uncontainably
and it's i18n support improved incrementally until it was adequate
without needing to interrupt development of other parts of the
product.

>

Some code will break because it splits characters or it
compares un-normalized strings, but these problems are far
easier to deal with than the mish-mash of encodings in the
past.
Easier, yes, but not all of the tools are necessarily in place.
Things like "isalpha()" are an obvious problem.
There is a need to standardize on something that handles all these
things - ICU is the only thing I have seen that gets close.

They seem to have done the most work in this direction to date.
On the other hand, they use UTF-16, which doesn't seem a
judicious choice today: UTF-32 or UTF-8 would seem preferable,
depending on what the program is doing.

Yeah, I recall having the same thought now. You should find this one
amusing:

http://mail-archives.apache.org/mod_...orconet.com%3e

Time for a new ICU.

May 14 '07 #27

Gianni Mariani

On May 14, 10:43 pm, Gianni Mariani <gi3nos...@mariani.wswrote:

http://mail-archives.apache.org/mod_...orconet.com%3e

http://tinyurl.com/29sglg

Google munged my url...

May 14 '07 #28

Gianni Mariani

On May 14, 5:07 pm, James Kanze <james.ka...@gmail.comwrote:
....

Those are the facts. Whether you like them or not.

aha. Fact is, - most of *MY* text files are ASCII.

>

The claim by James was that today "ASCII is pretty much
inexistant(sic)". Which is blatantly wrong.
Statistics? ASCII isn't used by Windows. It's not available in
the standard Linux distributions I use. All of the Internet
protocols I know *now* require more. (The now is important.
When I first implemented code around SMTP and NNTP, ASCII was
the standard encoding, and in fact, the only one supported.)
And this has a bearing on (non)existence of ASCII files how ?

Well, if ASCII isn't supported by Windows, and it isn't
supported by Linux, it obviously cannot be the general case for
most files.

aha. So. Fact is, - most of *MY* text files are ASCII.

....

>
Ah, even more blatant racism.

aha. I'm not sure if that's the PC American or the arrogant Parisian
talking.

>
[Lot's more racism cut...]

People with that sort of attitude are parochial. They've not
gone out and actually considered other people for what they are.
OK. Again, a very arrogant thing to say.

What's arrogant about calling a spade a spade?

.... "Have you stopped beating your wife ?" ....

That's an example of what you do by what you say is "calling a spade a
spade".

I may, or may not be racist, but I am convinced that accusing someone
of racism without knowing (or wanting to know) who they are or what
they do is a display of arrogance.

>

[...]
So, we should all proclaim that all ASCII files are now officially
utf-8 and all other text formats are deprecated and should be deleted.
Of course, if you'd have actually read what you're responding
to, I said that we have to deal with a lot of different code
sets. And that that is a real problem.
You have a choice to refuse to deal with anything but utf8. Until you
do, you will whine.

You may have a choice, but I live in the real world. The files
are there, and I have to deal with them. Whether I like it or
not.

It seems you have made a choice.

I don't know what product you work on or even have decision making
power on, but, there are many ways to slice the problem.

You know, we really have to stop meeting this way, people might get
the wrong idea. I need to move on. Feel free to make whatever damage
you like, at this point I'm going to cut-n-run.

It's been fun.

May 14 '07 #29

James Kanze

On May 14, 2:43 pm, Gianni Mariani <gi3nos...@mariani.wswrote:

On May 14, 5:24 pm, James Kanze <james.ka...@gmail.com>
wrote:On May 13, 1:40 pm, Gianni Mariani
<gi3nos...@mariani.wswrote:

[...]

Seriously, do you really believe that you can judge people by
their government? Even in so-called democracies, like France
and the USA. I've lived in three different countries, and I've
very close contacts with a fourth (my wife is Italian). I've
found people to be pretty much the same everywhere, and in the
vast majority, I've found them to be pretty decent.

Same here. Almost exactly but they are all European.Roman in
heritage. Try spending some serious time in India, Thailand, PRC or
Hong Kong though, and then some more time in the Saudi, or UAE or
Nigeria even. The cultural skew takes some time to come to grips
with.

Yes, but it's really still very superficial. Human nature is
human nature. It does make it more difficult to recognize the
similarities, however.

[...]

I have applications that still generate them. I still
generate them. I see them every day. Accusing me of being
parochial is very arrogant and disingenuous.

So what machine are you using? Posix requires 8 bit characters,
and it doesn't have a function "isascii" anymore---it requires
full support for an eight bit character set. And of course,
correct code will not fail because some file happens to contain
an accented character. You can pretend that your files are
ASCII, but that's just pretending.

You talk about processing technology, I talk about actual files. See,
not so deep and meaningful. It all works back to the context of the
original statement. Way too much energy spent here.

They're related, but my real point was different. Perhaps if I
stated it something along the lines "a correct program cannot
assume that any file it reads contains only characters in the
ASCII character set."

It's a conceptual point of view. When I first started working
on Unix, we pretty much considered that all text files were
ASCII. In some ways, it was false even then; the OS never made
the slightest guarantee, and characters with the 8th bit set did
creep into text files from time to time. But we had a function,
isascii(), which we used to test for such characters, and if
they were present, we rejected the file as being corrupt.

Today, of course, we no longer have that function, and every
editor, on every system, is capable of generating accented
characters. So the files aren't really ASCII, but whatever
encoding the editor was generating (ISO 8859-1 seems very
common). And of course, a correct program will handle them
correctly.

Now, you may say that all, or almost all of the files you have
to deal with actually only contain characters in the subset
common to ASCII, the ISO 8859 encodings and UTF-8. That may be
(although it's not the case where I work, and hasn't been for
well over 10 years). But I insist that that is not an
appropriate way of thinking about it. Those files were created
by an editor, or some other program, which is perfectly capable
of creating characters which are not in ASCII. And considering
them "pure" ASCII will lead to carelessness in programming, and
an increased risk of errors.

In that sense, ASCII files simply do not exist. There is no way
you can open a file, and say, this file is pure ASCII, and
cannot possibly contain anything else. I also suspect that it
is exceedingly rare that you can open a text file saying: this
file should be pure ASCII, and anything else means it is
corrupt. There are doubtlessly exceptions to this, particularly
with regards to machine generated data. But most of the
exceptions I know go even further: if the file contains, say, a
list of floating point values, then it is corrupt if it contains
any alpha character, not just if it contains an accented
character.

[...]
...
Life with Unicode is much easier. Surprising little code really needs
to care that it is parsing utf-8.
Are you kidding? What about code which uses e.g. "isalpha()".
Ok, you need to think a little harder at what you're trying to do.

In general. Once you can no longer count on just ASCII, you do
have problems. Regardless of the encoding. On the whole, I
think UTF-8 is the only viable solution for communications, and
it is also the prefered solution for internal coding for a lot
of applications. Other applications will prefer UTF-32. And a
number of applications will still make do with some pure 8 bit
encoding, ISO 8859-1, or such.

At one point I was asked to give a recommendation on
internationalizing and application. It was a web browser. My default
answer was "wide chars", etc etc I examined the code and realized I'd
given the project a death sentence because there was no way the
project would recover so I went back to the team and said - JUST
KIDDING. What you need is utf8 with one of these special string
classes that converts a string transparently between utf-8 and utf-16
whenever it needs to and slowly move more of the application over to
wide char code. The code was migrated when it needed to and much of
the application didn't need touching.

The main point of this was that the codebase never broke uncontainably
and it's i18n support improved incrementally until it was adequate
without needing to interrupt development of other parts of the
product.

I presume you're talking about internal representation here. A
Web browser certainly has to deal with a large number of
different external encodings. If I control the entire chain,
there's no doubt that everything would be Unicode, UTF-8
externally, and either UTF-8 or UTF-32 internally, depending on
what I was doing. But I never do control the entire chain: here
at work, the powers that be haven't installed any Unicode fonts
on the machines, so I'm stuck with ISO 8859-1:-(.

Some code will break because it splits characters or it
compares un-normalized strings, but these problems are far
easier to deal with than the mish-mash of encodings in the
past.
Easier, yes, but not all of the tools are necessarily in place.
Things like "isalpha()" are an obvious problem.
There is a need to standardize on something that handles all these
things - ICU is the only thing I have seen that gets close.

They seem to have done the most work in this direction to date.
On the other hand, they use UTF-16, which doesn't seem a
judicious choice today: UTF-32 or UTF-8 would seem preferable,
depending on what the program is doing.

Yeah, I recall having the same thought now. You should find this one
amusing:

http://mail-archives.apache.org/mod_...orconet.com%3e

Time for a new ICU.

:-). To be fair to them: when they defined their spec, Unicode
was only 16 bits. Also, any program really treating text
seriously will have to deal with various composite characters
anyway, and handling the surrogates isn't that much more work.

On the other hand, the more I work with such characters, the
more I realize that you can do directly in UTF-8. Multibyte
characters have a reputation for causing all sorts of problems,
but UTF-8 has addressed some of the issues (and of course, a lot
of the problems are just because the code isn't prepared for
multibyte characters). Once you're handling surrogates and
composite characters, is UTF-8 really any more difficult than
UTF-32?

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 15 '07 #30

Binary or text file

Similar topics