Simple high-ascii character encoding

On Thu, 25 Aug 2005 ch****@totalise.co.uk wrote under the
heading:

Simple high-ascii character encoding
Hmmm. What's that supposed to mean in an HTML context?
I have an Html document that declares that it uses the utf-8 character
set.
Terminology again! utf-8 is not a "character set", but a character
encoding scheme of unicode. I can't help it that, way back, MIME chose
the attribute name of "charset=" for this, which in current terminology
is very misleading, but utf-8 still isn't a "character set".
As this document is editable via a web interface I need to make
sure than high-ascii characters that may be accidentally entered
I think you'd benefit from getting rid of this obsolete term
"high-ascii". ASCII is a 7-bit code, containing a mere 95 displayable
characters, whereas the document character set of HTML is Unicode,
containing vastly more characters than ASCII.

Modern OSes often define input methods for wide ranges of these
non-ASCII characters...
are properly represented when the document is served.
Details depend on your OS and editing application, but modern OSes don't
mind storing utf-8, and serving them out as such.
My programming language allows me to get the ascii value for any
individual character
But most of the characters aren't in ASCII, so how could they have
an "ascii value"? Character representation in HTML isn't hard, but
you *do* have to use the terms with some care, if you want to make
sense.
so what I am doing when a change is saved is to look at each character
in the content and if the ascii value for a character > 127
There ARE no ASCII characters with a value above 127 !
then I replace 'character' with '&#AsciiValue;'.

There *are* no ASCII values greater than 127.

Representing non-ASCII characters as &#number; , using their character
number in Unicode, is a feasible approach - but rather voluminous if you
have many of them.

I have a checklist that's been quite widely peer-reviewed: I'd
recommend that you work your way down the scenarios, and pick one that
seems to fit your needs.

http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist

Hope this helps a bit.

Aug 25 '05 #3

Chandy

Yep, clearly I make no sense to people who understand this better than
I do :) Okay, the langauge returns integer values for the standard as
well as 'extended' ascii characters (as detailed, for example, on
http://www.asciitable.com/). My document is not public but starts
with:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The system is publishing content in english to the web but is
poentially for world-wide consumption. Generally the extra characters
I have to represent will be items like ®, © and ™ and
some accented letters, but I was wanting to avoid having to have a
lookup of ascii value->Html Entity by just changing the character for
&#Value; when it seemed to have a value that put it outwith the
standard ascii range. I'll re-ask the question 'is this sensible'
while I read through the document you referred to.

Thanks!

Chandy

Aug 25 '05 #4

On 25 Aug 2005, Chandy wrote:

(as detailed, for example, on http://www.asciitable.com/).

| Not Found
| The requested URL /). was not found on this server.

Did you mean http://www.asciitable.com/ ? This is just bullshit!
Please refer to
http://czyborra.com/charsets/iso646.html
http://czyborra.com/charsets/codepages.html
http://czyborra.com/charsets/iso8859.html
for reliable information.

Aug 25 '05 #5

Chandy wrote:

Yep, clearly I make no sense to people who understand this better than
I do :) Okay, the langauge returns integer values for the standard as
well as 'extended' ascii characters (as detailed, for example, on
http://www.asciitable.com/).

As that page itself says, "it took a while to get a single standard for
these extra characters and hence there are few varying 'extended' sets.
The most popular is presented below." This is all self-contradictory.
The point is there is no character set correctly called "extended
ASCII". Anyone using that term to refer to *a* mapping of a collection
of characters to codes 128-255 is using it because either:

(a) He thinks that "ASCII" itself refers to the numeric range 0-127, and
that "extended ASCII" therefore means, unambiguously, the range 128-255.
This is incorrect, because "ASCII" doesn't in the first place refer to
the range of numbers, it refers to a very specific set of characters (or
control codes) and its *assignment* to those numbers.

(b) He got the impression somewhere that the particular set of
characters he's seen assigned to the range 128-255 is *the* set of
characters so assigned, and that that particular set is known as
"extended ASCII". So what was introduced to MS-DOS users as "extended
ASCII" was the set of Microsoft line draw characters. Many other users
think it refers to the Western Europe (Latin-1) extension. And so forth.
The term is used by different people to refer to what they think it
means. In other words, it doesn't technically mean anything. It's a
misconception.

Aug 25 '05 #6

Harlan Messinger wrote:

As that page itself says, "it took a while to get a single standard for
these extra characters and hence there are few varying 'extended' sets.
The most popular is presented below." This is all self-contradictory.
The point is there is no character set correctly called "extended
ASCII". Anyone using that term to refer to *a* mapping of a collection
of characters to codes 128-255 is using it because either:

[snip]

To be fair, *any* of the character sets of which ASCII is a subset can
legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
extension, as is Unicode. But still, it makes no sense to speak of
"extended ASCII characters".

First, a given character may appear in one or more of these schemes and
*not* appear in one or more others. Would that character be an "extended
ASCII" character or not? The answer is that it's a character in some of
those character sets or that's represented in some of those encodings,
and not in others. The question of whether it's an "extended ASCII"
character is meaningless.

Second, a given character may appear in two different character sets but
mapped to different codes. What's the "extended ASCII code" for an em
dash? Well, under the standard Windows character set, an em-dash is
character 151; if you're using Unicode, it's character 8212; and if
you're using ISO-8859-1, it isn't anything at all because the em dash
isn't part of that character set. In other words, again, it's
meaningless to talk about a character's extended ASCII code.

Aug 25 '05 #7

Guy Macon

Harlan Messinger wrote:

To be fair, *any* of the character sets of which ASCII is a subset can
legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
extension, as is Unicode. But still, it makes no sense to speak of
"extended ASCII characters".

I was about to ask if anyone had bothered to list all the
different character sets that are identical to ASCII in the
first 127 characters, but perhaps it is easier to simply ask
if there are any character sets that are *not* identical to
ASCII in the first 127 characters...

Aug 25 '05 #8

On Thu, 25 Aug 2005, it was written:

but perhaps it is easier to simply ask
if there are any character sets that are *not* identical to
ASCII in the first 127 characters...

^
(Characters 0 to 127 are the first 128 characters.)

All of these
http://www.unicode.org/Public/MAPPIN...MICSFT/EBCDIC/
http://czyborra.com/charsets/iso646.html#EBCDIC

Aug 25 '05 #9

On Thu, 25 Aug 2005, Harlan Messinger wrote:

To be fair, *any* of the character sets of which ASCII is a subset
can legimately be called *an* "extension of ASCII".
It could - but it's not a particularly informative statement, as I
hope you'd agree.
Latin-1 is an ASCII extension,
To be pedantic, "Latin-1" defines a repertoire of characters:
CP-1047 is the "EBCDIC Latin-1 character encoding". When you
said Latin-1, I suspect you really meant iso-8859-1, which indeed
has ASCII as its lower half.
as is Unicode.
Indeed.
But still, it makes no sense to speak of "extended ASCII
characters".
Right!
Second, a given character may appear in two different character sets
but mapped to different codes. What's the "extended ASCII code" for
an em dash? Well, under the standard Windows character set, an
em-dash is character 151; if you're using Unicode, it's character
8212; and if you're using ISO-8859-1, it isn't anything at all
because the em dash isn't part of that character set. In other
words, again, it's meaningless to talk about a character's extended
ASCII code.

Right!!

And even in MS-DOS land, which is where this unfortunate phrase
*"extended ASCII" seems to have grown, there's a bushel of different
encodings: CP-437 for the USans, CP-850 for "multinational" use (which
contains approximately an MS-DOS encoding of the Latin-1 repertoire,
but organised completely differently than iso-8859-1), plus loads of
national-specific code pages too. I've got an MS-DOS version 6 manual
somewhere which lists page after page of the wretched things.

Thank goodness we rarely have to go there these days (except where
some user has blundered and converted DOS to Windows where they ought
not, or failed to do so when they should've).

best

Aug 25 '05 #10

On Thu, 25 Aug 2005, Chandy wrote:

http://www.asciitable.com/

Bleagh.

On cursory inspection, this appears to be the US-National MS-DOS code
page, CP-437. Utterly useless in the modern world: it's absolute
nonsense for them to claim that it's the "most popular", as indeed is
their claim that "it took a while to get a single standard", since
there never *has* been a "single" standard of the kind that they are
talking about. Possibly in the distant future, when this babel of
8-bit character codes has been forgotten, Unicode *will* be that
"single standard". Possibly.

Ho hum

Aug 25 '05 #11

Guy Macon wrote:

Harlan Messinger wrote:

To be fair, *any* of the character sets of which ASCII is a subset can
legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
extension, as is Unicode. But still, it makes no sense to speak of
"extended ASCII characters".

I was about to ask if anyone had bothered to list all the
different character sets that are identical to ASCII in the
first 127 characters, but perhaps it is easier to simply ask
if there are any character sets that are *not* identical to
ASCII in the first 127 characters...

EBCDIC, for starters.

Then there are all the non-standard arrangements that font designers
used in the past to map alphabets and symbol sets other than the basic
English one to the sub-128 positions so that foreign text and special
symbols could be rendered before more sophisticated means became
available. For example, the various Symbols and Wingdings fonts.

Aug 25 '05 #12

Alan J. Flavell wrote:

On Thu, 25 Aug 2005, Harlan Messinger wrote:
To be fair, *any* of the character sets of which ASCII is a subset
can legimately be called *an* "extension of ASCII".

It could - but it's not a particularly informative statement, as I
hope you'd agree.

Yes. Still, it's been convenient that for purposes of composing in
English most people (pre-Unicode) haven't had to worry about whether
their editor supported a particular encoding because it hasn't mattered
with respect to the common ASCII subset.

Latin-1 is an ASCII extension,

To be pedantic, "Latin-1" defines a repertoire of characters:
CP-1047 is the "EBCDIC Latin-1 character encoding". When you
said Latin-1, I suspect you really meant iso-8859-1, which indeed
has ASCII as its lower half.

I did, and thanks for the adjustment. I'm trying really hard to stop
mixing up character sets and encodings. (By the way--is a "repertoire"
different from a "set"?)

Aug 25 '05 #13

On Thu, 25 Aug 2005, Harlan Messinger wrote:

CP-1047 is the "EBCDIC Latin-1 character encoding". When you said
Latin-1, I suspect you really meant iso-8859-1, which indeed has
ASCII as its lower half.
I did, and thanks for the adjustment. I'm trying really hard to stop
mixing up character sets and encodings.

"character sets" versus "encodings" is yet another layer! - although
that's hardly noticeable with the old 8-bit codings, it gets quite
critical with encodings of Unicode.
(By the way--is a "repertoire" different from a "set"?)

Well, the term "character set" is usually understood to define not
only a particular repertoire of characters, but also the assignment of
each character to a "small" integer number. This assignment is of
course different in EBCDIC-based codings from what it is in
ASCII-based codings, to take the obvious example.

As such, I'd tend to avoid the use of the term "set" to refer to a
character repertoire, if I'm trying to avoid implying a particular
ordering of the characters or their assignment to "small" integers.
The "repertoire" is the unordered selection of characters, without
reference to one or other "character sets" which might be defined
comprising that repertoire.

hope that helps.

Btw, recall that after a certain point, the Latin-x repertoire is
encoded by the iso-8859-y character code, where x is no longer equal
to y. This is because some of the intervening codes weren't for Latin
at all, but for Greek, Arabic, Cyrillic, Hebrew etc. So, for example,
iso-8859-15 is the ISO encoding for Latin-9.

Aug 25 '05 #14

RobG

Harlan Messinger wrote:
[...]

Then there are all the non-standard arrangements that font designers
used in the past to map alphabets and symbol sets other than the basic
English one to the sub-128 positions so that foreign text

While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'? Or in the context of ASCII, are the two terms
identical?

[...]

--
Rob

Aug 25 '05 #15

Henri Sivonen

In article <pQ****************@news.optus.net.au>,
RobG <rg***@iinet.net.au> wrote:

Harlan Messinger wrote:
[...]
Then there are all the non-standard arrangements that font designers
used in the past to map alphabets and symbol sets other than the basic
English one to the sub-128 positions so that foreign text

While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'? Or in the context of ASCII, are the two terms
identical?

Isn't 'non-English' the definition of 'foreign'? :-) Although nowadays
it is politically correct to say 'international' instead of 'foreign'.

(I suppose you might get along with ASCII when writing Dutch and
Afrikaans.)

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Aug 26 '05 #16

Jukka K. Korpela

RobG wrote:

While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'? Or in the context of ASCII, are the two terms
identical?

It's easy to out-pedant you here: You cannot write even English
correctly using ASCII only. ASCII lacks orthographically correct
quotation marks (and apostrophe), en and am dashes, and horizontal
ellipsis, though the last one is debatable. (Is horizontal ellipsis just
a presentational form of "..."? The difference is real, anyway, and
English style guides require the use of "spaced dots", and achieving
this without using the horizontal ellipsis character is awkward.)

ASCII also lacks letters like e with acute accent, o with diaeresis, and
letter (ligature) ae, which belong to _some_ forms of written English at
least.

Aug 26 '05 #17

On Thu, 25 Aug 2005, RobG wrote:

Harlan Messinger wrote:
[...]
Then there are all the non-standard arrangements that font
designers used in the past to map alphabets and symbol sets other
than the basic English one to the sub-128 positions so that
foreign text
While we're being pedantic about words, should the phrase 'foreign
text' be 'non-English text'?

Even English uses some characters which are "foreign" to the ASCII
character set.
Or in the context of ASCII, are the two terms identical?

Well, to be extra pedantic, ASCII is an American Standard. They don't
have an exclusive hold on English.

Aug 26 '05 #18

Pierre Goiffon

Andreas Prilop wrote:

but perhaps it is easier to simply ask
if there are any character sets that are *not* identical to
ASCII in the first 127 characters...

^
(Characters 0 to 127 are the first 128 characters.)

All of these
http://www.unicode.org/Public/MAPPIN...MICSFT/EBCDIC/
http://czyborra.com/charsets/iso646.html#EBCDIC

I had to treat EBCDIC encodings recently - it is used by most IBM
plateforms, as for exemple iSeries servers (OS/400).
Just to give an idea of the numbers of such encodings, here is a list of
the charsets supported by the Open statement in LotusScript
(Notes/Domino R6) :

IBM037 US and Canadian English, Dutch, Protuguese
IBM273 German
IBM277 Danish, Norwegian
IBM278 Finnish, Swedish
IBM280 Italian
IBM284 Spanish
IBM285 International English
IBM297 French
IBM420 Arabic
IBM424 Hebrew
IBM500 Intl. Eglish, Latin-1, Albanian, Belgian English, French
IBM838 Thai
IBM870 Latin-2, Croatian, Czech, Hungarian, Polish
IBM871 Icelandic
IBM875 Greek
IBM1025 Bulgarian, Russian, Serbian Cyrillic
IBM1026 Turkish
IBM1047 Latin-1 Open Systems
IBM1112 Latvian, Lithuanian
IBM1122 Estonian
IBM930 Japanese Katakana
IBM933 Korean
IBM935 Simplified Chinese
IBM937 Traditional Chinese
IBM939 Japanese Latin
IBM1388 Simplified Chinese

Aug 26 '05 #19

On Fri, 26 Aug 2005, Alan J. Flavell wrote:

Even English uses some characters which are "foreign" to the ASCII
character set.

Aug 26 '05 #20

On Thu, 25 Aug 2005, RobG wrote:

While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'?

No! English text is foreign text; German text is non-foreign text.
;-)

Aug 26 '05 #21

On Fri, 26 Aug 2005, Henri Sivonen wrote:

(I suppose you might get along with ASCII when writing Dutch and
Afrikaans.)

Find the non-ASCII characters on
http://www.google.com/language_tools?hl=af
http://www.google.com/language_tools?hl=nl

Aug 26 '05 #22

RobG wrote:

Harlan Messinger wrote:
[...]
Then there are all the non-standard arrangements that font designers
used in the past to map alphabets and symbol sets other than the basic
English one to the sub-128 positions so that foreign text

While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'? Or in the context of ASCII, are the two terms
identical?

Touché.

Aug 26 '05 #23

Henri Sivonen wrote:

In article <pQ****************@news.optus.net.au>,
RobG <rg***@iinet.net.au> wrote:

Harlan Messinger wrote:
[...]

Then there are all the non-standard arrangements that font designers
used in the past to map alphabets and symbol sets other than the basic
English one to the sub-128 positions so that foreign text

While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'? Or in the context of ASCII, are the two terms
identical?

Isn't 'non-English' the definition of 'foreign'? :-) Although nowadays
it is politically correct to say 'international' instead of 'foreign'.

(I suppose you might get along with ASCII when writing Dutch and
Afrikaans.)

(I've set the encoding for this message to UTF-8.)

Nope. Dutch:
Ä²sselmeer (the former Zuider Zee) (the first character
should look like an IJ ligature)
Ik heb het maar Ã©Ã©n keer gezien. (I've only seen it once).
Ons tweeÃ«ns (the two of us) (funny--I tried to corroborate
my recollection of this from the web, but I can't
find it with either two or three "e"s. Can someone
tell me whether I'm making this up?)
Afrikaans:
Ek sal hÃª (I will have).

Aug 26 '05 #24

Stan Brown

On Fri, 26 Aug 2005 10:44:13 +0300, "Jukka K. Korpela"
<jk******@cs.tut.fi> wrote:

You cannot write even English
correctly using ASCII only. ASCII lacks orthographically correct
quotation marks (and apostrophe), en and am dashes, and horizontal
ellipsis, though the last one is debatable.

This claim does not become true no matter how often repeated.

Aside from the question of whether _punctuation_ can properly be
said to be part of a _language_ at all, there is no divine ordinance
that says opening and closing quotes need to look different, or that
apostrophes and single quotes need to look different.

Indeed, a counter-example comes readily to mind: the King James
bible, the standard English bible for centuries, was printed
entirely without quote marks, en dashes, "am dashes", and ellipses.

Perhaps, meaning no disrespect, you might yield just a tiny bit to a
native speaker in your pronouncements about what is and is not
correct English?

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you

Aug 26 '05 #25

Stan Brown wrote:

On Fri, 26 Aug 2005 10:44:13 +0300, "Jukka K. Korpela"
<jk******@cs.tut.fi> wrote:

You cannot write even English
correctly using ASCII only. ASCII lacks orthographically correct
quotation marks (and apostrophe), en and am dashes, and horizontal
ellipsis, though the last one is debatable.

This claim does not become true no matter how often repeated.

Aside from the question of whether _punctuation_ can properly be
said to be part of a _language_ at all, there is no divine ordinance
that says opening and closing quotes need to look different, or that
apostrophes and single quotes need to look different.

There's no ordinance that says a lower-case "l" ("L") and a numeral "1"
(one) need to look different, and indeed in the typeface in which this
very sentence appears on my screen they look identical. Ditto for
capital "O" ("o") and numeral "0" (zero). That doesn't alter the fact
that they are different characters with different purposes and that they
shouldn't be treated as one in the character scheme. (Alan, don't resent
me for introducing "scheme". I couldn't resist.)
Indeed, a counter-example comes readily to mind: the King James
bible, the standard English bible for centuries, was printed
entirely without quote marks, en dashes, "am dashes", and ellipses.

Perhaps, meaning no disrespect, you might yield just a tiny bit to a
native speaker in your pronouncements about what is and is not
correct English?

When I *write* quotation marks, my opening quotes are different from my
closing quotes, and that's they way they usually are in print. So,
indeed, I consider them separate characters. Using a single character to
represent both may not be the end of the world, but it can pose
disadvantages.

Aug 26 '05 #26