473,416 Members | 1,698 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,416 software developers and data experts.

Simple high-ascii character encoding

Hi,

I have an Html document that declares that it uses the utf-8 character
set. As this document is editable via a web interface I need to make
sure than high-ascii characters that may be accidentally entered are
properly represented when the document is served. My programming
language allows me to get the ascii value for any individual character
so what I am doing when a change is saved is to look at each character
in the content and if the ascii value for a character > 127 then I
replace 'character' with '&#AsciiValue;'.

I am not very well up on character sets and document encoding
mechanisms so I would like to know, is this a sensible idea?

TIA

Chandy

Aug 25 '05 #1
37 10109
ch****@totalise.co.uk wrote:
I have an Html document that declares that it uses the utf-8 character
set.
Does it do that properly? Prove it, show us the URL! :-)
As this document is editable via a web interface I need to make
sure than high-ascii characters that may be accidentally entered are
properly represented when the document is served.
There are no high-ascii characters. Ascii stops at 127, has always
stopped, and will always stop.

If your document is adequately UTF-8 encoded, then form data sent via a
form on the page will appear as UTF-8 encoded, too, though naturally it
will _also_ be encoded as specified for form data encoding in general.
My programming
language allows me to get the ascii value for any individual character
so what I am doing when a change is saved is to look at each character
in the content and if the ascii value for a character > 127 then I
replace 'character' with '&#AsciiValue;'.


Why would you do that, given the fact that there are no Ascii values
greater than 127 and the fact that your form data handler gets the data
in UTF-8 encoding? What would be the point in replacing it by a
character reference, when the page itself is UTF-8 encoded?
Aug 25 '05 #2
On Thu, 25 Aug 2005 ch****@totalise.co.uk wrote under the
heading:
Simple high-ascii character encoding
Hmmm. What's that supposed to mean in an HTML context?
I have an Html document that declares that it uses the utf-8 character
set.
Terminology again! utf-8 is not a "character set", but a character
encoding scheme of unicode. I can't help it that, way back, MIME chose
the attribute name of "charset=" for this, which in current terminology
is very misleading, but utf-8 still isn't a "character set".
As this document is editable via a web interface I need to make
sure than high-ascii characters that may be accidentally entered
I think you'd benefit from getting rid of this obsolete term
"high-ascii". ASCII is a 7-bit code, containing a mere 95 displayable
characters, whereas the document character set of HTML is Unicode,
containing vastly more characters than ASCII.

Modern OSes often define input methods for wide ranges of these
non-ASCII characters...
are properly represented when the document is served.
Details depend on your OS and editing application, but modern OSes don't
mind storing utf-8, and serving them out as such.
My programming language allows me to get the ascii value for any
individual character
But most of the characters aren't in ASCII, so how could they have
an "ascii value"? Character representation in HTML isn't hard, but
you *do* have to use the terms with some care, if you want to make
sense.
so what I am doing when a change is saved is to look at each character
in the content and if the ascii value for a character > 127
There ARE no ASCII characters with a value above 127 !
then I replace 'character' with '&#AsciiValue;'.


There *are* no ASCII values greater than 127.

Representing non-ASCII characters as &#number; , using their character
number in Unicode, is a feasible approach - but rather voluminous if you
have many of them.

I have a checklist that's been quite widely peer-reviewed: I'd
recommend that you work your way down the scenarios, and pick one that
seems to fit your needs.

http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist

Hope this helps a bit.
Aug 25 '05 #3
Yep, clearly I make no sense to people who understand this better than
I do :) Okay, the langauge returns integer values for the standard as
well as 'extended' ascii characters (as detailed, for example, on
http://www.asciitable.com/). My document is not public but starts
with:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The system is publishing content in english to the web but is
poentially for world-wide consumption. Generally the extra characters
I have to represent will be items like &reg;, &copy; and &trade; and
some accented letters, but I was wanting to avoid having to have a
lookup of ascii value->Html Entity by just changing the character for
&#Value; when it seemed to have a value that put it outwith the
standard ascii range. I'll re-ask the question 'is this sensible'
while I read through the document you referred to.

Thanks!

Chandy

Aug 25 '05 #4
On 25 Aug 2005, Chandy wrote:
(as detailed, for example, on http://www.asciitable.com/).


| Not Found
| The requested URL /). was not found on this server.

Did you mean http://www.asciitable.com/ ? This is just bullshit!
Please refer to
http://czyborra.com/charsets/iso646.html
http://czyborra.com/charsets/codepages.html
http://czyborra.com/charsets/iso8859.html
for reliable information.

Aug 25 '05 #5
Chandy wrote:
Yep, clearly I make no sense to people who understand this better than
I do :) Okay, the langauge returns integer values for the standard as
well as 'extended' ascii characters (as detailed, for example, on
http://www.asciitable.com/).


As that page itself says, "it took a while to get a single standard for
these extra characters and hence there are few varying 'extended' sets.
The most popular is presented below." This is all self-contradictory.
The point is there is no character set correctly called "extended
ASCII". Anyone using that term to refer to *a* mapping of a collection
of characters to codes 128-255 is using it because either:

(a) He thinks that "ASCII" itself refers to the numeric range 0-127, and
that "extended ASCII" therefore means, unambiguously, the range 128-255.
This is incorrect, because "ASCII" doesn't in the first place refer to
the range of numbers, it refers to a very specific set of characters (or
control codes) and its *assignment* to those numbers.

(b) He got the impression somewhere that the particular set of
characters he's seen assigned to the range 128-255 is *the* set of
characters so assigned, and that that particular set is known as
"extended ASCII". So what was introduced to MS-DOS users as "extended
ASCII" was the set of Microsoft line draw characters. Many other users
think it refers to the Western Europe (Latin-1) extension. And so forth.
The term is used by different people to refer to what they think it
means. In other words, it doesn't technically mean anything. It's a
misconception.
Aug 25 '05 #6
Harlan Messinger wrote:
As that page itself says, "it took a while to get a single standard for
these extra characters and hence there are few varying 'extended' sets.
The most popular is presented below." This is all self-contradictory.
The point is there is no character set correctly called "extended
ASCII". Anyone using that term to refer to *a* mapping of a collection
of characters to codes 128-255 is using it because either:


[snip]

To be fair, *any* of the character sets of which ASCII is a subset can
legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
extension, as is Unicode. But still, it makes no sense to speak of
"extended ASCII characters".

First, a given character may appear in one or more of these schemes and
*not* appear in one or more others. Would that character be an "extended
ASCII" character or not? The answer is that it's a character in some of
those character sets or that's represented in some of those encodings,
and not in others. The question of whether it's an "extended ASCII"
character is meaningless.

Second, a given character may appear in two different character sets but
mapped to different codes. What's the "extended ASCII code" for an em
dash? Well, under the standard Windows character set, an em-dash is
character 151; if you're using Unicode, it's character 8212; and if
you're using ISO-8859-1, it isn't anything at all because the em dash
isn't part of that character set. In other words, again, it's
meaningless to talk about a character's extended ASCII code.
Aug 25 '05 #7

Harlan Messinger wrote:
To be fair, *any* of the character sets of which ASCII is a subset can
legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
extension, as is Unicode. But still, it makes no sense to speak of
"extended ASCII characters".


I was about to ask if anyone had bothered to list all the
different character sets that are identical to ASCII in the
first 127 characters, but perhaps it is easier to simply ask
if there are any character sets that are *not* identical to
ASCII in the first 127 characters...
Aug 25 '05 #8
On Thu, 25 Aug 2005, it was written:
but perhaps it is easier to simply ask
if there are any character sets that are *not* identical to
ASCII in the first 127 characters...

^
(Characters 0 to 127 are the first 128 characters.)

All of these
http://www.unicode.org/Public/MAPPIN...MICSFT/EBCDIC/
http://czyborra.com/charsets/iso646.html#EBCDIC

Aug 25 '05 #9
On Thu, 25 Aug 2005, Harlan Messinger wrote:
To be fair, *any* of the character sets of which ASCII is a subset
can legimately be called *an* "extension of ASCII".
It could - but it's not a particularly informative statement, as I
hope you'd agree.
Latin-1 is an ASCII extension,
To be pedantic, "Latin-1" defines a repertoire of characters:
CP-1047 is the "EBCDIC Latin-1 character encoding". When you
said Latin-1, I suspect you really meant iso-8859-1, which indeed
has ASCII as its lower half.
as is Unicode.
Indeed.
But still, it makes no sense to speak of "extended ASCII
characters".
Right!
Second, a given character may appear in two different character sets
but mapped to different codes. What's the "extended ASCII code" for
an em dash? Well, under the standard Windows character set, an
em-dash is character 151; if you're using Unicode, it's character
8212; and if you're using ISO-8859-1, it isn't anything at all
because the em dash isn't part of that character set. In other
words, again, it's meaningless to talk about a character's extended
ASCII code.


Right!!

And even in MS-DOS land, which is where this unfortunate phrase
*"extended ASCII" seems to have grown, there's a bushel of different
encodings: CP-437 for the USans, CP-850 for "multinational" use (which
contains approximately an MS-DOS encoding of the Latin-1 repertoire,
but organised completely differently than iso-8859-1), plus loads of
national-specific code pages too. I've got an MS-DOS version 6 manual
somewhere which lists page after page of the wretched things.

Thank goodness we rarely have to go there these days (except where
some user has blundered and converted DOS to Windows where they ought
not, or failed to do so when they should've).

best
Aug 25 '05 #10
On Thu, 25 Aug 2005, Chandy wrote:
http://www.asciitable.com/


Bleagh.

On cursory inspection, this appears to be the US-National MS-DOS code
page, CP-437. Utterly useless in the modern world: it's absolute
nonsense for them to claim that it's the "most popular", as indeed is
their claim that "it took a while to get a single standard", since
there never *has* been a "single" standard of the kind that they are
talking about. Possibly in the distant future, when this babel of
8-bit character codes has been forgotten, Unicode *will* be that
"single standard". Possibly.

Ho hum
Aug 25 '05 #11
Guy Macon wrote:
Harlan Messinger wrote:

To be fair, *any* of the character sets of which ASCII is a subset can
legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
extension, as is Unicode. But still, it makes no sense to speak of
"extended ASCII characters".

I was about to ask if anyone had bothered to list all the
different character sets that are identical to ASCII in the
first 127 characters, but perhaps it is easier to simply ask
if there are any character sets that are *not* identical to
ASCII in the first 127 characters...


EBCDIC, for starters.

Then there are all the non-standard arrangements that font designers
used in the past to map alphabets and symbol sets other than the basic
English one to the sub-128 positions so that foreign text and special
symbols could be rendered before more sophisticated means became
available. For example, the various Symbols and Wingdings fonts.
Aug 25 '05 #12
Alan J. Flavell wrote:
On Thu, 25 Aug 2005, Harlan Messinger wrote:
To be fair, *any* of the character sets of which ASCII is a subset
can legimately be called *an* "extension of ASCII".


It could - but it's not a particularly informative statement, as I
hope you'd agree.


Yes. Still, it's been convenient that for purposes of composing in
English most people (pre-Unicode) haven't had to worry about whether
their editor supported a particular encoding because it hasn't mattered
with respect to the common ASCII subset.
Latin-1 is an ASCII extension,


To be pedantic, "Latin-1" defines a repertoire of characters:
CP-1047 is the "EBCDIC Latin-1 character encoding". When you
said Latin-1, I suspect you really meant iso-8859-1, which indeed
has ASCII as its lower half.


I did, and thanks for the adjustment. I'm trying really hard to stop
mixing up character sets and encodings. (By the way--is a "repertoire"
different from a "set"?)
Aug 25 '05 #13
On Thu, 25 Aug 2005, Harlan Messinger wrote:
CP-1047 is the "EBCDIC Latin-1 character encoding". When you said
Latin-1, I suspect you really meant iso-8859-1, which indeed has
ASCII as its lower half.
I did, and thanks for the adjustment. I'm trying really hard to stop
mixing up character sets and encodings.


"character sets" versus "encodings" is yet another layer! - although
that's hardly noticeable with the old 8-bit codings, it gets quite
critical with encodings of Unicode.
(By the way--is a "repertoire" different from a "set"?)


Well, the term "character set" is usually understood to define not
only a particular repertoire of characters, but also the assignment of
each character to a "small" integer number. This assignment is of
course different in EBCDIC-based codings from what it is in
ASCII-based codings, to take the obvious example.

As such, I'd tend to avoid the use of the term "set" to refer to a
character repertoire, if I'm trying to avoid implying a particular
ordering of the characters or their assignment to "small" integers.
The "repertoire" is the unordered selection of characters, without
reference to one or other "character sets" which might be defined
comprising that repertoire.

hope that helps.

Btw, recall that after a certain point, the Latin-x repertoire is
encoded by the iso-8859-y character code, where x is no longer equal
to y. This is because some of the intervening codes weren't for Latin
at all, but for Greek, Arabic, Cyrillic, Hebrew etc. So, for example,
iso-8859-15 is the ISO encoding for Latin-9.

Aug 25 '05 #14
Harlan Messinger wrote:
[...]
Then there are all the non-standard arrangements that font designers
used in the past to map alphabets and symbol sets other than the basic
English one to the sub-128 positions so that foreign text


While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'? Or in the context of ASCII, are the two terms
identical?

[...]

--
Rob
Aug 25 '05 #15
In article <pQ****************@news.optus.net.au>,
RobG <rg***@iinet.net.au> wrote:
Harlan Messinger wrote:
[...]
Then there are all the non-standard arrangements that font designers
used in the past to map alphabets and symbol sets other than the basic
English one to the sub-128 positions so that foreign text


While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'? Or in the context of ASCII, are the two terms
identical?


Isn't 'non-English' the definition of 'foreign'? :-) Although nowadays
it is politically correct to say 'international' instead of 'foreign'.

(I suppose you might get along with ASCII when writing Dutch and
Afrikaans.)

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Aug 26 '05 #16
RobG wrote:
While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'? Or in the context of ASCII, are the two terms
identical?


It's easy to out-pedant you here: You cannot write even English
correctly using ASCII only. ASCII lacks orthographically correct
quotation marks (and apostrophe), en and am dashes, and horizontal
ellipsis, though the last one is debatable. (Is horizontal ellipsis just
a presentational form of "..."? The difference is real, anyway, and
English style guides require the use of "spaced dots", and achieving
this without using the horizontal ellipsis character is awkward.)

ASCII also lacks letters like e with acute accent, o with diaeresis, and
letter (ligature) ae, which belong to _some_ forms of written English at
least.
Aug 26 '05 #17
On Thu, 25 Aug 2005, RobG wrote:
Harlan Messinger wrote:
[...]
Then there are all the non-standard arrangements that font
designers used in the past to map alphabets and symbol sets other
than the basic English one to the sub-128 positions so that
foreign text
While we're being pedantic about words, should the phrase 'foreign
text' be 'non-English text'?


Even English uses some characters which are "foreign" to the ASCII
character set.
Or in the context of ASCII, are the two terms identical?


Well, to be extra pedantic, ASCII is an American Standard. They don't
have an exclusive hold on English.
Aug 26 '05 #18
Andreas Prilop wrote:
but perhaps it is easier to simply ask
if there are any character sets that are *not* identical to
ASCII in the first 127 characters...


^
(Characters 0 to 127 are the first 128 characters.)

All of these
http://www.unicode.org/Public/MAPPIN...MICSFT/EBCDIC/
http://czyborra.com/charsets/iso646.html#EBCDIC


I had to treat EBCDIC encodings recently - it is used by most IBM
plateforms, as for exemple iSeries servers (OS/400).
Just to give an idea of the numbers of such encodings, here is a list of
the charsets supported by the Open statement in LotusScript
(Notes/Domino R6) :

IBM037 US and Canadian English, Dutch, Protuguese
IBM273 German
IBM277 Danish, Norwegian
IBM278 Finnish, Swedish
IBM280 Italian
IBM284 Spanish
IBM285 International English
IBM297 French
IBM420 Arabic
IBM424 Hebrew
IBM500 Intl. Eglish, Latin-1, Albanian, Belgian English, French
IBM838 Thai
IBM870 Latin-2, Croatian, Czech, Hungarian, Polish
IBM871 Icelandic
IBM875 Greek
IBM1025 Bulgarian, Russian, Serbian Cyrillic
IBM1026 Turkish
IBM1047 Latin-1 Open Systems
IBM1112 Latvian, Lithuanian
IBM1122 Estonian
IBM930 Japanese Katakana
IBM933 Korean
IBM935 Simplified Chinese
IBM937 Traditional Chinese
IBM939 Japanese Latin
IBM1388 Simplified Chinese
Aug 26 '05 #19
On Fri, 26 Aug 2005, Alan J. Flavell wrote:
Even English uses some characters which are "foreign" to the ASCII
character set.


£

Aug 26 '05 #20
On Thu, 25 Aug 2005, RobG wrote:
While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'?


No! English text is foreign text; German text is non-foreign text.
;-)

Aug 26 '05 #21
On Fri, 26 Aug 2005, Henri Sivonen wrote:
(I suppose you might get along with ASCII when writing Dutch and
Afrikaans.)


Find the non-ASCII characters on
http://www.google.com/language_tools?hl=af
http://www.google.com/language_tools?hl=nl

Aug 26 '05 #22
RobG wrote:
Harlan Messinger wrote:
[...]
Then there are all the non-standard arrangements that font designers
used in the past to map alphabets and symbol sets other than the basic
English one to the sub-128 positions so that foreign text

While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'? Or in the context of ASCII, are the two terms
identical?


Touché.
Aug 26 '05 #23
Henri Sivonen wrote:
In article <pQ****************@news.optus.net.au>,
RobG <rg***@iinet.net.au> wrote:

Harlan Messinger wrote:
[...]

Then there are all the non-standard arrangements that font designers
used in the past to map alphabets and symbol sets other than the basic
English one to the sub-128 positions so that foreign text


While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'? Or in the context of ASCII, are the two terms
identical?

Isn't 'non-English' the definition of 'foreign'? :-) Although nowadays
it is politically correct to say 'international' instead of 'foreign'.

(I suppose you might get along with ASCII when writing Dutch and
Afrikaans.)


(I've set the encoding for this message to UTF-8.)

Nope. Dutch:
IJsselmeer (the former Zuider Zee) (the first character
should look like an IJ ligature)
Ik heb het maar één keer gezien. (I've only seen it once).
Ons tweeëns (the two of us) (funny--I tried to corroborate
my recollection of this from the web, but I can't
find it with either two or three "e"s. Can someone
tell me whether I'm making this up?)
Afrikaans:
Ek sal hê (I will have).
Aug 26 '05 #24
On Fri, 26 Aug 2005 10:44:13 +0300, "Jukka K. Korpela"
<jk******@cs.tut.fi> wrote:
You cannot write even English
correctly using ASCII only. ASCII lacks orthographically correct
quotation marks (and apostrophe), en and am dashes, and horizontal
ellipsis, though the last one is debatable.


This claim does not become true no matter how often repeated.

Aside from the question of whether _punctuation_ can properly be
said to be part of a _language_ at all, there is no divine ordinance
that says opening and closing quotes need to look different, or that
apostrophes and single quotes need to look different.

Indeed, a counter-example comes readily to mind: the King James
bible, the standard English bible for centuries, was printed
entirely without quote marks, en dashes, "am dashes", and ellipses.

Perhaps, meaning no disrespect, you might yield just a tiny bit to a
native speaker in your pronouncements about what is and is not
correct English?

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
Aug 26 '05 #25
Stan Brown wrote:
On Fri, 26 Aug 2005 10:44:13 +0300, "Jukka K. Korpela"
<jk******@cs.tut.fi> wrote:

You cannot write even English
correctly using ASCII only. ASCII lacks orthographically correct
quotation marks (and apostrophe), en and am dashes, and horizontal
ellipsis, though the last one is debatable.

This claim does not become true no matter how often repeated.

Aside from the question of whether _punctuation_ can properly be
said to be part of a _language_ at all, there is no divine ordinance
that says opening and closing quotes need to look different, or that
apostrophes and single quotes need to look different.


There's no ordinance that says a lower-case "l" ("L") and a numeral "1"
(one) need to look different, and indeed in the typeface in which this
very sentence appears on my screen they look identical. Ditto for
capital "O" ("o") and numeral "0" (zero). That doesn't alter the fact
that they are different characters with different purposes and that they
shouldn't be treated as one in the character scheme. (Alan, don't resent
me for introducing "scheme". I couldn't resist.)
Indeed, a counter-example comes readily to mind: the King James
bible, the standard English bible for centuries, was printed
entirely without quote marks, en dashes, "am dashes", and ellipses.

Perhaps, meaning no disrespect, you might yield just a tiny bit to a
native speaker in your pronouncements about what is and is not
correct English?


When I *write* quotation marks, my opening quotes are different from my
closing quotes, and that's they way they usually are in print. So,
indeed, I consider them separate characters. Using a single character to
represent both may not be the end of the world, but it can pose
disadvantages.
Aug 26 '05 #26
On Fri, 26 Aug 2005, Stan Brown wrote:
Indeed, a counter-example comes readily to mind: the King James
bible, the standard English bible for centuries, was printed
entirely without quote marks, en dashes, "am dashes", and ellipses.


.... but with a doubled hyphen and lots of ligatures (e.g. U+FB06).

Aug 26 '05 #27
Tim
Jukka K. Korpela:
You cannot write even English correctly using ASCII only. ASCII
lacks orthographically correct quotation marks (and apostrophe),
en and am dashes, and horizontal ellipsis, though the last one
is debatable.


Stan Brown sent:
This claim does not become true no matter how often repeated.
Claiming that some of those things aren't true, no many how times
repeated, does NOT make it true.
Aside from the question of whether _punctuation_ can properly be
said to be part of a _language_ at all, there is no divine ordinance
that says opening and closing quotes need to look different, or that
apostrophes and single quotes need to look different.
Whatever they *look* like, apostrophes are not quote marks (neither are
they accents, as they're all-too-frequently abused for), and properly used
quote marks are not the same as what almost passes for quote marks in
ASCII (there are two separate opening and closing quote symbols, and ASCII
provides neither, certainly nothing that's opening quotes with a
corresponding something as closing quotes).

Anyone who's been taught to use the English language properly knows full
well that opening and closing quotes are two dissimilar symbols. Stylised
fonts are something else again, but standard punctuation marks do have
*proper* ways of being drawn (dots with tails going in particular
directions), even the letters have proper ways of being drawn. Anybody
who's *properly* taught children to write knows this.
Indeed, a counter-example comes readily to mind: the King James
bible, the standard English bible for centuries, was printed
entirely without quote marks, en dashes, "am dashes", and ellipses.
That a document doesn't do something is no proof. It also doesn't have
the word computer in it, no various other parts of our current language.
Perhaps, meaning no disrespect, you might yield just a tiny bit to a
native speaker in your pronouncements about what is and is not
correct English?


Maybe you ought to listen to some native speakers who assert that you're
wrong. Perhaps some who've been taught properly.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.

Aug 26 '05 #28
Jukka K. Korpela wrote:
You cannot write even English correctly using ASCII only. ASCII lacks
orthographically correct quotation marks (and apostrophe), en and am
dashes, and horizontal ellipsis, though the last one is debatable.


Thanks a lot, Jukka, now I'm sweer to post any followups in
case I run afoul of ISO Standard English!!

--
Jock
Aug 26 '05 #29
On Sat, 27 Aug 2005 01:32:54 +0900, Tim <ti*@mail.localhost.invalid>
wrote:
Anyone who's been taught to use the English language properly knows full
well that opening and closing quotes are two dissimilar symbols.
Sure they are - in some type faces. In other type faces they're the
same. In other type faces they don't exist. In handwriting they're
the same. The point is, they're not part of English, they're part of
typography.
standard punctuation marks do have
*proper* ways of being drawn (dots with tails going in particular
directions), even the letters have proper ways of being drawn. Anybody
who's *properly* taught children to write knows this.


Uh-huh. Children are all taught to make commas as dots with tails.
Suuuuure they are.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
Aug 27 '05 #30
In article <rd********************************@4ax.com>,
Stan Brown <th************@fastmail.fm> wrote:
On Sat, 27 Aug 2005 01:32:54 +0900, Tim <ti*@mail.localhost.invalid>
wrote:
Anyone who's been taught to use the English language properly knows full
well that opening and closing quotes are two dissimilar symbols.


Sure they are - in some type faces. In other type faces they're the
same. In other type faces they don't exist. In handwriting they're
the same. The point is, they're not part of English, they're part of
typography.


Such typographic conventions are typically rather tightly coupled with
the language. When it comes to English, the convention in clueful
typesetting is that you start a quotation with U+201C and end it with
U+201D. In Finnish, you start and end with U+201D (or start and end with
U+00BB).

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Aug 27 '05 #31
Stan Brown <th************@fastmail.fm> wrote:
Anyone who's been taught to use the English language properly knows full
well that opening and closing quotes are two dissimilar symbols.
Sure they are - in some type faces. In other type faces they're the
same.


Then those typefaces are as erroneous as one that has the same glyph for
"v" and "w". The difference is a character difference.
The point is, they're not part of English, they're part of
typography.


Unlike most languages of the world, English is a language that is both
spoken and written. Just as sounds and pauses are part of a language,
letters and punctuation are, too. It is highly illogical to claim that the
spoken form is not part of the language - especially in a context like
this, where we discuss HTML authoring for the WWW. As you know, WWW pages
are _mostly_ rendered in written (visual) form.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Aug 27 '05 #32
Tim
Tim:
Anyone who's been taught to use the English language properly knows
full well that opening and closing quotes are two dissimilar symbols.
Stan Brown sent:
Sure they are - in some type faces. In other type faces they're the
same. In other type faces they don't exist.
Don't try and use a deficiency or design idea about some fonts as a proof
for your argument, it doesn't wash.
In handwriting they're the same.
They're most certainly not, unless you're a lazy or uneducated writer.
standard punctuation marks do have *proper* ways of being drawn (dots
with tails going in particular directions), even the letters have
proper ways of being drawn. Anybody who's *properly* taught children
to write knows this.

Uh-huh. Children are all taught to make commas as dots with tails.
Suuuuure they are.


Maybe we have better teachers than you.

Some time ago our education department decided not to bother teaching
correct spelling, and letting things slide. That didn't make bad spelling
correct.

There's a correct way to draw a comma, apostrophes, quote marks, etc.,
just the same as there's a proper way to draw the letter e.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.

Aug 27 '05 #33
JRS: In article <hs****************************@news.fv.fi>, dated Fri,
26 Aug 2005 10:29:30, seen in news:comp.infosystems.www.authoring.html,
Henri Sivonen <hs******@iki.fi> posted :

Isn't 'non-English' the definition of 'foreign'? :-) Although nowadays
it is politically correct to say 'international' instead of 'foreign'.


Only if under US influence.

The true meaning of international (as in International Standard) is
something that should be understood everywhere, whereas the implication
of foreign is that it should be understood somewhere else.

International rightly includes one's own nation.

As examples, Latin was, 2000 years ago, an approximation to a truly
international language; but Icelandic has only IIRC ever been a foreign
language (except to Icelanders, of course).

To the vast majority of the world population, Finnish is foreign; but it
is not international.

--
© John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v4.00 MIME. ©
Web <URL:http://www.merlyn.demon.co.uk/> - w. FAQish topics, links, acronyms
PAS EXE etc : <URL:http://www.merlyn.demon.co.uk/programs/> - see 00index.htm
Dates - miscdate.htm moredate.htm js-dates.htm pas-time.htm critdate.htm etc.
Aug 27 '05 #34
JRS: In article <Pi*******************************@ppepc56.ph.gla. ac.uk
, dated Fri, 26 Aug 2005 09:06:15, seen in news:comp.infosystems.www.au thoring.html, Alan J. Flavell <fl*****@ph.gla.ac.uk> posted :
Well, to be extra pedantic, ASCII is an American Standard. They don't
have an exclusive hold on English.


Indeed, it is said that the finest English is to be found in Scotland;
my best dictionary is of Edinburgh origin, though typeset in Cambridge
and printed in Suffolk.

--
© John Stockton, Surrey, UK. ?@merlyn.demon.co.uk DOS 3.3, 6.20; Win98. ©
Web <URL:http://www.merlyn.demon.co.uk/> - FAQqish topics, acronyms & links.
PAS EXE TXT ZIP via <URL:http://www.merlyn.demon.co.uk/programs/00index.htm>
My DOS <URL:http://www.merlyn.demon.co.uk/batfiles.htm> - also batprogs.htm.
Aug 27 '05 #35
In article <ri**************@merlyn.demon.co.uk>,
Dr John Stockton <jr*@merlyn.demon.co.uk> wrote:
JRS: In article <hs****************************@news.fv.fi>, dated Fri,
26 Aug 2005 10:29:30, seen in news:comp.infosystems.www.authoring.html,
Henri Sivonen <hs******@iki.fi> posted :

Isn't 'non-English' the definition of 'foreign'? :-) Although nowadays
it is politically correct to say 'international' instead of 'foreign'.


Only if under US influence.

The true meaning of international


"International user" is practically just a way to say "foreigner". A
foreigner is an "international user" even if she has one nationality and
doesn't particularly interact with other nations.

An "international keyboard" is any one non-US keyboard even if targeted
at one nation.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Aug 28 '05 #36
In <11*********************@o13g2000cwo.googlegroups. com>, on
08/25/2005
at 03:52 AM, ch****@totalise.co.uk said:
I have an Html document that declares that it uses the utf-8
character set. As this document is editable via a web interface I
need to make sure than high-ascii characters that may be accidentally
entered are properly represented when the document is served.
ASCII is 0nl 0-127. Any character from 128-255 is not ASCII, and you
can only translate it to Unicode if you k now what code page it is
supposed to be in.
My programming language allows me to get the ascii value for any
individual character
There is no ASCII value for most characters. Again, you n eed to know
the code page in order to translate properly.
I am not very well up on character sets and document encoding
mechanisms so I would like to know, is this a sensible idea?


No. You need to be able to control or identify the encoding mechanism
for data that are entered.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Aug 28 '05 #37
JRS: In article <hs****************************@news.fv.fi>, dated Sun,
28 Aug 2005 09:08:32, seen in news:comp.infosystems.www.authoring.html,
Henri Sivonen <hs******@iki.fi> posted :
In article <ri**************@merlyn.demon.co.uk>,
Dr John Stockton <jr*@merlyn.demon.co.uk> wrote:
JRS: In article <hs****************************@news.fv.fi>, dated Fri,
26 Aug 2005 10:29:30, seen in news:comp.infosystems.www.authoring.html,
Henri Sivonen <hs******@iki.fi> posted :
>
>Isn't 'non-English' the definition of 'foreign'? :-) Although nowadays
>it is politically correct to say 'international' instead of 'foreign'.


Only if under US influence.

The true meaning of international


"International user" is practically just a way to say "foreigner". A
foreigner is an "international user" even if she has one nationality and
doesn't particularly interact with other nations.

An "international keyboard" is any one non-US keyboard even if targeted
at one nation.

As a non-American, you should know better than that.

Readers need to be aware of US terminology, but writers should avoid it;
in this case, it obscures a valuable distinction.

--
© John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v4.00 IE 4 ©
<URL:http://www.jibbering.com/faq/> JL/RC: FAQ of news:comp.lang.javascript
<URL:http://www.merlyn.demon.co.uk/js-index.htm> jscr maths, dates, sources.
<URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/jscr/&c, FAQ items, links.
Aug 29 '05 #38

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

19
by: gaudetteje | last post by:
I've been searching high and low for a way to simply convert a small XML configuration file to Python data structures. I came across gnosis XML tools, but need a built-in option for doing...
3
by: Fanofmsu | last post by:
Hi - from my limited expereince in DB2, it appears that you can't include a case statement in a group by clause - is this correct? For example, my SQL looked like this: SELECT CASE WHEN...
4
by: darrel | last post by:
We're a two person dev team with a handful of 'testers' that help us out. I'm looking for a simple issue tracker. I don't really want a full blown help desk/bug tracking system, but rather a...
6
by: sathyashrayan | last post by:
Dear group, Following is a exercise from a book called "Oreilly's practical C programming". I just wanted to do a couple of C programming exercise. I do have K and R book, but let me try some...
4
by: Shawnk | last post by:
This post is intended to verify that true value semantics DO NOT EXIST for the Enum class (relative to boolean operations). If this is true then (thus and therefore) you can not design state...
24
by: Michael | last post by:
Hi, I am trying to pass a function an array of strings, but I am having trouble getting the indexing to index the strings rather than the individual characters of one of the strings. I have...
7
by: aaragon | last post by:
Hi everyone, The idea is quite simple: generate a container with random values in it. For that, I decided to create a class that I called RandomContainer that inherits from a container (with...
1
by: Itanium | last post by:
Hi all! I'm new to .NET Platform and got some simple questions about efficiency... To put you in situation, to say that I'm involved in the writing of a complex regex based lexer for use over...
7
by: call_me_anything | last post by:
Is char abc different from char abc ?
5
by: just curious | last post by:
Create a C++ console application that uses a while loop to count, total, and average a series of positive integers entered by a user. The user enters a –1 to signal the end of data input and to...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.