need help with agonizing struggle to standardize my code on UTF-8 encoding

In <da**************************@posting.google.com >, on 09/30/2004
at 09:33 AM, lk******@geocities.com (lawrence) said:

I'm not sure I understand you. I went to this page to look up the
character 176: http://www.asciitable.com/
ASCII only runs from 0-127.
It appears to be cursor mark, or something. The page also notes that
extended ASCII characters are entirely standardized.
No. There are a multitude of different code pages on the PC that
overlap with ASCII. There are multiple international standards that
overlap with ASCII. That's one of the reasons for the existence of
Unicode.
This page says it is a degree sign:

It is, in, e.g., ISO 8859-1 (Latin-1), but it is not in, e.g., CP 450.

My advice is to stick with UTF-8 and to use chracter entities when
possible.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #11

On Sun, 3 Oct 2004, Shmuel (Seymour J.) Metz wrote:

ASCII only runs from 0-127.
True...

This page says it is a degree sign:

It is, in, e.g., ISO 8859-1 (Latin-1),

Fine...
My advice is to stick with UTF-8
I'm not sure that advice is helpful, in the terms in which it's
offered. My advice, for those who don't have some overriding
criterion, would often be to code the source in us-ascii and then
advertise it as utf-8 (which, in a degenerate sense, it is); but
preferably to make some effort to understand the issues, and make a
choice based on that understanding. Start from e.g
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6

My page http://ppewww.ph.gla.ac.uk/~flavell/...checklist.html may
be useful as a starting point. But of course nothing that I say
becomes true merely because I said it - naturally it should be checked
against the applicable specifications (if it fails, then feel free to
let me know, and I'll be happy to make the appropriate corrections!)
and to use chracter entities when possible.

If you're coding with charset=utf-8, then surely you don't really
*need* the "character entities"?

There can be advantages in using them, indeed; but I'd expect you to
offer a better justification than the above, before accepting it as
a design principle.

all the best

Jul 23 '05 #12

Jukka K. Korpela

lk******@geocities.com (lawrence) wrote (quoting a guide):

However, this shows us we can
use ° to show a degree sign, as in 24 °C. The non-breaking space
is required, but not for HTML reasons: see the Scientific Style
section at this SI guide page.

The statement is correct, though somewhat confusing. Using a space
between a number and a unit is indeed a requirement in the SI; and
although not explicitly mentioned in the SI standards (though mentioned
in some other standards), it is clearly undesirable to have a line break
between a number and a unit. _One_ way to achieve this is to use a no-
break space character, which can be presented in HTML e.g. as  . But
that's not the only way; you could use

24 °C

but you could also use

<span style="white-space: nowrap">24 °C</span>

(preferably using a class and an external style sheet instead of a style
attribute, but you get the idea)

or

<nobr>24 °C</nobr>

which is nonstandard but works well, or, in the special case where the
quantity expression is the content of a table cell,

<td nowrap>24 °C</td>

The interesting and important point is that the other ways that I
mentioned will, with certain caveats (CSS might be off, <nobr> might not
be supported, browsers may have bugs), _also_ prevent a line break
between the degree sign and the letter C. And this is important, because
the most common browser, Internet Explorer, feels free to break a line
after a degree sign, so that it may display 24 °C as 24 °
C, which is even more intolerable than breaking between a number and a
unit. More notes on this: http://www.cs.tut.fi/~jkorpela/html/nobr.html

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 23 '05 #13

lawrence

"Shmuel (Seymour J.) Metz" <sp******@library.lspace.org.invalid> wrote in message news:<41**************************@news.patriot.ne t>...

This page says it is a degree sign:

It is, in, e.g., ISO 8859-1 (Latin-1), but it is not in, e.g., CP 450.

My advice is to stick with UTF-8 and to use chracter entities when
possible.

Actually, I've begun to wonder if UTF-8 is a very bad idea, and ISO
8859-1 might be a much better choice? I find that on my weblog I can
no long copy and paste text from other weblogs to my own (for quoting)
without getting lots of garbage characters. I assume that's because
most of the websites out there are using ISO 8859-1 or the Windows
default.

Jul 23 '05 #14

lawrence

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in message news:<Pi*******************************@ppepc56.ph .gla.ac.uk>...

My advice is to stick with UTF-8
I'm not sure that advice is helpful, in the terms in which it's
offered. My advice, for those who don't have some overriding
criterion, would often be to code the source in us-ascii and then
advertise it as utf-8 (which, in a degenerate sense, it is); but
preferably to make some effort to understand the issues, and make a
choice based on that understanding. Start from e.g
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6

Yes, I've begun to wonder if UTF-8 is not painfully crippling on the
web. It makes it difficult to copy and paste from other websites,
because few websites are using UTF-8.

If you're coding with charset=utf-8, then surely you don't really
*need* the "character entities"?
There can be advantages in using them, indeed; but I'd expect you to
offer a better justification than the above, before accepting it as
a design principle.

I don't understand this. The character entities, like &, are from
HTML, they really have nothing to do with what charset I'm using, yes?
I'd have to use them no matter what my charset is?

Jul 23 '05 #15

Stan Brown

"lawrence" <lk******@geocities.com> wrote in
comp.infosystems.www.authoring.html:

Yes, I've begun to wonder if UTF-8 is not painfully crippling on the
web. It makes it difficult to copy and paste from other websites,
because few websites are using UTF-8.

And since I don't have Unicode applications on my computer, when I
view source of a page in my favorite editor I get garbage characters
too.

I originally went with UTF-8 because Netscape 4 does better with a
few &#nnnn; codes if the page is served as UTF-8 than if it's served
as ISO-8859-1. But Netscape 4 is ever less important, and t will be
a few years yet before Windows computers all have native Unicode
software.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/

Jul 23 '05 #16

In <da**************************@posting.google.com >, on 10/04/2004
at 07:23 PM, lk******@geocities.com (lawrence) said:

Actually, I've begun to wonder if UTF-8 is a very bad idea, and ISO
8859-1 might be a much better choice?
Only if your users are restricted to countries where that is an
appropriate code page. Even then 8859-15 might make more sense.
I find that on my weblog I can no long copy and paste text from other
weblogs to my own (for quoting) without getting lots of garbage
characters. I assume that's because most of the websites out there
are using ISO 8859-1 or the Windows default.

No, it's because your software is broken. In a multilingual
environment the software needs to be aware of character set issues.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #17

In <da**************************@posting.google.com >, on 10/04/2004
at 07:27 PM, lk******@geocities.com (lawrence) said:

Yes, I've begun to wonder if UTF-8 is not painfully crippling on the
web.
No, the problem is software that can't handle multiple chaacter sets.
You have the same types of problems when UTF-8 isn't involved.
I don't understand this. The character entities, like &, are
from HTML, they really have nothing to do with what charset I'm
using, yes?

That's why you should use them; they're readable in any supported
character set, even if your editor doesn't handle multiple character
sets properly. If you code, e.g., "Nöther", and edit it with software
that doesn't use the right code page then you won't get the expected
results.

Note that this is a separate issue from rendering the page properly;
your HTML should indicate the character set that you are using and the
browser should honor that.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #18

On Tue, 5 Oct 2004, Shmuel (Seymour J.) Metz wrote:

In <da**************************@posting.google.com >, on 10/04/2004
at 07:27 PM, lk******@geocities.com (lawrence) said:
Yes, I've begun to wonder if UTF-8 is not painfully crippling on the
web.
No, the problem is software that can't handle multiple chaacter sets.

I think you mean "multiple character encoding schemes".

In HTML (the subject of this group) the document character set is by
definition always iso-10646/Unicode, and that determines the numerical
values which are to be used in &#number; representations.

The external character representation (as declared in that
unfortunately-named MIME attribute "charset", which in modern
terminology refers to a character encoding scheme) can be anything.
You have the same types of problems when UTF-8 isn't involved.
Don't forget that us-ascii is a special case of utf-8, and (until
netscape 4 is finally deceased) there can be benefits in coding
a web page in us-ascii and calling it utf-8, see my "scenario 6":
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6

But one size doesn't fit all requirements, hence the other options
which are explored there.

But don't use iso-8859-15 for encoding HTML - there's honestly no
point.

I don't understand this. The character entities, like &, are
from HTML, they really have nothing to do with what charset I'm
using, yes?

That's why you should use them; they're readable in any supported
character set,

^^^^^^^^^^^^^

For "character set" read "encoding". HTML has only one "document
character set", as I said above. I know this seems like nit-picking -
until you really need to get it right, but there is a currently
accepted terminology, and the best way to avoid unnecessary
misunderstandings is to use it.

To get back to what you said: there's no "should" about using
&-notation, except where it's technically necessary (to distinguish
& and < from & and < in other words).

There are some situations where it's essential (to express e-acute in
us-ascii encoding, for example). There are some where it's advisable
(e.g €), and others where it's pointless (e.g if you use
iso-8859-7 encoding then it would be pointless to express Greek in
&#number; notation).
even if your editor doesn't handle multiple character sets properly.
If you code, e.g., "Nöther", and edit it with software that doesn't
use the right code page then you won't get the expected results.

Such software is really unfit for handling HTML, I must say.

all the best.

Jul 23 '05 #19

On Tue, 5 Oct 2004, Shmuel (Seymour J.) Metz wrote:

Actually, I've begun to wonder if UTF-8 is a very bad idea, and ISO
8859-1 might be a much better choice?
Only if your users are restricted to countries where that is an
appropriate code page.

"Code Page" is IBM-Microsoft-proprietary terminology.
Even then 8859-15 might make more sense.

No. <http://www.google.com/search?q=ISO-8859-15+Science-Museum>

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #20

In <Pi*******************************@ppepc56.ph.gla. ac.uk>, on
10/05/2004
at 08:24 PM, "Alan J. Flavell" <fl*****@ph.gla.ac.uk> said:

I think you mean "multiple character encoding schemes".
Yes, although a different character set would imply a different
encoding scheme.
In HTML (the subject of this group) the document character set is by
definition always iso-10646/Unicode,
Even when there is a meta tag with http-equiv="content-type" and a
charset in the content tag? And is XHTML out of scope for this group?

But don't use iso-8859-15 for encoding HTML - there's honestly no
point.

What if you have a in the document? When was € added; my HTML
4.0 documentation doesn't list it. Is it safe to assume a browser
supporting HTML 4.01 or XHTML?

If you code, e.g., "N Âther",

Check your newsreader settings; you're posting that without
Content-Type: and Content-Transfer-Encoding: header lines.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #21

On Thu, 7 Oct 2004, Shmuel (Seymour J.) Metz wrote:

at 08:24 PM, "Alan J. Flavell" <fl*****@ph.gla.ac.uk> said:
I think you mean "multiple character encoding schemes".
Yes, although a different character set would imply a different
encoding scheme.

Absolutely not. That's the whole point!

In (X)HTML you can (if you so choose) represent any Unicode character
by means of a markup string coded in us-ascii, even. The use of other
encoding schemes is merely a convenience when the desired character
repertoire fits a particular pattern, but whichever encoding scheme
you choose, you still - in principle - have access to any other
Unicode character you need, by means of &-notation. (Whether it's
practical to use it depends on whether you expect your readers'
browsers to render it, but in principle it's available). ☺

In HTML (the subject of this group) the document character set is by
definition always iso-10646/Unicode,
(Strictly, I should have said "subsequent to RFC2070", or, if we're
talking about W3C-approved specifications, "in HTML4 and later").
Even when there is a meta tag with http-equiv="content-type" and a
charset in the content tag?
Yes, absolutely. The "charset" MIME parameter specifies the character
encoding scheme (be it iso-8859-7, iso-2022-jp, utf-16, whatever).
Some of the encoding schemes might carry a hint about the anticipated
character repertoire, but they don't change the "Document Character
Set" (that's a technical term from SGML with a rather precise meaning
in this context).
And is XHTML out of scope for this group?
XHTML is no different in this regard.

But don't use iso-8859-15 for encoding HTML - there's honestly no
point.

What if you have a in the document?

Eh? Your posting header, very reasonably, said

| Content-Type: text/plain; charset=ISO-8859-1

I can deduce what you were trying to type, but it simply isn't in the
character repertoire of the encoding that you used for your posting.

I repeat my earlier point: iso-8859-15 encoding *can* be useful for
plain text, but it offers no practical benefits for encoding HTML.
When was € added; my HTML 4.0 documentation doesn't list it. Is
it safe to assume a browser supporting HTML 4.01 or XHTML?

There -are- web pages which go into excruciating detail about this.
I was just offering a practical summary.

But if you're going to get involved in arguments about these aspects
of (X)HTML, then I would strongly urge you to do a bit of background
reading on the HTML character model, or else we'll be doomed to
fruitless shouting matches based in different conceptions of the
underlying principles. This might help:
http://www.w3.org/TR/2004/WD-charmod-20040225/

Or indeed RFC2070 itself (now a bit dated, but nevertheless sound).

If you code, e.g., "N Âther",

Check your newsreader settings; you're posting that without
Content-Type: and Content-Transfer-Encoding: header lines.

Sorry about that. There *is* a problem, but it's not quite what
you presented it to be. For some reason, this new version of PINE
that I'm using saw fit to do this:
Content-Type: MULTIPART/MIXED;
BOUNDARY="616733697-1982083197-1097004245=:26474"

--616733697-1982083197-1097004245=:26474
Content-Type: TEXT/PLAIN; charset=ISO-8859-1
Content-Transfer-Encoding: 8BIT
As soon as I find out why it's doing that, I'll stop it. Sorry.

(Quite how a single part can be not only "multipart" but even "mixed"
is something of a puzzle...)

Jul 23 '05 #22

Just f.y.i administrivia:

On Thu, 7 Oct 2004, Alan J. Flavell wrote:

Content-Type: MULTIPART/MIXED;
BOUNDARY="616733697-1982083197-1097004245=:26474" [...]
As soon as I find out why it's doing that, I'll stop it. Sorry.

It's discussed in the thread which contains this posting:
http://groups.google.com/groups?selm....17679%40lcore

as well as in a much more acrimonious thread on the same group.

Jul 23 '05 #23

On Thu, 7 Oct 2004, Shmuel (Seymour J.) Metz wrote:

What if you have a in the document? When was € added; my HTML
4.0 documentation doesn't list it. Is it safe to assume a browser
supporting HTML 4.01 or XHTML?

Even Netscape 4.08 displays € as euro sign or as abbreviation
EUR - depending on the operating system and on the available fonts.

But Netscape 4.8 for Macintosh and Windows doesn't understand
ISO-8859-15. What's worse, the Macintosh version refuses to apply
the usual ISO-8859-1 <-> MacRoman transcoding, meaning that *all*¹
special, non-ASCII characters are incorrect.

¹ Strictly speaking "all but five": ¢ £ µ ± © have the same
code positions.

--
M. Pirard strikes again:
<http://www.alltheweb.com/search?q=don1t&_sb_lang=any>
<http://www.altavista.com/web/results?q=don1t&kgs=0&kls=0>

Jul 23 '05 #24

In <Pi*******************************@ppepc56.ph.gla. ac.uk>, on
10/07/2004
at 06:44 PM, "Alan J. Flavell" <fl*****@ph.gla.ac.uk> said:

Eh? Your posting header, very reasonably, said | Content-Type: text/plain; charset=ISO-8859-1

Whoops! I need an 8859-15 code page :-(

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #25

On Tue, 12 Oct 2004, Shmuel (Seymour J.) Metz wrote:

X-Newsreader: MR/2 Internet Cruiser Edition for OS/2 v2.47/47

Whoops! I need an 8859-15 code page :-(

No, you would need code page 858 (cp00858, cp858) for your OS/2
and a transcoding cp00858 <-> ISO-8859-15 in your e-mail and news
programs.

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #26

lawrence

"Shmuel (Seymour J.) Metz" <sp******@library.lspace.org.invalid> wrote in message news:<41***************************@news.patriot.n et>...

I find that on my weblog I can no long copy and paste text from other
weblogs to my own (for quoting) without getting lots of garbage
characters. I assume that's because most of the websites out there
are using ISO 8859-1 or the Windows default.

No, it's because your software is broken. In a multilingual
environment the software needs to be aware of character set issues.

Well, obviously my software is broken, that is why I started this
thread. If everything was working perfectly, or if it was broken but I
knew how to fix it, then I wouldn't have needed to bother this
newsgroup with my questions, yes?

Cleary, my software is broken. I'm trying to figure out how to fix it.
If a user has a form with a TEXTAREA and can copy and paste text from
anywhere on the web (pages using any encoding) and paste it into that
TEXTAREA and then input it, then how can I process that input to make
sure that text doesn't get turned to garbage when it is output to the
web?

You can see the problem clearly on this page, where all the quotes are
full of garbage characters:

http://www.krubner.com/index.php?pageId=31475

Can you suggest a strategy for dealing with this? I notice that users
of Blogger, TypePad, MoveableType and pMachine never seem to face this
issue, so I assume that the makers of other weblog software have
already figured out how to solve this problem.

Jul 23 '05 #27

lawrence

Stan Brown <th************@fastmail.fm> wrote in message news:<MP************************@news.odyssey.net> ...

"lawrence" <lk******@geocities.com> wrote in
comp.infosystems.www.authoring.html:
Yes, I've begun to wonder if UTF-8 is not painfully crippling on the
web. It makes it difficult to copy and paste from other websites,
because few websites are using UTF-8.

I originally went with UTF-8 because Netscape 4 does better with a
few &#nnnn; codes if the page is served as UTF-8 than if it's served
as ISO-8859-1. But Netscape 4 is ever less important, and t will be
a few years yet before Windows computers all have native Unicode
software.

It seems to me I might be luckier if I simply go with whatever the
most common encoding is on English-speaking websites. If I only market
to English-speakers, then the liklihood that they'll copy and paste
from non-English sites is low. (That might buy me a year during which
I can study this problem more. Maybe a year from now I'll understand
the issue better.)

Jul 23 '05 #28

In <da**************************@posting.google.com >, on 10/20/2004
at 07:06 PM, lk******@geocities.com (lawrence) said:

Cleary, my software is broken. I'm trying to figure out how to fix
it. If a user has a form with a TEXTAREA and can copy and paste text
from anywhere on the web (pages using any encoding) and paste it into
that TEXTAREA and then input it, then how can I process that input to
make sure that text doesn't get turned to garbage when it is output
to the web?
You can't fix bugs in your users' software, only in your own. The best
that you can do is to avoid presenting data that a lot of users can't
handle.
http://www.krubner.com/index.php?pageId=31475

That page is 403 compliant.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #29

In <Pine.GSO.4.44.0410131549360.8242-100000@s5b004>, on 10/13/2004
at 03:52 PM, Andreas Prilop <nh******@rrzn-user.uni-hannover.de>
said:

No, you would need code page 858 (cp00858, cp858) for your OS/2

Why? I already have a Euro symbol as D5 on the code pages I'm using
(437, 850). All I need is a charset for the MIME header lines that
includes the Euro, e.g., ISO-8859-15, and a proper mapping into it.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #30

On Thu, 21 Oct 2004, Shmuel (Seymour J.) Metz wrote:

No, you would need code page 858 (cp00858, cp858) for your OS/2

Why? I already have a Euro symbol as D5 on the code pages I'm using
(437, 850).

If code position xD5 is a euro sign, then it cannot be cp437 nor cp850.
http://www.unicode.org/Public/MAPPIN...T/PC/CP437.TXT
http://www.unicode.org/Public/MAPPIN...T/PC/CP850.TXT

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #31

On Fri, 22 Oct 2004, Andreas Prilop wrote:

On Thu, 21 Oct 2004, Shmuel (Seymour J.) Metz wrote:
No, you would need code page 858 (cp00858, cp858) for your OS/2
Why? I already have a Euro symbol as D5 on the code pages I'm using
(437, 850).

If code position xD5 is a euro sign, then it cannot be cp437 nor cp850.

Indeed.

Google suggests a discussion here:
http://sourceforge.net/mailarchive/f...&forum_id=1650
http://www.unicode.org/Public/MAPPIN...T/PC/CP437.TXT
http://www.unicode.org/Public/MAPPIN...T/PC/CP850.TXT
Well, that sourceforge discussion says 858 is already registered at
IANA (but as cp00858, not as cp858, woops), and here it is:
http://www.iana.org/assignments/charset-reg/IBM00858

But it seems nobody got it onto the Unicode site yet, neither in IBM's
area nor in Microsquish's DOS (ahem, /PC/) area.

Jul 23 '05 #32

On Fri, 22 Oct 2004, Alan J. Flavell wrote:

Well, that sourceforge discussion says 858 is already registered at
IANA (but as cp00858, not as cp858, woops),

^^

IBM has big plans for many more code pages, it seems :-)

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #33

lawrence

"Shmuel (Seymour J.) Metz" <sp******@library.lspace.org.invalid> wrote in message news:<41**************************@news.patriot.ne t>...

In <da**************************@posting.google.com >, on 10/20/2004
at 07:06 PM, lk******@geocities.com (lawrence) said:
Cleary, my software is broken. I'm trying to figure out how to fix
it. If a user has a form with a TEXTAREA and can copy and paste text
from anywhere on the web (pages using any encoding) and paste it into
that TEXTAREA and then input it, then how can I process that input to
make sure that text doesn't get turned to garbage when it is output
to the web?

You can't fix bugs in your users' software, only in your own. The best
that you can do is to avoid presenting data that a lot of users can't
handle.

I'm not sure I get you. I'm mostly worried about users who copy and
paste text from other pages.

http://www.krubner.com/index.php?pageId=31475

That page is 403 compliant.

Just keep trying. I think I'm hitting my bandwidth limit or something.
Sometimes it is 403, mostly it is not.

Jul 23 '05 #34

On Fri, 22 Oct 2004, Andreas Prilop wrote:

IANA (but as cp00858, not as cp858, woops),

^^
IBM has big plans for many more code pages, it seems :-)

Leaving room for cp10646, maybe....

Jul 23 '05 #35

lawrence

"Shmuel (Seymour J.) Metz" <sp******@library.lspace.org.invalid> wrote in message news:<41**************************@news.patriot.ne t>...

In <da**************************@posting.google.com >, on 10/20/2004
at 07:06 PM, lk******@geocities.com (lawrence) said:
Cleary, my software is broken. I'm trying to figure out how to fix
it. If a user has a form with a TEXTAREA and can copy and paste text
from anywhere on the web (pages using any encoding) and paste it into
that TEXTAREA and then input it, then how can I process that input to
make sure that text doesn't get turned to garbage when it is output
to the web?

You can't fix bugs in your users' software, only in your own. The best
that you can do is to avoid presenting data that a lot of users can't
handle.

Look at the garbage in the quotes on this page:

http://www.krubner.com/index.php?pageId=31475

I'm thinking I'd face less trouble if I sent these pages out as ISO-8859-15

Jul 23 '05 #36

In <da*************************@posting.google.com> , on 10/22/2004
at 03:07 PM, lk******@geocities.com (lawrence) said:

I'm not sure I get you. I'm mostly worried about users who copy and
paste text from other pages.

That involves their software, not yours. If it's broken, you can't fix
it for them.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #37