473,498 Members | 1,992 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

need help with agonizing struggle to standardize my code on UTF-8 encoding

I'm just now trying to give my site a character encoding of UTF-8. The
site has been built in a hodge-podge way over the last 6 years. The
validator tells me I've lots of characters that don't belong to the
UTF-8 encoding. Other than changing them by hand, can anyone think of
a clever way to find and convert these characters?

http://validator.w3.org/check?uri=ht...krubner.com%2F
Jul 23 '05 #1
38 2815

"lawrence" <lk******@geocities.com> wrote in message
news:da**************************@posting.google.c om...
I'm just now trying to give my site a character encoding of UTF-8. The
site has been built in a hodge-podge way over the last 6 years. The
validator tells me I've lots of characters that don't belong to the
UTF-8 encoding. Other than changing them by hand, can anyone think of
a clever way to find and convert these characters?

http://validator.w3.org/check?uri=ht...krubner.com%2F


Are you looking for a way that's more clever than reading the error message
that the above link gives you?

"Sorry, I am unable to validate this document because on lines 839, 846,
856, 929, 931, 933, 935, 937, 939, 1075, 1079, 1081, 1083, 1085, 1087, 1089,
1091, 1139, 1159, 1296, 1298, 1416, 1456, 1464, 1502, 1508, 1512, 1522,
1524, 1533, 1537, 1565, 1567, 1569, 1571, 1640, 1648, 1650, 1900, 1926, 2182
it contained one or more bytes that I cannot interpret as utf-8 (in other
words, the bytes found are not valid values in the specified Character
Encoding). Please check both the content of the file and the character
encoding indication."

After looking at line 839 I had my first guess as to the source of the
problem and after seeing lines 846 and 856 I was positive.

Jul 23 '05 #2
In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) writes:
I'm just now trying to give my site a character encoding of UTF-8. The
Why? Is it currently broken?
site has been built in a hodge-podge way over the last 6 years. The
Nothing necessarily wrong with that.
UTF-8 encoding. Other than changing them by hand, can anyone think of
a clever way to find and convert these characters?


man iconv.

--
Nick Kew

Nick's manifesto: http://www.htmlhelp.com/~nick/
Jul 23 '05 #3
"Harlan Messinger" <h.*********@comcast.net> wrote in message news:<2s*************@uni-berlin.de>...
"lawrence" <lk******@geocities.com> wrote in message
news:da**************************@posting.google.c om...
I'm just now trying to give my site a character encoding of UTF-8. The
site has been built in a hodge-podge way over the last 6 years. The
validator tells me I've lots of characters that don't belong to the
UTF-8 encoding. Other than changing them by hand, can anyone think of
a clever way to find and convert these characters?

http://validator.w3.org/check?uri=ht...krubner.com%2F
Are you looking for a way that's more clever than reading the error message
that the above link gives you?


I'm not sure I understand you. Reading the error message doesn't tell
me how to fix the problem. I'm looking for some kind of script that
might mass convert characters from the current encodings to UTF-8.


"Sorry, I am unable to validate this document because on lines 839, 846,
856, 929, 931, 933, 935, 937, 939, 1075, 1079, 1081, 1083, 1085, 1087, 1089, ......

After looking at line 839 I had my first guess as to the source of the
problem and after seeing lines 846 and 856 I was positive.


That's nice, but why don't you tell me? Line 839 is this:

</div>

Line 846 starts with the word "This".

What pattern are you seeing? You're being a bit cryptic in your reply.
Jul 23 '05 #4
ni**@hugin.webthing.com (Nick Kew) wrote in message news:<bp************@webthing.com>...
In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) writes:
I'm just now trying to give my site a character encoding of UTF-8. The


Why? Is it currently broken?


It is currently broken in the sense that it is a weblog and I'd like
to put an RSS feed on it, because all weblogs have RSS feeds nowadays.
But RSS feeds won't validate if the feed is sent out without a
character encoding. So I have to give it a character encoding of some
kind. So I decided on UTF-8 after hashing it out some over on
comp.lang.php. And now that I'm forcing the issue, there is a lot of
code that was input previously that is balking.

UTF-8 encoding. Other than changing them by hand, can anyone think of
a clever way to find and convert these characters?


man iconv.


Hot tip. Thanks.
Jul 23 '05 #5
"lawrence" <lk******@geocities.com> wrote in
comp.infosystems.www.authoring.html:
I'm just now trying to give my site a character encoding of UTF-8. The
site has been built in a hodge-podge way over the last 6 years. The
validator tells me I've lots of characters that don't belong to the
UTF-8 encoding. Other than changing them by hand, can anyone think of
a clever way to find and convert these characters?

http://validator.w3.org/check?uri=ht...krubner.com%2F


Look for the UNIX or GNU took "tr". You could for instance
transliterate the character 176 to the string &deg;

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Jul 23 '05 #6
lk******@geocities.com (lawrence) wrote:
"Harlan Messinger" <h.*********@comcast.net> wrote in message news:<2s*************@uni-berlin.de>...
"lawrence" <lk******@geocities.com> wrote in message
news:da**************************@posting.google.c om...
> I'm just now trying to give my site a character encoding of UTF-8. The
> site has been built in a hodge-podge way over the last 6 years. The
> validator tells me I've lots of characters that don't belong to the
> UTF-8 encoding. Other than changing them by hand, can anyone think of
> a clever way to find and convert these characters?
>
> http://validator.w3.org/check?uri=ht...krubner.com%2F


Are you looking for a way that's more clever than reading the error message
that the above link gives you?


I'm not sure I understand you. Reading the error message doesn't tell
me how to fix the problem. I'm looking for some kind of script that
might mass convert characters from the current encodings to UTF-8.


"Sorry, I am unable to validate this document because on lines 839, 846,
856, 929, 931, 933, 935, 937, 939, 1075, 1079, 1081, 1083, 1085, 1087, 1089,

......

After looking at line 839 I had my first guess as to the source of the
problem and after seeing lines 846 and 856 I was positive.


That's nice, but why don't you tell me? Line 839 is this:

</div>

Line 846 starts with the word "This".

What pattern are you seeing? You're being a bit cryptic in your reply.


My mistake, actually. I checked the box on the results page that says
"Show source", clicked Revalidate, and every line from the above list
of line numbers was full of pound signs. I thought that was the
obvious problem. I didn't notice that it was the validator that was
inserting those lines to replace the lines it couldn't read. My
apologies.
--
Harlan Messinger
Remove the first dot from my e-mail address.
Veuillez ōter le premier point de mon adresse de courriel.
Jul 23 '05 #7
Why? Is it currently broken?


It is currently broken in the sense that it is a weblog and I'd like
to put an RSS feed on it, because all weblogs have RSS feeds nowadays.
But RSS feeds won't validate if the feed is sent out without a
character encoding. So I have to give it a character encoding of some
kind. So I decided on UTF-8 after hashing it out some over on
comp.lang.php. And now that I'm forcing the issue, there is a lot of
code that was input previously that is balking.


Have you considered trying an RSS feed creation tool for the rss piece? Not sure
how it will handle the encoding but it might be worth a try to look at
FeedForAll http://www.feedforall.com

Best,
Chip

Jul 23 '05 #8
Stan Brown <th************@fastmail.fm> wrote in message news:<MP************************@news.odyssey.net> ...
"lawrence" <lk******@geocities.com> wrote in
comp.infosystems.www.authoring.html:
I'm just now trying to give my site a character encoding of UTF-8. The
site has been built in a hodge-podge way over the last 6 years. The
validator tells me I've lots of characters that don't belong to the
UTF-8 encoding. Other than changing them by hand, can anyone think of
a clever way to find and convert these characters?

http://validator.w3.org/check?uri=ht...krubner.com%2F


Look for the UNIX or GNU took "tr". You could for instance
transliterate the character 176 to the string &deg;


I'm not sure I understand you. I went to this page to look up the
character 176:

http://www.asciitable.com/

It appears to be cursor mark, or something. The page also notes that
extended ASCII characters are entirely standardized.

This page says it is a degree sign:

http://tranchant.plus.com/web/html-tutorial/characters

It says:

"This is the definition in the Document Type Declaration (DTD), which
is the machine-readable rules of HTML. However, this shows us we can
use &deg; to show a degree sign, as in 24 °C. The non-breaking space
is required, but not for HTML reasons: see the Scientific Style
section at this SI guide page."
Jul 23 '05 #9
Chip <Ch*********@newsguy.com> wrote in message news:<cj*********@drn.newsguy.com>...
Why? Is it currently broken?


It is currently broken in the sense that it is a weblog and I'd like
to put an RSS feed on it, because all weblogs have RSS feeds nowadays.
But RSS feeds won't validate if the feed is sent out without a
character encoding. So I have to give it a character encoding of some
kind. So I decided on UTF-8 after hashing it out some over on
comp.lang.php. And now that I'm forcing the issue, there is a lot of
code that was input previously that is balking.


Have you considered trying an RSS feed creation tool for the rss piece? Not sure
how it will handle the encoding but it might be worth a try to look at
FeedForAll http://www.feedforall.com


That would be fine if I just wanted to put this together in a hurry
and needed a quick solution, but one goal of this project is to donate
all code to the public domain, so we need to write our own solutions
to things. We give away our code at www.publicdomainsoftware.org.
Jul 23 '05 #10
In <da**************************@posting.google.com >, on 09/30/2004
at 09:33 AM, lk******@geocities.com (lawrence) said:
I'm not sure I understand you. I went to this page to look up the
character 176: http://www.asciitable.com/
ASCII only runs from 0-127.
It appears to be cursor mark, or something. The page also notes that
extended ASCII characters are entirely standardized.
No. There are a multitude of different code pages on the PC that
overlap with ASCII. There are multiple international standards that
overlap with ASCII. That's one of the reasons for the existence of
Unicode.
This page says it is a degree sign:


It is, in, e.g., ISO 8859-1 (Latin-1), but it is not in, e.g., CP 450.

My advice is to stick with UTF-8 and to use chracter entities when
possible.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #11
On Sun, 3 Oct 2004, Shmuel (Seymour J.) Metz wrote:
ASCII only runs from 0-127.
True...
This page says it is a degree sign:


It is, in, e.g., ISO 8859-1 (Latin-1),


Fine...
My advice is to stick with UTF-8
I'm not sure that advice is helpful, in the terms in which it's
offered. My advice, for those who don't have some overriding
criterion, would often be to code the source in us-ascii and then
advertise it as utf-8 (which, in a degenerate sense, it is); but
preferably to make some effort to understand the issues, and make a
choice based on that understanding. Start from e.g
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6

My page http://ppewww.ph.gla.ac.uk/~flavell/...checklist.html may
be useful as a starting point. But of course nothing that I say
becomes true merely because I said it - naturally it should be checked
against the applicable specifications (if it fails, then feel free to
let me know, and I'll be happy to make the appropriate corrections!)
and to use chracter entities when possible.


If you're coding with charset=utf-8, then surely you don't really
*need* the "character entities"?

There can be advantages in using them, indeed; but I'd expect you to
offer a better justification than the above, before accepting it as
a design principle.

all the best
Jul 23 '05 #12
lk******@geocities.com (lawrence) wrote (quoting a guide):
However, this shows us we can
use &deg; to show a degree sign, as in 24 °C. The non-breaking space
is required, but not for HTML reasons: see the Scientific Style
section at this SI guide page.


The statement is correct, though somewhat confusing. Using a space
between a number and a unit is indeed a requirement in the SI; and
although not explicitly mentioned in the SI standards (though mentioned
in some other standards), it is clearly undesirable to have a line break
between a number and a unit. _One_ way to achieve this is to use a no-
break space character, which can be presented in HTML e.g. as &nbsp;. But
that's not the only way; you could use

24&nbsp;&deg;C

but you could also use

<span style="white-space: nowrap">24 &deg;C</span>

(preferably using a class and an external style sheet instead of a style
attribute, but you get the idea)

or

<nobr>24 &deg;C</nobr>

which is nonstandard but works well, or, in the special case where the
quantity expression is the content of a table cell,

<td nowrap>24 &deg;C</td>

The interesting and important point is that the other ways that I
mentioned will, with certain caveats (CSS might be off, <nobr> might not
be supported, browsers may have bugs), _also_ prevent a line break
between the degree sign and the letter C. And this is important, because
the most common browser, Internet Explorer, feels free to break a line
after a degree sign, so that it may display 24 &deg;C as 24 °
C, which is even more intolerable than breaking between a number and a
unit. More notes on this: http://www.cs.tut.fi/~jkorpela/html/nobr.html

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 23 '05 #13
"Shmuel (Seymour J.) Metz" <sp******@library.lspace.org.invalid> wrote in message news:<41**************************@news.patriot.ne t>...
This page says it is a degree sign:


It is, in, e.g., ISO 8859-1 (Latin-1), but it is not in, e.g., CP 450.

My advice is to stick with UTF-8 and to use chracter entities when
possible.


Actually, I've begun to wonder if UTF-8 is a very bad idea, and ISO
8859-1 might be a much better choice? I find that on my weblog I can
no long copy and paste text from other weblogs to my own (for quoting)
without getting lots of garbage characters. I assume that's because
most of the websites out there are using ISO 8859-1 or the Windows
default.
Jul 23 '05 #14
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in message news:<Pi*******************************@ppepc56.ph .gla.ac.uk>...
My advice is to stick with UTF-8
I'm not sure that advice is helpful, in the terms in which it's
offered. My advice, for those who don't have some overriding
criterion, would often be to code the source in us-ascii and then
advertise it as utf-8 (which, in a degenerate sense, it is); but
preferably to make some effort to understand the issues, and make a
choice based on that understanding. Start from e.g
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6


Yes, I've begun to wonder if UTF-8 is not painfully crippling on the
web. It makes it difficult to copy and paste from other websites,
because few websites are using UTF-8.




If you're coding with charset=utf-8, then surely you don't really
*need* the "character entities"?
There can be advantages in using them, indeed; but I'd expect you to
offer a better justification than the above, before accepting it as
a design principle.


I don't understand this. The character entities, like &amp;, are from
HTML, they really have nothing to do with what charset I'm using, yes?
I'd have to use them no matter what my charset is?
Jul 23 '05 #15
"lawrence" <lk******@geocities.com> wrote in
comp.infosystems.www.authoring.html:
Yes, I've begun to wonder if UTF-8 is not painfully crippling on the
web. It makes it difficult to copy and paste from other websites,
because few websites are using UTF-8.


And since I don't have Unicode applications on my computer, when I
view source of a page in my favorite editor I get garbage characters
too.

I originally went with UTF-8 because Netscape 4 does better with a
few &#nnnn; codes if the page is served as UTF-8 than if it's served
as ISO-8859-1. But Netscape 4 is ever less important, and t will be
a few years yet before Windows computers all have native Unicode
software.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Jul 23 '05 #16
In <da**************************@posting.google.com >, on 10/04/2004
at 07:23 PM, lk******@geocities.com (lawrence) said:
Actually, I've begun to wonder if UTF-8 is a very bad idea, and ISO
8859-1 might be a much better choice?
Only if your users are restricted to countries where that is an
appropriate code page. Even then 8859-15 might make more sense.
I find that on my weblog I can no long copy and paste text from other
weblogs to my own (for quoting) without getting lots of garbage
characters. I assume that's because most of the websites out there
are using ISO 8859-1 or the Windows default.


No, it's because your software is broken. In a multilingual
environment the software needs to be aware of character set issues.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #17
In <da**************************@posting.google.com >, on 10/04/2004
at 07:27 PM, lk******@geocities.com (lawrence) said:
Yes, I've begun to wonder if UTF-8 is not painfully crippling on the
web.
No, the problem is software that can't handle multiple chaacter sets.
You have the same types of problems when UTF-8 isn't involved.
I don't understand this. The character entities, like &amp;, are
from HTML, they really have nothing to do with what charset I'm
using, yes?


That's why you should use them; they're readable in any supported
character set, even if your editor doesn't handle multiple character
sets properly. If you code, e.g., "Nöther", and edit it with software
that doesn't use the right code page then you won't get the expected
results.

Note that this is a separate issue from rendering the page properly;
your HTML should indicate the character set that you are using and the
browser should honor that.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #18
On Tue, 5 Oct 2004, Shmuel (Seymour J.) Metz wrote:
In <da**************************@posting.google.com >, on 10/04/2004
at 07:27 PM, lk******@geocities.com (lawrence) said:
Yes, I've begun to wonder if UTF-8 is not painfully crippling on the
web.
No, the problem is software that can't handle multiple chaacter sets.


I think you mean "multiple character encoding schemes".

In HTML (the subject of this group) the document character set is by
definition always iso-10646/Unicode, and that determines the numerical
values which are to be used in &#number; representations.

The external character representation (as declared in that
unfortunately-named MIME attribute "charset", which in modern
terminology refers to a character encoding scheme) can be anything.
You have the same types of problems when UTF-8 isn't involved.
Don't forget that us-ascii is a special case of utf-8, and (until
netscape 4 is finally deceased) there can be benefits in coding
a web page in us-ascii and calling it utf-8, see my "scenario 6":
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6

But one size doesn't fit all requirements, hence the other options
which are explored there.

But don't use iso-8859-15 for encoding HTML - there's honestly no
point.
I don't understand this. The character entities, like &amp;, are
from HTML, they really have nothing to do with what charset I'm
using, yes?


That's why you should use them; they're readable in any supported
character set,

^^^^^^^^^^^^^

For "character set" read "encoding". HTML has only one "document
character set", as I said above. I know this seems like nit-picking -
until you really need to get it right, but there is a currently
accepted terminology, and the best way to avoid unnecessary
misunderstandings is to use it.

To get back to what you said: there's no "should" about using
&-notation, except where it's technically necessary (to distinguish
&amp; and &lt; from & and < in other words).

There are some situations where it's essential (to express e-acute in
us-ascii encoding, for example). There are some where it's advisable
(e.g &euro;), and others where it's pointless (e.g if you use
iso-8859-7 encoding then it would be pointless to express Greek in
&#number; notation).
even if your editor doesn't handle multiple character sets properly.
If you code, e.g., "Nöther", and edit it with software that doesn't
use the right code page then you won't get the expected results.


Such software is really unfit for handling HTML, I must say.

all the best.
Jul 23 '05 #19
On Tue, 5 Oct 2004, Shmuel (Seymour J.) Metz wrote:
Actually, I've begun to wonder if UTF-8 is a very bad idea, and ISO
8859-1 might be a much better choice?
Only if your users are restricted to countries where that is an
appropriate code page.


"Code Page" is IBM-Microsoft-proprietary terminology.
Even then 8859-15 might make more sense.


No. <http://www.google.com/search?q=ISO-8859-15+Science-Museum>

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #20
In <Pi*******************************@ppepc56.ph.gla. ac.uk>, on
10/05/2004
at 08:24 PM, "Alan J. Flavell" <fl*****@ph.gla.ac.uk> said:
I think you mean "multiple character encoding schemes".
Yes, although a different character set would imply a different
encoding scheme.
In HTML (the subject of this group) the document character set is by
definition always iso-10646/Unicode,
Even when there is a meta tag with http-equiv="content-type" and a
charset in the content tag? And is XHTML out of scope for this group?

But don't use iso-8859-15 for encoding HTML - there's honestly no
point.


What if you have a in the document? When was &euro; added; my HTML
4.0 documentation doesn't list it. Is it safe to assume a browser
supporting HTML 4.01 or XHTML?
If you code, e.g., "N Āther",


Check your newsreader settings; you're posting that without
Content-Type: and Content-Transfer-Encoding: header lines.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #21
On Thu, 7 Oct 2004, Shmuel (Seymour J.) Metz wrote:
at 08:24 PM, "Alan J. Flavell" <fl*****@ph.gla.ac.uk> said:
I think you mean "multiple character encoding schemes".
Yes, although a different character set would imply a different
encoding scheme.


Absolutely not. That's the whole point!

In (X)HTML you can (if you so choose) represent any Unicode character
by means of a markup string coded in us-ascii, even. The use of other
encoding schemes is merely a convenience when the desired character
repertoire fits a particular pattern, but whichever encoding scheme
you choose, you still - in principle - have access to any other
Unicode character you need, by means of &-notation. (Whether it's
practical to use it depends on whether you expect your readers'
browsers to render it, but in principle it's available). ☺
In HTML (the subject of this group) the document character set is by
definition always iso-10646/Unicode,
(Strictly, I should have said "subsequent to RFC2070", or, if we're
talking about W3C-approved specifications, "in HTML4 and later").
Even when there is a meta tag with http-equiv="content-type" and a
charset in the content tag?
Yes, absolutely. The "charset" MIME parameter specifies the character
encoding scheme (be it iso-8859-7, iso-2022-jp, utf-16, whatever).
Some of the encoding schemes might carry a hint about the anticipated
character repertoire, but they don't change the "Document Character
Set" (that's a technical term from SGML with a rather precise meaning
in this context).
And is XHTML out of scope for this group?
XHTML is no different in this regard.
But don't use iso-8859-15 for encoding HTML - there's honestly no
point.


What if you have a in the document?


Eh? Your posting header, very reasonably, said

| Content-Type: text/plain; charset=ISO-8859-1

I can deduce what you were trying to type, but it simply isn't in the
character repertoire of the encoding that you used for your posting.

I repeat my earlier point: iso-8859-15 encoding *can* be useful for
plain text, but it offers no practical benefits for encoding HTML.
When was &euro; added; my HTML 4.0 documentation doesn't list it. Is
it safe to assume a browser supporting HTML 4.01 or XHTML?


There -are- web pages which go into excruciating detail about this.
I was just offering a practical summary.

But if you're going to get involved in arguments about these aspects
of (X)HTML, then I would strongly urge you to do a bit of background
reading on the HTML character model, or else we'll be doomed to
fruitless shouting matches based in different conceptions of the
underlying principles. This might help:
http://www.w3.org/TR/2004/WD-charmod-20040225/

Or indeed RFC2070 itself (now a bit dated, but nevertheless sound).
If you code, e.g., "N Āther",


Check your newsreader settings; you're posting that without
Content-Type: and Content-Transfer-Encoding: header lines.


Sorry about that. There *is* a problem, but it's not quite what
you presented it to be. For some reason, this new version of PINE
that I'm using saw fit to do this:
Content-Type: MULTIPART/MIXED;
BOUNDARY="616733697-1982083197-1097004245=:26474"

--616733697-1982083197-1097004245=:26474
Content-Type: TEXT/PLAIN; charset=ISO-8859-1
Content-Transfer-Encoding: 8BIT
As soon as I find out why it's doing that, I'll stop it. Sorry.

(Quite how a single part can be not only "multipart" but even "mixed"
is something of a puzzle...)
Jul 23 '05 #22

Just f.y.i administrivia:

On Thu, 7 Oct 2004, Alan J. Flavell wrote:
Content-Type: MULTIPART/MIXED;
BOUNDARY="616733697-1982083197-1097004245=:26474" [...]
As soon as I find out why it's doing that, I'll stop it. Sorry.


It's discussed in the thread which contains this posting:
http://groups.google.com/groups?selm....17679%40lcore

as well as in a much more acrimonious thread on the same group.
Jul 23 '05 #23
On Thu, 7 Oct 2004, Shmuel (Seymour J.) Metz wrote:
What if you have a in the document? When was &euro; added; my HTML
4.0 documentation doesn't list it. Is it safe to assume a browser
supporting HTML 4.01 or XHTML?


Even Netscape 4.08 displays &euro; as euro sign or as abbreviation
EUR - depending on the operating system and on the available fonts.

But Netscape 4.8 for Macintosh and Windows doesn't understand
ISO-8859-15. What's worse, the Macintosh version refuses to apply
the usual ISO-8859-1 <-> MacRoman transcoding, meaning that *all*¹
special, non-ASCII characters are incorrect.

¹ Strictly speaking "all but five": ¢ £ µ ± © have the same
code positions.

--
M. Pirard strikes again:
<http://www.alltheweb.com/search?q=don1t&_sb_lang=any>
<http://www.altavista.com/web/results?q=don1t&kgs=0&kls=0>

Jul 23 '05 #24
In <Pi*******************************@ppepc56.ph.gla. ac.uk>, on
10/07/2004
at 06:44 PM, "Alan J. Flavell" <fl*****@ph.gla.ac.uk> said:
Eh? Your posting header, very reasonably, said | Content-Type: text/plain; charset=ISO-8859-1


Whoops! I need an 8859-15 code page :-(

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #25
On Tue, 12 Oct 2004, Shmuel (Seymour J.) Metz wrote:
X-Newsreader: MR/2 Internet Cruiser Edition for OS/2 v2.47/47

Whoops! I need an 8859-15 code page :-(


No, you would need code page 858 (cp00858, cp858) for your OS/2
and a transcoding cp00858 <-> ISO-8859-15 in your e-mail and news
programs.

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #26
"Shmuel (Seymour J.) Metz" <sp******@library.lspace.org.invalid> wrote in message news:<41***************************@news.patriot.n et>...
I find that on my weblog I can no long copy and paste text from other
weblogs to my own (for quoting) without getting lots of garbage
characters. I assume that's because most of the websites out there
are using ISO 8859-1 or the Windows default.


No, it's because your software is broken. In a multilingual
environment the software needs to be aware of character set issues.


Well, obviously my software is broken, that is why I started this
thread. If everything was working perfectly, or if it was broken but I
knew how to fix it, then I wouldn't have needed to bother this
newsgroup with my questions, yes?

Cleary, my software is broken. I'm trying to figure out how to fix it.
If a user has a form with a TEXTAREA and can copy and paste text from
anywhere on the web (pages using any encoding) and paste it into that
TEXTAREA and then input it, then how can I process that input to make
sure that text doesn't get turned to garbage when it is output to the
web?

You can see the problem clearly on this page, where all the quotes are
full of garbage characters:

http://www.krubner.com/index.php?pageId=31475

Can you suggest a strategy for dealing with this? I notice that users
of Blogger, TypePad, MoveableType and pMachine never seem to face this
issue, so I assume that the makers of other weblog software have
already figured out how to solve this problem.
Jul 23 '05 #27
Stan Brown <th************@fastmail.fm> wrote in message news:<MP************************@news.odyssey.net> ...
"lawrence" <lk******@geocities.com> wrote in
comp.infosystems.www.authoring.html:
Yes, I've begun to wonder if UTF-8 is not painfully crippling on the
web. It makes it difficult to copy and paste from other websites,
because few websites are using UTF-8.


I originally went with UTF-8 because Netscape 4 does better with a
few &#nnnn; codes if the page is served as UTF-8 than if it's served
as ISO-8859-1. But Netscape 4 is ever less important, and t will be
a few years yet before Windows computers all have native Unicode
software.


It seems to me I might be luckier if I simply go with whatever the
most common encoding is on English-speaking websites. If I only market
to English-speakers, then the liklihood that they'll copy and paste
from non-English sites is low. (That might buy me a year during which
I can study this problem more. Maybe a year from now I'll understand
the issue better.)
Jul 23 '05 #28
In <da**************************@posting.google.com >, on 10/20/2004
at 07:06 PM, lk******@geocities.com (lawrence) said:
Cleary, my software is broken. I'm trying to figure out how to fix
it. If a user has a form with a TEXTAREA and can copy and paste text
from anywhere on the web (pages using any encoding) and paste it into
that TEXTAREA and then input it, then how can I process that input to
make sure that text doesn't get turned to garbage when it is output
to the web?
You can't fix bugs in your users' software, only in your own. The best
that you can do is to avoid presenting data that a lot of users can't
handle.
http://www.krubner.com/index.php?pageId=31475


That page is 403 compliant.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #29
In <Pine.GSO.4.44.0410131549360.8242-100000@s5b004>, on 10/13/2004
at 03:52 PM, Andreas Prilop <nh******@rrzn-user.uni-hannover.de>
said:
No, you would need code page 858 (cp00858, cp858) for your OS/2


Why? I already have a Euro symbol as D5 on the code pages I'm using
(437, 850). All I need is a charset for the MIME header lines that
includes the Euro, e.g., ISO-8859-15, and a proper mapping into it.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #30
On Thu, 21 Oct 2004, Shmuel (Seymour J.) Metz wrote:
No, you would need code page 858 (cp00858, cp858) for your OS/2


Why? I already have a Euro symbol as D5 on the code pages I'm using
(437, 850).


If code position xD5 is a euro sign, then it cannot be cp437 nor cp850.
http://www.unicode.org/Public/MAPPIN...T/PC/CP437.TXT
http://www.unicode.org/Public/MAPPIN...T/PC/CP850.TXT

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #31
On Fri, 22 Oct 2004, Andreas Prilop wrote:
On Thu, 21 Oct 2004, Shmuel (Seymour J.) Metz wrote:
No, you would need code page 858 (cp00858, cp858) for your OS/2
Why? I already have a Euro symbol as D5 on the code pages I'm using
(437, 850).


If code position xD5 is a euro sign, then it cannot be cp437 nor cp850.


Indeed.

Google suggests a discussion here:
http://sourceforge.net/mailarchive/f...&forum_id=1650
http://www.unicode.org/Public/MAPPIN...T/PC/CP437.TXT
http://www.unicode.org/Public/MAPPIN...T/PC/CP850.TXT
Well, that sourceforge discussion says 858 is already registered at
IANA (but as cp00858, not as cp858, woops), and here it is:
http://www.iana.org/assignments/charset-reg/IBM00858

But it seems nobody got it onto the Unicode site yet, neither in IBM's
area nor in Microsquish's DOS (ahem, /PC/) area.

Jul 23 '05 #32
On Fri, 22 Oct 2004, Alan J. Flavell wrote:
Well, that sourceforge discussion says 858 is already registered at
IANA (but as cp00858, not as cp858, woops),

^^

IBM has big plans for many more code pages, it seems :-)

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #33
"Shmuel (Seymour J.) Metz" <sp******@library.lspace.org.invalid> wrote in message news:<41**************************@news.patriot.ne t>...
In <da**************************@posting.google.com >, on 10/20/2004
at 07:06 PM, lk******@geocities.com (lawrence) said:
Cleary, my software is broken. I'm trying to figure out how to fix
it. If a user has a form with a TEXTAREA and can copy and paste text
from anywhere on the web (pages using any encoding) and paste it into
that TEXTAREA and then input it, then how can I process that input to
make sure that text doesn't get turned to garbage when it is output
to the web?


You can't fix bugs in your users' software, only in your own. The best
that you can do is to avoid presenting data that a lot of users can't
handle.


I'm not sure I get you. I'm mostly worried about users who copy and
paste text from other pages.
http://www.krubner.com/index.php?pageId=31475


That page is 403 compliant.


Just keep trying. I think I'm hitting my bandwidth limit or something.
Sometimes it is 403, mostly it is not.
Jul 23 '05 #34
On Fri, 22 Oct 2004, Andreas Prilop wrote:
IANA (but as cp00858, not as cp858, woops),

^^
IBM has big plans for many more code pages, it seems :-)


Leaving room for cp10646, maybe....
Jul 23 '05 #35
"Shmuel (Seymour J.) Metz" <sp******@library.lspace.org.invalid> wrote in message news:<41**************************@news.patriot.ne t>...
In <da**************************@posting.google.com >, on 10/20/2004
at 07:06 PM, lk******@geocities.com (lawrence) said:
Cleary, my software is broken. I'm trying to figure out how to fix
it. If a user has a form with a TEXTAREA and can copy and paste text
from anywhere on the web (pages using any encoding) and paste it into
that TEXTAREA and then input it, then how can I process that input to
make sure that text doesn't get turned to garbage when it is output
to the web?


You can't fix bugs in your users' software, only in your own. The best
that you can do is to avoid presenting data that a lot of users can't
handle.


Look at the garbage in the quotes on this page:

http://www.krubner.com/index.php?pageId=31475

I'm thinking I'd face less trouble if I sent these pages out as ISO-8859-15
Jul 23 '05 #36
In <da*************************@posting.google.com> , on 10/22/2004
at 03:07 PM, lk******@geocities.com (lawrence) said:
I'm not sure I get you. I'm mostly worried about users who copy and
paste text from other pages.


That involves their software, not yours. If it's broken, you can't fix
it for them.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #37
In <Pine.GSO.4.44.0410221514150.18876-100000@s5b004>, on 10/22/2004
at 03:18 PM, Andreas Prilop <nh******@rrzn-user.uni-hannover.de>
said:
If code position xD5 is a euro sign, then it cannot be cp437 nor
cp850.
COUNTRY=001,G:\OS2\SYSTEM\COUNTRY.SYS
CODEPAGE=850,437
DEVINFO=KBD,UX,G:\OS2\KEYBOARD.DCP

Alt-e displays a Euro. QED.
http://www.unicode.org/Public/MAPPIN...T/PC/CP437.TXT
http://www.unicode.org/Public/MAPPIN...T/PC/CP850.TXT


Those might have some relevance if I were running an m$ operating
system. I'm not.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #38
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote in message news:<Pine.GSO.4.44.0410081515480.2997-100000@s5b004>...
On Thu, 7 Oct 2004, Shmuel (Seymour J.) Metz wrote:
What if you have a in the document? When was &euro; added; my HTML
4.0 documentation doesn't list it. Is it safe to assume a browser
supporting HTML 4.01 or XHTML?


Even Netscape 4.08 displays &euro; as euro sign or as abbreviation
EUR - depending on the operating system and on the available fonts.

But Netscape 4.8 for Macintosh and Windows doesn't understand
ISO-8859-15. What's worse, the Macintosh version refuses to apply
the usual ISO-8859-1 <-> MacRoman transcoding, meaning that *all*¹
special, non-ASCII characters are incorrect.

¹ Strictly speaking "all but five": ¢ £ µ ± © have the same
code positions.


You gentlemen clearly know more about character encodings than I do.
Perhaps you can help me. If I copy some text from one page is there a
way to clean it up when I paste it into another page?

For instance, I copy some text from a page with this header:
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1" />

Now I paste it into a form run by the PHP/MySql weblog software I'm
working on. The software sends out the page with a charset of UTF-8. I
get a lot of garbage characters in the final output, as you can see
here:

http://www.krubner.com/index.php?pageId=34622

Is there anything I can do to clean up this mess? For instance, in
PHP, when a user inputs some info, I could scan the byte stream for
iso-8859-1 quote marks, if they are unique from other character
encodings. That is, if I could figure a way of determining what
characters are mistakes, and which are not. I can do the programming,
but for now I don't know what to program. Can anyone suggest a
strategy? Is there anything that would show up in a mass of iso-8859-1
text that would never, ever show up in UTF-8 text?
Jul 23 '05 #39

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
1634
by: Generic Usenet Account | last post by:
The C/C++ libxml library from xmfsoft (www.xmlsoft.org ---- GNOME project) has in excess of 1000 APIs (http://www.xmlsoft.org/APIfunctions.html). This excessively high number of APIs is beyond...
4
4791
by: Derek | last post by:
Hi, I've built a rather large CGI that dumps a lot of data and a fairly complex javascript app out to the client's browser. Granted this may be poor style according to someone web design...
10
3088
by: Mark McLellan | last post by:
Dear all Following the oft-repeated advice here and ciwas I have made my site nearly 4.01 strict (working on it). There are some items on which I would appreciate your advice: 1. Hidden...
32
11700
by: Zeljko | last post by:
Hi, I moved from VB6 to C# recently, and am very satisfied with the result. However, I miss this one feature. The closest thing I could find is "using" statement, but it requires object to...
5
1253
by: hubmei75 | last post by:
Hello, I have a simple table containing adresses. A sample view of the table is id name city -------------------------------- 100 Meier New York 101 Meier Tokyo 110 ...
8
2062
by: pamelafluente | last post by:
I am beginning aspNet, I know well win apps. Need a simple and schematic code example to start work. This is what I need to accomplish: ---------------------- Given button and a TextBox on a...
6
1890
by: Christoph | last post by:
I'm trying to come up with a stylesheet where, when the rows are displayed, duplicate game names are not shown on subsequent rows. It works but doesn't work properly. If I sort the data using...
10
1650
by: sfmcfar | last post by:
Hi, Can someone recommend an Javascript toolkit that provides both AJAX- like requests and effects, and that supports IE 5.5+, Firefox 1.5+, and Netscape 7+? I know that NS7 may not be as...
5
2422
by: Thelma Roslyn Lubkin | last post by:
I am still having trouble trying to use a popup form to allow user to set filters for the main form. The main form is based on a single table. The popup contains 5 listboxes, so the user can...
2
2037
by: jmDesktop | last post by:
I'm using C#, but I don't know that it matters for this question. I know that many experienced folks are on here, so sorry for being off topic. I am finally at a point where I want to and I think...
0
7002
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7165
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7203
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
7379
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5462
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
3093
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3081
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
656
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
290
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.