469,275 Members | 1,802 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,275 developers. It's quick & easy.

UTF-8 garbage characters

I'd love to ask why this page is not rendering correctly in Safari on
a Macintosh but I suspect someone will tell me to validate the page
first. Nevertheless, if anyone sees an obvious reason that I'm
missing, I'd like to know. It looks like a missing div tag but I
can't see one.

http://www.krubner.com/
Let's move on to a question that might be answerable. If i copy and
paste non-UTF-8 characters to the page, and then send out a UTF-8
charset UTF-8 header then I'll get the garbage characters that I'm
seeing?
Jul 23 '05 #1
23 9959
*lawrence* skrev 2004-10-01 09:34:
I'd love to ask why this page is not rendering correctly in Safari on
a Macintosh but I suspect someone will tell me to validate the page
first. Nevertheless, if anyone sees an obvious reason that I'm
missing, I'd like to know. It looks like a missing div tag but I
can't see one.

http://www.krubner.com/
Let's move on to a question that might be answerable. If i copy and
paste non-UTF-8 characters to the page, and then send out a UTF-8
charset UTF-8 header then I'll get the garbage characters that I'm
seeing?


If you set charset UTF-8 in the header, you also have to save the
document as UTF-8 before uploading it. I belive it's saved in i default
charset, probably iso-8859-1

--
/Arne
Jul 23 '05 #2
Arne <ar**@luras.nu> wrote in message news:<SH*****************@newsb.telia.net>...
*lawrence* skrev 2004-10-01 09:34:
I'd love to ask why this page is not rendering correctly in Safari on
a Macintosh but I suspect someone will tell me to validate the page
first. Nevertheless, if anyone sees an obvious reason that I'm
missing, I'd like to know. It looks like a missing div tag but I
can't see one.

http://www.krubner.com/
Let's move on to a question that might be answerable. If i copy and
paste non-UTF-8 characters to the page, and then send out a UTF-8
charset UTF-8 header then I'll get the garbage characters that I'm
seeing?


If you set charset UTF-8 in the header, you also have to save the
document as UTF-8 before uploading it. I belive it's saved in i default
charset, probably iso-8859-1


Since sending out UTF-8 headers, I've started to get a lot of garbage
characters on the page. This despit hitting it with encode_UTF-8 in
PHP. The problem is I don't know what charset the text is in a head of
time, so I can't do any kind of proper conversion on the text. The web
is too heterogenous. I wonder how Blogger and TypePad tackle this
problem? Is it just luck that so few of their users end up with
garbage characters? Or is it because they standardize on a different
charset?
Jul 23 '05 #3
*lawrence* skrev 2004-10-02 02:13:
Arne <ar**@luras.nu> wrote in message news:<SH*****************@newsb.telia.net>...
*lawrence* skrev 2004-10-01 09:34:
> I'd love to ask why this page is not rendering correctly in Safari on
> a Macintosh but I suspect someone will tell me to validate the page
> first. Nevertheless, if anyone sees an obvious reason that I'm
> missing, I'd like to know. It looks like a missing div tag but I
> can't see one.
>
> http://www.krubner.com/
>
>
> Let's move on to a question that might be answerable. If i copy and
> paste non-UTF-8 characters to the page, and then send out a UTF-8
> charset UTF-8 header then I'll get the garbage characters that I'm
> seeing?


If you set charset UTF-8 in the header, you also have to save the
document as UTF-8 before uploading it. I belive it's saved in i default
charset, probably iso-8859-1


Since sending out UTF-8 headers, I've started to get a lot of garbage
characters on the page. This despit hitting it with encode_UTF-8 in
PHP. The problem is I don't know what charset the text is in a head of
time, so I can't do any kind of proper conversion on the text. The web
is too heterogenous. I wonder how Blogger and TypePad tackle this
problem? Is it just luck that so few of their users end up with
garbage characters? Or is it because they standardize on a different
charset?


What editor do you use for building and editing the pages?

In the editor I use, I can choose whatever encoding I want to use, and
on saving the file is saved in that encoding. If I open a file I can
change the encoding the same way, with just a click on a button in the
editors toolbar. In Windows XP you can also choose encoding for files
when saving them in Notebook.

Because I save the files in UTF-8 I don't manually need to do any
conversion in the text, when writing it. My guess is that Blogger and
TypePad is done the same way, the posts that the user write is saved as
UTF-8 and since that is a unicod, all kind of characters and languages
can be used.

--
/Arne
Jul 23 '05 #4
Arne <ar**@luras.nu> wrote in message news:<%9*****************@newsb.telia.net>...
Since sending out UTF-8 headers, I've started to get a lot of garbage
characters on the page. This despit hitting it with encode_UTF-8 in
PHP. The problem is I don't know what charset the text is in a head of
time, so I can't do any kind of proper conversion on the text. The web
is too heterogenous. I wonder how Blogger and TypePad tackle this
problem? Is it just luck that so few of their users end up with
garbage characters? Or is it because they standardize on a different
charset?
What editor do you use for building and editing the pages?


Most people use Microsoft Internet Explorer to build pages on their
weblogs. That is, they log in and type some text in a TEXTAREA and hit
"Post" and then there words appear as a new page on the web.

The problem arises when they copy and paste from other places. The
biggest problems, I think, arise when they copy text off a webpage
that uses another web page encoding.

You can check see for yourself here:

http://www.publicdomainsoftware.org/designer/

You'll need a username and password to login. Use these:

username: designer
password: designer123

Link for login at bottom.

Because I save the files in UTF-8 I don't manually need to do any
conversion in the text, when writing it. My guess is that Blogger and
TypePad is done the same way, the posts that the user write is saved as
UTF-8 and since that is a unicod, all kind of characters and languages
can be used.


I guess that is the theory that now needs to be tested. My suspicion
is the opposite of yours - they are using something other than UTF-8,
which is why garbage characters appear if you copy something from them
and paste it into a web page that is using UTF-8.
Jul 23 '05 #5
lk******@geocities.com (lawrence) wrote in message news:<da**************************@posting.google. com>...
Arne <ar**@luras.nu> wrote in message news:<%9*****************@newsb.telia.net>...
Most people use Microsoft Internet Explorer to build pages on their
weblogs. That is, they log in and type some text in a TEXTAREA and hit
"Post" and then there words appear as a new page on the web.


I would have thought that bloggers would be the sort to be clueful
enough to use decent browsers, like Mozilla, instead.

--
Dan
Jul 23 '05 #6
*Daniel R. Tobias* skrev 2004-10-05 22:49:
lk******@geocities.com (lawrence) wrote in message news:<da**************************@posting.google. com>...
Arne <ar**@luras.nu> wrote in message news:<%9*****************@newsb.telia.net>...
Most people use Microsoft Internet Explorer to build pages on their
weblogs. That is, they log in and type some text in a TEXTAREA and hit
"Post" and then there words appear as a new page on the web.


I would have thought that bloggers would be the sort to be clueful
enough to use decent browsers, like Mozilla, instead.


Ok. I see noe what Lawrence's problem is. The browser used don't matter
here. When testing the blog I used som words from my native language
(Swedish) and writed this:

The problem with charset UTF-8 on pages with forms for e.g. guestbooks,
formmail and bloggs is that writing in a non-english language can give
garbage characters from the letters that is not represented in the
english language. That's because what is writed in the text box don't
get encoded, as text done with HTML editors does.
So the best may be to use an other charset (is ISO-8859-1 the best?) for
pages with forms and textbox. ISO-8859-1 cover a lot of languages with
"strange" characters, but far from all, so the problem may not totally
be solved.

You can write the message in a separate software (sutch as Notebook) and
then encode foreign letters to entities. As an example the letter "å"
(latin small letter a with ring above) to &aring; but then, will the
&aring; look like "å" &aring; or on the page after posting?

--
/Arne
Jul 23 '05 #7
"Arne" <ar**@luras.nu> a écrit dans le message de
news:4t*********************@newsc.telia.net
The problem with charset UTF-8 on pages with forms for e.g.
guestbooks, formmail and bloggs is that writing in a non-english
language can give garbage characters from the letters that is not
represented in the english language. That's because what is writed in
the text box don't get encoded, as text done with HTML editors does.


I really can't understand your post. A server that sends a form to a client
with the appropriate charset headers should get in return all the users
input encoded in that charset. If the form is sent with a UTF-8 header, you
should get all the characters encoded in UTF-8. And so user could input any
character included in Unicode.

Jul 23 '05 #8
On Wed, 6 Oct 2004, Pierre Goiffon wrote:
"Arne" <ar**@luras.nu> a écrit dans le message de
news:4t*********************@newsc.telia.net
The problem with charset UTF-8 on pages with forms for e.g.
guestbooks, formmail and bloggs is that writing in a non-english
language can give garbage characters from the letters that is not
represented in the english language. That's because what is writed in
the text box don't get encoded, as text done with HTML editors does.
I really can't understand your post. A server that sends a form to a client
with the appropriate charset headers should get in return all the users
input encoded in that charset.


"should" sounds right, but it isn't always going to work, depending on
all kinds of browser bugs and oddities, to say nothing of the
uncertainty of what should happen with method GET (which /officially/
only supports us-ascii).
If the form is sent with a UTF-8 header, you should get all the
characters encoded in UTF-8. And so user could input any character
included in Unicode.


On the other hand, the combination of user errors, browser bugs and
plain malice means that your server *could* get presented with all
kinds of rubbish: you might not be able to do anything useful with it,
but you better program defensively to ensure it won't do you any harm.
(E.g you better validate any byte-sequences that are supposed to be
utf-8 to make sure they really are - the unicode spec effectively
mandates such validation, and failure to do so could be rated as a
security exposure).

I haven't worked this page over in a while, but it gives a general
flavour of what I found, reviewed against what the specifications say
(and - also important - what they /don't/ say):
http://ppewww.ph.gla.ac.uk/~flavell/...form-i18n.html

h.t.h
Jul 23 '05 #9
On Wed, 6 Oct 2004, Pierre Goiffon wrote:
A server that sends a form to a client
with the appropriate charset headers should get in return all the users
input encoded in that charset. If the form is sent with a UTF-8 header, you
should get all the characters encoded in UTF-8. And so user could input any
character included in Unicode.


Try
<http://google.com/search?q=%EA%E5%F1%DC%F4%E9%EF.%ED&ie=ISO-8859-7&oe=UTF-8>
with Netscape 4.x.

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #10
*Pierre Goiffon* skrev 2004-10-06 10:29:
"Arne" <ar**@luras.nu> a écrit dans le message de
news:4t*********************@newsc.telia.net
The problem with charset UTF-8 on pages with forms for e.g.
guestbooks, formmail and bloggs is that writing in a non-english
language can give garbage characters from the letters that is not
represented in the english language. That's because what is writed in
the text box don't get encoded, as text done with HTML editors does.


I really can't understand your post. A server that sends a form to a client
with the appropriate charset headers should get in return all the users
input encoded in that charset. If the form is sent with a UTF-8 header, you
should get all the characters encoded in UTF-8. And so user could input any
character included in Unicode.


Yes, the servers *should* do that, but I have yet to see a server that
does. Google is the exeption to this.

When I write the Swedish word "räksmörgås" to Google I can see in the
search string code on the output page that Google encoded it to
r%C3%A4ksm%C3%B6rg%C3%A5s and that is encoded back to "räksmörgås" in
the links Google finds. But that is still to happend in other servers.

E.g many websites by Swedes (in Swedish or other languages) use the free
guestbook services find on Internet (guestbook files is on a remote
server). Mostly they are using the charset ISO-8859-1 but if they use
UTF-8 the garbage is those guestbooks also.

--
/Arne
Jul 23 '05 #11
On Wed, 6 Oct 2004, Arne wrote:
When I write the Swedish word "räksmörgås" to Google
"Google" is no URL!
There's <http://www.google.se/webhp?oe=ISO-8859-1> and
there's <http://www.google.se/webhp?oe=UTF-8> .
I can see in the
search string code on the output page that Google encoded it to
r%C3%A4ksm%C3%B6rg%C3%A5s


Not necessarily. You can also have
<http://www.google.se/search?ie=ISO-8859-1&oe=ISO-8859-1&q=r%E4ksm%F6rg%E5s>

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #12
*Andreas Prilop* skrev 2004-10-06 17:53:
On Wed, 6 Oct 2004, Arne wrote:
When I write the Swedish word "räksmörgås" to Google
"Google" is no URL!


Did I said that? It's a Search service, or what ever you like to call
it, on the Internet. Goofle is used to search URL's from a webpage.
There's <http://www.google.se/webhp?oe=ISO-8859-1> and
there's <http://www.google.se/webhp?oe=UTF-8> .


I only type in www.google.se (or google.com) to get the page with the
search form, as I guess most people do. And in all cases the meta tag is
<meta http-equiv="content-type" content="text/html; charset=UTF-8">

BTW, I can give you http://www.google.se/webhp?hl=ISO-8859-1 also, but
that don't change the charset, only the prefered language from Swedish
to English on the search page.

I can see in the
search string code on the output page that Google encoded it to
r%C3%A4ksm%C3%B6rg%C3%A5s


Not necessarily. You can also have
<http://www.google.se/search?ie=ISO-8859-1&oe=ISO-8859-1&q=r%E4ksm%F6rg%E5s>


What are you trying to prof? Why should I use URL's like that? :-)

--
/Arne

Jul 23 '05 #13
*Arne* skrev 2004-10-06 19:16:
it, on the Internet. Goofle is used to search URL's from a webpage.

^
Bah, damn typos! Google!!! :-)

--
/Arne
Jul 23 '05 #14
"Arne" <ar**@luras.nu> a écrit dans le message de
news:OO*****************@newsb.telia.net
I really can't understand your post. A server that sends a form to a
client with the appropriate charset headers should get in return all
the users input encoded in that charset.
(...)

Alan, Andreas, Arne, thanks for your answers.
Well I saw on different technical documents that the GET method is really
buggy (characters sent back by the browser to the server aren't always
encoded using the appropriate charset). But I never heard of such problems
with the POST method ? Does anyone made a few sample tests with it ?

Andreas , I didn't understand what to see on the web page you indicated : Try
<http://google.com/search?q=%EA%E5%F1...859-7&oe=UTF-8
with Netscape 4.x.

I open this using Netscape 4.75 and, well, I can't really read arabic and
russian alphabet but everything seems ok compared with Mozilla 1.7.3 (apart
for the arabic characters in the title).

Jul 23 '05 #15
On Thu, 7 Oct 2004, Pierre Goiffon wrote:
<http://google.com/search?q=%EA%E5%F1...859-7&oe=UTF-8
with Netscape 4.x.


I open this using Netscape 4.75 and, well, I can't really read arabic and
russian alphabet but everything seems ok compared with Mozilla 1.7.3 (apart
for the arabic characters in the title).


You are confused. I refered to
<http://google.com/search?q=%EA%E5%F1%DC%F4%E9%EF.%ED&ie=ISO-8859-7&oe=UTF-8>
which has only Greek characters in the title.

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #16
On Wed, 6 Oct 2004, Arne wrote:
"Google" is no URL!
Did I said that? It's a Search service, or what ever you like to call
it, on the Internet. Goofle is used to search URL's from a webpage.

^^^^^^^^^^^^^^^

<sigh> You have to specify an address (URL).
There's <http://www.google.se/webhp?oe=ISO-8859-1> and
there's <http://www.google.se/webhp?oe=UTF-8> .


I only type in www.google.se (or google.com) to get the page with the

^^^ search form, as I guess most people do.
You didn't get it, did you? There are many different search forms,
in different encodings, languages, with different options, etc.
There is no such thing as *the* page.
And in all cases the meta tag is
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
No, it isn't. Even if you do not specify the output encoding (oe),
Google sends different encodings depending on your operating system
and your browser.
BTW: This meta thingy is only an ersatz - but Google doesn't know
either as their pages are sent without "charset" HTTP header.
BTW, I can give you http://www.google.se/webhp?hl=ISO-8859-1


That's nonsense since "hl" is "help language". You can have, e.g.
<http://www.google.se/webhp?hl=sa&oe=UTF-8> or
<http://www.google.se/webhp?hl=fy&oe=ISO-8859-1>
I can see in the
search string code on the output page that Google encoded it to
r%C3%A4ksm%C3%B6rg%C3%A5s


Not necessarily. You can also have
<http://www.google.se/search?ie=ISO-8859-1&oe=ISO-8859-1&q=r%E4ksm%F6rg%E5s>


What are you trying to prof?


.... that it is sufficient *but not necessary* to encode "räksmörgås"
as [UTF-8] "r%C3%A4ksm%C3%B6rg%C3%A5s.

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #17
Once upon a time *Andreas Prilop* wrote:
On Wed, 6 Oct 2004, Arne wrote:
What are you trying to prof?


... that it is sufficient *but not necessary* to encode "räksmörgås"
as [UTF-8] "r%C3%A4ksm%C3%B6rg%C3%A5s.


I can't say I fully understand what's happening in Googles server when
search for "räksmörgås" :-)

But I earlier pointed out that Googles serves does what I have yet to
see any other browser to do, when I see the encoding UTF-8 in source.

--
/Arne
Jul 23 '05 #18
"Andreas Prilop" <nh******@rrzn-user.uni-hannover.de> a écrit dans le
message de news:Pine.GSO.4.44.0410071538180.29802-100000@s5b004
You are confused.


Yes I was, thanks for the precision !

Jul 23 '05 #19
On Thu, 7 Oct 2004, Arne wrote:
But I earlier pointed out that Googles serves does what I have yet to
see any other browser to do, when I see the encoding UTF-8 in source.


Come again?

Other search engines also allow you to search in different encodings:
<http://www.unics.uni-hannover.de/nhtcapri/#search_engines>
<http://www.unics.uni-hannover.de/nhtcapri/arabic.html#search_engines>
etc.

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #20
Once upon a time *Andreas Prilop* wrote:
On Thu, 7 Oct 2004, Arne wrote:
But I earlier pointed out that Googles serves does what I have yet to
see any other browser to do, when I see the encoding UTF-8 in source.


Come again?

Other search engines also allow you to search in different encodings:
<http://www.unics.uni-hannover.de/nhtcapri/#search_engines>
<http://www.unics.uni-hannover.de/nhtcapri/arabic.html#search_engines>
etc.


As I said, *I* have yet to see.
I understand there is others search engines, maybee even other sitese
somewhere. But as I don't use them, I have not seen it on them :-)

--
/Arne

Jul 23 '05 #21
Arne <ar**@luras.nu> wrote in message news:<4t*********************@newsc.telia.net>...
*Daniel R. Tobias* skrev 2004-10-05 22:49:
lk******@geocities.com (lawrence) wrote in message news:<da**************************@posting.google. com>...
Arne <ar**@luras.nu> wrote in message news:<%9*****************@newsb.telia.net>...
Most people use Microsoft Internet Explorer to build pages on their
weblogs. That is, they log in and type some text in a TEXTAREA and hit
"Post" and then there words appear as a new page on the web.
I would have thought that bloggers would be the sort to be clueful
enough to use decent browsers, like Mozilla, instead.


Ok. I see noe what Lawrence's problem is. The browser used don't matter
here. When testing the blog I used som words from my native language
(Swedish) and writed this:

The problem with charset UTF-8 on pages with forms for e.g. guestbooks,
formmail and bloggs is that writing in a non-english language can give
garbage characters from the letters that is not represented in the
english language. That's because what is writed in the text box don't
get encoded, as text done with HTML editors does.


Any text input through a form is, by default in most browsers, given
the same encoding as the page.

However, I am not relying on the defaults here, I am using PHP to
encode it to UTF-8.



You can write the message in a separate software (sutch as Notebook) and
then encode foreign letters to entities.


No, the message has to be written in my software, the one that I'm
creating. I need to figure out what the right answer to this problem
is.
Jul 23 '05 #22
"Pierre Goiffon" <pg******@nowhere.invalid> wrote in message news:<41***********************@news.free.fr>...
"Arne" <ar**@luras.nu> a écrit dans le message de
news:4t*********************@newsc.telia.net
The problem with charset UTF-8 on pages with forms for e.g.
guestbooks, formmail and bloggs is that writing in a non-english
language can give garbage characters from the letters that is not
represented in the english language. That's because what is writed in
the text box don't get encoded, as text done with HTML editors does.


I really can't understand your post. A server that sends a form to a client
with the appropriate charset headers should get in return all the users
input encoded in that charset. If the form is sent with a UTF-8 header, you
should get all the characters encoded in UTF-8. And so user could input any
character included in Unicode.


I agree, Arne's post was not to the point.

The real problem comes up when other text is copied and pasted from
another page. You can see it clearly in the quotes on this page:

http://www.krubner.com/index.php?pageId=31475

The quotes come from some page that had another encoding. They show up
as garbage characters on a page that is suppose to be UTF-8. How to
handle this?
Jul 23 '05 #23
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote in message news:<Pine.GSO.4.44.0410081500230.2887-100000@s5b004>...
On Thu, 7 Oct 2004, Arne wrote:
But I earlier pointed out that Googles serves does what I have yet to
see any other browser to do, when I see the encoding UTF-8 in source.


Come again?

Other search engines also allow you to search in different encodings:
<http://www.unics.uni-hannover.de/nhtcapri/#search_engines>
<http://www.unics.uni-hannover.de/nhtcapri/arabic.html#search_engines>
etc.


I guess what I'd like to do is to do what Google does and run some
code that figures out what charset I'm being handed on input. But I
can't figure a way to do that when I'm getting mixed input in a single
TEXTAREA box. If part of the input text is UTF-8 and part of it in
some other ISO charset, what to do? And why am I struggling on this
issue, when it doesn't seem to come up for people using Blogger,
MoveableType, or pMachine? How did they solve this problem?
Jul 23 '05 #24

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

27 posts views Thread by EU citizen | last post: by
38 posts views Thread by Haines Brown | last post: by
7 posts views Thread by Philipp Lenssen | last post: by
1 post views Thread by stevelooking41 | last post: by
6 posts views Thread by jmgonet | last post: by
1 post views Thread by David Bertoni | last post: by
7 posts views Thread by Jimmy Shaw | last post: by
23 posts views Thread by Allan Ebdrup | last post: by
35 posts views Thread by Bjoern Hoehrmann | last post: by
4 posts views Thread by =?ISO-8859-2?Q?Boris_Du=B9ek?= | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.