can a textarea on a form be used to cast text to a specific charset like UTF-16?

lawrence

I was told in another newsgroup (about XML, I was wondering how to
control user input) that most modern browsers empower the designer to
cast the user created input to a particular character encoding. This
arose in answer to my question about how to control user input. I had
complained that I had users who wrote articles in Microsoft Word or
WordPerfect and then input that to the web through a textarea box on a
form I'd created.

I've run google searches on this and I get tons of info but none to
the point. Can anyone here give me pointers on converting form input
to a particular character encoding?

Jul 20 '05 #1

Subscribe Post Reply

6127

Kris

In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:

I was told in another newsgroup (about XML, I was wondering how to
control user input)
Authors on the WWW cannot control.
that most modern browsers empower the designer to
cast the user created input to a particular character encoding. This
arose in answer to my question about how to control user input. I had
complained that I had users who wrote articles in Microsoft Word or
WordPerfect and then input that to the web through a textarea box on a
form I'd created.

I've run google searches on this and I get tons of info but none to
the point. Can anyone here give me pointers on converting form input
to a particular character encoding?

Serve the page that holds your form in the character encoding you
envision. Use an HTTP Content-type header firstly, the Meta
'content-type' element as an addition.

Why not use utf-8? I hear utf-16 has problems on the Web. Perhaps I am
misinformed, so take the comments of better-informed people more
seriously.

--
Kris
<kr*******@xs4all.netherlands> (nl)

Jul 20 '05 #2

Pierre Goiffon

"lawrence" <lk******@geocities.com> a écrit dans le message de
news:da**************************@posting.google.c om

I was told in another newsgroup (about XML, I was wondering how to
control user input) that most modern browsers empower the designer to
cast the user created input to a particular character encoding. This
arose in answer to my question about how to control user input. I had
complained that I had users who wrote articles in Microsoft Word or
WordPerfect and then input that to the web through a textarea box on a
form I'd created.

This is a very vast subject. You should read a lot of docs to fix your
ideas... And then read things that will help you fix your particular
problem.

Firstly, go to the W3C :
http://www.w3.org/TR/html401/charset.html
And point also to the internationalization part :
http://www.w3.org/International/articles/

Then you could go and read the lots of complementary documents, for exemple
:
http://www.cs.tut.fi/~jkorpela/chars/index.html
http://ppewww.ph.gla.ac.uk/~flavell/...form-i18n.html

Then you will notice theory and pratice are, as commonly observed on the
web, not very identical. Especially, strange things happens when the user do
very common things, like insert a euro sign under IE6 on Windows, on a form
that was sent specifying an UTF-8 charset. I didn't do a lot of testing with
Office in Windows, but I suspect such strange behaviors like this to happen
when doing a simple copy/paste... If anyone have experience about that, I
would be very pleased to ear from her/him !

Oh by the way, using UTF-16 in a web context isn't to recommend, even if
your document contains a lot of non latin characters... You should use an
usual 8 bit charset (iso latin-1 or 9 for exemple, depends on the main
language you use), or UTF-8 if you really need it. But be aware of making
your choice knowing exactly all the consequences of them !

Jul 20 '05 #3

Alan J. Flavell

On Wed, 11 Aug 2004, lawrence wrote:

I was told in another newsgroup (about XML, I was wondering how to
control user input) that most modern browsers empower the designer to
cast the user created input to a particular character encoding.
If you have a captive browser population, then that might be feasible;
but in a WWW context this is rarely the case - you have to make the
best use of what you get.
This arose in answer to my question about how to control user input.
With respect: in a WWW context you do better to pay attention to the
options that are open to you to interpret what you've been sent, since
in the final analysis you can't literally "control" anything. In your
server-side process, you need to be able to cope with ( it means:
either accept if you can, or gracefully refuse if you can't ) just
anything that a client will send to you, including what will be sent
by malicious or just plain broken clients.
I had complained that I had users who wrote articles in Microsoft
Word or WordPerfect and then input that to the web through a
textarea box on a form I'd created.
I think you'll need to be more specific. In Word alone, I've seen so
many variations (including Mac Word users who had created Mac-coded
characters which didn't exist in the Windows encoding!) that I could
write a whole thesis on the topic.
I've run google searches on this and I get tons of info but none to
the point. Can anyone here give me pointers on converting form input
to a particular character encoding?

I see that you've already been pointed to my tutorial-ish page at
http://ppewww.ph.gla.ac.uk/~flavell/...form-i18n.html

My impression is that your best chance with modern browsers is to send
the form page with utf-8 encoding and to expect the form submission to
come back in utf-8 encoding. But if you need to deal with NN4.* this
will go horribly wrong, and some other browsers with limited scope
will do that too. Content negotiation unfortunately doesn't help
here, since NN4 *claims* to support utf-8 - and, as far as display of
web pages is concerned, that's sort-of true; but when it comes to form
submission, it goes desperately wrong.

There may be other browsers (e.g WebTV) to worry about, but at least
they don't make Accept-charset claims which they're unable to fulfil.

My web page cited above is certainly incomplete. And there are some
more-recent notes on this topic at the W3C, I think. Feel free to
share your experiences and see if we can improve this area of
coverage, if not in the software, then at least in documentation and
tutorials, OK?

btw I don't know any reason to favour utf-16 encodings - every browser
which I'm aware of supporting utf-16 can also support utf-8, which
seems to me to be better supported (and advertised as supported via
accept-charset) in general. So I'd go for utf-8 if it's advertised by
the client, except where it's known not to work (NN4.*).

good luck

Jul 20 '05 #4

Lachlan Hunt

Alan J. Flavell wrote:

btw I don't know any reason to favour utf-16 encodings - every browser
which I'm aware of supporting utf-16 can also support utf-8, which
seems to me to be better supported (and advertised as supported via
accept-charset) in general.
UTF-16 is better for documents written in a language where the majority
of characters used would be more than 2 bytes in UTF-8. So, for
documents that mostly use ASCII characters, with the occasional
puncutation (such as an Em-dash â€”), dingbat â˜º, or other symbol outside
the ASCII range, then UTF-8 is better. AFAIK, the main reason to choose
one over the other on the web is file size, and UA support.
So I'd go for utf-8 if it's advertised by
the client, except where it's known not to work (NN4.*).

Anyone who hasn't upgraded from NN4 will have difficulty with more than
just character encodings, so I wouldn't consider that a problem worth
worrying about.

--
Lachlan Hunt
http://www.lachy.id.au/

Please direct all spam to ab***@127.0.0.1
Thank you.

Jul 20 '05 #5

Alan J. Flavell

On Sat, 14 Aug 2004, Lachlan Hunt wrote:

Alan J. Flavell wrote:
btw I don't know any reason to favour utf-16 encodings - every
browser which I'm aware of supporting utf-16 can also support
utf-8, which seems to me to be better supported (and advertised as
supported via accept-charset) in general.

UTF-16 is better for documents written in a language where the
majority of characters used would be more than 2 bytes in UTF-8.

Yes, I'm sorry, my remark reads as being much wider than I had
intended: the comment was specifically related to forms submission
support in a WWW context - not meant to be interpreted more widely.

So I'd go for utf-8 if it's advertised by the client, except
where it's known not to work (NN4.*).

Anyone who hasn't upgraded from NN4 will have difficulty with more
than just character encodings, so I wouldn't consider that a problem
worth worrying about.

No disagreement - my point was that NN4 advertises utf-8 in its
Accept-charset, and indeed it has the capability to *display* utf-8,
making utf-8 a good choice for serving i18n content to NN4, if it was
only about displaying the web document; but when it comes to forms
submission it all goes sadly wrong.

Jul 20 '05 #6

Pierre Goiffon

"Lachlan Hunt" <la**********@lachy.id.au.invalid> a Ã©crit dans le
message de news:lb*******************@news-server.bigpond.net.au

UTF-16 is better for documents written in a language where the
majority of characters used would be more than 2 bytes in UTF-8.

But HTML, CSS, JS only use 7-bit ascii characters. So it all depends on
ratio between the amount of text you've got for your code and for your
content... I think everyone has his own theory about that :), but I
personnaly think the only thing to remember is that UTF-16 seems to have far
less support in browsers than UTF-8 !

Jul 20 '05 #7

lawrence

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in message news:<Pi*******************************@ppepc56.ph .gla.ac.uk>...

On Wed, 11 Aug 2004, lawrence wrote:
I was told in another newsgroup (about XML, I was wondering how to
control user input) that most modern browsers empower the designer to
cast the user created input to a particular character encoding.

If you have a captive browser population, then that might be feasible;
but in a WWW context this is rarely the case - you have to make the
best use of what you get.
This arose in answer to my question about how to control user input.

With respect: in a WWW context you do better to pay attention to the
options that are open to you to interpret what you've been sent, since
in the final analysis you can't literally "control" anything. In your
server-side process, you need to be able to cope with ( it means:
either accept if you can, or gracefully refuse if you can't ) just
anything that a client will send to you, including what will be sent
by malicious or just plain broken clients.
I had complained that I had users who wrote articles in Microsoft
Word or WordPerfect and then input that to the web through a
textarea box on a form I'd created.

I think you'll need to be more specific. In Word alone, I've seen so
many variations (including Mac Word users who had created Mac-coded
characters which didn't exist in the Windows encoding!) that I could
write a whole thesis on the topic.

Thanks so much for the lengthy reply. I realize my question was too
broad. I have something like a captive population. Not really of
course, I don't where my customers come from, but they'll be using my
server-side software to create weblog entries. I think I only need
force conversion to UTF-8. If the customer gets some garbage
characters, that is fine, in a sense, they can always redo the page.
The more important thing is that I need to be able to output their
input as XML and I need to be able to say what character set the
output XML is in. That is what is important. My question really should
have been "Can I force conversion to UTF-8 from any kind of input?"

I've not yet read your tutorial, I'll go read that now.

Jul 20 '05 #8

lawrence

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in message news:<Pi*******************************@ppepc56.ph .gla.ac.uk>...

On Wed, 11 Aug 2004, lawrence wrote:
I was told in another newsgroup (about XML, I was wondering how to
control user input) that most modern browsers empower the designer to
cast the user created input to a particular character encoding.

If you have a captive browser population, then that might be feasible;
but in a WWW context this is rarely the case - you have to make the
best use of what you get.
This arose in answer to my question about how to control user input.

With respect: in a WWW context you do better to pay attention to the
options that are open to you to interpret what you've been sent, since
in the final analysis you can't literally "control" anything. In your
server-side process, you need to be able to cope with ( it means:
either accept if you can, or gracefully refuse if you can't ) just
anything that a client will send to you, including what will be sent
by malicious or just plain broken clients.
I had complained that I had users who wrote articles in Microsoft
Word or WordPerfect and then input that to the web through a
textarea box on a form I'd created.

I think you'll need to be more specific. In Word alone, I've seen so
many variations (including Mac Word users who had created Mac-coded
characters which didn't exist in the Windows encoding!) that I could
write a whole thesis on the topic.
I've run google searches on this and I get tons of info but none to
the point. Can anyone here give me pointers on converting form input
to a particular character encoding?

I see that you've already been pointed to my tutorial-ish page at
http://ppewww.ph.gla.ac.uk/~flavell/...form-i18n.html

My impression is that your best chance with modern browsers is to send
the form page with utf-8 encoding and to expect the form submission to
come back in utf-8 encoding. But if you need to deal with NN4.* this
will go horribly wrong, and some other browsers with limited scope
will do that too. Content negotiation unfortunately doesn't help
here, since NN4 *claims* to support utf-8 - and, as far as display of
web pages is concerned, that's sort-of true; but when it comes to form
submission, it goes desperately wrong.

Okay, I've read over most of your tutorial now. It is very good. Thank
you for undertaking this work.

It occurs to me now that all I need do is to make clear to users when
they are being stupid - I need to make their garbage characters
visible to them. I guess I can achieve this by having them submit the
form with charset set to UTF-8 and then have the server resend the
input, show it to the user, and ask them, "Please review this. If you
see any garbage characters in this, you'll need to save your input as
plain text in a word processor before you submit it." Or words to that
effect.

I'm not clear from your tutorial if support on current browsers is
widespread enough to rely on the charset attribute when doing
multi-part form inputs. But it seems to me I can use PHP to get the
same effect on the server.

Jul 20 '05 #9

Lachlan Hunt

Pierre Goiffon wrote:

"Lachlan Hunt" <la**********@lachy.id.au.invalid> a Ã©crit dans le
message de news:lb*******************@news-server.bigpond.net.au
UTF-16 is better for documents written in a language where the
majority of characters used would be more than 2 bytes in UTF-8.

But HTML, CSS, JS only use 7-bit ascii characters.

Since when? They can all be in any character set that your editor
supports saving in. So, if your editor only supports 7-bit ascii, then
of course that's all they can be. I must have a more advanced editor
than you, since I've created HTML, CSS and JS files in UTF-8, UTF-16 and
ISO-8859-1 with non-ascii characters, and they work just fine.

I personnaly think the only thing to remember is that UTF-16 seems to have far
less support in browsers than UTF-8 !

That could be true of older UAs, but AFAIK all modern UAs that support
UTF-8, also support UTF-16.
--
Lachlan Hunt
http://www.lachy.id.au/

Please direct all spam to ab***@127.0.0.1
Thank you.

Jul 20 '05 #10

Pierre Goiffon

"Lachlan Hunt" <la**********@lachy.id.au.invalid> a Ã©crit dans le
message de news:dK*************@news-server.bigpond.net.au

UTF-16 is better for documents written in a language where the
majority of characters used would be more than 2 bytes in UTF-8.

But HTML, CSS, JS only use 7-bit ascii characters.

Since when? They can all be in any character set that your editor
supports saving in.

I didn't thought about charset... I meant, they only use simple ascii
characters. If you save a text file containing only HTML in UTF-16, you
would get a larger file than if you save it in us-ascii because the
characters 0-127 are encoded using more bits in UTF-16.

Jul 20 '05 #11

Pierre Goiffon

"lawrence" <lk******@geocities.com> a écrit dans le message de
news:da**************************@posting.google.c om

It occurs to me now that all I need do is to make clear to users when
they are being stupid - I need to make their garbage characters
visible to them. I guess I can achieve this by having them submit the
form with charset set to UTF-8 and then have the server resend the
input, show it to the user, and ask them, "Please review this. If you
see any garbage characters in this, you'll need to save your input as
plain text in a word processor before you submit it." Or words to that
effect.
Very good idea !
I'm not clear from your tutorial if support on current browsers is
widespread enough to rely on the charset attribute when doing
multi-part form inputs. But it seems to me I can use PHP to get the
same effect on the server.

I don't know what you mean by "charset attribute" ? Is this a form tag
attribute ? Is this the meta http-equiv value ? Well, anyway, you _should_
*always* specify the encoding used in the appropriate http headers. Reading
the document at the URL I gave before (in the W3C HTML spec) will certainly
convince you of that.

Jul 20 '05 #12

Alan J. Flavell

On Tue, 17 Aug 2004, Lachlan Hunt wrote:

I personnaly think the only thing to remember is that UTF-16 seems to have
far
less support in browsers than UTF-8 !

That could be true of older UAs, but AFAIK all modern UAs that support UTF-8,
also support UTF-16.

Mozilla (1.7) says:

Accept-charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

I wouldn't exactly call that "support" for utf-16, would you?

I won't deny that if you send it utf-16 under that wildcard, it'll
know how to display it. Although I haven't investigated its behaviour
when submitting a form under those circumstances. Have you?

best regards

Jul 20 '05 #13

Alan J. Flavell

On Mon, 16 Aug 2004, lawrence wrote:

It occurs to me now that all I need do is to make clear to users when
they are being stupid - I need to make their garbage characters
visible to them.
Good plan, indeed. Overall, browser support for display is definitely
better than for forms submission, so if you re-play the input to them
(with proper charset specification in the HTTP headers) and they can
see what they intended to see, then they can be pretty confident that
it's working. If they don't see what they intended to see, then it
could be their fault or someone else's (yours or the browser
designer's), but whichever it is, they'll need to try something else.
So yes, yours looks a good plan all round.
I'm not clear from your tutorial if support on current browsers is
widespread enough to rely on the charset attribute when doing
multi-part form inputs.
I think what it comes down to is that if you find no charset on their
multipart MIME portion(s), then you have to assume it's the same
charset as you had sent out in your original HTML page (the one that
contained the form). If you -do- find a charset, then you can rely on
it, but some browser implementers were reluctant to send a charset
because of incompetent server-side scripts which got upset by it.
Could be that's changed by now. My page is admittedly incomplete, and
not exactly up to the minute.

From your point of view, I certainly wouldn't want to try relying on
setting an accept-charset on the form. I think you just have to leave
that open, and try and cope with whatever comes back to you.
But it seems to me I can use PHP to get the same effect on the
server.

Sorry, I don't understand that remark. Your server-side process gets
sent whatever it gets sent by the client. Doesn't matter how you
program it on the server side, you get no more nor less data from the
client. Can you be more specific about what you were thinking at that
point?

Just a general comment. Some of the search services, especially those
with support for multiple languages and writing systems, are evidently
tracking browser support for this kind of stuff much more closely than
my limited personal resources can achieve. While you can't read their
minds about why they're doing what they do, at least you can review
what their facilities look like in the various languages, what kind of
HTML they use, and so on, and draw certain conclusions about what they
must have found to be successful.

all the best

Jul 20 '05 #14

Alan J. Flavell

On Tue, 17 Aug 2004, Pierre Goiffon wrote:

"lawrence" <lk******@geocities.com> a écrit dans le message de
I'm not clear from your tutorial if support on current browsers is
widespread enough to rely on the charset attribute when doing
multi-part form inputs.

I don't know what you mean by "charset attribute" ?

http://www.w3.org/TR/html401/interac...ml#h-17.13.4.2

A "multipart/form-data" message contains a series of parts, each
representing a successful control.
[...]
As with all multipart MIME types, each part has an optional
"Content-Type" header that defaults to "text/plain". User agents
should supply the "Content-Type" header, accompanied by a
"charset" parameter.
^^^^^^^

I interpreted the question as relating to *this* charset parameter.
Is this a form tag attribute ?
Is there an HTML specification to refer to before asking such
questions? (Hint: yes, there is: the form element is documented
at http://www.w3.org/TR/html401/interact/forms.html#h-17.3 ).

More to the point of the discussion here would be "how well supported
is the accept-charset attribute of the form element?". Unfortunately
I don't have wide and up-to-date details of that, but on the grounds
of principle it's clear that coverage can only be wider if one can
devise ways of working which don't rely on that.
Well, anyway, you _should_ *always* specify the encoding used in the
appropriate http headers.
That would be true in its appropriate context, yes; but in the case of
a form submission it's irrelevant, because there's no place to specify
a charset in a GET submission; and in the case of a POST submission,
if multipart/form-data is used then the place for the charset(s) is in
the MIME packages inside the multipart data, which is the wrong
network layer to be concerned with HTTP.
Reading the document at the URL I gave before (in the W3C HTML spec)
will certainly convince you of that.

Erm. See above :-}

good luck

Jul 20 '05 #15

Tim

Pierre Goiffon wrote:

I personnaly think the only thing to remember is that UTF-16 seems to have far
less support in browsers than UTF-8 !

Lachlan Hunt <la**********@lachy.id.au.invalid> posted:
That could be true of older UAs, but AFAIK all modern UAs that support
UTF-8, also support UTF-16.

I recently tried UTF-16 on a few recent, and not obscure, web browsers.
Only some of them could handle it. It was enough to convince me that it
wasn't a good idea at this time. Even UTF-8 is still problematic (witness
the mess some search engines make of quoting UTF-8 encoded site content).

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please delete some files yourself.

Jul 20 '05 #16

lawrence

"Pierre Goiffon" <pg******@nowhere.invalid> wrote in message

I'm not clear from your tutorial if support on current browsers is
widespread enough to rely on the charset attribute when doing
multi-part form inputs. But it seems to me I can use PHP to get the
same effect on the server.

I don't know what you mean by "charset attribute" ? Is this a form tag
attribute ? Is this the meta http-equiv value ? Well, anyway, you _should_
*always* specify the encoding used in the appropriate http headers. Reading
the document at the URL I gave before (in the W3C HTML spec) will certainly
convince you of that.

I meant that perhaps web browsers could not be trusted to encode input
into UTF-8. In which case, it might be wiser to do so on the server,
using the PHP function for that purpose. More here:

http://www.php.net/manual/en/function.utf8-encode.php

Jul 23 '05 #17

can a textarea on a form be used to cast text to a specific charset like UTF-16?

Similar topics