By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,932 Members | 1,486 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,932 IT Pros & Developers. It's quick & easy.

Forms and encoding

P: n/a
I'd like to implement some sort of search function on my site, so I took
Google sample code and tried it, i.e. basically:

<form method="GET" action="http://www.google.com/search">
<input type="hidden" name="as_sitesearch" value="www.relinquiere.com">
<input type="text" name="q" size="15" value="">
<input type="image" id="submit" value="" src="..." ...>
</form>

It works fine, most of the time: if I type in accented characters, they
get somehow misinterpreted.

My test page is : http://wwww.relinquiere.com/search.html
As you can see by yourself, the charset parameter in Content-Type is
ISO-8859-1 (that's intended) so I expect my client to send the request
(when submitting the form) using the same encoding (even if it is not
required to do so).

Here is the request when I enter "préhistorique" in my search box:
GET /search?as_sitesearch=www.relinquiere.com&q=pr%E9hi storique&x=8&y=8
HTTP/1.1

where %E9 is actually the value for "é" in the latin-1 répertoire. But
Google interprets it as "pr?historique". If I enter some UTF-8 data in
the search field, this works fine (accented characters are correctly
passed to Google). Does it mean that Google expects UTF-8 data? or that
something is wrong with my form?

Then I added a hidden field to my form:
<input type="hidden" name="ie" value="ISO88591">
as you can see in: http://www.relinquiere.com/search-latin-1.html

(I assume that this "ie" field stood for "input encoding" so that Google
can interpret the received data as Latin-1)

Now, entering "préhistorique" as before works and returns one page. Here
is the request sent to Google:
GET
/search?as_sitesearch=www.relinquiere.com&ie=ISO885 91&q=pr%E9historique&x=9&y=3
HTTP/1.1

What I conclude is that Google needs to be told what encoding is used
for the parameters, which is fair, but this raises a big issue: how am I
supposed to know what encoding my visitors use?

Imagine that a French-speaking Japanese visits my site: he will receive
my page encoded in ISO-8859-1, enter some text (let's assume this text
is made of latin characters - is this possible in Japanese encoding?),
submit the form, and now what? Will his input be encoded in ISO-8859-1 too?
--
Want to spend holidays in France ? Check http://www.relinquiere.com/
Jul 23 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Vincent Poinot <vi***************************@wanadoo.fr> wrote:
I'd like to implement some sort of search function on my site, so I
took Google sample code and tried it, i.e. basically: - - It works fine, most of the time: if I type in accented characters,
they get somehow misinterpreted.
Yep, and you're right: it's an encoding problem.
Does it mean that Google expects UTF-8
data? or that something is wrong with my form?
Apparently Google expects UTF-8 by default.
<input type="hidden" name="ie" value="ISO88591"> - - Now, entering "préhistorique" as before works and returns one page.
That's interesting. I don't know whether Google recognizes the misspelled
name of the encoding or just uses ISO-8859-1 when it does not understand
the value of the ie field, but in any case the correct method is to use
an IANA registered name for the encoding, preferably the preferred MIME
name:
<input type="hidden" name="ie" value="ISO-8859-1">
(Hyphens are significant in character encoding names.)
What I conclude is that Google needs to be told what encoding is used
for the parameters, which is fair, but this raises a big issue: how
am I supposed to know what encoding my visitors use?
The browser normally uses, for form submission, the encoding of the page
where the form appears. (In theory, you could specify otherwise by using
the accept-charset attribute in the <form> element, but as far as I know,
no browser supports it.) So you should just check that you have specified
that encoding properly, preferably in HTTP headers, or at least in a
<meta> tag.
Imagine that a French-speaking Japanese visits my site: he will
receive my page encoded in ISO-8859-1, enter some text (let's assume
this text is made of latin characters - is this possible in Japanese
encoding?), submit the form, and now what? Will his input be encoded
in ISO-8859-1 too?


As far as I've understood, his browser should send the characters as
ISO-8859-1 encoded and does so. If you tried something less common like
ISO-8859-15, problems would arise. But ISO-8859-1 should work fine, at
least in all browsing situations where your ISO-8859-1 encoded page is
legible in the first place!

You might wish to check Alan Flavell's treatise on character encoding
problems in forms:
http://ppewww.ph.gla.ac.uk/%7eflavel...form-i18n.html

(What Google does with accented letters is an interesting story, though
beyond the scope of the group. It gives strangely different results for
préhistorique
prehistorique
+préhistorique
+prehistorique
but it generally treats e.g. e and é as equivalent.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 23 '05 #2

P: n/a
Jukka K. Korpela wrote:
Vincent Poinot <vi***************************@wanadoo.fr> wrote:

<input type="hidden" name="ie" value="ISO88591">
That's interesting. I don't know whether Google recognizes the misspelled
name of the encoding or just uses ISO-8859-1 when it does not understand
the value of the ie field, but in any case the correct method is to use
an IANA registered name for the encoding, preferably the preferred MIME
name:
<input type="hidden" name="ie" value="ISO-8859-1">
(Hyphens are significant in character encoding names.)

Thanks for the tip: I changed that (and it still works, of course). Just
ouf of curiosity, as you suggested, I also tried to give Google some
garbage instead of a proper encoding name... and it also returns correct
results!
(http://www.google.com/search?as_site...orique&x=0&y=0)
Imagine that a French-speaking Japanese visits my site: he will
receive my page encoded in ISO-8859-1, enter some text (let's assume
this text is made of latin characters - is this possible in Japanese
encoding?), submit the form, and now what? Will his input be encoded
in ISO-8859-1 too?

As far as I've understood, his browser should send the characters as
ISO-8859-1 encoded and does so. If you tried something less common like
ISO-8859-15, problems would arise. But ISO-8859-1 should work fine, at
least in all browsing situations where your ISO-8859-1 encoded page is
legible in the first place!

Yes, I guess this is where I was heading to: as long as I stick to
ISO-8859-1, everything should be fine. However, this whole mechanism
looks pretty fragile to me when it comes to more exotic encodings...
You might wish to check Alan Flavell's treatise on character encoding
problems in forms:
http://ppewww.ph.gla.ac.uk/%7eflavel...form-i18n.html

Already read that: excellent and very useful indeed.
--
Want to spend holidays in France ? Check http://www.relinquiere.com/
Jul 23 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.