By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,496 Members | 1,517 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,496 IT Pros & Developers. It's quick & easy.

How do search engines index multilingual content?

P: n/a
I am building a website with identical content in four different
languages. On a first visit, the search engine determines the language
of the content by the IP address of the visitor. What the user sees is
content in only one language at a time. He or she can then switch to
another language and set this as the preferred language, but again he
or she sees content in only this one other language.

The question now is: How do I get search engines to index ALL of the
content, in all languages?

Should I include the non-displayed content in DIVs with display set to
"none" (like we used to include complete websites in the noframes tag)?
Or do search engines ignore invisible DIVs?

Or can I somehow detect that a search engine is visiting and deliver a
page with the complete content of all four languages in it? Or would
that get me banned?

Or do I have to rely on the search engine following the local links to
the pages in the other languages? This might be a problem, because the
varying content is always displayed on the same page, so the URI stays
the same, and only one parameter changes, thus:
"content.php?language=oneoffourlanguages". In fact it might even be
impossible, because I do not want to transfer the language information
through the URL via GET, but want to send it through a form via POST.
So the URI is exactly the same for all languages (at least in the
version I am aiming at).

If you have solved this problem on your website or know how to go about
it, I'd be grateful for some help.

Jan 29 '06 #1
Share this Question
Share on Google+
64 Replies


P: n/a
Manfred Kooistra wrote:
I am building a website with identical content in four different
languages.
Fine. You have explicitly linked the different versions to each other,
right? With reasonable link texts like the name of the page in the other
language, or, as tolerable option, the name of the language in the
language itself, right? No flags, no dropdowns, mm'kay?
On a first visit, the search engine determines the language
of the content by the IP address of the visitor.
No it does not. The idea is absurd. There is no visitor (beyond the
search engine itself) when a search engine indexes your content.
Besides, using the IP address to determine the language of a person is
absurd, too.
What the user sees is
content in only one language at a time.
Fine, but he should have simple access (links) to the other versions, too.
He or she can then switch to
another language and set this as the preferred language,
Setting a preferred language would take place via a URL string or via
cookies. That would be extra bonus to some (many) users, but the solid
basis needs to be built first.
The question now is: How do I get search engines to index ALL of the
content, in all languages?
The way you make them find all of your pages in general: using links.
Should I include the non-displayed content in DIVs with display set to
"none" (like we used to include complete websites in the noframes tag)?
No, that would be absurd and destructive (especially when your style
sheet is not used).
Or do search engines ignore invisible DIVs?
They may, or they may not, or they may punish the page for suspected
keyword spamming or cloaking.
Or can I somehow detect that a search engine is visiting and deliver a
page with the complete content of all four languages in it?
Some indexing robots can be detected heuristically. But don't do it.
Or would that get me banned?
Hopefully yes.
Or do I have to rely on the search engine following the local links to
the pages in the other languages?
That's the general idea.
This might be a problem, because the
varying content is always displayed on the same page, so the URI stays
the same, and only one parameter changes, thus:
"content.php?language=oneoffourlanguages".
If you have reasons to suspect that this is a problem, then don't do
that. But I wouldn't be worried about search engines that ignore pages
with a simple query part of the form ?foo=bar - they probably exist, but
they are losers in the search engine competition.
In fact it might even be
impossible, because I do not want to transfer the language information
through the URL via GET, but want to send it through a form via POST.
Just don't do that. Simple, eh?
So the URI is exactly the same for all languages (at least in the
version I am aiming at).


That's a completely wrong idea. You might, however, use an _additional_
generic URL that is resolved to one of specific URLs, via language
negotiation at the HTTP level. See
http://www.cs.tut.fi/~jkorpela/multi/
Jan 30 '06 #2

P: n/a
Manfred Kooistra wrote:
I am building a website with identical content in four different
languages. On a first visit, the search engine determines the language
of the content by the IP address of the visitor.
I don't understand what you mean by that. The best way of specifying the
language of an HTML page is with a lang attribute in the HTML tag (e.g.,
<HTML lang="en">)
What the user sees is
content in only one language at a time. He or she can then switch to
another language and set this as the preferred language, but again he
or she sees content in only this one other language.

The question now is: How do I get search engines to index ALL of the
content, in all languages?
Use <LINK> elements to indicate where the oehr translations can be found.

<http://www.w3.org/TR/REC-html40/struct/links.html#edef-LINK>
Should I include the non-displayed content in DIVs with display set to
"none" (like we used to include complete websites in the noframes tag)?
Or do search engines ignore invisible DIVs?
They generally penalize you for doing that. This is a frequently abused
technique for stuffing keywords into web pages in order to rank higher in the
result listings.
Or can I somehow detect that a search engine is visiting and deliver a
page with the complete content of all four languages in it? Or would
that get me banned?
That's called "cloaking", and is also frowned upon by search engines.
Or do I have to rely on the search engine following the local links to
the pages in the other languages? This might be a problem, because the
varying content is always displayed on the same page, so the URI stays
the same, and only one parameter changes, thus:
"content.php?language=oneoffourlanguages".


So the URI is actually different. But I'm not sure this is the best way of
doing things. Apache servers have some very useful built-in features for this
sort of thing.

<http://www.google.com/search?q=apache%20content%20negotiation>

--
philronan [@] blueyonder [dot] co [dot] uk

Jan 30 '06 #3

P: n/a
Philip Ronan wrote:
The best way of specifying the
language of an HTML page is with a lang attribute in the HTML tag (e.g.,
<HTML lang="en">)


The lang attribute is recommendable, but it has _no_ verified effect on
search engines.
The question now is: How do I get search engines to index ALL of the
content, in all languages?


Use <LINK> elements to indicate where the oehr translations can be found.


There is no evidence that shows that search engines utilize such <LINK>
elements. Surely normal links (<A> elements) are better, since they are
much more widely recognized by browsers _and_ search engines.
Jan 30 '06 #4

P: n/a
Jukka K. Korpela wrote:
Philip Ronan wrote:
The best way of specifying the
language of an HTML page is with a lang attribute in the HTML tag (e.g.,
<HTML lang="en">)
The lang attribute is recommendable, but it has _no_ verified effect on
search engines.


What do you mean by "verified"? Google mention that some pages provide
insufficient context for guessing the language of a web page. See
<http://www.google.co.uk/intl/en/help/faq_translation.html#link>, for
example:
Why don't all the results in translatable languages have the
"Translate" link?

We only offer the "Translate" link when we have a high degree of
confidence about the language of the selected page. Some pages may
contain multiple languages or insufficient text to provide a high
degree of certainty about the language in which they were written.
Are you suggesting Google don't use lang attributes to assist in this
process? Do you think they prefer to use a whole bunch of syntactic analysis
and word frequency tools to make this decision instead?
Use <LINK> elements to indicate where the oehr translations can be found.


There is no evidence that shows that search engines utilize such <LINK>
elements.


Really? What about <http://www.google.com/webmasters/bot.html#whatlinks>:
12. What kinds of links does Googlebot follow?

Googlebot follows HREF links and SRC links.


Are you saying that a <LINK> element with an href attribute is somehow *not*
an HREF link? Please explain.
Surely normal links (<A> elements) are better, since they are
much more widely recognized by browsers _and_ search engines.


That would be a useful addition, but not an absolute necessity as far as
search engines are concerned. Perhaps a pop-up menu would be neater than a
long list of A links.

--
philronan [@] blueyonder [dot] co [dot] uk

Jan 30 '06 #5

P: n/a
On Mon, 30 Jan 2006, Philip Ronan wrote:
The lang attribute is recommendable, but it has _no_ verified effect on
search engines.
What do you mean by "verified"? Google mention that some pages provide
insufficient context for guessing the language of a web page. See

^^^^^^^^ !! <http://www.google.co.uk/intl/en/help/faq_translation.html#link>,
They don't mention the LANG attribute here.
Are you suggesting Google don't use lang attributes to assist in this
process? Do you think they prefer to use a whole bunch of syntactic analysis
and word frequency tools to make this decision instead?


I'm afraid, yes. Until recently, Google did the same to *guess*
the encoding (charset) of a page instead of reading the HTTP or META
charset parameter. Google Groups still ignore the charset parameter
of Usenet articles. Instead they use the group name and I-don't-know-
what-else to select an encoding for an article.

Example:
http://www.seekport.de/help/webmaster_tips.html#Sprache
mentions only a META tag. These simpletons don't know the LANG
attribute of HTML.

--
Netscape 3.04 does everything I need, and it's utterly reliable.
Why should I switch? Peter T. Daniels in <news:sci.lang>

Jan 30 '06 #6

P: n/a
Andreas Prilop wrote:
On Mon, 30 Jan 2006, Philip Ronan wrote:
The lang attribute is recommendable, but it has _no_ verified effect on
search engines.
What do you mean by "verified"? Google mention that some pages provide
insufficient context for guessing the language of a web page. See

^^^^^^^^ !!
<http://www.google.co.uk/intl/en/help/faq_translation.html#link>,


They don't mention the LANG attribute here.


So read between the lines. It's quite obvious Google uses certain technqiues
to ascertain the language of a web page. Examining lang attributes (where
available) is an obvious method for achieving this with the least effort.

Here's something you can try: click the "Advanced search" link in Google, and
search for pages in English that contain the phrase "Der Spindoktor". Then
take a look at the lang attribute in the HTML tag of the top result.
Google Groups still ignore the charset parameter
of Usenet articles. Instead they use the group name and I-don't-know-
what-else to select an encoding for an article.


That's an inevitable problem caused by putting multiple articles (with
different charsets) in a single web page. Google Groups has plenty of other
problems, but this has nothing to do with lang attributes.

--
philronan [@] blueyonder [dot] co [dot] uk

Jan 30 '06 #7

P: n/a
On Mon, 30 Jan 2006, Philip Ronan wrote:
Andreas Prilop wrote:
Google Groups still ignore the charset parameter of Usenet
articles. Instead they use the group name and I-don't-know-
what-else to select an encoding for an article.


That's an inevitable problem caused by putting multiple articles
(with different charsets) in a single web page.


I can't agree.

Mozilla's Bugzilla made the same mistake, and some of the
charset-related bug reports are sheer incomprehensible as a
consequence - they contain a mish-mash of Chinese, Cyrillic and
whatever else, in their different encodings, served out as raw bytes.
But the mistake was made many years back...

At least their discussion shows that they have recognised their
mistake, and understand how to correct it - mapping the various
encodings into Unicode, and serving out the results accordingly -
probably in utf-8.

(This might cause problems for people who are discussing the finer
details of Han unification, but that can't be helped now.)

Google have already, in effect, implemented something like that for
indexing web content. Otherwise it wouldn't be possible to find texts
in koi8-r and Windows-1251 when searching with a utf-8-encoded query:
the kind of problems that Andreas was reporting some years back with
various search engines, which (to put it briefly) made a query in one
encoding, and only returned pages which used that same encoding.

They just need to apply the same principle to what their ggroups
thingy is serving out. Admittedly, ggroups have *other*, *serious*,
problems to attend to first, such as encouraging their users to follow
netiquette - to at least the extent needed to get them out of the
widespread killfiling that they've already earned. But I digress.

Jan 30 '06 #8

P: n/a
On Mon, 30 Jan 2006, Philip Ronan wrote:
So read between the lines.
Never. I prefer to read the actual text. Internet Explorer reads
between the lines to guess the Content-Type, etc.
It's quite obvious Google uses certain technqiues
to ascertain the language of a web page.
But what are "certain techniques"?
Examining lang attributes (where
available) is an obvious method for achieving this with the least effort.
Yes. But many (including Google) seem to think "Warum einfach, wenn's
auch kompliziert geht?"
That's an inevitable problem caused by putting multiple articles (with
different charsets) in a single web page.
No, it isn't. All articles are converted to UTF-8 by Google and
presented to you in UTF-8.
Google Groups has plenty of other
problems, but this has nothing to do with lang attributes.


I never said that. I said:
Google's web search ignored and Google's Usenet archive still ignores
the charset parameter although this is the only valid method to
get the encoding of a web page or an article.

Therefore it is plausible that they also ignore the LANG attribute.
And we have no indication that Google does read the LANG attribute -
which is sad, btw.

Jan 30 '06 #9

P: n/a
On Mon, 30 Jan 2006, Philip Ronan wrote:
Here's something you can try: click the "Advanced search" link in Google, and
search for pages in English that contain the phrase "Der Spindoktor". Then
take a look at the lang attribute in the HTML tag of the top result.


There is no such thing as "the top result".
http://www.google.com/search?q=%22De...=lang_en&hl=de
http://www.google.com/search?q=%22De...=lang_en&hl=en
http://www.google.com/search?q=%22De...=lang_en&hl=fr

Jan 30 '06 #10

P: n/a
On Mon, 30 Jan 2006, Andreas Prilop wrote:
On Mon, 30 Jan 2006, Philip Ronan wrote:
It's quite obvious Google uses certain technqiues
to ascertain the language of a web page.


But what are "certain techniques"?


Indeed. It's "quite obvious" to me that Google uses UNcertain
techniques to guess the language of a web page.
Jan 30 '06 #11

P: n/a
Philip Ronan <no****@example.invalid> wrote:
The lang attribute is recommendable, but it has _no_ verified effect on
search engines.
What do you mean by "verified"?


Proved to be true, on the basis of observed facts, as opposite to mere
claims.
Google mention
Quite a lot, but the lack of _any_ reference to LANG attributes is ominous.
There is no evidence that shows that search engines utilize such <LINK>
elements. - -
Googlebot follows HREF links and SRC links.

That's what they, but is there some evidence of its being true (for <LINK>
elements)?
Surely normal links (<A> elements) are better, since they are much
more widely recognized by browsers _and_ search engines.


That would be a useful addition,


No, you are thinking it all upside down. The <A> link is the real thing.
but not an absolute necessity as far as
search engines are concerned.
You have absolutely no direct evidence of search engines actually following
<LINK> references. There might be such evidence, but you haven't got it. And
it is more than obvious that search engines _have_ to follow <A> links,
whereas there's no real need for them to follow <LINK> references. Pages that
rely on <LINK> references only are so rare (and so poorly designed) that they
can be ignored.
Perhaps a pop-up menu would be neater than
a long list of A links.


I smell a troll.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jan 30 '06 #12

P: n/a
Jukka K. Korpela wrote:
Philip Ronan <no****@example.invalid> wrote:
The lang attribute is recommendable, but it has _no_ verified effect on
search engines.
What do you mean by "verified"?


Proved to be true, on the basis of observed facts, as opposite to mere
claims.


So unless I can show you in detail how Google and other search engines work,
you're going to stick with your own blinkered opinion?

Here are some references for you to pick over:
<http://www.seo-
guy.com/forum/showthread.php?s=3b479461ee5ca93132729d6902260b0a& p=71027#post71
027>
<http://randomfire.fierymill.net/arch...ilingual-site-
development-part-ii-the-lang-attribute/>
<http://diveintomark.org/archives/2002/06/18/day_7_identifying_your_language>
Google mention


Quite a lot, but the lack of _any_ reference to LANG attributes is ominous.


Why so? The lack of lang attributes in most web pages [1] makes their job a
lot harder for sure, but it would be idiotic to ignore them.

[1] <http://blog.searchenginewatch.com/blog/060125-160004>
There is no evidence that shows that search engines utilize such <LINK>
elements. - -> Googlebot follows HREF links and SRC links.
That's what they, but is there some evidence of its being true (for <LINK>
elements)?


So you think Google are telling lies? What would be the point of that?
You have absolutely no direct evidence of search engines actually following
<LINK> references.
I have their word on the matter. What more do you want?
There might be such evidence, but you haven't got it.
Yes I have: <http://www.google.com/webmasters/bot.html#whatlinks>
And
it is more than obvious that search engines _have_ to follow <A> links,
whereas there's no real need for them to follow <LINK> references.
More than obvious? I'm sorry, but can you please prove that to be true, on
the basis of observed facts, as opposite to mere claims?
Pages that
rely on <LINK> references only are so rare (and so poorly designed) that they can be ignored.


Would you mind proving that for me as well?
Perhaps a pop-up menu would be neater than
a long list of A links.


I smell a troll.


/Meh

--
philronan [@] blueyonder [dot] co [dot] uk

Jan 30 '06 #13

P: n/a
On Mon, 30 Jan 2006, Philip Ronan wrote:
User-Agent: Hogwasher/4.2.2

<http://www.seo-
guy.com/forum/showthread.php?s=3b479461ee5ca93132729d6902260b0a& p=71027#post71
027>
<http://randomfire.fierymill.net/arch...ilingual-site-
development-part-ii-the-lang-attribute/>
Please learn first how to post long URLs to Usenet.
<http://diveintomark.org/archives/2002/06/18/day_7_identifying_your_language>
We know that all - but there is no proof that Google actually uses
the LANG attribute.
The lack of lang attributes in most web pages [1] makes their job a
lot harder for sure, but it would be idiotic to ignore them.


So the conclusion is that the monkeys working at Google are idiotic -
I agree.

Jan 30 '06 #14

P: n/a
Andreas Prilop wrote:
On Mon, 30 Jan 2006, Philip Ronan wrote:
The lack of lang attributes in most web pages [1] makes their job a
lot harder for sure, but it would be idiotic to ignore them.


So the conclusion is that the monkeys working at Google are idiotic -
I agree.


Sour grapes, huh?

--
philronan [@] blueyonder [dot] co [dot] uk

Jan 30 '06 #15

P: n/a
Thank you all!

First: I made a booboo. I am sorry that this has lead to some
confusion. My post should have read: "On a first visit, the SERVER
determines the language in which the content will be displayed." I
wanted to explain what a human visitor sees, before I ask my question
about search engines.

My web site is made for human beings. When a person visits it for the
first time, a script on my server guesses at the most likely language
that the person might be speaking by his or her IP address. (I use
http://ip-to-country.webhosting.info to do this.) So the visitor sees a
page with content in one language, but he or she can choose one of the
other languages (and set a cookie) through a menu.

I would prefer URLs without parameters, but after what you wrote, some
research, and some thinking, it seems to me that a link to
thissamepage.php?language=otherlanguage is the best version to make
sure that search engines find them.

For me it does not matter, wether search engines follow LINK tags or
not, because for me the URL is the problem: I prefer URLs without
parameters - for security reasons and because what I want to do, if
possible, is a hack for a CMS where I want to change as little of the
code as possible. If I have to use URLs with parameters, it will be
easier for me to build the site myself from scratch, but I still try to
avoid that.

Anyway, the question here is: What is the _best_ way to present
identical content in different languages to a search engine?

I don't like the answer, because it means work for me, but the answer
seems to be:

A text link with language parameters.

Anyone disagree?

---

Jukka, thank you for the step by step answers. I have read your
articles, and they gave me valuable ideas. Thank you for making them
available. I don't yet completely agree regarding the country flag
question, although I admit that I need to think about this.

I want to explain: I am a graphic artist, and the website will present
my artwork. Aesthetic questions are rather important to me, and users
with text browsers will not see the relevant content of my site anyway,
so for me four flags simply look better than four words. Also, I prefer
non-verbal communication where possible. I believe that people process
images faster, more accurately, and more easily than words (although,
of course, "English | French" is rather straightforward).

Still, I am glad that you pointed the difficulties regarding the flag
practice out to me, and I will carefully consider a possible solution.

Jan 30 '06 #16

P: n/a
Manfred Kooistra wrote:
My web site is made for human beings. When a person visits it for the
first time, a script on my server guesses at the most likely language
that the person might be speaking by his or her IP address.


Reinventing the wheel is a classic error, and that one's square.

Don't guess. Just serve the language requested by the browser.
It's less work, because the capability is built into your server
(unless you have something unusual).

See for example the documentation at httpd.apache.org. Change your
language preferences in your browser, and it'll serve you a different
language.

--
Nick Kew
Jan 31 '06 #17

P: n/a
Philip Ronan wrote:
Andreas Prilop wrote:
On Mon, 30 Jan 2006, Philip Ronan wrote:
<http://www.google.co.uk/intl/en/help/faq_translation.html#link>,

They don't mention the LANG attribute here.


So read between the lines. It's quite obvious Google uses certain technqiues
to ascertain the language of a web page. Examining lang attributes (where
available) is an obvious method for achieving this with the least effort.

Here's something you can try: click the "Advanced search" link in Google, and
search for pages in English that contain the phrase "Der Spindoktor". Then
take a look at the lang attribute in the HTML tag of the top result.


The top result I received didn't even contain a lang attribute. It
contained a meta element specifying Content-Langauge as "en", but that
still doesn't prove anything about what Google used to determine the
language, especially considering the majority of that page in written in
English, it could have used any form of analysis to determine it.

The only way I know to test it accurately would be to set up some test
cases somewhere that are completely identical in content, so as to avoid
any other possible interference in the experement, with the exception of
differing lang attributes. In other words, the only possible factor
that could be used to determine the language is the lang attribute.

A small problem with this is that the actual content in the page needs
to be as language independent as physically possible. i.e. Try to use
words/phrases that are common to both languages, but with different
meanings in each language; or otherwise use an equal sampling of each
language so any form of word frequency analysis by Google will have as
little effect on the result as possible.

Then, it's a matter of waiting for Google to index the pages and
performing a language specific search for a phrase contained in the
tests (also restrict the search to the site they're hosted on to avoid
unnecessary results) and see which pages are returned.

The experiment could also be repeated with the meta element and the HTTP
headers using Content-Language.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Jan 31 '06 #18

P: n/a
Andreas Prilop ,comp.infosystems.www.authoring.html:
The lack of lang attributes in most web pages [1] makes their job a
lot harder for sure, but it would be idiotic to ignore them.


So the conclusion is that the monkeys working at Google are idiotic -
I agree.


Actually, it may just be a pragmatic to ignore lang attributes, since a
non negligible number of webpages declare a different language from what
they are written in... This may be because of copy/paste habits, or
because of strange behaviour in authoring tools. I have no statistics on
the proportion of incorrect lang attribute on the html tag, but MS Word,
in particular, often put absurd <span lang=".."> around sentences it
thought to be in a different language. It is quite possible that, even on
webpages with a lang attribute for the html tag, a better precision will
be obtained by just guessing the language rather than relying on this
attribute.
Jan 31 '06 #19

P: n/a
There is another thread, where people argue against using the language
information supplied by the browser, because this is often missing or
incorrect. I don't know about this, but I can watch my own behaviour:

I consider myself to be a web surfer with some technical knowledge, yet
I have never set my preferred language in any of the browsers that I
have used over the years. To tell the truth, I don't know how to do it
(though I would probably find the settings quickly). Why didn't I do
it? Most of the websites I visit are unilingual, so I take them as they
are. And if a website is multilingual, I simply switch to the language
I prefer (which usually is the original language, so I don't switch at
all, because on most multilingual websites the translations are bad and
many pages are missing in the additional languages).
From my experience, most surfers are like me: they do not care for the

settings of their browsers, they take them as they are. You know,
people want to buy a computer, turn it on and do what was promised.
Computers here (in Germany) are marketed that way: plug in and play.

So, I don't rely on information that may be missing or wrong. An IP
address is always there, and in most cases it points to a country of
origin, and in most countrys one language is understood by a vast
majority. For example, most Canadians understand English. I don't say
that it is the language of all Canadians, but more people have it as
their mother tongue than French, and even the French Canadians and
foreigners living there know English. So, I simply serve English to
Canadians, and a visitor from Quebec can then switch to French and set
a cookie for future visits.

Jan 31 '06 #20

P: n/a
"Manfred Kooistra" <ma**************@gmx.de> wrote:
So, I don't rely on information that may be missing or wrong.
It is illogical to refuse to use the information in the protocol header
that was _designed_ to carry the user's language preferences and make a
guesswork based on information that has quite a different purpose and
meaning. Of course we know that the Accept-Language header often carries
wrong information, so it should not be trusted blindly. You still need to
give users an escape, in the form of explicit links, to other versions.
Besides, there are _other_ reasons
An IP address is always there,
It's there whenever TCP/IP is used. But for all that you can know, it could
be the IP address of a proxy, perhaps an anonymizer in a country different
from the user's homeland. You cannot even deduce the country reliably, still
less the language, or _a_ language. After all, you probably have the content
available in a few languages only, not in the about 6,000 or 8,000 languages
spoken in the world. Therefore, knowing the native language of the user is
not enough; you need to know his preferences among the languages in your
supply. The Accept-Language header has been designed for such purposes.
and in most cases it points to a country of origin,
Origin of what?
and in most countrys one language is understood by a vast
majority.
How many countries did you check when estimating this? For which values of
"understood"? What would the one language be in India or Nigeria, for
example? Besides, why would you serve the content in a language that you
_guess_ (through several steps of heuristics, with lots of sources of error
at each step) to be "understood" by the user, instead of the _best_ fit?
For example, most Canadians understand English. I don't say
that it is the language of all Canadians, but more people have it as
their mother tongue than French, and even the French Canadians and
foreigners living there know English. So, I simply serve English to
Canadians, and a visitor from Quebec can then switch to French and set
a cookie for future visits.


So you mean that even if a Canadian _has_ configured his browser to send a
particular piece of information about his language preferences, you throw an
English version at him? That's not friendly.

The point in language settings in browsers is that it's a _general_
mechanism. Any multilingual site can play by the protocol rules and not
bother users who _have_ set their preferences. This is far better than using
various homebrew methods, many of which rely on unreliable techniques like
cookies (which are illegal in the European Union unless the site explicitly
explains that cookies are used and what they are used, causing yet another
disturbance, a violation of the principle "don't mention the techniques").

What you _can_ meaningfully do with the result of your language guessing game
is to select the language of the error message or the default version to be
shown to the user, in case there is _no_ language specified in the Accept-
Language header that matches a language in your repertoire. Then again, I
wouldn't bother. Using (simple) English probably works fine, as long as you
have the explicit links to other versions.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jan 31 '06 #21

P: n/a
Manfred Kooistra wrote:
I consider myself to be a web surfer with some technical knowledge, yet
I have never set my preferred language in any of the browsers that I
have used over the years.


So you'll have the default. In other words, the language your operating
system is set to. Not really too hard, is it?

But clearly you're determined to hack up something complex and broken,
so I'll just ignore it.

--
Nick Kew
Jan 31 '06 #22

P: n/a
On Tue, 31 Jan 2006, Pierre Senellart wrote:
Actually, it may just be a pragmatic to ignore lang attributes, since a
non negligible number of webpages declare a different language from what
they are written in...
That's the same reasoning as to ignore the Content-Type and treat
"text/plain" as "text/html" when the latter seems to fit.
And - What do you say - Google does exactly this:
http://ppewww.ph.gla.ac.uk/~flavell/...tent-type.html
MS Word,
in particular, often put absurd <span lang=".."> around sentences it
thought to be in a different language.


Those who write their HTML documents with MS Word deserve it
to be punished.

--
Netscape 3.04 does everything I need, and it's utterly reliable.
Why should I switch? Peter T. Daniels in <news:sci.lang>

Jan 31 '06 #23

P: n/a
On Tue, 31 Jan 2006, Nick Kew wrote:
Manfred Kooistra wrote:
I consider myself to be a web surfer with some technical
knowledge, yet I have never set my preferred language in any of
the browsers that I have used over the years.
So you'll have the default. In other words, the language your
operating system is set to. Not really too hard, is it?


When I, for a while, set our web server to log the Accept-language
settings, the overwhelming majority of requests were demanding content
in en-US only.

If one takes the specification at its word, that means they reject
generic English, as well as rejecting any other kind of English that
isn't -US. Perhaps if the negotiation had been enforced more firmly
from the outset, they would soon have realised that something was
wrong with their choice (OK - with their vendor's choice). Not for
the first time, MS is in violation of a recommendation in the
applicable IETF RFC. But most other browsers are weak on this
particular issue too.[1]

Now that we've drifted into this mess, it's hard to get out of it
again.
But clearly you're determined to hack up something complex and
broken, so I'll just ignore it.


Obviously, I'm not advocating any other mechanism to substitute for
the real thing - but it does seem as if a fairly liberal
interpretation of the Accept-language setting is essential, in
addition to explicit links to non-negotiated URLs for the other
languages which are offered (no national flags, by request!).

(But I think you agree, basically, don't you?)

cheers

[1] at the end of RFC2616 section 14.4:

Note: When making the choice of linguistic preference available to
the user, we remind implementors of the fact that users are not
familiar with the details of language matching as described above,
and should provide appropriate guidance. As an example, users
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
might assume that on selecting "en-gb", they will be served any
kind of English document if British English is not available. A
user agent might suggest in such a case to add "en" to get the
best matching behavior.
Jan 31 '06 #24

P: n/a
On 30 Jan 2006, Manfred Kooistra wrote:
the best version to make sure that search engines find them.
[...]
What is the _best_ way to present
identical content in different languages to a search engine?
[...]
and users
with text browsers will not see the relevant content of my site anyway,


Can't you see the contradiction?

Jan 31 '06 #25

P: n/a
On 31 Jan 2006, Manfred Kooistra wrote:
Organization: http://groups.google.com

Most of the websites I visit are unilingual,


Hint: http://www.google.com/language_tools

Jan 31 '06 #26

P: n/a
On Tue, 31 Jan 2006 06:12:28 -0800, Manfred Kooistra wrote:
From my experience, most surfers are like me: they do not care for the
settings of their browsers, they take them as they are. You know, people
want to buy a computer, turn it on and do what was promised. Computers
here (in Germany) are marketed that way: plug in and play.
Most systems require you to specify where you are and what language you
like to use when you first install them. This may be why you don't
remember setting it. Good browsers will pick this info up from the
system.
So, I don't rely on information that may be missing or wrong. An IP
address is always there


It is always good to use the best information, but what is that? I'd bet
that a user's locale setting is more reliable the IP->geography+guess
language idea. I have no proof and no stats, but if users have not given
the correct loace info, they will also be getting menus and message boxes
in the wrong language -- so on balance I'd trust that over anything
guessed from the IP address.

--
Ben.

Jan 31 '06 #27

P: n/a
Andreas, what is the contradiction between the relevant content of my
website being the drawings and illustrations that I present (which
makes my website useless to text browsers) and me wanting search
engines to index the text that surrounds the drawings and which may
lead people searching for "drawings" or "zeichnungen" or "dessins" etc.
to my site.

Think of a book about the "Art of Leonardo". If you cannot see the
images, the book is useless to you, because what good is it to read
about artworks that you cannot see? But the title and other text of the
book are still used in library catalogues or online book shops to link
this art book to the keywords or keyphrases that someone looking for
"italian art" or "art and engineering" may search for. Without any
text, the book will never be found by anyone anywhere. And now imagine
this book having text in several languages, just like the books about
popular artists or obscure porn that Taschen publishes worldwide.

So where is the contradiction?

Jan 31 '06 #28

P: n/a
Nick and Jukka, I really don't understand why you are so aggressiv
about all of this. I am asking a question (that most of the people in
this thread ignore and do not answer), and I show some ignorance with
regard to IP addresses and the accept language header. So what?
Enlighten me. I am here because I am willing to listen, so you need not
beat me up verbally.

Jan 31 '06 #29

P: n/a
On Tue, 31 Jan 2006, Manfred Kooistra wrote:
So, I don't rely on information that may be missing or wrong. An IP
address is always there, and in most cases it points to a country of
origin, and in most countrys one language is understood by a vast
majority. For example, most Canadians understand English.
Yes, I really believe you *would* do that! A Quebecois visits with
their browser set to say that they prefer Canadian French, they would
be willing to accept generic French, but don't really care for English
at all - and you would send them the English version in preference to
the French version.

Expect a couple of heavies to visit from the Quebecois language
police: as a French colleague of mine said to us, in quite an ominous
voice, when we were discussing this issue: "In matters of the French
language, *NEVER* tangle with a Quebecois". He should know!
and a visitor from Quebec can then switch to French and set
a cookie for future visits.


Here we go again. Their browser already *has* a purpose-designed
solution for this; web servers already *have* carefully-designed
negotiation mechanisms implemented. It already works *for every site
that chooses to use it* - But no, you insist on designing a square
wheel, which only turns in this particular fashion on *your* site, so
the user needs something different on the next multilingual site that
they visit, and different again on the next one, that is too stubborn
to use the already-implemented mechanisms. Is it any wonder that Nick
gets completely exasperated at your attitude to this? I'm feeling
much the same.

If you want to do *anything* positive for your users, then teach them
how to use their browser's language selection configuration. The web
will be an incrementally better place if can succeed.
Jan 31 '06 #30

P: n/a
Alan, I think I understand. But let me explain.

I am German. I speak and read three foreign languages and can negotiate
the web in a handful of others. Since school I have understood that
people in the world do not know German. Everything that I was
interested in (rock music, science fiction, science and, later, the
internet) was in English. I have American friends, I have lived in
Russia, and I go to France or Sweden to hike. For me, meeting people
and their ideas often means that I have to communicate in a language
not my own.

So, for me it is VERY hard to understand how people can get upset if
they have to read something in a language that is not their own but
which they know. For me, that is how the internet is: it is not in my
mother tongue.

I agree with the technical aspects of what you and Nick tell me, and I
agree that it is certainly better to serve my site in the language of
the visitor, if I have that information. I did not know about the
accept language header, and I am glad that you told me and will take
the time to learn about it and implement it. But I cannot for the life
of me follow your emotions regarding language.

I come from a country where until recently popular music by local bands
was sung in a foreign language.

Feb 1 '06 #31

P: n/a
Gazing into my crystal ball I observed "Manfred Kooistra"
<ma**************@gmx.de> writing in news:1138738357.948649.41170
@g49g2000cwa.googlegroups.com:
Nick and Jukka, I really don't understand why you are so aggressiv
about all of this. I am asking a question (that most of the people in
this thread ignore and do not answer), and I show some ignorance with
regard to IP addresses and the accept language header. So what?
Enlighten me. I am here because I am willing to listen, so you need not
beat me up verbally.


Let's put it this way:

Say I'm at a hotel in Moscow, and the IP address you receive shows
Russia. Fine. But I'm not Russian, I'm on holiday and I'm German. The
settings on my laptop show German as my preferred language, and that is
the header that is sent to your server. Now why would you want to serve
me Russian when I am asking for German?

In other words, don't try to reinvent the wheel. Look for the
HTTP_LANGUAGE_ACCEPT header and serve that. That's should always be
present, and should always be accurate because the user chose it.

--
Adrienne Boswell
http://www.cavalcade-of-coding.info
Please respond to the group so others can share
Feb 1 '06 #32

P: n/a
On Wed, 1 Feb 2006, Adrienne Boswell wrote:
In other words, don't try to reinvent the wheel. Look for the
HTTP_LANGUAGE_ACCEPT header and serve that. That's should always be
present,
Well, that statement isn't quite right: if the client has no language
preferences, the header may be absent. But Apache's own content
negotiation handles that just fine. There should be no need to
re-invent content negotiation (probably badly) in CGI or PHP or
whatever it was that you had in mind.
and should always be accurate because the user chose it.


As already noted: most users, in practice, get what their installation
preconfigured for them (which may be the language that they installed
the OS in); they might need to be made aware that this is something
they can choose. But making them aware of the choice is surely a
better thing to do for them than to present them with some
non-standard works-on-one-site-only feature - and, probably, yet
more cookies to be stored.

For all its little difficulties, I still say that doing it by the
proper negotiation mechanism is the right thing to do.
Feb 1 '06 #33

P: n/a
in comp.infosystems.www.authoring.html, Alan J. Flavell wrote:
On Wed, 1 Feb 2006, Adrienne Boswell wrote:
In other words, don't try to reinvent the wheel. Look for the
HTTP_LANGUAGE_ACCEPT header and serve that. That's should always be
present,


Well, that statement isn't quite right: if the client has no language
preferences, the header may be absent. But Apache's own content
negotiation handles that just fine. There should be no need to
re-invent content negotiation (probably badly) in CGI or PHP or
whatever it was that you had in mind.


I have been thinking about making table of contents so that it mixes
languages that are accepted with more than q=0. So if someone comes to my
page, accepting fi and en with q=1 and 0.9, I will serve them table of
contents that mixes languages, so that for example all my hiking stuff,
in English and Finnish are in same place. Then I put link to rest of the
languages to bottom of page.

Of course, I don't have that much stuff on my pages yet, so not much
difference... But if Jukka would use the same, I would find stuff lot
easier, as I wouldn't need to read both English and Finnish TOC ;-)

I think that people knowing more than one language is very common, and
that it would really make sence to rethink things with this in mind.

I have no idea how SEs would like this. Doesn anyone know what accept
language thay send, or none? If it is none, then it is easy, as I can
just put everything together, like it understood all languages...
--
Lauri Raittila <http://www.iki.fi/lr> <http://www.iki.fi/zwak/fonts>
Feb 1 '06 #34

P: n/a
Ben wrote:
Most systems require you to specify where you are and what language you
like to use when you first install them. This may be why you don't
remember setting it. Good browsers will pick this info up from the
system.


What if my system is German but my browser is English? I usually
download new versions regularly, and the newest versons are usually
English. So does this browser, which confronts me with English menues,
send the setting of my system or does it send its own language?

Because the strange thing is: I NEVER get German pages, wherever I go.
And they do exist on some websites (like official EU sites, the UNESCO
website etc.). Does that mean these guys are too lazy or too stupid to
read my language accept header?

Feb 1 '06 #35

P: n/a
On Wed, 1 Feb 2006 11:01:19 +0000, "Alan J. Flavell"
<fl*****@physics.gla.ac.uk> wrote:
On Wed, 1 Feb 2006, Adrienne Boswell wrote:
and should always be accurate because the user chose it.


As already noted: most users, in practice, get what their installation
preconfigured for them (which may be the language that they installed
the OS in); they might need to be made aware that this is something
they can choose.


One thing I did notice about the otherwise awful UI that was IE7 was
on first load it prompted you for your language preferences.

Jim.
--
comp.lang.javascript FAQ - http://jibbering.com/faq/

Feb 1 '06 #36

P: n/a
On 1 Feb 2006, Manfred Kooistra wrote:
Because the strange thing is: I NEVER get German pages, wherever I go.
And they do exist on some websites (like official EU sites, the UNESCO
website etc.). Does that mean these guys are too lazy or too stupid to
read my language accept header?


Perhaps. An example page that works is
http://www.google.com/webhp
Set your preferred language to "sa" and you get the page in
Sanskrit (though still a mess). You might need to remove or disable
a Google cookie first.

--
Netscape 3.04 does everything I need, and it's utterly reliable.
Why should I switch? Peter T. Daniels in <news:sci.lang>

Feb 1 '06 #37

P: n/a
On 31 Jan 2006, Manfred Kooistra wrote:
So where is the contradiction?


Search engine bots *are* text browsers.

--
Netscape 3.04 does everything I need, and it's utterly reliable.
Why should I switch? Peter T. Daniels in <news:sci.lang>

Feb 1 '06 #38

P: n/a
On Wed, 01 Feb 2006 05:29:01 -0800, Manfred Kooistra wrote:
Ben wrote:
Most systems require you to specify where you are and what language you
like to use when you first install them. This may be why you don't
remember setting it. Good browsers will pick this info up from the
system.


What if my system is German but my browser is English? I usually download
new versions regularly, and the newest versons are usually English. So
does this browser, which confronts me with English menues, send the
setting of my system or does it send its own language?

Because the strange thing is: I NEVER get German pages, wherever I go.


Indeed. I will have to do a rare thing for Usenet (though sadly not so
rare for me): admit I am quite wrong!!

Alan J Flavell posted a message that, during an experiment, he saw lots of
HTTP headers asking for en-US (and nothing else). This got me thinking:
are there really so many "pre-configured" systems out there? Surely mine
is set up how I want? So I went to see what my browsers are asking for.
I got this:

lynx & links: "en"
epiphany & wget: no Accept-Language header
firefox: "en-us,en,q=0.5"

As far as I can tell, my locale information is correctly set up for GB
English but this fact is ignored by at least some of the browsers I use
(the console browsers *may* be getting the "en" from the "en_GB" loacale).
Is this also what happens on Windows machines and other OSes? Does anyone
know why this seeming obvious source of reliable information is being
ignored?

NOTE: I am not now advocating IP->language mapping. As an earstwhile
protocol designer, this is using the wrong information from the wrong
level and is full of pitfalls as already pointed out (another one is
scaling it to IP v6).

[Aside to Alan J F: just in case you spotted this gross error and decided
to hold back from correcting another of my posts, please fear not -- I
took no offence last time and would not have done so this time.]

--
Ben.

Feb 1 '06 #39

P: n/a
On Wed, 1 Feb 2006, Ben Bacarisse wrote:
On Wed, 01 Feb 2006 05:29:01 -0800, Manfred Kooistra wrote:
Because the strange thing is: I NEVER get German pages, wherever I
go.

If your language selection is appropriately set, you can expect to see
German versions (where available) on the Apache httpd web site,
httpd.apache.org, for a start; zum Beispiel
http://httpd.apache.org/docs/2.0/ - the server then says in its HTTP
response headers:

| Date: Wed, 01 Feb 2006 17:34:04 GMT
| Server: Apache/2.2.0 (Unix)
| Content-Location: index.html.de
| Vary: negotiate,accept-language,accept-charset

I just moved "de" to the top of my preferences list and revisted
Google, and, lo and behold, it too was then in German. (I might need
to add that my default cookie policy is rejection, since Andreas
thinks they could take a cookie selection into account too).
Indeed. I will have to do a rare thing for Usenet (though sadly not
so rare for me): admit I am quite wrong!!
Good for you. :-)
Alan J Flavell posted a message that, during an experiment, he saw
lots of HTTP headers asking for en-US (and nothing else).


There's more detail about the topic, including my short study of
Accept-language headers logged on our server, on my page
http://ppewww.ph.gla.ac.uk/~flavell/www/lang-neg.html

regards
Feb 1 '06 #40

P: n/a
in comp.infosystems.www.authoring.html, Manfred Kooistra wrote:
Ben wrote:
Most systems require you to specify where you are and what language you
like to use when you first install them. This may be why you don't
remember setting it. Good browsers will pick this info up from the
system.
What if my system is German but my browser is English? I usually
download new versions regularly, and the newest versons are usually
English. So does this browser, which confronts me with English menues,
send the setting of my system or does it send its own language?


You can set it in all browsers exept some on Mac OS X. What is default
depends on lots of things. But usually most about language of browser
Because the strange thing is: I NEVER get German pages, wherever I go.
And they do exist on some websites (like official EU sites, the UNESCO
website etc.). Does that mean these guys are too lazy or too stupid to
read my language accept header?


Likely.

--
Lauri Raittila <http://www.iki.fi/lr> <http://www.iki.fi/zwak/fonts>
Feb 1 '06 #41

P: n/a
in comp.infosystems.www.authoring.html, Jim Ley wrote:
On Wed, 1 Feb 2006 11:01:19 +0000, "Alan J. Flavell"
<fl*****@physics.gla.ac.uk> wrote:
On Wed, 1 Feb 2006, Adrienne Boswell wrote:
and should always be accurate because the user chose it.


As already noted: most users, in practice, get what their installation
preconfigured for them (which may be the language that they installed
the OS in); they might need to be made aware that this is something
they can choose.


One thing I did notice about the otherwise awful UI that was IE7 was
on first load it prompted you for your language preferences.


Was it possible to set multible values, or was is dumbed down version
that works on people that speak no more than 2 languages?

--
Lauri Raittila <http://www.iki.fi/lr> <http://www.iki.fi/zwak/fonts>
Feb 1 '06 #42

P: n/a
Alan J. Flavell wrote:
As already noted: most users, in practice, get what their installation
preconfigured for them (which may be the language that they installed
the OS in); they might need to be made aware that this is something
they can choose.


If they can choose ! Lots of Internet users are using computers they
can't customize : shared computers, or corporate configurations.

As someone else said elsewhere in the thread, if you're using
content-negociation techniques, it is anyway hardly recommended to offer
proper means to the user to access all the availables translations.

For my part, I really think content negociation should take precedence
only at the first visit. Lots of browsers aren't configured as their
users should expect them to be, but even if an header corresponding to
the user general preferences is sent, your languages preferences could
vary for one or more sites. For exemple, because translations from
english to french are generally very bad, or made by non techies people,
I want to read technical support informations in their original version.
And of course there are all the updates reasons : you can never say all
the language version of your web site are identical and up to date.

A very good model I think is the Wikipedia websites : a generic homepage
in english, all the pointers you need to access the others languages, a
distinct URL for each language (exemple fr.wikipedia.com) for future
direct access to the localized homepage, and on each document all the
direct links to the existing translations of the document.
Feb 2 '06 #43

P: n/a
On Thu, 2 Feb 2006, Pierre Goiffon wrote:
Alan J. Flavell wrote:
As already noted: most users, in practice, get what their
installation preconfigured for them (which may be the language
that they installed the OS in); they might need to be made aware
that this is something they can choose.
If they can choose ! Lots of Internet users are using computers they
can't customize : shared computers, or corporate configurations.


But that's no reason to go implementing some one-site-only botch, that
only works via cookies (which are not going to work very well either,
on the model of "shared computers or corporate configurations" which
you seem to have in mind). It *is* a good reason to follow the advice
that I've already been offering on my page
http://ppewww.ph.gla.ac.uk/~flavell/www/lang-neg.html :-

||There are numerous reasons why server-driven language negotiation
||should not be the only selection mechanism available to users. [...]

In truth, our "shared computers" in the office use my personal
browsing profile, kept on the file space assigned to my logon user,
and my settings (language choices, history stack, any cookies which I
might accept - though there are very few of those) are personal to me,
and do not interfere in any way with other users of the same computer.
For my part, I really think content negociation should take
precedence only at the first visit.
Since a properly-defended "shared" or "corporate" browser (to use your
terminology, and making some assumptions about the kind of
configuration you had in mind) is going to discard all cookies,
session information, history etc. when the user leaves, it follows
that every "visit" counts as a "first visit" for such users. The same
goes for internet cafes, etc. etc.
Lots of browsers aren't configured as their users should expect them
to be, but even if an header corresponding to the user general
preferences is sent, your languages preferences could vary for one
or more sites.
As my page already says:

|| Users will likely, on occasion, wish to refer to some other
|| language version than the one which their customary language
|| preferences would indicate, and this may change on a page by page
|| basis.

This is still no reason for not using language negotiation. It *is* a
good reason for offering other navigation methods, alongside language
negotiation. But I already said that.
For exemple, because translations from english to french are
generally very bad, or made by non techies people, I want to read
technical support informations in their original version. [...]


That's what the source quality value (qs=) is for.

You haven't said anything which seems to indicate that "language
negotiation is bad". You've only confirmed what we already knew, that
"language negotiation shouldn't be the only navigation route offered",
and you've pointed out a few issues that mean if it's done at all, it
needs to be done properly. Apache's content negotiation already forms
the basis of doing it properly. Every botch that I've seen yet in CGI
or PHP scripts has had at least one gaping error in it.

h t h
Feb 2 '06 #44

P: n/a
JRS: In article <Xn***************************@69.28.186.121>, dated
Wed, 1 Feb 2006 09:10:11 remote, seen in news:comp.infosystems.www.autho
ring.html, Adrienne Boswell <ar********@sbcglobal.net> posted :

In other words, don't try to reinvent the wheel. Look for the
HTTP_LANGUAGE_ACCEPT header and serve that. That's should always be
present, and should always be accurate because the user chose it.


It's wrong to assume that the installer and the user are the same
person, or prefer the same language; or that the installer did it right.

Related example : In our local library, which is in England,
IE/javascript is installed to show American dates. Many of the users
are Korean. Koreans may prefer ISO 8601 dates (Japanese probably do).
There's no obvious indication that an individual user can indicate a
preference, either for those dates or for HTTP_LANGUAGE_ACCEPT.

--
John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v4.00 IE 4
<URL:http://www.jibbering.com/faq/> JL/RC: FAQ of news:comp.lang.javascript
<URL:http://www.merlyn.demon.co.uk/js-index.htm> jscr maths, dates, sources.
<URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/jscr/&c, FAQ items, links.
Feb 2 '06 #45

P: n/a
On Thu, 2 Feb 2006, Dr John Stockton wrote:
It's wrong to assume that the installer and the user are the same
person, or prefer the same language; or that the installer did it right.


It's equally wrong (and DAMNED ANNOYING) when authors always assume
that they know better what the user wants, than what the user is
telling them they want. It starts with font size and, evidently,
doesn't end with language preferences.
Feb 2 '06 #46

P: n/a
Alan, after reading your article and the relevant apache docs, I have
been trying to redirect visitors of my home directory to different
subdirectories, depending on the accept language header sent by the
browser. For this I wrote a .htaccess file with the following content:
RewriteEngine On

RewriteCond %{HTTP_ACCEPT_LANGUAGE} ^de [NC]
RewriteRule ^/ http://www.domain.com/de/ [L,R=301]

RewriteCond %{HTTP_ACCEPT_LANGUAGE} ^en [NC]
RewriteRule ^/ http://www.domain.com/en/ [L,R=301]

RewriteRule ^/ http://www.domain.com/no/ [L,R=301]
But when I go to www.domain.com (with my browser set to "de" or "en" -
I checked that this is indeed transferred), I always end up in the
.../no directory. Does RewriteCond not understand the
HTTP_ACCEPT_LANGUAGE variable, or did I make some mistake?

All of this is happening on Apache 1.3.33

(I understand the approach with AddLanguage and different files with
language code extensions, e.g. index.php.de and index.php.en, but I
would prefer to direct to subdirectories instead.)

Feb 2 '06 #47

P: n/a
Alan J. Flavell wrote:
You haven't said anything which seems to indicate that "language
negotiation is bad". You've only confirmed what we already knew, that
"language negotiation shouldn't be the only navigation route offered"


So please let me rephrase.

This might be a too quickly made genalization, but I really think most
people don't know nothing about the languages preferences in their
browsers. OK, this is not a reason not to use the accept-language (you
can educate user), but I mean, content negociation is a choice made
server side without asking anything to the user. And she/he could feel
very badly about this : "what the hell ?!!".

Maybe language negociation is a good tool from time to time, but for the
reason explained it shocks me somehow. A global homepage with links to
localized versions is simple, and I think simplicity is good.
Feb 3 '06 #48

P: n/a
On Fri, 3 Feb 2006, Pierre Goiffon wrote:
So please let me rephrase.

This might be a too quickly made genalization, but I really think most people
don't know nothing about the languages preferences in their browsers.
This argument just keeps going round and round. To quote from my
already-cited page
http://ppewww.ph.gla.ac.uk/~flavell/www/lang-neg.html
___
/
Authors are often seen arguing that it is pointless to apply language
negotiation, because most people have no idea how to configure their
browsers to use it properly.

They are then inclined to respond to this belief (which may very well
be true) by designing their own weird and wonderful language selection
mechanism, exclusively for their own site, based typically on some
kind of user dialogue leading to the setting of a cookie, or even on
making guesses based on the user's IP address or domain name.

I would suggest that this is perverse in several important respects[..]
\___
OK, this is not a reason not to use the accept-language (you can
educate user),
Just so...
but I mean, content negociation is a choice made
server side without asking anything to the user.
As far as the WWW protocol is concerned, the user /has/ asked for the
language, just as clearly as they asked for the URL. It would be
distinctly rude for an author to override what they had asked for,
just as clearly as it would be rude to ignore the URL that they asked
for and send them a different URL that you "guessed" (on the basis of
some tenuous heuristics) would better fit their needs.

I hear what you say about users who don't yet know how to use their
browsers, but I repeat, the only constructive solution I can see to
that is to teach them. Any other approach just leads to additional
fragmentation of the web, which is bad.

[...] Maybe language negociation is a good tool from time to time, but for
the reason explained it shocks me somehow. A global homepage with
links to localized versions is simple, and I think simplicity is
good.


___
/
For this and other reasons, it's recommended that pages that are
available to readers in more than one language variant, should offer
explicit links to the other languages.
\___

I seem to have a prepared answer to quote from my page, to each
objection that's been raised so far. Please can I have a new
objection - one that isn't already dealt with on the existing page?
Thanks.

Feb 3 '06 #49

P: n/a
On Thu, 2 Feb 2006, Manfred Kooistra wrote:
Alan, after reading your article and the relevant apache docs, I have
been trying to redirect visitors of my home directory to different
subdirectories, depending on the accept language header sent by the
browser.
MultiViews already works well, but it doesn't care for that kind of
structure.
For this I wrote a .htaccess file with the following content: [...] RewriteCond %{HTTP_ACCEPT_LANGUAGE} ^de [NC]
RewriteRule ^/ http://www.domain.com/de/ [L,R=301]
When I said before that:

||Apache's content negotiation already forms the basis of doing it
||properly. Every botch that I've seen yet in CGI or PHP scripts has
||had at least one gaping error in it.

- I should have included mod_rewrite in that list!!!

Here's what I believe to be a protocol-correct languages preference
header:

Accept-language: de;q=0,fr-CA,fr;q=0.5

That says the reader prefers Canadian French, failing which they'll
accept generic French, but under no circumstances do they want German.

It's left as an exercise to the student to work out what your rewrite
recipe will do for them.
(I understand the approach with AddLanguage and different files with
language code extensions, e.g. index.php.de and index.php.en, but I
would prefer to direct to subdirectories instead.)


I *think* you could implement that by means of a type-map file; or you
could create a bundle of symlinks in a collective subdirectory,
pointing to the various language-specific subdirectories; but, if you
do that, take care that relative URLs get resolved correctly. I
haven't actually tried either approach in practice myself, so YMMV.

MultiViews actually works pretty well for most kinds of requirements;
there were a few little glitches in Apache 1.3 versions - in as much
as a pedantic interpretation of the negotiation rules could result in
some surprises in practice, such as all those USA users who would be
told there's nothing available for them (because they refused generic
English), but my page shows ways to get around that. I think in
Apache 2.0 the algorithms have been tweaked a bit, to get better
practical results without needing workarounds.

good luck
Feb 3 '06 #50

64 Replies

This discussion thread is closed

Replies have been disabled for this discussion.