By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,670 Members | 1,526 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,670 IT Pros & Developers. It's quick & easy.

How do search engines index multilingual content?

P: n/a
I am building a website with identical content in four different
languages. On a first visit, the search engine determines the language
of the content by the IP address of the visitor. What the user sees is
content in only one language at a time. He or she can then switch to
another language and set this as the preferred language, but again he
or she sees content in only this one other language.

The question now is: How do I get search engines to index ALL of the
content, in all languages?

Should I include the non-displayed content in DIVs with display set to
"none" (like we used to include complete websites in the noframes tag)?
Or do search engines ignore invisible DIVs?

Or can I somehow detect that a search engine is visiting and deliver a
page with the complete content of all four languages in it? Or would
that get me banned?

Or do I have to rely on the search engine following the local links to
the pages in the other languages? This might be a problem, because the
varying content is always displayed on the same page, so the URI stays
the same, and only one parameter changes, thus:
"content.php?language=oneoffourlanguages". In fact it might even be
impossible, because I do not want to transfer the language information
through the URL via GET, but want to send it through a form via POST.
So the URI is exactly the same for all languages (at least in the
version I am aiming at).

If you have solved this problem on your website or know how to go about
it, I'd be grateful for some help.

Jan 29 '06
Share this Question
Share on Google+
64 Replies


P: n/a
Alan J. Flavell schrieb:
Here's what I believe to be a protocol-correct languages preference
header:

Accept-language: de;q=0,fr-CA,fr;q=0.5


Thank you for pointing this out. Actually my RewriteCond has a more
elaborate regular expression in it, but I did not want to post it here,
as I was asking about HTTP_ACCEPT_LANGUAGE being supported here.
Anyway, I am glad you told me, cause I hadn't thought about it, because
my browsers do not transmit q=0. All languages that I define as wanted,
have a positive value or, in the instance of the first language, no
value (meaning, I guess q=1). What sense does q=0 make anyway? It would
mean that users had to define every singe language they do not
understand and do not want to be served - which I wouldn't know how to
do in any of my browsers, nor does it make much sense, because it might
leave a user with no page at all, instead of one he cannot read but
which might offer visual information or links that might be more useful
than nothing. For example, I often browse Japanese or Korean pages for
the images displayed there, although I cannot read the surrounding
text.

I have to read up on multiviews and symlinks again. Thanks for those
ideas, also, although I seem to remember having read that multiviews
won't work on my Apache version (1.3.33), but I may be mistaken.
Anyway, that was the reason I switched to using mod_rewrite.
But again: Is HTTP_ACCEPT_LANGUAGE suported as an argument for
RewriteCond as in my example in my last post, or does Apache simply not
understand it here?

At this moment, I am doing everything that I want with a PHP script,
which reads the accepted language from the browsers headers and returns
a permanent 301 redirect to the proper subdirectory. (The routine in
PHP actualy evaluates all the languages and qualities in the accept
language header, matches them to a qualified hierarchical list of
available languages, and decides which one to return.) I would prefer
to use .htaccess for this, as this seems much quicker to me, but if it
won't work, I'll stick with PHP.

Feb 3 '06 #51

P: n/a
Alan J. Flavell wrote:
So please let me rephrase.

This might be a too quickly made genalization, but I really think most people
don't know nothing about the languages preferences in their browsers.
This argument just keeps going round and round.


You're leaving my second point : I'm going to support.a_company.com and
a french page is automatically sent to me, but I wanted to see the
international one. For example, a_company is a software vendor, I'm
using the original english speaking version of their software, and I
need a fix that just applies to this original version.

Of course here language and nationals matters are mixed : the french
entity of a_company maintains the french section of the website, the
english one is maintained in the a_company headquarters.

Going back to the user, OK I'll just have to click on the english or
"international" link, but I really won't welcome that choose that was
the opposite that what I wanted to get.

You were talking about q= in the accept language header that is exactly
about that... But this preference is global, and it should be customized
site by site - indeed for each different nature of content.
I seem to have a prepared answer to quote from my page, to each
objection that's been raised so far.


I'll take time this week end to go read your page - don't answer to me
before as I'll check for something on the other point. Anyway thanks to
have took the time to quote fragments from your document in your response !

By the way, a big thanks for the ressources you bought online, I did
read with a great interest all your very interestings documents about
characters encodings !
Feb 3 '06 #52

P: n/a
On Fri, 3 Feb 2006, Pierre Goiffon wrote:
You're leaving my second point : I'm going to support.a_company.com
and a french page is automatically sent to me, but I wanted to see
the international one. For example, a_company is a software vendor,
I'm using the original english speaking version of their software,
and I need a fix that just applies to this original version. [...]
Going back to the user, OK I'll just have to click on the english or
"international" link, but I really won't welcome that choose that
was the opposite that what I wanted to get.
You will get what (according to the protocol) best represents what you
asked for. It's a kind of negotiation, after all, which depends on
choices that have been made both by you and by your provider.
You were talking about q= in the accept language header that is
exactly about that... But this preference is global, and it should
be customized site by site - indeed for each different nature of
content.
There is nothing in the protocol which prevents a user from doing
that1 Quite a number of existing browser options can be configured
differently for different sites: if enough users asked for it, there
is no reason that a browser could not implement per-site language
preferences as a browser option.

But first people are complaining that nobody knows how to configure an
existing and rather simple feature of their browser, and that some
users aren't allowed to configure it; and then you ask for a more
elaborate configuration scheme to be implemented. This seems a bit
paradoxical, you know...
By the way, a big thanks for the ressources you bought online, I did
read with a great interest all your very interestings documents
about characters encodings !


You will probably have seen the quote, from someone who had read one
of my pages in that area:

|| Je viens de la sauver dans mes signets, tant elle est riche
|| d'enseignements... et de perte d'illusion

;-)

best
Feb 3 '06 #53

P: n/a
On Fri, 3 Feb 2006, Manfred Kooistra wrote:
Anyway, I am glad you told me, cause I hadn't thought about it,
because my browsers do not transmit q=0.
It's already taken account of in the Apache negotiation algorithm,
I'm sure. This is why I'm trying to convince you that there *is no
need* for you to learn all the fine details of the negotiation RFC
and then try to implement it in some other way. It's already *built
in to the server that you use*.
What sense does q=0 make anyway?
Perhaps I should have written as my example

Accept-language: de;q=0,fr-CA,fr;q=0.5,*;q=0.1

As I understand it, that means (if French isn't available) "you can
send me any other language which you have, as long as it isn't
German". At least, I just tried something analogous[*], and got the
results that I expected.
[*] I actually tried Accept-language: *,en;q=0
which I reckon means "send me any language you have, as long
as it isn't English". The Apache site sent me German, in fact, for
those pages which I visited.

although I seem to remember having read that multiviews
won't work on my Apache version (1.3.33),
I don't see why not!!! It seems to me that you're accepting bogus
advice from somewhere.
but I may be mistaken.
I think so.
But again: Is HTTP_ACCEPT_LANGUAGE suported as an argument for
RewriteCond as in my example in my last post,
The important point that I was dealing with in my posting was that
this was the *wrong* solution, so I didn't go into the technical
details. I still say it's the *wrong* solution - don't be misled by
the fact that I'm now commenting on a technical detail.

But, speaking purely hypothetically, I don't see why it would not be
feasible to test that; it's certainly present in the environment when
I run a CGI script, for example. Why would it not be?

HTTP_ACCEPT_LANGUAGE = en-gb,en;q=0.7,de;q=0.3
At this moment, I am doing everything that I want with a PHP script,
which reads the accepted language from the browsers headers and
returns a permanent 301 redirect to the proper subdirectory.
That's got all the wrong properties with respect to intermediate
caches, for a start!
(The routine in PHP actualy evaluates all the languages and
qualities in the accept language header, matches them to a qualified
hierarchical list of available languages, and decides which one to
return.) I would prefer to use .htaccess for this, as this seems
much quicker to me,


I can only say that IMHO, Apache's built-in negotiation is likely to
be both quicker and more accurate than either. If you're not
satisfied with what MultiViews offers, you can set your own rules for
writing a type-map file.

good luck
Feb 3 '06 #54

P: n/a
On Fri, 3 Feb 2006, Alan J. Flavell wrote:
On Fri, 3 Feb 2006, Manfred Kooistra wrote:
But again: Is HTTP_ACCEPT_LANGUAGE suported as an argument for
RewriteCond as in my example in my last post,


The important point that I was dealing with in my posting was that
this was the *wrong* solution, so I didn't go into the technical
details. I still say it's the *wrong* solution - don't be misled by
the fact that I'm now commenting on a technical detail.

But, speaking purely hypothetically, I don't see why it would not be
feasible to test that; it's certainly present in the environment when
I run a CGI script, for example. Why would it not be?


Ho hum. I see that you've been raising the same question on another
group, and Andreas has helpfully read the documentation for you.

http://httpd.apache.org/docs/1.3/mod...ml#RewriteCond

Evidently mod_rewrite does not give access to the whole range of CGI
environment settings. Sorry for the above misinformation. But see
"Special note 3" in that subsection. Or rather, *don't* see special
note 3, because you are just wasting time and effort on the wrong
approach. Use the built-in negotiation, and get the right result,
instead of cobbling something up and getting anomalies from start to
finish.

Feb 3 '06 #55

P: n/a
JRS: In article <Pi******************************@ppepc62.ph.gla.a c.uk>
, dated Thu, 2 Feb 2006 19:20:51 remote, seen in news:comp.infosystems.w
ww.authoring.html, Alan J. Flavell <fl*****@physics.gla.ac.uk> posted :
On Thu, 2 Feb 2006, Dr John Stockton wrote:
It's wrong to assume that the installer and the user are the same
person, or prefer the same language; or that the installer did it right.


It's equally wrong (and DAMNED ANNOYING) when authors always assume
that they know better what the user wants, than what the user is
telling them they want. It starts with font size and, evidently,
doesn't end with language preferences.


Indeed. The only generally-valid assumption can be that if the user
him/herself is asked to choose from comprehensible available
possibilities, and makes that choice personally after being asked, then
the user should be given the chosen possibility and ought to be willing
to accept the logical consequences.

Your systems, I suppose, are set up for British (Scottish?) preferences,
and your browser will indicate a preference for English over all foreign
languages. But if someone from the Continent phones to ask about
something that seems strange about a page served in Foreign, then you'll
want to look at it in Foreign. Of course, *you*'ll know how to set that
up as a browser preference; but few others will remember.

Information pre-configured to be sent by the browser cannot be trusted,
unless it can be established that the user's OS/browser combination has
configuration facilities which are completely obvious and easy to use.

Perhaps software should be written such that directly after compilation
all choices are explicitly undefined. The intended consequence of that
will be that systems are designed to make choosing easy and obvious.

--
John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v4.00 MIME.
Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
For news:borland.*, use their server newsgroups.borland.com ; but first read
Guidelines <URL:http://www.borland.com/newsgroups/guide.html> ff. with care.
Feb 3 '06 #56

P: n/a
Alan J. Flavell wrote:
I actually tried Accept-language: *,en;q=0
How did you get your browser to send q=0? (And which browser are you
setting up to do it?)
At this moment, I am doing everything that I want with a PHP script,
which reads the accepted language from the browsers headers and
returns a permanent 301 redirect to the proper subdirectory.


That's got all the wrong properties with respect to intermediate
caches, for a start!


Could you spell that out for me? I'm not sure I understand.
Ho hum. I see that you've been raising the same question on another
group, and Andreas has helpfully read the documentation for you.
:-)

Well, in fact I did read the documentation, several times - it was open
in the background all day yesterday -, but I am an Apache novice who
does not know what he is NOT reading (alternative modules or
directives), and so the text is very hard for me to make sense of:
extremely technical lingo with very few illustrative examples.

I would really wish for advanced users to be able to post moderated
comments as on php.net, where much of the more valuable information can
be found below the official content.

... But see
"Special note 3" in that subsection. Or rather, *don't* see special
note 3, ...
You might have guessed by now that my psychological markup forces me to
try everything that THE_REQUEST, mentioned in note 3, offers :-)
... because you are just wasting time and effort on the wrong
approach. Use the built-in negotiation, and get the right result ...


But seriously, why is it wrong to use mod_rewrite?

---

By the way, how do I set up groups.google.de to quote "Alan wrote ..."
(in English) instead of "Alan schrieb ..." (in German)? I can choose
all kinds of languages from Arabian to Hungarian, but English is not
among them. Isn't that strange? (Just a comment, no answers expected.)

Feb 3 '06 #57

P: n/a
On Fri, 3 Feb 2006, Dr John Stockton wrote:
Your systems, I suppose, are set up for British (Scottish?)
preferences,
To be honest, I don't know what they're set up for by default; on my
own account, they always seem to inherit the values which I set
before, and the original settings of /those/ are lost in the depths of
history.
and your browser will indicate a preference for English over all
foreign languages.
Well, I just checked the MSIE setting on this system, and it's sending
en-GB,de;q=0.7,en;q=0.3

That says "give me British English if you've got it, otherwise prefer
any kind of German to any other kind of English". But it's only set
that way because of an earlier test...

As we've already discussed: MSIE's *initial* setting is in flagrant
disregard of the useful advice in RFC2616 - so what's new?
But if someone from the Continent phones to ask about something that
seems strange about a page served in Foreign, then you'll want to
look at it in Foreign. Of course, *you*'ll know how to set that up
as a browser preference; but few others will remember.
Then whoever's on the Helldesk should refer the problem upwards until
it reaches someone who /does/ understand. I'm not sure what point
you're trying to make here.
Information pre-configured to be sent by the browser cannot be
trusted,
It's arguable that when a user *first* runs the browser, or otherwise
initialises a browser profile, they should be forced^Wstrongly urged
to make a choice of default text size, language preferences, and
anything else that can't easily be deduced. Alright: in Windoze I
just fired-up the Mozilla Profile Manager and tried to create a new
profile called Stockton. It offered me two buttons: "Choose Folder"
and "Region Selection..."

Naturally I took a look at "Region Selection", and found that it
defaulted the language to "English US". Looking tolerable so far -
but when I tried to investigate the options on the Region pulldown, I
found that it offered me precisely one choice: "US Region". So that's
not very friendly. Seems that, after all, it would be necessary to
visit the rather obvious language preferences dialogue /after/ the
browser has been started up.

Anyhow, I completed the profile, and then looked at the resulting
default settings, and here they are:

HTTP_ACCEPT_LANGUAGE = en-us,en;q=0.5

Which is a reasonable choice for a USAn - unlike MSIE which, as I
said, configures by US default to refuse all kinds of English other
than en-US.

You'll note that in doing this, Mozilla took no account of my
Windows locale setting which, not surprisingly, is set to
"English (United Kingdom)".
unless it can be established that the user's OS/browser combination
has configuration facilities which are completely obvious and easy
to use.
You're not trying to tell me that /any/ of the worthwhile options in
Windows are "completely obvious and easy to use", Shirley?
Perhaps software should be written such that directly after
compilation all choices are explicitly undefined.
I must agree with you that the absence of a language selection list
would be a better initial choice than what the vast majority of
readers evidently got in MSIE (I'm referring back to that web server
study that I mentioned earlier).
The intended consequence of that will be that systems are designed
to make choosing easy and obvious.


Pull the other one! Many "surfers" have no idea that they are using
MSIE as their browser, nor do they have a clue what a URL is: they
think only that they are "opening the Internet", period.

But that doesn't change the fact that there is an IETF-specified
negotiation protocol, whether they know or care about it or not. As
and when I see fit to use it in the interests of clue-endowed
readers, I refuse to be discouraged by some amorphous mass of people
who, even if smeared with clue pheromone and dumped in a field of
randy clues... (well, you know the analogy).

But I *will* go so far as to adjust my settings so that even if they
demand en-US and nothing else, I won't go sending them the ominous
Status-406 page.
Feb 3 '06 #58

P: n/a
On Fri, 3 Feb 2006, Manfred Kooistra wrote:
Alan J. Flavell wrote:
I actually tried Accept-language: *,en;q=0


How did you get your browser to send q=0? (And which browser are you
setting up to do it?)


For that particular test, I used Lynx, since its language selection
string can be set just as I choose. But web clients don't /have/ to
be browsers, remember. Any client can present Accept* headers and
initiate server-side negotiation if the mechanism is enabled on the
server.

Sorry, I'm running out of time for now, and your other questions would
need a lot of time to answer. But perhaps you start to grasp the idea
that you're evidently trying to re-implement language negotiation from
a position of only understanding some part of what the requirements
are, whereas Apache has been implemented by folks who really do know
what they're doing. Judging by their products, anyway.

hope this helps a bit.
Feb 3 '06 #59

P: n/a
JRS: In article <Pi******************************@ppepc62.ph.gla.a c.uk>
, dated Fri, 3 Feb 2006 20:45:46 remote, seen in news:comp.infosystems.w
ww.authoring.html, Alan J. Flavell <fl*****@physics.gla.ac.uk> posted :
On Fri, 3 Feb 2006, Dr John Stockton wrote:
Your
That's a singular "your".
systems, I suppose, are set up for British (Scottish?)
preferences,
To be honest, I don't know what they're set up for by default; on my
own account, they always seem to inherit the values which I set
before, and the original settings of /those/ are lost in the depths of
history.
and your browser will indicate a preference for English over all
foreign languages.


Well, I just checked the MSIE setting on this system, and it's sending
en-GB,de;q=0.7,en;q=0.3

That says "give me British English if you've got it, otherwise prefer
any kind of German to any other kind of English". But it's only set
that way because of an earlier test...


So it's hardly fair to cite it. A well-designed system would keep the
delivery preferences and the user's normal preferences as well as what
he currently wants. Such are rare. I vaguely recall that the VT100
terminal had it.

As we've already discussed: MSIE's *initial* setting is in flagrant
disregard of the useful advice in RFC2616 - so what's new?
But if someone from the Continent phones to ask about something that
seems strange about a page served in Foreign, then you'll want to
look at it in Foreign. Of course, *you*'ll know how to set that up
as a browser preference; but few others will remember.
Then whoever's on the Helldesk should refer the problem upwards until
it reaches someone who /does/ understand. I'm not sure what point
you're trying to make here.


That "phones" means "phones you", singular : AJF himself.
You're not trying to tell me that /any/ of the worthwhile options in
Windows are "completely obvious and easy to use", Shirley?


Windows is not the only OS.
Perhaps software should be written such that directly after
compilation all choices are explicitly undefined. The intended consequence of that will be that systems are designed
to make choosing easy and obvious.


Pull the other one! Many "surfers" have no idea that they are using
MSIE as their browser, nor do they have a clue what a URL is: they
think only that they are "opening the Internet", period.


Of course. I'm assuming that the software is tested after initial
compilation at the authoring establishment, and that the testers will
feed back to the coders and designers.

When, as is common, the preferences of the authors are built in or pre-
loaded, customisation for outlandish places such as Glasgow (or,
generally, anywhere outside AL..WY) will be at best moderately tested.
If the authoring establishment itself is impelled to customise, then the
authors will be driven to make a better job of getting it convenient and
obvious. That will benefit the end users.

--
John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v4.00 MIME.
Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
Do not Mail News to me. Before a reply, quote with ">" or "> " (SonOfRFC1036)
Feb 4 '06 #60

P: n/a
Dan
Pierre Goiffon wrote:
A very good model I think is the Wikipedia websites : a generic homepage
in english, all the pointers you need to access the others languages, a
distinct URL for each language (exemple fr.wikipedia.com) for future
direct access to the localized homepage, and on each document all the
direct links to the existing translations of the document.


fr.wikipedia.org, actually. Wikipedia is a noncommercial project, and
properly reflects this in its domain name.

And the different-language Wikipedias are properly in different
subdomains rather than served from the same URI via language
negotiation because they are not simply translations of one another;
they are each separate wiki projects with many differences.

--
Dan

Feb 4 '06 #61

P: n/a
Thank you to everyone who has participated in this thead so far. This
is my solution to half the problem (redirect to subdirectories based on
language preferences), here with two languages:

RewriteEngine On

RewriteBase /

RewriteCond %{HTTP:Accept-Language} ^.*de.*$ [NC]
RewriteRule ^(index\.php)?$ http://www.domain.com/de/ [L,R=301]

RewriteCond %{HTTP:Accept-Language} ^.*en.*$ [NC]
RewriteRule ^(index\.php)?$ http://www.domain.com/en/ [L,R=301]

RewriteRule ^(index\.php)?$ http://www.domain.com/de/ [L,R=301]

Yes, the regular expression will be refined not to serve "de" to
someone prefering English but knowing German, not to someone not
wanting German and stating "de; q=0". This post is about
"HTTP:Accept-Language".

I have dropped the IP to country thing as I agree that we have to
utilize the standards to make them standards. If anyone has sincere
objections to my solution, please let me know the reasons.

Feb 5 '06 #62

P: n/a
Thank you to everyone who has participated in this thead so far. This
is my solution to half the problem (redirect to subdirectories based on
language preferences), here with two languages:

RewriteEngine On

RewriteBase /

RewriteCond %{HTTP:Accept-Language} ^.*de.*$ [NC]
RewriteRule ^(index\.php)?$ http://www.domain.com/de/ [L,R=301]

RewriteCond %{HTTP:Accept-Language} ^.*en.*$ [NC]
RewriteRule ^(index\.php)?$ http://www.domain.com/en/ [L,R=301]

RewriteRule ^(index\.php)?$ http://www.domain.com/de/ [L,R=301]

The regular expression will be refined not to serve "de" to someone
prefering English but knowing German, nor to someone not wanting German
and stating "de; q=0".

I have dropped the IP to country thing as I agree that we have to
utilize the standards to make them standards. If anyone has sincere
objections to my solution, please let me know the reasons.

Feb 5 '06 #63

P: n/a
Dan wrote:
A very good model I think is the Wikipedia websites : a generic homepage
in english, all the pointers you need to access the others languages, a
distinct URL for each language (exemple fr.wikipedia.com)
fr.wikipedia.org, actually.


Oops yes, sorry and thanks for correction.
And the different-language Wikipedias are properly in different
subdomains rather than served from the same URI via language
negotiation because they are not simply translations of one another;
they are each separate wiki projects with many differences.


This is particularly true for Wikipedia, of course. But I don't know any
multilangual website that presents exactly the same content in each
available languages... There's always some difference, and that's why I
read so many web pages in english.
Feb 6 '06 #64

P: n/a
Alan J. Flavell wrote:
Going back to the user, OK I'll just have to click on the english or
"international" link, but I really won't welcome that choose that
was the opposite that what I wanted to get.
You will get what (according to the protocol) best represents what you
asked for.


One more, I don't think this could always be so perfectly equal to what
the user was expecting to get.
But first people are complaining that nobody knows how to configure an
existing and rather simple feature of their browser, and that some
users aren't allowed to configure it; and then you ask for a more
elaborate configuration scheme to be implemented. This seems a bit
paradoxical, you know...


If you're refering to my posts, then there were no "first" and "then" :
I tried to include these 2 points in
<43***********************@news.free.fr>

I don't see any paradox in saying that lots of people don't know nothing
about language negociation possibilities, and even they who know don't
find implementations corresponding to their needs. Of course, that's
something very usual in web authoring - and
By the way, a big thanks for the ressources you bought online


You will probably have seen the quote, from someone who had read one
of my pages in that area:

|| Je viens de la sauver dans mes signets, tant elle est riche
|| d'enseignements... et de perte d'illusion


:)
I bookmarked your pages too ! But of course there is so mutch to read !
Feb 6 '06 #65

64 Replies

This discussion thread is closed

Replies have been disabled for this discussion.