473,543 Members | 1,908 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Multilingual websites and web-crawlers

I would like to have my website translated in several languages and take
advantage of language negotiation to let the user choose its preferred
version of the site.

I read with most interest the invaluable informations on
http://www.cs.tut.fi/~jkorpela/, http://ppewww.ph.gla.ac.uk/~flavell/www
and http://webtips.dan.info/, but I still have some questions (maybe I
overlooked the answers on these sites, though...).

First, my ISP runs an Apache server but as far as I can see, MultiViews
is not activated (I'm still waiting for some confirmation) so I decided
to use the type-map method: I associate to each of my pages a variant
file that directs Apache to the desired version of the page which is in
turn sent back to the user agent, i.e. cave.var points to cave.fr.html
and cave.en.html for example.

Then if the user agent asks for http://server/cave.var or
http://serve/cave, it gets either cave.en.html or cave.fr.html according
to its language settings. Fine.

Now my cave.en.html and cave.fr.html contain links to other pages that
are themselves translated. The href attribute of the link is then a
generic one, say href="lascaux" instead of href="lascaux.v ar" (and
lascaux.var in turn points to lascaux.fr.html and lascaux.en.html ).

Here is my problem: let's assume my home pages are cave.fr.html and
cave.en.html, I submit these pages to a web-crawler that is going to
analyze them and find a link to "lascaux" and try to scan a hypothetical
lascaux.html that doesn't exist so that it will stop without indexing
lascaux.fr.html nor lascaux.en.html ...

Am I wrong ? If not, is there a workaround to tell a robot to scan all
pages even though there is no explicit reference to them in HTML files
(apart from submitting all of them to the robot...) ?

Thanks,

Vincent.

Jul 20 '05 #1
54 4724
Vincent <vi************ @wanadoo.fr> writes:
Now my cave.en.html and cave.fr.html contain links to other pages that
are themselves translated. The href attribute of the link is then a
generic one, say href="lascaux" instead of href="lascaux.v ar" (and
lascaux.var in turn points to lascaux.fr.html and lascaux.en.html ).

Here is my problem: let's assume my home pages are cave.fr.html and
cave.en.html, I submit these pages to a web-crawler that is going to
analyze them and find a link to "lascaux" and try to scan a
hypothetical lascaux.html that doesn't exist so that it will stop
without indexing lascaux.fr.html nor lascaux.en.html ...


Hold on, why would a web crawler attempt to access lascaux.html on
finding a link to lascaux?

And why wouldn't lascaux->lascaux.var->lascaux.(en|fr ).html work the
same for a web crawler as for any other user agent.

Only thing I can think of is if the crawler doesn't send any language
preferences, but shouldn't .var fall back to one of the others as
appropriate in that case, rather than arbitrarily redirecting to
lascaux.html?

Try it out with something like wget or telnet where it's easy to set
custom headers for testing, but I don't think there should be a problem.

--
Chris
Jul 20 '05 #2
On Mon, Sep 1, Vincent inscribed on the eternal scroll:
First, my ISP runs an Apache server but as far as I can see, MultiViews
is not activated (I'm still waiting for some confirmation) so I decided
to use the type-map method:
I'm a bit puzzled as to why a server should be set up that allows you
to use typemaps but prevents you from using multiviews. But maybe
it came out that way without them really thinking about it - who
knows?

You could at least try sticking-in a .htaccess file to see what
happens. Here's a clue:

1. put some complete junk into a .htaccess file temporarily, and then
try accessing one of the pages that it controls. You should get a
server error. If you don't, then clearly the server is paying no
attention to the .htaccess, and there's nothing further you can do in
this direction.

2. take out the junk, and put in

Options +MultiViews

instead. Try again. If you still get a server error, then it
suggests the server is configured to prohibit that directive in your
..htaccess. Too bad. If the server responds ok, on the other hand,
then you should be all set for MultiViews.
I associate to each of my pages a variant
file that directs Apache to the desired version of the page which is in
turn sent back to the user agent, i.e. cave.var points to cave.fr.html
and cave.en.html for example.
That's the idea, if you're using a typemap, yes.
Then if the user agent asks for http://server/cave.var or
http://serve/cave, it gets either cave.en.html or cave.fr.html according
to its language settings. Fine.
Well, the user agent is only reasonably going to ask for the URLs
which you nominate in your links. (Or do you mean that there are
already other sites linking to your cave.fr.html etc. URLs
explicitly?)
Now my cave.en.html and cave.fr.html contain links to other pages that
are themselves translated. The href attribute of the link is then a
generic one, say href="lascaux" instead of href="lascaux.v ar" ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^

Which mechanism are you using to achieve that? MultiViews would do
that for you, but you're saying you aren't using it: mod_speling would
do it for you, I think, but by redirecting "lascaux" to "lascaux.va r",
so you could save a network transaction by linking directly to
lascaux.var in the first place.
(and lascaux.var in turn points to lascaux.fr.html and
lascaux.en.html ).
Sure...
Here is my problem: let's assume my home pages are cave.fr.html and
cave.en.html, I submit these pages to a web-crawler that is going to
analyze them and find a link to "lascaux" and try to scan a hypothetical
lascaux.html that doesn't exist so that it will stop without indexing
lascaux.fr.html nor lascaux.en.html ...


It whould present the URL (which it got from your href="...", OK?) to
the server, and get whatever a browser would have got if they had
presented the same request.

Why do you suppose the indexer would append an unsolicited ".html"
to your URL? That would be improper of it!

The only issue is that the indexer might be making the request without
including Accept-language preferences. But surely you have a default
language set for such requests?

In any case, it's good practice to include explicit links to the other
language versions, so that readers can "switch" languages temporarily
if they want, without needing to reconfigure their browsers. So the
indexer should get to see links to all of the pages quite naturally as
it browses around your site.

good luck
Jul 20 '05 #3
Alan J. Flavell wrote:
Options +MultiViews
This is what I tried, but I get the following message:
"Problem on restriction by .htaccess file
..htaccess file in this directory is not valid and cannot be interpreted
by the web server."

Maybe my ISP didn't AllowOverride for the Options directive ? I sent
them a mail, but I'm still waiting for the answer...

If I remove the Options line, eveything works fine...
Now my cave.en.html and cave.fr.html contain links to other pages that
are themselves translated. The href attribute of the link is then a
generic one, say href="lascaux" instead of href="lascaux.v ar"


^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^

Which mechanism are you using to achieve that? MultiViews would do
that for you, but you're saying you aren't using it


Well, I'm just using the type-map mechanism, i.e. I added to my
..htaccess file the line :
AddHandler type-map .var
and then the result is the same whether I have a link with an
href="lascaux" or href="lascaux.v ar".
I don't know if this is the standard behaviour, but this is what I get...

After reading your explanations, I understand that I can safely use the
lascaux.var version of the link: what was unclear to me was the
behaviour of a web-crawler. From what you say, it's just another user
agent that gets the same results as a web browser would. Since the
lascaux.var version yields the correct result to my web browser, it will
also work with a robot: great.
The only issue is that the indexer might be making the request without
including Accept-language preferences. But surely you have a default
language set for such requests?
This is achieved by using the DefaultLanguage directive I guess ?
In any case, it's good practice to include explicit links to the other
language versions, so that readers can "switch" languages temporarily
if they want, without needing to reconfigure their browsers. So the
indexer should get to see links to all of the pages quite naturally as
it browses around your site.
Yes, but I have already read your site, so I knew this :-)
good luck


Thanks

Jul 20 '05 #4
On Mon, Sep 1, Vincent inscribed on the eternal scroll:
Well, I'm just using the type-map mechanism, i.e. I added to my
.htaccess file the line :
AddHandler type-map .var
and then the result is the same whether I have a link with an
href="lascaux" or href="lascaux.v ar".
I don't know if this is the standard behaviour, but this is what I get...


Well, I think you've caught me out here - if that happens (other than
by having MultiViews enabled) then I hadn't realised it, and I don't
think it's even clear from the documentation. I'll try taking a look
at it later (it's not on your critical path, evidently) and if it's
unclear, maybe move the question to the appropriate servers group
where the Apache experts might be found...
The only issue is that the indexer might be making the request without
including Accept-language preferences. But surely you have a default
language set for such requests?


This is achieved by using the DefaultLanguage directive I guess ?


No, this *is* confusing! DefaultLanguage is a directive to declare to
the server that all documents (within the dierctive's scope) which
don't have an explicit language extension like .fr etc., can be
assumed to be in the specified language.

In order to control the response when the client has no language
preferences (which is what we were discussing), you could provide a
non-language-specific variant (foo.html) alongside the
language-specific variants (foo.html.en, foo.html.fr etc.).

If you know enough unix you can use "ln -s" to symlink them together,
e.g

ln -s foo.html.en foo.html

(for the purpose of illustration I'm assuming you want to serve
the English version out if the client has no preference), rather than
keeping duplicate files.

That would work fine with multiviews too.

If you're using Apache 2 (as opposed to 1.3), then you can
alternatively handle this situation as shown under "Language
Negotiation Exceptions" at
http://httpd.apache.org/docs-2.0/con...gotiation.html

cheers
Jul 20 '05 #5
Alan J. Flavell wrote:
Well, I think you've caught me out here - if that happens (other than
by having MultiViews enabled) then I hadn't realised it, and I don't
think it's even clear from the documentation. I'll try taking a look
at it later (it's not on your critical path, evidently) and if it's
unclear, maybe move the question to the appropriate servers group
where the Apache experts might be found...


Don't bother to make further tests, I was wrong: I was to hasty in
drawing conclusions. Remember I told you that the line "Options
+MultiViews" in my .htaccess produced some error from the HTTP server ?
Well, I concluded that MultiViews was disabled and I was wrong: I simply
removed this line and simply left:
AddLanguage en .en
AddLanguage fr .fr

and everything was allright. I guess that this time I can conclude that
MultiViews is enabled by default... (I removed all the type-map stuff).

Sorry for the inconvenience.

Jul 20 '05 #6
On Tue, Sep 2, Vincent inscribed on the eternal scroll:
and everything was allright. I guess that this time I can conclude that
MultiViews is enabled by default... (I removed all the type-map stuff).
Thanks for posting the solution! Good to see it came out OK in the
end.
Sorry for the inconvenience.


No problem. Maybe the answer will help someone else too.

all the best.

Jul 20 '05 #7
Vincent schrieb:
I simply removed this line and simply left:
AddLanguage en .en
AddLanguage fr .fr

and everything was allright.


Does anybody know if it makes sense to add things like
AddLanguage en-GB .en
AddLanguage en-US .en
or other language variants?
For exapmle for people who only have en-US as accepted language.
I would guess they could understand en-GB too ;-)

Greetings
Jan

--
Due to financial reasons the light at the end of the tunnel has been
shut down until further notice.

Jul 20 '05 #8
On Wed, Sep 3, Jan Steffen inscribed on the eternal scroll:
Does anybody know
Anyone who's attentively read the RFC, certainly ;-)
if it makes sense to add things like
AddLanguage en-GB .en
AddLanguage en-US .en
or other language variants?
If you're not offering a specifically British-English version then you
shouldn't advertise it as British-English. If you _are_ offering a
specifically British-English version, and not offering a separate
generic version, then you can advertise it as en-GB, and then someone
who requests a generic English version will get it, thanks to the
so-called "prefix rule".
For exapmle for people who only have en-US as accepted language.
In theory that says that they refuse all other variants of English -
even the generic English variant.
I would guess they could understand en-GB too ;-)


Maybe they don't want to? That's what the RFC says that it means!
But OK, there is a more practical answer, within the scope of the RFC.
Read the Apache tutorial and the other materials that it cites, and
reach your own conclusion.

There *are* acceptable solutions for the dilemma; but pretending that
one and the same version of the document is in several language
variants is not a good answer. You never know what kind of English
they'd ask for next (Indian variant? New-Zealand? Cockney?).

cheers
Jul 20 '05 #9
Alan J. Flavell wrote:
On Wed, Sep 3, Jan Steffen inscribed on the eternal scroll:
Does anybody know Anyone who's attentively read the RFC, certainly ;-)


Ok, I've got to admit, I didn't.
if it makes sense to add things like
AddLanguage en-GB .en
AddLanguage en-US .en
or other language variants?


If you're not offering a specifically British-English version then you
shouldn't advertise it as British-English. If you _are_ offering a
specifically British-English version, and not offering a separate
generic version, then you can advertise it as en-GB, and then someone
who requests a generic English version will get it, thanks to the
so-called "prefix rule".


OK, in reality the differences are not that big. I could replace
color by colour and that stuff. But I would think in everyday text
*I* wouldn't spot much difference. (Ok, I'm not a nativ speaker)
In fact I don't even know if the pages I translated myself are now
en-GB or en-US or en-WithTypicalGerm anErrors!

For exapmle for people who only have en-US as accepted language.


In theory that says that they refuse all other variants of English -
even the generic English variant.


Yes, but that is theory. In RL users just have the setting their
browser came with and that is often just en-US. It is not an
informed decision.

I would guess they could understand en-GB too ;-)


Maybe they don't want to? That's what the RFC says that it means!
But OK, there is a more practical answer, within the scope of the RFC.
Read the Apache tutorial and the other materials that it cites, and
reach your own conclusion.
There *are* acceptable solutions for the dilemma; but pretending that
one and the same version of the document is in several language
variants is not a good answer.


I just reread some of the documentation, but could not get to a
final conclusion. Are you willing to enlighten me?
You never know what kind of English
they'd ask for next (Indian variant? New-Zealand? Cockney?).


Do you have an example of a site offering content in different
en-flavours? Just for curiosity?

HAND, Jan

--
Warning: This message may contain traces of peanuts.

Jul 20 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
2309
by: charliewest | last post by:
Building Multilingual Portal I have been assigned a new project to build a multilingual portal using ASP and/or ASP.NET and the expected Microsoft technologies including ADO and SQL Server 2000. To date, all of my Web Solutions use ADO to access data which is stored in SQL SERVER. Given the need to develop a solution that is very easy to...
0
1323
by: JoanneC | last post by:
When developing dynamic web content for the global market place using ASP.NET, VB.NET and MS-Access, why should one install MS-Windows 2000 Server (mulitlanguage/multilingual), and is it necessary to install MS-Access (multilingual/multilanguage) also?
0
1251
by: JoanneC | last post by:
When developing dynamic web content for the global market place using ASP.NET, VB.NET and MS-Access, why should one install MS-Windows 2000 Server (mulitlanguage/multilingual), and is it necessary to install MS-Access (multilingual/multilanguage) also? Thanks Joanne
4
1627
by: Jim Adams | last post by:
Anyone have any insights into this? I'm planning an upgrade to an existing ASP.Net project to support multiple display languages (e.g. English, Spanish, ...). I'd like to use a solution that will allow me to develop web pages with English text, and then switch languages in the Visual Studio interface and translate into a second language....
0
1714
by: Karl | last post by:
I just finished my first tutorial on creating multilingual websites. If you are interested, you can see it here: http://www.openmymind.net/localization/ I plan on creating a 2nd part which will talk more about the architecture. Karl
5
1780
by: Anthony J Biondo Jr. | last post by:
I am trying to design an application that supports multiple languages. I have seen examples using an XML file and writing a custom function to import the strings and I have seen examples of resource files. I am trying to see what other people are using out there so I can decide what direction I am going to go. Your input is much...
1
1968
by: Dean R. Henderson | last post by:
I setup FormsAuthentication on a couple websites where it has been working as expected for a long time. I used this code to setup the same type of authentication on a new website I am working on and the Cookie Name is not getting setup the same way. In my Web.config file, I use the following basic settings on both the old and new websites:...
64
6114
by: Manfred Kooistra | last post by:
I am building a website with identical content in four different languages. On a first visit, the search engine determines the language of the content by the IP address of the visitor. What the user sees is content in only one language at a time. He or she can then switch to another language and set this as the preferred language, but again he...
2
2487
by: | last post by:
Best practices and recommendations for asp.net 2 multilingual web sites? Thanks
2
2267
by: raju | last post by:
Hai all, I am working on the multilingual application, in asp.net. In that we are displaying contents in some other language (user selected language). In that page, we are having some textbox for user input. Is there any way to restrict the user to input only english characters on the given textbox. if any one knows the solution pl....
0
7411
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7354
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7594
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
7746
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7354
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
1
5282
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
4898
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3394
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
643
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.