By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,676 Members | 1,759 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,676 IT Pros & Developers. It's quick & easy.

Search engines continue to ignore LANG markup

P: n/a
I have three test pages that are marked as Italian, Spanish,
Portuguese, resp. by

Content-Language: it
<html lang="it">
<body lang="it">

and the same for "es" and "pt".

Yahoo regards all three pages as Italian:
http://search.yahoo.com/search?p=%22...l=1&vl=lang_it

Google regards one as English (What??) and two as Spanish:
http://www.google.com/search?q=%22id...%22&lr=lang_en
http://www.google.com/search?q=%22id...%22&lr=lang_es

:-(

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell
Feb 28 '07 #1
Share this Question
Share on Google+
16 Replies


P: n/a
Andreas Prilop <An***************@trashmail.netwrote:
>I have three test pages that are marked as Italian, Spanish,
Portuguese, resp. by

Content-Language: it
<html lang="it">
<body lang="it">

and the same for "es" and "pt".

Yahoo regards all three pages as Italian:
http://search.yahoo.com/search?p=%22...l=1&vl=lang_it

Google regards one as English (What??) and two as Spanish:
http://www.google.com/search?q=%22id...%22&lr=lang_en
http://www.google.com/search?q=%22id...%22&lr=lang_es
I'd be surprised if author provided meta data like language info on the
web was broadly reliable. I'd expect better results from using
heuristics to determine a document's language. So I'd expect SEs to use
heuristics, it serves their users better.

I don't speak any of the test languages, but comparing two of the test
pages it seems to me that they do not contain words that are
characteristic for each language, in fact the content appears to be
chosen to confuse heuristic guessing.

The choice of using a list of words instead of natural language probably
also hinders heuristic guessing since it makes it impossible to use
context for similar words in the various languages.

--
Spartanicus
Feb 28 '07 #2

P: n/a
On Wed, 28 Feb 2007, Spartanicus wrote:
I'd be surprised if author provided meta data like language info on the
web was broadly reliable.
Mostly, LANG markup is *missing* from documents. However, if the author
supplies LANG markup, it should be taken as ... well ... authoritative.
The author knows best in which language he writes.
I'd expect better results from using
heuristics to determine a document's language. So I'd expect SEs to use
heuristics, it serves their users better.
That's the same argument used by Internet Explorer 6:

| The server sends "text/plain" but I take "text/html"
| because it seems to make more sense to me.

They can still guess when LANG markup is *missing*.
in fact the content appears to be chosen to confuse heuristic guessing.
Exactly.
The choice of using a list of words instead of natural language probably
also hinders heuristic guessing since it makes it impossible to use
context for similar words in the various languages.
But only with such a list of words, you can take different LANG
parameters. All the words exist in Italian, Spanish, Portuguese.
Each page could be IT or ES or PT.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell
Feb 28 '07 #3

P: n/a
Andreas Prilop wrote:
I have three test pages that are marked as Italian, Spanish,
Portuguese, resp. by

Content-Language: it
<html lang="it">
<body lang="it">

and the same for "es" and "pt".

Yahoo regards all three pages as Italian:
http://search.yahoo.com/search?p=%22...l=1&vl=lang_it

Google regards one as English (What??) and two as Spanish:
http://www.google.com/search?q=%22id...%22&lr=lang_en
http://www.google.com/search?q=%22id...%22&lr=lang_es

:-(
Shouldn't you have <meta lang="it"/in the head rather than specifying
the language of elements?
--
am

laurus : rhodophyta : brethoneg : smalltalk : stargate

--
Posted via a free Usenet account from http://www.teranews.com

Feb 28 '07 #4

P: n/a
Andreas Prilop <An***************@trashmail.netwrote:
>I'd be surprised if author provided meta data like language info on the
web was broadly reliable.

Mostly, LANG markup is *missing* from documents. However, if the author
supplies LANG markup, it should be taken as ... well ... authoritative.
The author knows best in which language he writes.
I don't have any statistics, but I'd expect that many documents on the
web are produced by authoring tools that use templates which may contain
false language info. I've done it myself even as a hand coder, my
default document template contains lang="en", on more than one occasion
have I published "Lorem ipsum" demo pages with the default lang="en"
still in there.
>I'd expect better results from using
heuristics to determine a document's language. So I'd expect SEs to use
heuristics, it serves their users better.

That's the same argument used by Internet Explorer 6:

| The server sends "text/plain" but I take "text/html"
| because it seems to make more sense to me.
That is a spec violation (must). There is no spec requirement on a UA to
use language meta data:
"Language information specified via the lang attribute may be used by a
user agent"
http://www.w3.org/TR/html4/struct/dirlang.html#h-8.1
>in fact the content appears to be chosen to confuse heuristic guessing.

Exactly.
>The choice of using a list of words instead of natural language probably
also hinders heuristic guessing since it makes it impossible to use
context for similar words in the various languages.

But only with such a list of words, you can take different LANG
parameters. All the words exist in Italian, Spanish, Portuguese.
Each page could be IT or ES or PT.
I don't think that it is realistic to expect SEs to use language meta
data if they cannot determine the language via heuristics. And as I've
noted before I find their decision to use heuristics logical.

--
Spartanicus
Feb 28 '07 #5

P: n/a
Scripsit Andreas Prilop:

[ Search engines ignore lang attributes and Content-Language headers,
apparently using some guesswork instead. ]

Sadly enough, this will probably not improve. The problem is that there are
too many phoney lang attributes on web pages, typically resulting from
authoring software that spits them out, though clueless authors write them,
too. There are also wrong lang attributes due to simple carelessness. What
would you do, then, if you were a search engine that tried to be useful?

For example, http://www.kko.fi/29566.htm is a page by the Supreme Court of
Finland, actually in Sámi language, but with lang="sa", i.e. claiming to be
in Sanskrit, despite the detailed explanation of the mistake that I sent
months ago. Someone took the trouble of actually typing in the lang
attribute but didn't get it right, and apparently it is impossible to fix
it.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Feb 28 '07 #6

P: n/a
Scripsit António Marques:
Shouldn't you have <meta lang="it"/in the head rather than
specifying the language of elements?
No. By definition, the lang attribute specifies the language of the text in
the element and its attributes. The <metaelement never has any content, so
the above element is completely pointless. Using lang with other attributes
could make sense in odd cases, if you have metainformation in a language
other than the document's overall language, but that would normally be
keyword spamming and could be treated as such.

Followups trimmed; there was no reason for a silent addition of sci.lang.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Feb 28 '07 #7

P: n/a
mb
On Feb 28, 8:54 am, Andreas Prilop <AndreasPrilop2...@trashmail.net>
wrote:
I have three test pages that are marked as Italian, Spanish,
Portuguese, resp. by

Content-Language: it
<html lang="it">
<body lang="it">

and the same for "es" and "pt".
Can you inform a total ignorant? All these, and many other languages
too, can be typed on an international keyboard. Where is the sense of
arbitrarily assigning "language" to texts then?

Feb 28 '07 #8

P: n/a
Jukka K. Korpela wrote:
Scripsit António Marques:
>Shouldn't you have <meta lang="it"/in the head rather than
specifying the language of elements?

No. By definition, the lang attribute specifies the language of the text
in the element and its attributes.
Yes, my fault. I intended to write <meta http-equiv="Content-Language"
content="it"/>, as I suspect that's what search engines look at.
The <metaelement never has any
content, so the above element is completely pointless. Using lang with
other attributes could make sense in odd cases, if you have
metainformation in a language other than the document's overall
language, but that would normally be keyword spamming and could be
treated as such.

Followups trimmed; there was no reason for a silent addition of sci.lang.
There was no reason for the original posting to sci.lang either.
--
am

laurus : rhodophyta : brethoneg : smalltalk : stargate

--
Posted via a free Usenet account from http://www.teranews.com

Feb 28 '07 #9

P: n/a
mb wrote:
> Content-Language: it
<html lang="it">
Can you inform a total ignorant? All these, and many other languages
too, can be typed on an international keyboard. Where is the sense of
arbitrarily assigning "language" to texts then?
It isn't arbitrary, it describes the language the content is written in.
This has implications for (among other things) what pronunciation
dictionary a screen reader should use and what an automated system could do
given an instruction to get some data if it knows what languages the user
can understand.

--
David Dorward <http://blog.dorward.me.uk/ <http://dorward.me.uk/>
Home is where the ~/.bashrc is
Feb 28 '07 #10

P: n/a
mb
On Feb 28, 1:20 pm, David Dorward <dorw...@yahoo.comwrote:
mb wrote:
Content-Language: it
<html lang="it">
Can you inform a total ignorant? All these, and many other languages
too, can be typed on an international keyboard. Where is the sense of
arbitrarily assigning "language" to texts then?

It isn't arbitrary, it describes the language the content is written in.
This has implications for (among other things) what pronunciation
dictionary a screen reader should use and what an automated system could do
given an instruction to get some data if it knows what languages the user
can understand.
Thank you, wasn't thinking of that.
Meaning that if I somehow could get a tag on Word documents I could
stop that @##! Word spellchecker from "automatically" deciding what
dictionary to use for each %$@@! word?

Mar 1 '07 #11

P: n/a
mb wrote:
Meaning that if I somehow could get a tag on Word documents I could
stop that @##! Word spellchecker from "automatically" deciding what
dictionary to use for each %$@@! word?
Yes. And you can. The easiest way is to use Word's GUI for the task. My
Word is in Finnish so you have to press F1 for help :)

Osmo


Mar 1 '07 #12

P: n/a
On Wed, 28 Feb 2007, mb wrote:
Can you inform a total ignorant? All these, and many other languages
too, can be typed on an international keyboard. Where is the sense of
arbitrarily assigning "language" to texts then?
Search engines allow you to restrict your search to certain languages.
For example, you might want to restrict your search to English or
to French or to German when looking for the word "elf".
This will go wrong of course when the search engine is unable
to detect the language correctly.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell
Mar 1 '07 #13

P: n/a
On Wed, 28 Feb 2007, António Marques wrote:
I intended to write <meta http-equiv="Content-Language"
content="it"/>, as I suspect that's what search engines look at.
First, the slash is wrong in HTML.

Second, *everything* called <meta http-equivis only a poor ersatz,
a cheapo surrogate, a plastic imitation from China. What you should
have instead, is the *real* HTTP header

Content-Language: it

And that's exactly what I wrote in my original posting.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell
Mar 1 '07 #14

P: n/a
mb
On Mar 1, 1:34 am, Osmo Saarikumpu <o...@weppipakki.comwrote:
mb wrote:
Meaning that if I somehow could get a tag on Word documents I could
stop that @##! Word spellchecker from "automatically" deciding what
dictionary to use for each %$@@! word?

Yes. And you can. The easiest way is to use Word's GUI for the task. My
Word is in Finnish so you have to press F1 for help :)
Nah. When you have multiple languages installed the damn thing only
does "automatic" recognition, because "tools-language" selects them
all by default.

Mar 1 '07 #15

P: n/a
mb wrote:
>On Mar 1, 1:34 am, Osmo Saarikumpu <o...@weppipakki.comwrote:
Yes. And you can. The easiest way is to use Word's GUI for the task. My
Word is in Finnish so you have to press F1 for help :)
Nah. When you have multiple languages installed the damn thing only
does "automatic" recognition, because "tools-language" selects them
all by default.
I don't understand. Are you saying that when you have multiple languages
installed Word GUI does not allow user defined language markup? I'm
sorry, but I have to ask: did you select part of the text before
applying the relevant language information?

An example:

I'm from Finland. Suomi is Finland in Finnish.

In the above text the word Suomi would be signaled as misspelled because
Word's automatic language recognition does not find the word in the
English dictionary. I guess that the correct procedure (using GUI) would
be to select the word and then change it's language to Finnish. And that
would end the (understandably mistaken) automatic recognition.

The code underneath the hood concerning our word before the change would be:

<span class=SpellE>Suomi</span>

And after the change:

<span lang=FI>Suomi</span>

HTH, Osmo


Mar 1 '07 #16

P: n/a
mb
On Mar 1, 2:02 pm, Osmo Saarikumpu <o...@weppipakki.comwrote:
mb wrote:
On Mar 1, 1:34 am, Osmo Saarikumpu <o...@weppipakki.comwrote:
Yes. And you can. The easiest way is to use Word's GUI for the task. My
Word is in Finnish so you have to press F1 for help :)
Nah. When you have multiple languages installed the damn thing only
does "automatic" recognition, because "tools-language" selects them
all by default.

I don't understand. Are you saying that when you have multiple languages
installed Word GUI does not allow user defined language markup? I'm
sorry, but I have to ask: did you select part of the text before
applying the relevant language information?
Yeah. Selecting the whole text does not allow you to define one
dictionary language for all. They say they fixed it in later editions.

Mar 1 '07 #17

This discussion thread is closed

Replies have been disabled for this discussion.