473,378 Members | 1,439 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

Spidering only english webpages

I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.

Oct 26 '05 #1
5 1420
On 26 Oct 2005 10:06:02 -0700, el*************@yahoo.com wrote:
I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.


http://www.deutsch-online.com/
--
Regards, Paul Herber, Sandrila Ltd. http://www.pherber.com/
Electronics stencils for Visio http://www.electronics.pherber.com/
Oct 26 '05 #2
>I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.


Lots of websites in any domain are multi-lingual. Lots of websites
in non-English-speaking countries are in English (at least partly).

Your spider might manage to use content negotiation to try to select
the English content over other versions of it, but I suspect most
websites aren't really set up to use content negotiaton.

There are probably some word frequency tests you can use to guess
what language a web page is in. sci.crypt often uses such info to
try to crack ciphers if they think they know what language the
message is in. This might fall flat on its face if the web site
is discussing another language (e.g. computer programming languages,
or something laced heavily with technical jargon).

Gordon L. Burditt
Oct 26 '05 #3
"Paul Herber" wrote:
On 26 Oct 2005 10:06:02 -0700, el*************@yahoo.com wrote:
I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.


http://www.deutsch-online.com/


Here are some more for you:

http://www.clemi.org/
http://www.tottori.co.uk/

Not even Google can work out a web page's language with 100% reliability
(see <http://www.google.com/help/faq_translation.html#link>)

As Gordon suggests, you might achieve *some* success by checking things like
word frequency, but this is computationally expensive, and you still have to
consider things like speling mistaiks and typign errors.

Some pages have a lang attribute in the HTML tag (e.g., <HTML lang="en">),
but most don't.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/
Oct 26 '05 #4
>I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.


There are literally thousands of spanish pages using dot com or dot org
domains (I own a couple of them). AFAIK, anyone in any country of any
language can register a dot com, and I can't imagine why you would
assume otherwise.

Greetings.

Oct 28 '05 #5
el*************@yahoo.com wrote:
I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.


If the website is well developed, the language code will be in lang
attribute <http://www.w3.org/TR/REC-html40/struct/dirlang.html> and or
in META. But, it's again not dependable.

--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

Oct 29 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Mark Watson | last post by:
Last year, I did an experiment of allowing a very polite web spider run for a few days trying to find RDF markup embedded in web pages. I found close to zero RDF - not encouraging! I a recent...
7
by: dow | last post by:
I'm looking for ideas or how to make webpages for members. I want a member log in and view their profiles and check boxes stating that they have completed listed instructions and waiting for the...
5
by: Michael Landberg | last post by:
Hi this may be a stupid question, but anyway! I have created a website with a few pages. I have noticed that when I change the IE browser text size ( through the view menu -> text size) to...
0
by: xmail123 | last post by:
I have written the following simple C# project in Visual studio. I am trying to see how the XML documenting works. I click Tools, Build Comment WebPages… then just click OK in the Build Comment...
3
by: Yul | last post by:
Hi, We are in the process of designing an ASP.NET app, where a user will enter some 'Customer ID' to be queried in the database. If the ID is valid, several stored procedures will be called to...
12
by: scsharma | last post by:
Hi, I am working on creating a webapplication and my design calls for creating main webform which will have menu bar on left hand side and a IFrame which will contain all the forms that are shown...
5
by: dananrg | last post by:
O'Reilly's Spidering Hacks books terrific. One problem. All the code samples are in Perl. Nothing Pythonic. Is there a book out there for Python which covers spidering / crawling in depth?
5
by: David Waizer | last post by:
Hello.. I'm looking for a script (perl, python, sh...)or program (such as wget) that will help me get a list of ALL the links on a website. For example ./magicscript.pl www.yahoo.com and...
1
by: George Orwell | last post by:
Would I be missing much if I stopped trying to learn Perl well enough to use for spidering, screen scraping etc. and converted over to PHP ? I am looking to do all, or at least most of the hacks...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.