473,811 Members | 3,627 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Spidering only english webpages

I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.

Oct 26 '05 #1
5 1439
On 26 Oct 2005 10:06:02 -0700, el************* @yahoo.com wrote:
I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.


http://www.deutsch-online.com/
--
Regards, Paul Herber, Sandrila Ltd. http://www.pherber.com/
Electronics stencils for Visio http://www.electronics.pherber.com/
Oct 26 '05 #2
>I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.


Lots of websites in any domain are multi-lingual. Lots of websites
in non-English-speaking countries are in English (at least partly).

Your spider might manage to use content negotiation to try to select
the English content over other versions of it, but I suspect most
websites aren't really set up to use content negotiaton.

There are probably some word frequency tests you can use to guess
what language a web page is in. sci.crypt often uses such info to
try to crack ciphers if they think they know what language the
message is in. This might fall flat on its face if the web site
is discussing another language (e.g. computer programming languages,
or something laced heavily with technical jargon).

Gordon L. Burditt
Oct 26 '05 #3
"Paul Herber" wrote:
On 26 Oct 2005 10:06:02 -0700, el************* @yahoo.com wrote:
I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.


http://www.deutsch-online.com/


Here are some more for you:

http://www.clemi.org/
http://www.tottori.co.uk/

Not even Google can work out a web page's language with 100% reliability
(see <http://www.google.com/help/faq_translation .html#link>)

As Gordon suggests, you might achieve *some* success by checking things like
word frequency, but this is computationally expensive, and you still have to
consider things like speling mistaiks and typign errors.

Some pages have a lang attribute in the HTML tag (e.g., <HTML lang="en">),
but most don't.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/
Oct 26 '05 #4
>I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.


There are literally thousands of spanish pages using dot com or dot org
domains (I own a couple of them). AFAIK, anyone in any country of any
language can register a dot com, and I can't imagine why you would
assume otherwise.

Greetings.

Oct 28 '05 #5
el************* @yahoo.com wrote:
I am working on a spider script but I only want to parse english pages.
Is there a way I can check to see what language the content is in? I
suppose I could restrict my spider to just .com , .org, etc so foreign
countries would not get parsed.


If the website is well developed, the language code will be in lang
attribute <http://www.w3.org/TR/REC-html40/struct/dirlang.html> and or
in META. But, it's again not dependable.

--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

Oct 29 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
1965
by: Mark Watson | last post by:
Last year, I did an experiment of allowing a very polite web spider run for a few days trying to find RDF markup embedded in web pages. I found close to zero RDF - not encouraging! I a recent post, I compalined about not being able to embed RDF in XHTML (at least no standard way to do it and still pass th W3C XHTML validator). Another poster (Jeen Broekstr) provided a good example of simply linking to a RDF file at the same site.
7
1380
by: dow | last post by:
I'm looking for ideas or how to make webpages for members. I want a member log in and view their profiles and check boxes stating that they have completed listed instructions and waiting for the instructor's approval when thy check thier boxes. Once the student checks some boxes and the instructors approve, Then the student can access the next level webpages. I searched the web using "javascript databases" and I don't see anything that I...
5
1701
by: Michael Landberg | last post by:
Hi this may be a stupid question, but anyway! I have created a website with a few pages. I have noticed that when I change the IE browser text size ( through the view menu -> text size) to largest everything is blown up very large. Is there a way to change the html code so that the text won't be blown up to much? without changing the font in the html to a very small size?
0
1162
by: xmail123 | last post by:
I have written the following simple C# project in Visual studio. I am trying to see how the XML documenting works. I click Tools, Build Comment WebPages… then just click OK in the Build Comment WebPages window. I get the top level page showing me the project. But when I click the link the next page is blank. Is there a setting or som code that I am missing. using System;
3
2706
by: Yul | last post by:
Hi, We are in the process of designing an ASP.NET app, where a user will enter some 'Customer ID' to be queried in the database. If the ID is valid, several stored procedures will be called to populate multiple webpages containing customer information. There isn't a one-to-one correlation between the stored procedure and a webpage. In other words, a webpage may have to refer to 1 or more DataTables to populate itself. Therefore, a...
12
2827
by: scsharma | last post by:
Hi, I am working on creating a webapplication and my design calls for creating main webform which will have menu bar on left hand side and a IFrame which will contain all the forms that are shown when menu items are clicked.Besides these i would like to put a custom status bar. Any error message encountered in any of the webpage will be displayed in the banner. The problem iam encountering is how to access the customer status bar in child...
5
2303
by: dananrg | last post by:
O'Reilly's Spidering Hacks books terrific. One problem. All the code samples are in Perl. Nothing Pythonic. Is there a book out there for Python which covers spidering / crawling in depth?
5
1337
by: David Waizer | last post by:
Hello.. I'm looking for a script (perl, python, sh...)or program (such as wget) that will help me get a list of ALL the links on a website. For example ./magicscript.pl www.yahoo.com and outputs it to a file, it would be kind of like a spidering software.. Any suggestions would be appreciated.
1
1808
by: George Orwell | last post by:
Would I be missing much if I stopped trying to learn Perl well enough to use for spidering, screen scraping etc. and converted over to PHP ? I am looking to do all, or at least most of the hacks decribed in the books "Spidering Hacks" and "Perl & LWP". I am familiar with the book "Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL" Would anyone know of any other sources of info related to this kind...
0
9730
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9605
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10392
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
7671
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6893
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5555
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5693
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4341
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3868
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.