How to retrieve URLs and text from web pages

Hi,

I’m new to programming. I’m currently learning python to write a web crawler to extract all text from a web page, in addition to, crawling to further URLs and collecting the text there. The idea is to place all the extracted text in a .txt file with each word in a single line. So the text has to be tokenized. All punctuation marks, duplicate words and non-stop words have to be removed.

The program should crawl the web to a certain depth and collect the URLs and text from each depth (level). I decided to choose a depth of 3. I divided the code to two parts. Part one to collect the URLs and part two to extract the text. Here is my problem:

The program is extremely slow.
I'm not sure if it functions properly.
Is there a better way to extract text?
Are there any available modules to help clean the text i.e. removing duplicates, non-stop words ...

(Please Note: the majority of the code (the first part) is written by “James Mills”. I found the code online and it looks helpful so I used it. I just modified it and added my code to it)

Thanks,
Kal

Expand|Select|Wrap|Line Numbers

 
import sys

import urllib2

import urlparse

from BeautifulSoup import BeautifulSoup,NavigableString
 
__version__ = "0.1"

__copyright__ = "CopyRight (C) 2008 by James Mills"

__license__ = "GPL"

__author__ = "James Mills"

__author_email__ = "James Mills, James dot Mills st dotred dot com dot au"
 
USAGE = "%prog [options] <url>"

VERSION = "%prog v" + __version__
 
AGENT = "%s/%s" % (__name__, __version__)
 
def encodeHTML(s=""):

    """encodeHTML(s) -> str
 
    Encode HTML special characters from their ASCII form to

    HTML entities.

    """
 
    return s.replace("&", "&amp;") \

            .replace("<", "&lt;") \

            .replace(">", "&gt;") \

            .replace("\"", "&quot;") \

            .replace("'", "'") \

            .replace("--", "&mdash")
 
class Fetcher(object):
 
    def __init__(self, url):

        self.url = url

        self.urls = []
 
    def __contains__(self, x):

        return x in self.urls
 
    def __getitem__(self, x):

        return self.urls[x]
 
    def _addHeaders(self, request):

        request.add_header("User-Agent", AGENT)
 
    def open(self):

        url = self.url

        #print "\nFollowing %s" % url

        try:

            request = urllib2.Request(url)

            handle = urllib2.build_opener()

        except IOError:

            return None

        return (request, handle)
 
    def fetch(self):

        request, handle = self.open()

        self._addHeaders(request)

        if handle:

            soup = BeautifulSoup()

            try:

                content = unicode(handle.open(request).read(), errors="ignore")

                soup.feed(content)

                #soup = BeautifulSoup(content)

                tags = soup('a')

            except urllib2.HTTPError, error:

                if error.code == 404:

                    print >> sys.stderr, "ERROR: %s -> %s" % (error, error.url)

                else:

                    print >> sys.stderr, "ERROR: %s" % error

                tags = []

            except urllib2.URLError, error:

                print >> sys.stderr, "ERROR: %s" % error

                tags = []

            for tag in tags:

                try:

                    href = tag["href"]

                    if href is not None:

                        url = urlparse.urljoin(self.url, encodeHTML(href))

                        if url not in self:

                            #print " Found: %s" % url

                            self.urls.append(url)

                except KeyError:

                    pass
 
################################################################################

# I created 3 lists (root, level2 and level3).                                 #

# Each list saves the URLs of that level i.e. depth. I choose to create 3      #

# lists so I can have the flexibility of testing the text in each level. Also, #

# the 3 lists can be easily combined into one list.                            #

################################################################################
 
# Level1: 

root = Fetcher('http://www.wantasimplewebsite.co.uk/index.html')

root.fetch()

for url in root:

    if url not in root: # Avoid duplicate links 

       root.append(url)
 
print "\nRoot URLs are:"

for i, url in enumerate(root):

   print "%d. %s" % (i+1, url)
 
# Level2: 

level2 = []

for url in root: # Traverse every element(i.e URL) in root and fetch the URLs from it

    temp = Fetcher(url)

    temp.fetch()

    for url in temp:

        if url not in level2: # Avoid duplicate links 

           level2.append(url)
 
print "\nLevel2 URLs are:"

for i, url in enumerate(level2):

   print "%d. %s" % (i+1, url)
 
# Level3: 

level3 = []

for url in level2: # Traverse every element(i.e URL) in level2 and fetch the URLs from it

    temp = Fetcher(url)

    temp.fetch()

    for url in temp:

        if url not in level3: # Avoid duplicate links

           level3.append(url)
 
print "\nLevel3 URLs are:"

for i, url in enumerate(level3):

   print "%d. %s" % (i+1, url)
 
# 1. Traverse every link in the lists and extract its web page content.

# 2. Tokenize Text.

# 3. Remove stop-words (i.e. and, but, to...)

# 4. Remove duplicates

# 5. What about stemming?

# 6. Check the spelling.

# 7. Save the result in a file
 
html = urllib2.urlopen('http://www.wantasimplewebsite.co.uk/index.html').read()

soup = BeautifulSoup(html)
 
def printText(tags):

    for tag in tags:

        if tag.__class__ == NavigableString:

            print tag,

        else:

            printText(tag)
 
printText(soup.findAll("body"))

Jun 26 '10 #1

Subscribe Post Reply

1633

by: James Turner | last post by:

I am trying to store formatted text (windows format) into a MySQL database and then retrieve it. The field in the database is a varchar. I cut and paste the test into a form field formatted, then...

PHP

Are text pages deleted when a column is converted to varchar

by: Steve Morris | last post by:

When I change a column from text to varchar using the design view of a table within Enterprise Manager the varchar value (less than 8000 characters) appears in the column but does SQL Server...

Microsoft SQL Server

retrieve clean text between <td></td>

by: js | last post by:

I have a table rendered with XSLT. The first column has a radio button controls for user to make a selection for a particular row. All the values in the remaining columns are all concated with a...

Javascript

Q: Retrieve Selected Text from TextField

by: Soul | last post by:

Hi, In ASP.Net page, Is there a way to retrieve those selected text from a normal TextField? For example, I have a multi-line TextField for user to enter some text. There is also a Button if...

ASP.NET

How to retrieve the text of a drop down list after a POST?

by: Zorro | last post by:

Hello guys, <SELECT name = "f_hello"> <OPTION value = 'H'>Hello</OPTION>" <OPTION value = 'G'>Good-Bye</OPTION>" </SELECT> In the case above, I can easily retrieve the f_hello variable ...

PHP

convert urls text as needed

by: David Thielen | last post by:

Hi; Is there a method that will convert a string to the correct text for a url? For example www.dave.com/my name.htm -> www.dave.com/my%20name.htm but & should not be changed to & -- thanks...

ASP.NET

Retrieve cell text in datagrid

by: .net_developer | last post by:

Hi I have a datagrid that has the following fields.. 1.Document Title - Bound Column 2.Document Date-Bound Column 3.Save to Computer- Template Column-Link Button I need to retrieve the...

.NET Framework

Using a regular expression to retrieve the text between two parentheses

by: Mark Rae | last post by:

Hi, Supposing I had a string made up of a person's name followed by their profession in parentheses e.g. string strText = "Tiger Woods (golfer)"; and I wanted to extract the portion of the...

C# / C Sharp

I want to retrieve the text property of the buttons in another form

by: subbugannamani | last post by:

I want to retrieve the text property of the buttons in another form (for C# desktop application).For ex in Form1.cs I have four buttons I want to get the text property of that 4 buttons in...

.NET Framework

retrieve dynamic text.value

by: nightwalker | last post by:

hi all, I am trying to retrieve dynamic text value, but I need some helps with my script below. it doesn't get executed. Please help. for (var i =1; i <=3; i++) { ...

Javascript

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

How to retrieve URLs and text from web pages

Similar topics