473,402 Members | 2,053 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,402 software developers and data experts.

How to retrieve URLs and text from web pages

3
Hi,

I’m new to programming. I’m currently learning python to write a web crawler to extract all text from a web page, in addition to, crawling to further URLs and collecting the text there. The idea is to place all the extracted text in a .txt file with each word in a single line. So the text has to be tokenized. All punctuation marks, duplicate words and non-stop words have to be removed.

The program should crawl the web to a certain depth and collect the URLs and text from each depth (level). I decided to choose a depth of 3. I divided the code to two parts. Part one to collect the URLs and part two to extract the text. Here is my problem:
  1. The program is extremely slow.
  2. I'm not sure if it functions properly.
  3. Is there a better way to extract text?
  4. Are there any available modules to help clean the text i.e. removing duplicates, non-stop words ...

(Please Note: the majority of the code (the first part) is written by “James Mills”. I found the code online and it looks helpful so I used it. I just modified it and added my code to it)

Thanks,
Kal

Expand|Select|Wrap|Line Numbers
  1. import sys
  2. import urllib2
  3. import urlparse
  4. from BeautifulSoup import BeautifulSoup,NavigableString
  5.  
  6. __version__ = "0.1"
  7. __copyright__ = "CopyRight (C) 2008 by James Mills"
  8. __license__ = "GPL"
  9. __author__ = "James Mills"
  10. __author_email__ = "James Mills, James dot Mills st dotred dot com dot au"
  11.  
  12. USAGE = "%prog [options] <url>"
  13. VERSION = "%prog v" + __version__
  14.  
  15. AGENT = "%s/%s" % (__name__, __version__)
  16.  
  17. def encodeHTML(s=""):
  18.     """encodeHTML(s) -> str
  19.  
  20.     Encode HTML special characters from their ASCII form to
  21.     HTML entities.
  22.     """
  23.  
  24.     return s.replace("&", "&amp;") \
  25.             .replace("<", "&lt;") \
  26.             .replace(">", "&gt;") \
  27.             .replace("\"", "&quot;") \
  28.             .replace("'", "'") \
  29.             .replace("--", "&mdash")
  30.  
  31. class Fetcher(object):
  32.  
  33.     def __init__(self, url):
  34.         self.url = url
  35.         self.urls = []
  36.  
  37.     def __contains__(self, x):
  38.         return x in self.urls
  39.  
  40.     def __getitem__(self, x):
  41.         return self.urls[x]
  42.  
  43.     def _addHeaders(self, request):
  44.         request.add_header("User-Agent", AGENT)
  45.  
  46.     def open(self):
  47.         url = self.url
  48.         #print "\nFollowing %s" % url
  49.         try:
  50.             request = urllib2.Request(url)
  51.             handle = urllib2.build_opener()
  52.         except IOError:
  53.             return None
  54.         return (request, handle)
  55.  
  56.     def fetch(self):
  57.         request, handle = self.open()
  58.         self._addHeaders(request)
  59.         if handle:
  60.             soup = BeautifulSoup()
  61.             try:
  62.                 content = unicode(handle.open(request).read(), errors="ignore")
  63.                 soup.feed(content)
  64.                 #soup = BeautifulSoup(content)
  65.                 tags = soup('a')
  66.             except urllib2.HTTPError, error:
  67.                 if error.code == 404:
  68.                     print >> sys.stderr, "ERROR: %s -> %s" % (error, error.url)
  69.                 else:
  70.                     print >> sys.stderr, "ERROR: %s" % error
  71.                 tags = []
  72.             except urllib2.URLError, error:
  73.                 print >> sys.stderr, "ERROR: %s" % error
  74.                 tags = []
  75.             for tag in tags:
  76.                 try:
  77.                     href = tag["href"]
  78.                     if href is not None:
  79.                         url = urlparse.urljoin(self.url, encodeHTML(href))
  80.                         if url not in self:
  81.                             #print " Found: %s" % url
  82.                             self.urls.append(url)
  83.                 except KeyError:
  84.                     pass
  85.  
  86.  
  87. ################################################################################
  88. # I created 3 lists (root, level2 and level3).                                 #
  89. # Each list saves the URLs of that level i.e. depth. I choose to create 3      #
  90. # lists so I can have the flexibility of testing the text in each level. Also, #
  91. # the 3 lists can be easily combined into one list.                            #
  92. ################################################################################
  93.  
  94. # Level1: 
  95. root = Fetcher('http://www.wantasimplewebsite.co.uk/index.html')
  96. root.fetch()
  97. for url in root:
  98.     if url not in root: # Avoid duplicate links 
  99.        root.append(url)
  100.  
  101. print "\nRoot URLs are:"
  102. for i, url in enumerate(root):
  103.    print "%d. %s" % (i+1, url)
  104.  
  105.  
  106. # Level2: 
  107. level2 = []
  108. for url in root: # Traverse every element(i.e URL) in root and fetch the URLs from it
  109.     temp = Fetcher(url)
  110.     temp.fetch()
  111.     for url in temp:
  112.         if url not in level2: # Avoid duplicate links 
  113.            level2.append(url)
  114.  
  115.  
  116. print "\nLevel2 URLs are:"
  117. for i, url in enumerate(level2):
  118.    print "%d. %s" % (i+1, url)
  119.  
  120.  
  121. # Level3: 
  122. level3 = []
  123. for url in level2: # Traverse every element(i.e URL) in level2 and fetch the URLs from it
  124.     temp = Fetcher(url)
  125.     temp.fetch()
  126.     for url in temp:
  127.         if url not in level3: # Avoid duplicate links
  128.            level3.append(url)
  129.  
  130.  
  131. print "\nLevel3 URLs are:"
  132. for i, url in enumerate(level3):
  133.    print "%d. %s" % (i+1, url)
  134.  
  135. # 1. Traverse every link in the lists and extract its web page content.
  136. # 2. Tokenize Text.
  137. # 3. Remove stop-words (i.e. and, but, to...)
  138. # 4. Remove duplicates
  139. # 5. What about stemming?
  140. # 6. Check the spelling.
  141. # 7. Save the result in a file
  142.  
  143.  
  144.  
  145. html = urllib2.urlopen('http://www.wantasimplewebsite.co.uk/index.html').read()
  146. soup = BeautifulSoup(html)
  147.  
  148. def printText(tags):
  149.     for tag in tags:
  150.         if tag.__class__ == NavigableString:
  151.             print tag,
  152.         else:
  153.             printText(tag)
  154.  
  155. printText(soup.findAll("body"))
  156.  
Jun 26 '10 #1
0 1633

Sign in to post your reply or Sign up for a free account.

Similar topics

6
by: James Turner | last post by:
I am trying to store formatted text (windows format) into a MySQL database and then retrieve it. The field in the database is a varchar. I cut and paste the test into a form field formatted, then...
1
by: Steve Morris | last post by:
When I change a column from text to varchar using the design view of a table within Enterprise Manager the varchar value (less than 8000 characters) appears in the column but does SQL Server...
2
by: js | last post by:
I have a table rendered with XSLT. The first column has a radio button controls for user to make a selection for a particular row. All the values in the remaining columns are all concated with a...
0
by: Soul | last post by:
Hi, In ASP.Net page, Is there a way to retrieve those selected text from a normal TextField? For example, I have a multi-line TextField for user to enter some text. There is also a Button if...
11
by: Zorro | last post by:
Hello guys, <SELECT name = "f_hello"> <OPTION value = 'H'>Hello</OPTION>" <OPTION value = 'G'>Good-Bye</OPTION>" </SELECT> In the case above, I can easily retrieve the f_hello variable ...
2
by: David Thielen | last post by:
Hi; Is there a method that will convert a string to the correct text for a url? For example www.dave.com/my name.htm -> www.dave.com/my%20name.htm but & should not be changed to &amp; -- thanks...
1
by: .net_developer | last post by:
Hi I have a datagrid that has the following fields.. 1.Document Title - Bound Column 2.Document Date-Bound Column 3.Save to Computer- Template Column-Link Button I need to retrieve the...
16
by: Mark Rae | last post by:
Hi, Supposing I had a string made up of a person's name followed by their profession in parentheses e.g. string strText = "Tiger Woods (golfer)"; and I wanted to extract the portion of the...
1
by: subbugannamani | last post by:
I want to retrieve the text property of the buttons in another form (for C# desktop application).For ex in Form1.cs I have four buttons I want to get the text property of that 4 buttons in...
1
by: nightwalker | last post by:
hi all, I am trying to retrieve dynamic text value, but I need some helps with my script below. it doesn't get executed. Please help. for (var i =1; i <=3; i++) { ...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.