473,287 Members | 1,580 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,287 software developers and data experts.

Extracting text from a Webpage using BeautifulSoup

Hi,

I wish to extract all the words on a set of webpages and store them in
a large dictionary. I then wish to procuce a list with the most common
words for the language under consideration. So, my code below reads
the page -

http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm

a welsh language page. I hope to then establish the 1000 most commonly
used words in Welsh. The problem I'm having is that
soup.findAll(text=True) is returning the likes of -

u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
www.w3.org/TR/REC-html40/loose.dtd"'

and -

<a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"

Any suggestions how I might overcome this problem?

Thanks,

Barry.
Here's my code -

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

# proxy_support = urllib2.ProxyHandler({"http":"http://
999.999.999.999:8080"})
# opener = urllib2.build_opener(proxy_support)
# urllib2.install_opener(opener)

page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
newsid_7420900/7420967.stm')
soup = BeautifulSoup(page)

pageText = soup.findAll(text=True)
print pageText

Jun 27 '08 #1
3 9787
On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:
I wish to extract all the words on a set of webpages and store them in
a large dictionary. I then wish to procuce a list with the most common
words for the language under consideration. So, my code below reads
the page -

http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm

a welsh language page. I hope to then establish the 1000 most commonly
used words in Welsh. The problem I'm having is that
soup.findAll(text=True) is returning the likes of -

u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
www.w3.org/TR/REC-html40/loose.dtd"'
Just extract the text from the body of the document.

body_texts = soup.body(text=True)
and -

<a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"

Any suggestions how I might overcome this problem?
Ask the BBC to produce HTML that's less buggy. ;-)

http://validator.w3.org/ reports bugs like "'body' tag not allowed here"
or closing tags without opening ones and so on.

Ciao,
Marc 'BlackJack' Rintsch
Jun 27 '08 #2
On 27 Maj, 12:54, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:
I wish to extract all the words on a set of webpages and store them in
a large dictionary. I then wish to procuce a list with the most common
words for the language under consideration. So, my code below reads
the page -
http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm
a welsh language page. I hope to then establish the 1000 most commonly
used words in Welsh. The problem I'm having is that
soup.findAll(text=True) is returning the likes of -
u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
www.w3.org/TR/REC-html40/loose.dtd"'

Just extract the text from the body of the document.

body_texts = soup.body(text=True)
and -
<a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
Any suggestions how I might overcome this problem?

Ask the BBC to produce HTML that's less buggy. ;-)

http://validator.w3.org/reports bugs like "'body' tag not allowed here"
or closing tags without opening ones and so on.

Ciao,
Marc 'BlackJack' Rintsch
Great, thanks!
Jun 27 '08 #3
On May 27, 5:01*am, Magnus.Morab...@gmail.com wrote:
Hi,

I wish to extract all the words on a set of webpages and store them in
a large dictionary. I then wish to procuce a list with the most common
words for the language under consideration. So, my code below reads
the page -

http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm

a welsh language page. I hope to then establish the 1000 most commonly
used words in Welsh. The problem I'm having is that
soup.findAll(text=True) is returning the likes of -

u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"'

and -

<a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"

Any suggestions how I might overcome this problem?

Thanks,

Barry.

Here's my code -

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

# proxy_support = urllib2.ProxyHandler({"http":"http://
999.999.999.999:8080"})
# opener = urllib2.build_opener(proxy_support)
# urllib2.install_opener(opener)

page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
newsid_7420900/7420967.stm')
soup = BeautifulSoup(page)

pageText = soup.findAll(text=True)
print pageText
As an alternative datapoint, you can try out the htmlStripper example
on the pyparsing wiki: http://pyparsing.wikispaces.com/spac...tmlStripper.py

-- Paul
Jun 27 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Ryan Kaskel | last post by:
How can I obtain the source of a remote webpage (e.g. http://www.python.org/index.html) using Python? Something like: pyPage = open('http://www.python.org/index.html',r).read() Obviously...
7
by: Gonzillaaa | last post by:
I'm trying to get the data on the "Central London Property Price Guide" box at the left hand side of this page http://www.findaproperty.com/regi0018.html I have managed to get the data :) but...
2
by: s. d. rose | last post by:
Hello All. I am learning Python, and have never worked with HTML. However, I would like to write a simple script to audit my 100+ Netware servers via their web portal. I was reading Chapter 8...
3
by: Frank Potter | last post by:
There are ten web pages I want to deal with. from http://www.af.shejis.com/new_lw/html/125926.shtml to http://www.af.shejis.com/new_lw/html/125936.shtml Each of them uses the charset of...
9
by: Mizipzor | last post by:
Is there a way to "subscribe" to individual topics? im currently getting bombarded with daily digests and i wish to only receive a mail when there is activity in a topic that interests me. Can this...
1
by: rpjd | last post by:
I am completely new to this so please bear with me here. My project involves a webpage executing php scripts via an xmlhttprequest which queries a database and returns data to the webpage. This code...
9
by: sebzzz | last post by:
Hi, I work at this company and we are re-building our website: http://caslt.org/. The new website will be built by an external firm (I could do it myself, but since I'm just the summer student...
5
by: Larry Bates | last post by:
Info: Python version: ActivePython 2.5.1.1 Platform: Windows I wanted to install BeautifulSoup today for a small project and decided to use easy_install. I can install other packages just...
1
by: =?UTF-8?B?4KSw4KS14KWA4KSC4KSm4KSwIOCkoOCkvuCkleCl | last post by:
hello friends, is there any lib in python that provides a mechanism to get the title of a web page ? also is there anything available to get a nice summary like the way google shows below...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.