468,287 Members | 1,946 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,287 developers. It's quick & easy.

Extracting text from a Webpage using BeautifulSoup

Hi,

I wish to extract all the words on a set of webpages and store them in
a large dictionary. I then wish to procuce a list with the most common
words for the language under consideration. So, my code below reads
the page -

http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm

a welsh language page. I hope to then establish the 1000 most commonly
used words in Welsh. The problem I'm having is that
soup.findAll(text=True) is returning the likes of -

u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
www.w3.org/TR/REC-html40/loose.dtd"'

and -

<a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"

Any suggestions how I might overcome this problem?

Thanks,

Barry.
Here's my code -

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

# proxy_support = urllib2.ProxyHandler({"http":"http://
999.999.999.999:8080"})
# opener = urllib2.build_opener(proxy_support)
# urllib2.install_opener(opener)

page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
newsid_7420900/7420967.stm')
soup = BeautifulSoup(page)

pageText = soup.findAll(text=True)
print pageText

Jun 27 '08 #1
3 9449
On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:
I wish to extract all the words on a set of webpages and store them in
a large dictionary. I then wish to procuce a list with the most common
words for the language under consideration. So, my code below reads
the page -

http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm

a welsh language page. I hope to then establish the 1000 most commonly
used words in Welsh. The problem I'm having is that
soup.findAll(text=True) is returning the likes of -

u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
www.w3.org/TR/REC-html40/loose.dtd"'
Just extract the text from the body of the document.

body_texts = soup.body(text=True)
and -

<a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"

Any suggestions how I might overcome this problem?
Ask the BBC to produce HTML that's less buggy. ;-)

http://validator.w3.org/ reports bugs like "'body' tag not allowed here"
or closing tags without opening ones and so on.

Ciao,
Marc 'BlackJack' Rintsch
Jun 27 '08 #2
On 27 Maj, 12:54, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:
I wish to extract all the words on a set of webpages and store them in
a large dictionary. I then wish to procuce a list with the most common
words for the language under consideration. So, my code below reads
the page -
http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm
a welsh language page. I hope to then establish the 1000 most commonly
used words in Welsh. The problem I'm having is that
soup.findAll(text=True) is returning the likes of -
u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
www.w3.org/TR/REC-html40/loose.dtd"'

Just extract the text from the body of the document.

body_texts = soup.body(text=True)
and -
<a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
Any suggestions how I might overcome this problem?

Ask the BBC to produce HTML that's less buggy. ;-)

http://validator.w3.org/reports bugs like "'body' tag not allowed here"
or closing tags without opening ones and so on.

Ciao,
Marc 'BlackJack' Rintsch
Great, thanks!
Jun 27 '08 #3
On May 27, 5:01*am, Magnus.Morab...@gmail.com wrote:
Hi,

I wish to extract all the words on a set of webpages and store them in
a large dictionary. I then wish to procuce a list with the most common
words for the language under consideration. So, my code below reads
the page -

http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm

a welsh language page. I hope to then establish the 1000 most commonly
used words in Welsh. The problem I'm having is that
soup.findAll(text=True) is returning the likes of -

u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"'

and -

<a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"

Any suggestions how I might overcome this problem?

Thanks,

Barry.

Here's my code -

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

# proxy_support = urllib2.ProxyHandler({"http":"http://
999.999.999.999:8080"})
# opener = urllib2.build_opener(proxy_support)
# urllib2.install_opener(opener)

page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
newsid_7420900/7420967.stm')
soup = BeautifulSoup(page)

pageText = soup.findAll(text=True)
print pageText
As an alternative datapoint, you can try out the htmlStripper example
on the pyparsing wiki: http://pyparsing.wikispaces.com/spac...tmlStripper.py

-- Paul
Jun 27 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Ryan Kaskel | last post: by
7 posts views Thread by Gonzillaaa | last post: by
2 posts views Thread by s. d. rose | last post: by
9 posts views Thread by Mizipzor | last post: by
5 posts views Thread by Larry Bates | last post: by
1 post views Thread by =?UTF-8?B?4KSw4KS14KWA4KSC4KSm4KSwIOCkoOCkvuCkleCl | last post: by
reply views Thread by NPC403 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.