469,625 Members | 1,127 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,625 developers. It's quick & easy.

Help with using findAll() in BeautifulSoup


Okay, I am not sure if there is a better way of doing this than findAll() but
that is how I am doing it right now. I am making an app that screen scapes
dictionary.com for definitions. However, I would like to have the type of
the word for each definition. For example if def1 and def2 are noun
defintions but def3 isn't:
noun
def1
def2
verb
def3

Something like that. Now I can get the definitions just fine. But the
problem comes when I want to get the type. I can get the types, but I don't
know for what definitions they go with. So I can get noun and verb, but for
all I know noun is def1, and verb is 2 and 3. I am wondering if there is a
way to use findAll() but like stop once it hits a certain thing, or a way to
do just that. for example, if I have

noun
<table blah>
<table blah>
verb
<table blah>

I want to be able to do like findAll('span', {'class': 'pg'}), but tell me
how many <tablethings are after it, or before the next so I know how many
defintions it has.

Here is the code I am using(I used "cheese" because that is kinda my test
word for everything in the app.):

import urllib
from BeautifulSoup import BeautifulSoup

class defWord:
def __init__(self, word):
self.word = word

def get_types(term):
soup =
BeautifulSoup(urllib.urlopen('http://dictionary.reference.com/search?q=%s' %
term))

for tabs in soup.findAll('span', {'class': 'pg'}):
yield tabs.contents[0].string

self.mainList = list(get_types(self.word))
print self.mainList

type = defWord("cheese")

I don't know if this is really something anyone can help me fix or if I have
to do it on my own. But I would love some help.
--
View this message in context: http://www.nabble.com/Help-with-usin...p18415792.html
Sent from the Python - python-list mailing list archive at Nabble.com.

Jul 12 '08 #1
2 6154
Alexnb wrote:
Okay, I am not sure if there is a better way of doing this than findAll() but
that is how I am doing it right now.
Consider using lxml.html and lxml.cssselect.

http://codespeak.net/lxml/

I am making an app that screen scapes
dictionary.com for definitions.
Do they have a policy for doing that?

noun
<table blah>
<table blah>
verb
<table blah>

I want to be able to do like findAll('span', {'class': 'pg'}), but tell me
how many <tablethings are after it, or before the next so I know how many
defintions it has.
You didn't say where the "span" is in the HTML code, but lxml.cssselect should
get you pretty close to what you want. If your tables are descendants of the
"span"s, a selector like:

"span.pg table"

might work. There's also a CSS syntax for siblings.

Stefan
Jul 12 '08 #2
On Jul 12, 12:55*am, Stefan Behnel <stefan...@behnel.dewrote:
Alexnb wrote:
I am making an app that screen scapes
dictionary.com for definitions.

Do they have a policy for doing that?
From the Dictionary.com Terms of Use (http://dictionary.reference.com/
help/terms.html):

3.2 You will not modify, publish, transmit, participate in the
transfer or sale, create derivative works, or in any way exploit, any
of the content, in whole or in part, found on the Site. You will
download copyrighted content solely for your personal use, but will
make no other use of the content without the express written
permission of Lexico and the copyright owner. You will not make any
changes to any content that you are permitted to download under this
Agreement, and in particular you will not delete or alter any
proprietary rights or attribution notices in any content. You agree
that you do not acquire any ownership rights in any downloaded
content.

IANAL, but it seems pretty clear that, unless this content scraper is
"solely for your personal use," you'll need to get written permission
to include content that you have scraped from Dictionary.com into your
app.

-- Paul
Jul 12 '08 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Johnny Lee | last post: by
3 posts views Thread by Michael Rockwell | last post: by
2 posts views Thread by ted | last post: by
15 posts views Thread by Francach | last post: by
1 post views Thread by snewman18 | last post: by
2 posts views Thread by linuxprog | last post: by
4 posts views Thread by egonslokar | last post: by
1 post views Thread by Magnus.Moraberg | last post: by
3 posts views Thread by Magnus.Moraberg | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.