472,377 Members | 1,524 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,377 software developers and data experts.

Help with using findAll() in BeautifulSoup


Okay, I am not sure if there is a better way of doing this than findAll() but
that is how I am doing it right now. I am making an app that screen scapes
dictionary.com for definitions. However, I would like to have the type of
the word for each definition. For example if def1 and def2 are noun
defintions but def3 isn't:
noun
def1
def2
verb
def3

Something like that. Now I can get the definitions just fine. But the
problem comes when I want to get the type. I can get the types, but I don't
know for what definitions they go with. So I can get noun and verb, but for
all I know noun is def1, and verb is 2 and 3. I am wondering if there is a
way to use findAll() but like stop once it hits a certain thing, or a way to
do just that. for example, if I have

noun
<table blah>
<table blah>
verb
<table blah>

I want to be able to do like findAll('span', {'class': 'pg'}), but tell me
how many <tablethings are after it, or before the next so I know how many
defintions it has.

Here is the code I am using(I used "cheese" because that is kinda my test
word for everything in the app.):

import urllib
from BeautifulSoup import BeautifulSoup

class defWord:
def __init__(self, word):
self.word = word

def get_types(term):
soup =
BeautifulSoup(urllib.urlopen('http://dictionary.reference.com/search?q=%s' %
term))

for tabs in soup.findAll('span', {'class': 'pg'}):
yield tabs.contents[0].string

self.mainList = list(get_types(self.word))
print self.mainList

type = defWord("cheese")

I don't know if this is really something anyone can help me fix or if I have
to do it on my own. But I would love some help.
--
View this message in context: http://www.nabble.com/Help-with-usin...p18415792.html
Sent from the Python - python-list mailing list archive at Nabble.com.

Jul 12 '08 #1
2 6271
Alexnb wrote:
Okay, I am not sure if there is a better way of doing this than findAll() but
that is how I am doing it right now.
Consider using lxml.html and lxml.cssselect.

http://codespeak.net/lxml/

I am making an app that screen scapes
dictionary.com for definitions.
Do they have a policy for doing that?

noun
<table blah>
<table blah>
verb
<table blah>

I want to be able to do like findAll('span', {'class': 'pg'}), but tell me
how many <tablethings are after it, or before the next so I know how many
defintions it has.
You didn't say where the "span" is in the HTML code, but lxml.cssselect should
get you pretty close to what you want. If your tables are descendants of the
"span"s, a selector like:

"span.pg table"

might work. There's also a CSS syntax for siblings.

Stefan
Jul 12 '08 #2
On Jul 12, 12:55*am, Stefan Behnel <stefan...@behnel.dewrote:
Alexnb wrote:
I am making an app that screen scapes
dictionary.com for definitions.

Do they have a policy for doing that?
From the Dictionary.com Terms of Use (http://dictionary.reference.com/
help/terms.html):

3.2 You will not modify, publish, transmit, participate in the
transfer or sale, create derivative works, or in any way exploit, any
of the content, in whole or in part, found on the Site. You will
download copyrighted content solely for your personal use, but will
make no other use of the content without the express written
permission of Lexico and the copyright owner. You will not make any
changes to any content that you are permitted to download under this
Agreement, and in particular you will not delete or alter any
proprietary rights or attribution notices in any content. You agree
that you do not acquire any ownership rights in any downloaded
content.

IANAL, but it seems pretty clear that, unless this content scraper is
"solely for your personal use," you'll need to get written permission
to include content that you have scraped from Dictionary.com into your
app.

-- Paul
Jul 12 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Johnny Lee | last post by:
Hi, I've met a problem in match a regular expression in python. Hope any of you could help me. Here are the details: I have many tags like this: xxx<a href="http://xxx.xxx.xxx" xxx>xxx xxx<a...
3
by: Michael Rockwell | last post by:
I am new to using C# generics and I am liking what I am finding. However the examples in online help are lacking. Can someone help me with the FindAll method of the generic List class? As I...
2
by: ted | last post by:
Hi, I'm using the BeautifulSoup module and having some trouble processing a file. It's not printing what I'm expecting. In the code below, I'm expecting cells with only "bgcolor" attributes to...
15
by: Francach | last post by:
Hi, I'm trying to use the Beautiful Soup package to parse through the "bookmarks.html" file which Firefox exports all your bookmarks into. I've been struggling with the documentation trying to...
1
by: snewman18 | last post by:
I'm trying to parse out some XML nodes with namespaces using BeautifulSoup. I can't seem to get the syntax correct. It doesn't like the colon in the tag name, and I'm not sure how to refer to that...
2
by: linuxprog | last post by:
hello i have that string "<html>hello</a>world<anytag>ok" and i want to extract all the text , without html tags , the result should be some thing like that : helloworldok i have tried that :...
4
by: egonslokar | last post by:
Hello Python Community, It'd be great if someone could provide guidance or sample code for accomplishing the following: I have a single unicode file that has descriptions of hundreds of...
1
by: Magnus.Moraberg | last post by:
Hi, I have the following code - import urllib2 from BeautifulSoup import BeautifulSoup proxy_support = urllib2.ProxyHandler({"http":"http:// 999.999.999.999:8080"}) opener =...
3
by: Magnus.Moraberg | last post by:
Hi, I wish to extract all the words on a set of webpages and store them in a large dictionary. I then wish to procuce a list with the most common words for the language under consideration. So,...
2
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge required to effectively administer and manage Oracle...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was proposed, which integrated multiple engines and...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific technical details, Gmail likely implements measures...
0
Oralloy
by: Oralloy | last post by:
Hello Folks, I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA. My problem (spelled failure) is with the synthesis of my design into a bitstream, not the C++...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand. Background colors can be used to highlight important...
2
by: Ricardo de Mila | last post by:
Dear people, good afternoon... I have a form in msAccess with lots of controls and a specific routine must be triggered if the mouse_down event happens in any control. Than I need to discover what...
0
DizelArs
by: DizelArs | last post by:
Hi all) Faced with a problem, element.click() event doesn't work in Safari browser. Tried various tricks like emulating touch event through a function: let clickEvent = new Event('click', {...
0
by: F22F35 | last post by:
I am a newbie to Access (most programming for that matter). I need help in creating an Access database that keeps the history of each user in a database. For example, a user might have lesson 1 sent...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.