473,657 Members | 2,397 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Help with using findAll() in BeautifulSoup


Okay, I am not sure if there is a better way of doing this than findAll() but
that is how I am doing it right now. I am making an app that screen scapes
dictionary.com for definitions. However, I would like to have the type of
the word for each definition. For example if def1 and def2 are noun
defintions but def3 isn't:
noun
def1
def2
verb
def3

Something like that. Now I can get the definitions just fine. But the
problem comes when I want to get the type. I can get the types, but I don't
know for what definitions they go with. So I can get noun and verb, but for
all I know noun is def1, and verb is 2 and 3. I am wondering if there is a
way to use findAll() but like stop once it hits a certain thing, or a way to
do just that. for example, if I have

noun
<table blah>
<table blah>
verb
<table blah>

I want to be able to do like findAll('span', {'class': 'pg'}), but tell me
how many <tablethings are after it, or before the next so I know how many
defintions it has.

Here is the code I am using(I used "cheese" because that is kinda my test
word for everything in the app.):

import urllib
from BeautifulSoup import BeautifulSoup

class defWord:
def __init__(self, word):
self.word = word

def get_types(term) :
soup =
BeautifulSoup(u rllib.urlopen(' http://dictionary.refe rence.com/search?q=%s' %
term))

for tabs in soup.findAll('s pan', {'class': 'pg'}):
yield tabs.contents[0].string

self.mainList = list(get_types( self.word))
print self.mainList

type = defWord("cheese ")

I don't know if this is really something anyone can help me fix or if I have
to do it on my own. But I would love some help.
--
View this message in context: http://www.nabble.com/Help-with-usin...p18415792.html
Sent from the Python - python-list mailing list archive at Nabble.com.

Jul 12 '08 #1
2 6362
Alexnb wrote:
Okay, I am not sure if there is a better way of doing this than findAll() but
that is how I am doing it right now.
Consider using lxml.html and lxml.cssselect.

http://codespeak.net/lxml/

I am making an app that screen scapes
dictionary.com for definitions.
Do they have a policy for doing that?

noun
<table blah>
<table blah>
verb
<table blah>

I want to be able to do like findAll('span', {'class': 'pg'}), but tell me
how many <tablethings are after it, or before the next so I know how many
defintions it has.
You didn't say where the "span" is in the HTML code, but lxml.cssselect should
get you pretty close to what you want. If your tables are descendants of the
"span"s, a selector like:

"span.pg table"

might work. There's also a CSS syntax for siblings.

Stefan
Jul 12 '08 #2
On Jul 12, 12:55*am, Stefan Behnel <stefan...@behn el.dewrote:
Alexnb wrote:
I am making an app that screen scapes
dictionary.com for definitions.

Do they have a policy for doing that?
From the Dictionary.com Terms of Use (http://dictionary.reference.com/
help/terms.html):

3.2 You will not modify, publish, transmit, participate in the
transfer or sale, create derivative works, or in any way exploit, any
of the content, in whole or in part, found on the Site. You will
download copyrighted content solely for your personal use, but will
make no other use of the content without the express written
permission of Lexico and the copyright owner. You will not make any
changes to any content that you are permitted to download under this
Agreement, and in particular you will not delete or alter any
proprietary rights or attribution notices in any content. You agree
that you do not acquire any ownership rights in any downloaded
content.

IANAL, but it seems pretty clear that, unless this content scraper is
"solely for your personal use," you'll need to get written permission
to include content that you have scraped from Dictionary.com into your
app.

-- Paul
Jul 12 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2148
by: Johnny Lee | last post by:
Hi, I've met a problem in match a regular expression in python. Hope any of you could help me. Here are the details: I have many tags like this: xxx<a href="http://xxx.xxx.xxx" xxx>xxx xxx<a href="wap://xxx.xxx.xxx" xxx>xxx xxx<a href="http://xxx.xxx.xxx" xxx>xxx ..... And I want to find all the "http://xxx.xxx.xxx" out, so I do it
3
31440
by: Michael Rockwell | last post by:
I am new to using C# generics and I am liking what I am finding. However the examples in online help are lacking. Can someone help me with the FindAll method of the generic List class? As I understand the method, it will return a list that meets the criteria evaluated in the delegate function that I use for evaluation. I am not sure how to write this delegate, how do I get the list item to evaluate in my delegate. Is their a defined...
2
2435
by: ted | last post by:
Hi, I'm using the BeautifulSoup module and having some trouble processing a file. It's not printing what I'm expecting. In the code below, I'm expecting cells with only "bgcolor" attributes to be printed, but I'm getting cells with other attributes and some without any attributes. Any help appreciated. Thanks, Ted
15
5980
by: Francach | last post by:
Hi, I'm trying to use the Beautiful Soup package to parse through the "bookmarks.html" file which Firefox exports all your bookmarks into. I've been struggling with the documentation trying to figure out how to extract all the urls. Has anybody got a couple of longer examples using Beautiful Soup I could play around with? Thanks, Martin.
1
4767
by: snewman18 | last post by:
I'm trying to parse out some XML nodes with namespaces using BeautifulSoup. I can't seem to get the syntax correct. It doesn't like the colon in the tag name, and I'm not sure how to refer to that tag. I'm trying to get the attributes of this tag: <yweather:forecast day="Sun" date="18 Feb 2007" low="39" high="55" text="Partly Cloudy/Wind" code="24"> The only way I've been able to get it is by doing a findAll with
2
1012
by: linuxprog | last post by:
hello i have that string "<html>hello</a>world<anytag>ok" and i want to extract all the text , without html tags , the result should be some thing like that : helloworldok i have tried that : from re import findall
4
1537
by: egonslokar | last post by:
Hello Python Community, It'd be great if someone could provide guidance or sample code for accomplishing the following: I have a single unicode file that has descriptions of hundreds of objects. The file fairly resembles HTML-EXAMPLE pasted below. I need to parse the file in such a way to extract data out of the html and to come up with a tab separated file that would look like OUTPUT-
1
3360
by: Magnus.Moraberg | last post by:
Hi, I have the following code - import urllib2 from BeautifulSoup import BeautifulSoup proxy_support = urllib2.ProxyHandler({"http":"http:// 999.999.999.999:8080"}) opener = urllib2.build_opener(proxy_support)
3
9809
by: Magnus.Moraberg | last post by:
Hi, I wish to extract all the words on a set of webpages and store them in a large dictionary. I then wish to procuce a list with the most common words for the language under consideration. So, my code below reads the page - http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm a welsh language page. I hope to then establish the 1000 most commonly
0
8402
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8734
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8508
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
6172
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4164
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4323
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2733
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
1962
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
2
1627
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.