473,327 Members | 2,118 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,327 software developers and data experts.

scraping question

hey everyone, here's probably an easy qu but i'm new to this...

i'm doing a web scrabe and the line i want in the source code looks like this:

Expand|Select|Wrap|Line Numbers
  1. <td>Return on Average Equity</td>
  2. <td align=right>
  3. 16.08%
  4. </td>
  5. <td align=right>
  6. 10.58%
  7. </td>
  8. <td align=right>
  9. 11.71%
  10. </td>
The number i want to get out is 10.58. However, that number will change so i can't just search for it. Any ideas how i should go about this?

I was hoping there might be a way to look for "<td align=right>" then say get the next line.

any thoughts.

thanks
Oct 2 '07 #1
3 1185
bvdet
2,851 Expert Mod 2GB
hey everyone, here's probably an easy qu but i'm new to this...

i'm doing a web scrabe and the line i want in the source code looks like this:

Expand|Select|Wrap|Line Numbers
  1. <td>Return on Average Equity</td>
  2. <td align=right>
  3. 16.08%
  4. </td>
  5. <td align=right>
  6. 10.58%
  7. </td>
  8. <td align=right>
  9. 11.71%
  10. </td>
The number i want to get out is 10.58. However, that number will change so i can't just search for it. Any ideas how i should go about this?

I was hoping there might be a way to look for "<td align=right>" then say get the next line.

any thoughts.

thanks
This should work, but you probably need a way to terminate the for loop. What kind of data follows?
Expand|Select|Wrap|Line Numbers
  1. import re
  2.  
  3. patt = re.compile(r'(Return on Average Equity)')
  4. fn = 'test.txt'
  5.  
  6. f = open(fn)
  7.  
  8. # skip to 'Return on Average Equity'
  9. s = f.next()
  10. while not patt.search(s):
  11.     s = f.next()
  12.  
  13. returnList = []
  14. for line in f:
  15.     if '<td align=right>' in line:
  16.         returnList.append(f.next().strip())
  17.  
  18. print returnList
>>> ['16.08%', '10.58%', '11.71%']
>>>
Oct 2 '07 #2
When I try your method I get an error that is like this...
Expand|Select|Wrap|Line Numbers
  1. >>> s = f.next()
  2. Traceback (most recent call last):
  3.   File "<interactive input>", line 1, in <module>
  4. StopIteration
  5. >>> 
Also, if i plan to do this a few hundred/thousand times...would I need to create a few hundred/thousand text.txt files before hand?

Thanks
Oct 3 '07 #3
bvdet
2,851 Expert Mod 2GB
When I try your method I get an error that is like this...
Expand|Select|Wrap|Line Numbers
  1. >>> s = f.next()
  2. Traceback (most recent call last):
  3.   File "<interactive input>", line 1, in <module>
  4. StopIteration
  5. >>> 
Also, if i plan to do this a few hundred/thousand times...would I need to create a few hundred/thousand text.txt files before hand?

Thanks
If you are scraping a website, you may be using the urllib module. Your code may look something like this:
Expand|Select|Wrap|Line Numbers
  1. f = urllib.urlopen( 'http://www.bvdetailing.com')
'f' is a file like object on which you can iterate. The file method next() is similar to readline(). A StopIteration error will be raised when the end of file is reached. Example:
Expand|Select|Wrap|Line Numbers
  1. >>> import urllib
  2. >>> f = urllib.urlopen( 'http://www.bvdetailing.com')
  3. >>> f.next()
  4. '<html><head><meta http-equiv="Content-Type" content="text/html; charset=win.............
  5. >>> s = f.read()
  6. >>> f.next()
  7. Traceback (most recent call last):
  8.   File "<interactive input>", line 1, in ?
  9.   File "C:\Python23\lib\socket.py", line 405, in next
  10.     raise StopIteration
  11. StopIteration
  12. >>> 
There is no need to create a disc file. Consider using an HTML parser to get the data you need.
Oct 3 '07 #4

Sign in to post your reply or Sign up for a free account.

Similar topics

4
by: David Jones | last post by:
Hi, I'm interested in learning about web scraping/site scraping using Python. Does anybody know of some online resources or have any modules that are available to help out. O'Reilly published an...
4
by: Roland Hall | last post by:
Am I correct in assuming screen scraping is just the response text sent to the browser? If so, would that mean that this could not be screen scraped? function moi() { var tag = '<a href='; var...
1
by: mustafa | last post by:
anyone know some good reliable html scraping (with python) tutorials. i have looked around and found a few. one uses urllib2 and beautifull soap modules for scraping and parsing...
2
by: Selden McCabe | last post by:
I've been working on a web scraping program, and have the basics down. But I don't understand the parameters. Normally, you go to a URL (say a reverse yellow pages directory), and enter some...
3
by: Jim Giblin | last post by:
I need to scrape specific information from another website, specifically the prices of precious metals from several different vendors. While I will credit the vendors as the data source, I do not...
1
by: niv | last post by:
Hello, I would like to screen scrape certain parts of a webpage...how can I do this in asp.net For instance.... a stockticker thats embeded on a webpage.. I dont want the entire page.. I...
2
by: Victor | last post by:
I'm doing screen scraping by retrieving data from one site and entering into another site. I have a problem with logging into the site. User name and password field contain 'name' property, and...
4
by: jeffbg123 | last post by:
Hey, I am trying to make a bot for a flash game using python. However I am having some trouble with a screen scraping strategy. Is there an accepted way to compare a full screenshot with the...
3
by: bruce | last post by:
Hi... got a short test app that i'm playing with. the goal is to get data off the page in question. basically, i should be able to get a list of "tr" nodes, and then to iterate/parse them....
1
by: bruce | last post by:
Hi Paul... Thanks for the reply. Came to the same conclusion a few minutes before I saw your email. Another question: tr=d.xpath(foo) gets me an array of nodes.
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.