472,353 Members | 2,146 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,353 software developers and data experts.

python screen scraping/parsing

Hi...

got a short test app that i'm playing with. the goal is to get data off the
page in question.

basically, i should be able to get a list of "tr" nodes, and then to
iterate/parse them. i'm missing something, as i think i can get a single
node, but i can't figure out how to display the contents of the node.. nor
how to get the list of the "tr" nodes....

my test code is:
--------------------------------
#!/usr/bin/python
#test python script
import re
import libxml2dom
import urllib
import urllib2
import sys, string
from mechanize import Browser
import mechanize
#import tidy
import os.path
import cookielib
from libxml2dom import Node
from libxml2dom import NodeList

########################
#
# Parse pricegrabber.com
########################
# datafile
tfile = open("price.dat", 'wr+')
efile = open("price_err.dat", 'wr+')
urlopen = urllib2.urlopen
##cj = urllib2.cookielib.LWPCookieJar()
Request = urllib2.Request
br = Browser()
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values1 = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
url ="http://www.pricegrabber.com/rating_summary.php/page=1"

#=======================================
if __name__ == "__main__":
# main app

txdata = None

#----------------------------
# get the kentucky test pages

#br.set_cookiejar(cj)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-Agent', 'Firefox')]
br.open(url)
#cj.save(COOKIEFILE) # resave cookies

res = br.response() # this is a copy of response
s = res.read()

# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)

print "d = d",d

#get the input/text dialogs
#tn1 = "//div[@id='main_content']/form[1]/input[position()=1]/@name"

t1 =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table[2]/tbo
dy"
tr =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table[2]/tbo
dy/tr[4]"

tr_=d.xpath(tr)

print "len =",tr_[1].nodeValue

print "fin"

-----------------------------------------------

my issue appears to be related to the last "tbody", or tbody/tr[4]...

if i leave off the tbody, i can display data, as the tr_ is an array with
data...

with the "tbody" it appears that the tr_ array is not defined, or it has no
data... however, i can use the DOM tool with firefox to observe the fact
that the "tbody" is there...

so.. what am i missing...
thoughts/comments are most welcome...

also, i'm willing to send a small amount via paypal!!

-bruce

Jun 27 '08 #1
3 2510

BeautifulSoup is a pretty nice python module for screen scraping (not
necessarily well formed) web pages.

On Fri, 13 Jun 2008 11:10:09 -0700, bruce wrote:
Hi...

got a short test app that i'm playing with. the goal is to get data off
the page in question.

basically, i should be able to get a list of "tr" nodes, and then to
iterate/parse them. i'm missing something, as i think i can get a single
node, but i can't figure out how to display the contents of the node..
nor how to get the list of the "tr" nodes....

my test code is:
--------------------------------
#!/usr/bin/python
#test python script
import re
import libxml2dom
import urllib
import urllib2
import sys, string
from mechanize import Browser
import mechanize
#import tidy
import os.path
import cookielib
from libxml2dom import Node
from libxml2dom import NodeList

########################
#
# Parse pricegrabber.com
########################
# datafile
tfile = open("price.dat", 'wr+')
efile = open("price_err.dat", 'wr+')
urlopen = urllib2.urlopen
##cj = urllib2.cookielib.LWPCookieJar() Request = urllib2.Request
br = Browser()
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values1 =
{'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
url ="http://www.pricegrabber.com/rating_summary.php/page=1"

#=======================================
if __name__ == "__main__":
# main app

txdata = None

#----------------------------
# get the kentucky test pages

#br.set_cookiejar(cj)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-Agent', 'Firefox')] br.open(url)
#cj.save(COOKIEFILE) # resave cookies

res = br.response() # this is a copy of response s = res.read()

# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)

print "d = d",d

#get the input/text dialogs
#tn1 =
"//div[@id='main_content']/form[1]/input[position()=1]/@name"

t1 =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table
[2]/tbo
dy"
tr =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table
[2]/tbo
dy/tr[4]"

tr_=d.xpath(tr)

print "len =",tr_[1].nodeValue

print "fin"

-----------------------------------------------

my issue appears to be related to the last "tbody", or tbody/tr[4]...

if i leave off the tbody, i can display data, as the tr_ is an array
with data...

with the "tbody" it appears that the tr_ array is not defined, or it has
no data... however, i can use the DOM tool with firefox to observe the
fact that the "tbody" is there...

so.. what am i missing...
thoughts/comments are most welcome...

also, i'm willing to send a small amount via paypal!!

-bruce
Jun 27 '08 #2
On 13 Jun, 20:10, "bruce" <bedoug...@earthlink.netwrote:
>
url ="http://www.pricegrabber.com/rating_summary.php/page=1"
[...]
tr =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table[2]/tbo
dy/tr[4]"

tr_=d.xpath(tr)
[...]
my issue appears to be related to the last "tbody", or tbody/tr[4]...

if i leave off the tbody, i can display data, as the tr_ is an array with
data...
Yes, I can confirm this.
with the "tbody" it appears that the tr_ array is not defined, or it has no
data... however, i can use the DOM tool with firefox to observe the fact
that the "tbody" is there...
Yes, but the DOM tool in Firefox probably inserts virtual nodes for
its own purposes. Remember that it has to do a lot of other stuff like
implement CSS rendering and DOM event models.

You can confirm that there really is no tbody by printing the result
of this...

d.xpath("/html/body/div[@id='pgSiteContainer']/
div[@id='pgPageContent']/table[2]")[0].toString()

This should fetch the second table in a single element list and then
obviously give you the only element of that list. You'll see that the
raw HTML doesn't have any tbody tags at all.

Paul
Jun 27 '08 #3
Dan Stromberg wrote:
BeautifulSoup is a pretty nice python module for screen scraping (not
necessarily well formed) web pages.

On Fri, 13 Jun 2008 11:10:09 -0700, bruce wrote:
>Hi...

got a short test app that i'm playing with. the goal is to get data off
the page in question.

basically, i should be able to get a list of "tr" nodes, and then to
iterate/parse them. i'm missing something, as i think i can get a single
node, but i can't figure out how to display the contents of the node..
nor how to get the list of the "tr" nodes....

my test code is:
--------------------------------
#!/usr/bin/python
#test python script
import re
import libxml2dom
import urllib
import urllib2
import sys, string
from mechanize import Browser
import mechanize
#import tidy
import os.path
import cookielib
from libxml2dom import Node
from libxml2dom import NodeList

########################
#
# Parse pricegrabber.com
########################
# datafile
tfile = open("price.dat", 'wr+')
efile = open("price_err.dat", 'wr+')
urlopen = urllib2.urlopen
##cj = urllib2.cookielib.LWPCookieJar() Request = urllib2.Request
br = Browser()
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values1 =
{'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
url ="http://www.pricegrabber.com/rating_summary.php/page=1"

#=======================================
if __name__ == "__main__":
# main app

txdata = None

#----------------------------
# get the kentucky test pages

#br.set_cookiejar(cj)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-Agent', 'Firefox')] br.open(url)
#cj.save(COOKIEFILE) # resave cookies

res = br.response() # this is a copy of response s = res.read()

# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)

print "d = d",d

#get the input/text dialogs
#tn1 =
"//div[@id='main_content']/form[1]/input[position()=1]/@name"

t1 =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table
[2]/tbo
>dy"
tr =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table
[2]/tbo
>dy/tr[4]"

tr_=d.xpath(tr)

print "len =",tr_[1].nodeValue

print "fin"

-----------------------------------------------

my issue appears to be related to the last "tbody", or tbody/tr[4]...

if i leave off the tbody, i can display data, as the tr_ is an array
with data...

with the "tbody" it appears that the tr_ array is not defined, or it has
no data... however, i can use the DOM tool with firefox to observe the
fact that the "tbody" is there...

so.. what am i missing...
thoughts/comments are most welcome...

also, i'm willing to send a small amount via paypal!!

-bruce
FYI: Mechanize includes all of BeautifulSoup's methods and adds additional
functionality (like forms handling).

Larry
Jun 27 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: asdf sdf | last post by:
wondering what python resources might be available for python-based legacy system access, particularly to MVS, DB2 and Adabas. What additional...
0
by: Simon Brunning | last post by:
QOTW: "Sure, but what about the case where his program is on paper tape and all he has for an editor is an ice pick?" - Grant Edwards "And in...
4
by: Roland Hall | last post by:
Am I correct in assuming screen scraping is just the response text sent to the browser? If so, would that mean that this could not be screen...
0
by: Robert Martinez | last post by:
I've seen a lot about screen scraping with .NET, mostly in VB.net. I have been able to convert most of it over, but it is still just very basic...
3
by: Jim Giblin | last post by:
I need to scrape specific information from another website, specifically the prices of precious metals from several different vendors. While I will...
4
by: rachel | last post by:
Hello, I am currently contracted out by a real estate agent. He has a page that he has created himself that has a list of homes.. their images...
3
by: Sanjay Arora | last post by:
We are looking to select the language & toolset more suitable for a project that requires getting data from several web-sites in real- time....html...
4
by: different.engine | last post by:
Folks: I am screen scraping a large volume of data from Yahoo Finance each evening, and parsing with Beautiful Soup. I was wondering if anyone...
1
by: bruce | last post by:
Hi Paul... Thanks for the reply. Came to the same conclusion a few minutes before I saw your email. Another question: tr=d.xpath(foo) ...
1
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge...
0
jalbright99669
by: jalbright99669 | last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand....
0
BLUEPANDA
by: BLUEPANDA | last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS...
0
by: Rahul1995seven | last post by:
Introduction: In the realm of programming languages, Python has emerged as a powerhouse. With its simplicity, versatility, and robustness, Python...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.