Hi...
got a short test app that i'm playing with. the goal is to get data off the
page in question.
basically, i should be able to get a list of "tr" nodes, and then to
iterate/parse them. i'm missing something, as i think i can get a single
node, but i can't figure out how to display the contents of the node.. nor
how to get the list of the "tr" nodes....
my test code is:
--------------------------------
#!/usr/bin/python
#test python script
import re
import libxml2dom
import urllib
import urllib2
import sys, string
from mechanize import Browser
import mechanize
#import tidy
import os.path
import cookielib
from libxml2dom import Node
from libxml2dom import NodeList
########################
#
# Parse pricegrabber.com
########################
# datafile
tfile = open("price.dat", 'wr+')
efile = open("price_err.dat", 'wr+')
urlopen = urllib2.urlopen
##cj = urllib2.cookielib.LWPCookieJar()
Request = urllib2.Request
br = Browser()
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values1 = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
url ="http://www.pricegrabber.com/rating_summary.php/page=1"
#=======================================
if __name__ == "__main__":
# main app
txdata = None
#----------------------------
# get the kentucky test pages
#br.set_cookiejar(cj)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-Agent', 'Firefox')]
br.open(url)
#cj.save(COOKIEFILE) # resave cookies
res = br.response() # this is a copy of response
s = res.read()
# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)
print "d = d",d
#get the input/text dialogs
#tn1 = "//div[@id='main_content']/form[1]/input[position()=1]/@name"
t1 =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table[2]/tbo
dy"
tr =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table[2]/tbo
dy/tr[4]"
tr_=d.xpath(tr)
print "len =",tr_[1].nodeValue
print "fin"
-----------------------------------------------
my issue appears to be related to the last "tbody", or tbody/tr[4]...
if i leave off the tbody, i can display data, as the tr_ is an array with
data...
with the "tbody" it appears that the tr_ array is not defined, or it has no
data... however, i can use the DOM tool with firefox to observe the fact
that the "tbody" is there...
so.. what am i missing...
thoughts/comments are most welcome...
also, i'm willing to send a small amount via paypal!!
-bruce 3 2510
BeautifulSoup is a pretty nice python module for screen scraping (not
necessarily well formed) web pages.
On Fri, 13 Jun 2008 11:10:09 -0700, bruce wrote:
Hi...
got a short test app that i'm playing with. the goal is to get data off
the page in question.
basically, i should be able to get a list of "tr" nodes, and then to
iterate/parse them. i'm missing something, as i think i can get a single
node, but i can't figure out how to display the contents of the node..
nor how to get the list of the "tr" nodes....
my test code is:
--------------------------------
#!/usr/bin/python
#test python script
import re
import libxml2dom
import urllib
import urllib2
import sys, string
from mechanize import Browser
import mechanize
#import tidy
import os.path
import cookielib
from libxml2dom import Node
from libxml2dom import NodeList
########################
#
# Parse pricegrabber.com
########################
# datafile
tfile = open("price.dat", 'wr+')
efile = open("price_err.dat", 'wr+')
urlopen = urllib2.urlopen
##cj = urllib2.cookielib.LWPCookieJar() Request = urllib2.Request
br = Browser()
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values1 =
{'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
url ="http://www.pricegrabber.com/rating_summary.php/page=1"
#=======================================
if __name__ == "__main__":
# main app
txdata = None
#----------------------------
# get the kentucky test pages
#br.set_cookiejar(cj)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-Agent', 'Firefox')] br.open(url)
#cj.save(COOKIEFILE) # resave cookies
res = br.response() # this is a copy of response s = res.read()
# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)
print "d = d",d
#get the input/text dialogs
#tn1 =
"//div[@id='main_content']/form[1]/input[position()=1]/@name"
t1 =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table
[2]/tbo
dy"
tr =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table
[2]/tbo
dy/tr[4]"
tr_=d.xpath(tr)
print "len =",tr_[1].nodeValue
print "fin"
-----------------------------------------------
my issue appears to be related to the last "tbody", or tbody/tr[4]...
if i leave off the tbody, i can display data, as the tr_ is an array
with data...
with the "tbody" it appears that the tr_ array is not defined, or it has
no data... however, i can use the DOM tool with firefox to observe the
fact that the "tbody" is there...
so.. what am i missing...
thoughts/comments are most welcome...
also, i'm willing to send a small amount via paypal!!
-bruce
On 13 Jun, 20:10, "bruce" <bedoug...@earthlink.netwrote:
>
url ="http://www.pricegrabber.com/rating_summary.php/page=1"
[...]
tr =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table[2]/tbo
dy/tr[4]"
tr_=d.xpath(tr)
[...]
my issue appears to be related to the last "tbody", or tbody/tr[4]...
if i leave off the tbody, i can display data, as the tr_ is an array with
data...
Yes, I can confirm this.
with the "tbody" it appears that the tr_ array is not defined, or it has no
data... however, i can use the DOM tool with firefox to observe the fact
that the "tbody" is there...
Yes, but the DOM tool in Firefox probably inserts virtual nodes for
its own purposes. Remember that it has to do a lot of other stuff like
implement CSS rendering and DOM event models.
You can confirm that there really is no tbody by printing the result
of this...
d.xpath("/html/body/div[@id='pgSiteContainer']/
div[@id='pgPageContent']/table[2]")[0].toString()
This should fetch the second table in a single element list and then
obviously give you the only element of that list. You'll see that the
raw HTML doesn't have any tbody tags at all.
Paul
Dan Stromberg wrote:
BeautifulSoup is a pretty nice python module for screen scraping (not
necessarily well formed) web pages.
On Fri, 13 Jun 2008 11:10:09 -0700, bruce wrote:
>Hi...
got a short test app that i'm playing with. the goal is to get data off the page in question.
basically, i should be able to get a list of "tr" nodes, and then to iterate/parse them. i'm missing something, as i think i can get a single node, but i can't figure out how to display the contents of the node.. nor how to get the list of the "tr" nodes....
my test code is: -------------------------------- #!/usr/bin/python
#test python script import re import libxml2dom import urllib import urllib2 import sys, string from mechanize import Browser import mechanize #import tidy import os.path import cookielib from libxml2dom import Node from libxml2dom import NodeList
######################## # # Parse pricegrabber.com ########################
# datafile tfile = open("price.dat", 'wr+') efile = open("price_err.dat", 'wr+')
urlopen = urllib2.urlopen ##cj = urllib2.cookielib.LWPCookieJar() Request = urllib2.Request br = Browser()
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values1 = {'name' : 'Michael Foord', 'location' : 'Northampton', 'language' : 'Python' } headers = { 'User-Agent' : user_agent }
url ="http://www.pricegrabber.com/rating_summary.php/page=1"
#=======================================
if __name__ == "__main__": # main app
txdata = None
#---------------------------- # get the kentucky test pages
#br.set_cookiejar(cj) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) br.addheaders = [('User-Agent', 'Firefox')] br.open(url) #cj.save(COOKIEFILE) # resave cookies
res = br.response() # this is a copy of response s = res.read()
# s contains HTML not XML text d = libxml2dom.parseString(s, html=1)
print "d = d",d
#get the input/text dialogs #tn1 = "//div[@id='main_content']/form[1]/input[position()=1]/@name"
t1 = "/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table
[2]/tbo
>dy" tr = "/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table
[2]/tbo
>dy/tr[4]"
tr_=d.xpath(tr)
print "len =",tr_[1].nodeValue
print "fin"
-----------------------------------------------
my issue appears to be related to the last "tbody", or tbody/tr[4]...
if i leave off the tbody, i can display data, as the tr_ is an array with data...
with the "tbody" it appears that the tr_ array is not defined, or it has no data... however, i can use the DOM tool with firefox to observe the fact that the "tbody" is there...
so.. what am i missing...
thoughts/comments are most welcome...
also, i'm willing to send a small amount via paypal!!
-bruce
FYI: Mechanize includes all of BeautifulSoup's methods and adds additional
functionality (like forms handling).
Larry This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: asdf sdf |
last post by:
wondering what python resources might be available for python-based
legacy system access, particularly to MVS, DB2 and Adabas. What
additional...
|
by: Simon Brunning |
last post by:
QOTW: "Sure, but what about the case where his program is on paper tape and
all he has for an editor is an ice pick?" - Grant Edwards
"And in...
|
by: Roland Hall |
last post by:
Am I correct in assuming screen scraping is just the response text sent to
the browser? If so, would that mean that this could not be screen...
|
by: Robert Martinez |
last post by:
I've seen a lot about screen scraping with .NET, mostly in VB.net. I have
been able to convert most of it over, but it is still just very basic...
|
by: Jim Giblin |
last post by:
I need to scrape specific information from another website, specifically the
prices of precious metals from several different vendors. While I will...
|
by: rachel |
last post by:
Hello,
I am currently contracted out by a real estate agent. He
has a page that he has created himself that has a list of
homes.. their images...
|
by: Sanjay Arora |
last post by:
We are looking to select the language & toolset more suitable for a
project that requires getting data from several web-sites in real-
time....html...
|
by: different.engine |
last post by:
Folks:
I am screen scraping a large volume of data from Yahoo Finance each
evening, and parsing with Beautiful Soup.
I was wondering if anyone...
|
by: bruce |
last post by:
Hi Paul...
Thanks for the reply. Came to the same conclusion a few minutes before I saw
your email.
Another question:
tr=d.xpath(foo)
...
|
by: Kemmylinns12 |
last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
|
by: Naresh1 |
last post by:
What is WebLogic Admin Training?
WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge...
|
by: jalbright99669 |
last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
|
by: antdb |
last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine
In the overall architecture, a new "hyper-convergence" concept was...
|
by: AndyPSV |
last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...
|
by: WisdomUfot |
last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific...
|
by: Carina712 |
last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand....
|
by: BLUEPANDA |
last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS...
|
by: Rahul1995seven |
last post by:
Introduction:
In the realm of programming languages, Python has emerged as a powerhouse. With its simplicity, versatility, and robustness, Python...
| |