By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,728 Members | 1,120 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,728 IT Pros & Developers. It's quick & easy.

RE: python screen scraping/parsing

P: n/a
Hi Paul...

Thanks for the reply. Came to the same conclusion a few minutes before I saw
your email.

Another question:

tr=d.xpath(foo)

gets me an array of nodes.

is there a way for me to then iterate through the node tr[x] to see if a
child node exists???

"d" is a document object, while "tr" would be a node object?, or would i
convert the "tr[x]" to a string, and then feed that into the
libxml2dom.parseString()...
thanks

-----Original Message-----
From: py*****************************************@python .org
[mailto:py***************************************** @python.org]On Behalf
Of Paul Boddie
Sent: Friday, June 13, 2008 12:49 PM
To: py*********@python.org
Subject: Re: python screen scraping/parsing
On 13 Jun, 20:10, "bruce" <bedoug...@earthlink.netwrote:
>
url ="http://www.pricegrabber.com/rating_summary.php/page=1"
[...]
tr =
"/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table[2]/tbo
dy/tr[4]"

tr_=d.xpath(tr)
[...]
my issue appears to be related to the last "tbody", or tbody/tr[4]...

if i leave off the tbody, i can display data, as the tr_ is an array with
data...
Yes, I can confirm this.
with the "tbody" it appears that the tr_ array is not defined, or it has
no
data... however, i can use the DOM tool with firefox to observe the fact
that the "tbody" is there...
Yes, but the DOM tool in Firefox probably inserts virtual nodes for
its own purposes. Remember that it has to do a lot of other stuff like
implement CSS rendering and DOM event models.

You can confirm that there really is no tbody by printing the result
of this...

d.xpath("/html/body/div[@id='pgSiteContainer']/
div[@id='pgPageContent']/table[2]")[0].toString()

This should fetch the second table in a single element list and then
obviously give you the only element of that list. You'll see that the
raw HTML doesn't have any tbody tags at all.

Paul
--
http://mail.python.org/mailman/listinfo/python-list

Jun 27 '08 #1
Share this Question
Share on Google+
1 Reply

P: n/a
On 13 Jun, 23:09, "bruce" <bedoug...@earthlink.netwrote:
>
Thanks for the reply. Came to the same conclusion a few minutes before I saw
your email.

Another question:

tr=d.xpath(foo)

gets me an array of nodes.

is there a way for me to then iterate through the node tr[x] to see if a
child node exists???
You can always use the DOM or perform another XPath query:

for node in tr[x].childNodes:
<do something with node>

for node in tr[x].xpath(some_other_query_inside_tr):
<do something with node>
"d" is a document object, while "tr" would be a node object?, or would i
convert the "tr[x]" to a string, and then feed that into the
libxml2dom.parseString()...
There's no need to parse anything again: just use the methods on the
object that tr[x] produces, including the xpath method, of course.
Remember that the document object is just a special node object, so
most of the methods are available on both. If in doubt, run your
program using Python's -i option and then inspect the objects at the
interactive prompt.

Paul
Jun 27 '08 #2

This discussion thread is closed

Replies have been disabled for this discussion.