473,378 Members | 1,417 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

python - firefox dom/xpath question/issue

Hi.

Got a test web page, that basically has two "<html" tags in it. Examining
the page via Firefox/Dom Inspector, I can create a test xpath query
"/html/body/form" which gets the target form for the test.

The issue comes when I examine the page's source html. It looks like:
<html>
<body>
</body>
</html>

<html>
<body>
..
..
..
</body>
</html>

I've simplified things a bit... but basically, the 1st "html/body" is empty,
with the 2nd containing the data/nodes I need.

In using xpath("/html/body/form"), the app returns nothing/crashes.. I've
tried to do something like xpath("/html[position()=0]") as well with no
luck... It's as if xpath only looks at the 1st html that it sees in a given
page. I can't seem to find any docs for xpath to work around this. I'm using
the libxml2dom for python 2.5.1.

Any thoughts/comments...

If I comment out the 1st html section, things work as they should. The test
code is below...

thanks

------------------------------------------
#!/usr/bin/python
#
# test.py
#
# scrapes/extracts the basic data for the college
#
#
# the app gets/stores
# name
# url
# address (street/city/state
# phone
#
################################################## ####################3
#test python script
import re
import libxml2dom
import urllib
import urllib2
import sys, string
from mechanize import Browser
import mechanize
#import tidy
import os.path
import cookielib
from libxml2dom import Node
from libxml2dom import NodeList
import subprocess
import time

########################
#
# Parse pricegrabber.com
########################
##cj = "p"
##COOKIEFILE = 'cookies.lwp'
#cookielib = 1
urlopen = urllib2.urlopen
Request = urllib2.Request
br = Browser()
br2 = Browser()

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values1 = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

url="http://schedule.psu.edu/"
#=======================================
if __name__ == "__main__":
# main app

txdata = None

#----------------------------

##br.set_cookiejar(cj)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-Agent', 'Firefox')]

print "url =",url
#br.open(url)
##cj.save(COOKIEFILE) # resave cookies

#res = br.response() # this is a copy of response
#s = res.read()
#print "slen=",len(s)
tfile = open("/college/psu1.dat")
s = tfile.read()
print s
# s contains HTML not XML text
d=[]
d = libxml2dom.parseString(s, html=1)
print "d",d

name_=[]
len_=0

br.open(url)
##cj.save(COOKIEFILE) # resave cookies

#res = br.response() # this is a copy of response
#s = res.read()
print "slen=",len(s)

# s contains HTML not XML text
#d=[]
#d = libxml2dom.parseString(s, html=1)
#print "d",d

#name_ = d.xpath("//form")
name_ = d.xpath("/html/body/form")
len_ = len(name_)
print "len=",len_

print "name1",name_
print "len",len(name_)
#print "sdlfs"
sys.exit()
# else:
# print "err in form_ID"
print "here..."
Aug 25 '08 #1
1 2201
bruce schrieb:
Hi.

Got a test web page, that basically has two "<html" tags in it. Examining
the page via Firefox/Dom Inspector, I can create a test xpath query
"/html/body/form" which gets the target form for the test.

The issue comes when I examine the page's source html. It looks like:
<html>
<body>
</body>
</html>

<html>
<body>
.
.
.
</body>
</html>

I've simplified things a bit... but basically, the 1st "html/body" is empty,
with the 2nd containing the data/nodes I need.
If that's your document, it is invalid XML - XML only allows *one* root.
Thus the parsers failure isn't too suprising.

Try & wrap the whole document under an arbitrary root-tag, and included
that as first part of the xpath. See if that helps.

Diez
Aug 25 '08 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Ot | last post by:
I posted this to the wrong group. It went to m.p.dotnet.languages.vb. Ooops. -------------------------------------------------------------------- I have this tiny problem. I have learned...
3
by: bruce | last post by:
for guys with python/xpath expertise.. i'm playing with xpath.. and i'm trying to solve an issue... i have the following kind of situation where i'm trying to get certain data. i have a...
0
by: pompair | last post by:
Hello, I'm making a quiz game for fun. I have an xml file like this: <?xml version="1.0" encoding="utf-8" ?> <results> <index>99</index> <answers>11</answers> <questions> <question id="1">
3
by: bruce | last post by:
Hi... got a short test app that i'm playing with. the goal is to get data off the page in question. basically, i should be able to get a list of "tr" nodes, and then to iterate/parse them....
1
by: bruce | last post by:
Hi Paul... Thanks for the reply. Came to the same conclusion a few minutes before I saw your email. Another question: tr=d.xpath(foo) gets me an array of nodes.
0
by: bruce | last post by:
hi... i can use an xpath query to create a node from an html/dom representation. however, if i have a node, is there a way to generate an xpath query from the node. in testing with...
2
by: bruce | last post by:
morning.... i apologize up front as this is really more of an xpath question.. in my python, i'm using the xpath function to iterate/parse some html. i can do something like ...
0
by: John Krukoff | last post by:
On Wed, 2008-09-03 at 13:36 -0700, bruce wrote: Well, you could just do the test (and the count!) in the xpath expression: count( //tr/td ) It sounds like you're not familiar with xpath? I...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.