python - firefox dom/xpath question/issue

bruce

Hi.

Got a test web page, that basically has two "<html" tags in it. Examining
the page via Firefox/Dom Inspector, I can create a test xpath query
"/html/body/form" which gets the target form for the test.

The issue comes when I examine the page's source html. It looks like:
<html>
<body>
</body>
</html>

<html>
<body>
..
..
..
</body>
</html>

I've simplified things a bit... but basically, the 1st "html/body" is empty,
with the 2nd containing the data/nodes I need.

In using xpath("/html/body/form"), the app returns nothing/crashes.. I've
tried to do something like xpath("/html[position()=0]") as well with no
luck... It's as if xpath only looks at the 1st html that it sees in a given
page. I can't seem to find any docs for xpath to work around this. I'm using
the libxml2dom for python 2.5.1.

Any thoughts/comments...

If I comment out the 1st html section, things work as they should. The test
code is below...

thanks

------------------------------------------
#!/usr/bin/python
#
# test.py
#
# scrapes/extracts the basic data for the college
#
#
# the app gets/stores
# name
# url
# address (street/city/state
# phone
#
################################################## ####################3
#test python script
import re
import libxml2dom
import urllib
import urllib2
import sys, string
from mechanize import Browser
import mechanize
#import tidy
import os.path
import cookielib
from libxml2dom import Node
from libxml2dom import NodeList
import subprocess
import time

########################
#
# Parse pricegrabber.com
########################
##cj = "p"
##COOKIEFILE = 'cookies.lwp'
#cookielib = 1
urlopen = urllib2.urlopen
Request = urllib2.Request
br = Browser()
br2 = Browser()

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values1 = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

url="http://schedule.psu.edu/"
#=======================================
if __name__ == "__main__":
# main app

txdata = None

#----------------------------

##br.set_cookiejar(cj)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-Agent', 'Firefox')]

print "url =",url
#br.open(url)
##cj.save(COOKIEFILE) # resave cookies

#res = br.response() # this is a copy of response
#s = res.read()
#print "slen=",len(s)
tfile = open("/college/psu1.dat")
s = tfile.read()
print s
# s contains HTML not XML text
d=[]
d = libxml2dom.parseString(s, html=1)
print "d",d

name_=[]
len_=0

br.open(url)
##cj.save(COOKIEFILE) # resave cookies

#res = br.response() # this is a copy of response
#s = res.read()
print "slen=",len(s)

# s contains HTML not XML text
#d=[]
#d = libxml2dom.parseString(s, html=1)
#print "d",d

#name_ = d.xpath("//form")
name_ = d.xpath("/html/body/form")
len_ = len(name_)
print "len=",len_

print "name1",name_
print "len",len(name_)
#print "sdlfs"
sys.exit()
# else:
# print "err in form_ID"
print "here..."

Aug 25 '08 #1

Subscribe Post Reply

2201

Diez B. Roggisch

bruce schrieb:

Hi.

Got a test web page, that basically has two "<html" tags in it. Examining
the page via Firefox/Dom Inspector, I can create a test xpath query
"/html/body/form" which gets the target form for the test.

The issue comes when I examine the page's source html. It looks like:
<html>
<body>
</body>
</html>

<html>
<body>
.
.
.
</body>
</html>

I've simplified things a bit... but basically, the 1st "html/body" is empty,
with the 2nd containing the data/nodes I need.

If that's your document, it is invalid XML - XML only allows *one* root.
Thus the parsers failure isn't too suprising.

Try & wrap the whole document under an arbitrary root-tag, and included
that as first part of the xpath. See if that helps.

Diez

Aug 25 '08 #2

Similar topics

Simple Xpath question

by: Ot | last post by:

I posted this to the wrong group. It went to m.p.dotnet.languages.vb. Ooops. -------------------------------------------------------------------- I have this tiny problem. I have learned...

.NET Framework

python/xpath question...

by: bruce | last post by:

for guys with python/xpath expertise.. i'm playing with xpath.. and i'm trying to solve an issue... i have the following kind of situation where i'm trying to get certain data. i have a...

Python

DetailsView with XmlDataSource (perhaps XPath issue?)

by: pompair | last post by:

Hello, I'm making a quiz game for fun. I have an xml file like this: <?xml version="1.0" encoding="utf-8" ?> <results> <index>99</index> <answers>11</answers> <questions> <question id="1">

ASP.NET

python screen scraping/parsing

by: bruce | last post by:

Hi... got a short test app that i'm playing with. the goal is to get data off the page in question. basically, i should be able to get a list of "tr" nodes, and then to iterate/parse them....

Python

RE: python screen scraping/parsing

by: bruce | last post by:

Hi Paul... Thanks for the reply. Came to the same conclusion a few minutes before I saw your email. Another question: tr=d.xpath(foo) gets me an array of nodes.

Python

python XPATH question - mechanize/libxml2dom

by: bruce | last post by:

hi... i can use an xpath query to create a node from an html/dom representation. however, if i have a node, is there a way to generate an xpath query from the node. in testing with...

Python

python/xpath question..

by: bruce | last post by:

morning.... i apologize up front as this is really more of an xpath question.. in my python, i'm using the xpath function to iterate/parse some html. i can do something like ...

Python

Re: python/xpath question..

by: John Krukoff | last post by:

On Wed, 2008-09-03 at 13:36 -0700, bruce wrote: Well, you could just do the test (and the count!) in the xpath expression: count( //tr/td ) It sounds like you're not familiar with xpath? I...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++