473,385 Members | 1,855 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

How use XML parsing tools on this one specific URL?

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Chris

Mar 4 '07 #1
9 1426

Chrishttp://moneycentral.msn.com/companyreport?Symbol=BBBY

ChrisI can't validate it and xml.minidom.dom.parseString won't work on
Chrisit.

ChrisIf this was just some teenager's web site I'd move on. Is there
Chrisany hope avoiding regular expression hacks to extract the data
Chrisfrom this page?

Tidy it perhaps or use BeautifulSoup? ElementTree can use tidy if it's
available.

Skip
Mar 4 '07 #2
"se******@spawar.navy.mil" <se******@spawar.navy.milwrites:
I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:
Yes... And Microsoft is responsible for a lot of the ill-formed pages on the
web be it on their website or made by their applications.
>
http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?
It all depends on what data you want. Probably a non-validating parser would
be able to extract some things. Another option is pass the page through some
validator that can fix the page, like tidy...
--
Jorge Godoy <jg****@gmail.com>
Mar 4 '07 #3
In article <11**********************@i80g2000cwc.googlegroups .com>,
"se******@spawar.navy.mil" <se******@spawar.navy.milwrote:
I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?
Valid XHTML is scarcer than hen's teeth. Luckily, someone else has
already written the ugly regex parsing hacks for you. Try Connelly
Barnes' HTMLData:
http://oregonstate.edu/~barnesc/htmldata/

Or BeautifulSoup as others have suggested.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Mar 4 '07 #4
se******@spawar.navy.mil wrote:
I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY
Yes, thank you Microsoft!
I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?
The standards adherence from Microsoft services is clearly at "teenage
level", but here's a recipe:

import libxml2dom
import urllib
f = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
d = libxml2dom.parse(f, html=1)
f.close()

You now have a document which contains a DOM providing libxml2's
interpretation of the HTML. Sadly, PyXML's HtmlLib doesn't seem to
work with the given document. Other tools may give acceptable results,
however.

Paul

Mar 4 '07 #5
On Mar 4, 11:42 am, "seber...@spawar.navy.mil"
<seber...@spawar.navy.milwrote:
I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Chris
How about a pyparsing hack instead? With English-readable expression
names and a few comments, I think this is fairly easy to follow. Also
note the sample statement at then end showing how to use the results
names to access the individual data fields (much easier than indexing
into a 20-element list!).

(You should also verify you are not running afoul of any terms of
service related to the content of this page.)

-- Paul

=======================
from pyparsing import *
import urllib

# define matching elements
integer = Word(nums).setParseAction(lambda t:int(t[0]))
real = Combine(Word(nums) + Word(".",nums)).setParseAction(lambda
t:float(t[0]))
pct = real + Suppress("%")
date = Combine(Word(nums) + '/' + Word(nums))
tdStart,tdEnd = map(Suppress,makeHTMLTags("td"))
dollarUnits = oneOf("Mil Bil")

# stats are one of two patterns - single value or double value stat,
wrapped in HTML <tdtags
# also, attach parse action to make sure each matches only once
def statPattern(name,label,statExpr=real):
if (isinstance(statExpr,And)):
statExpr.exprs[0] = statExpr.exprs[0].setResultsName(name)
else:
statExpr = statExpr.setResultsName(name)
expr = tdStart + Suppress(label) + tdEnd + tdStart + statExpr +
tdEnd
return expr.setParseAction(OnlyOnce(lambda t:None))

def bistatPattern(name,label,statExpr1=real,statExpr2= real):
expr = (tdStart + Suppress(label) + tdEnd +
tdStart + statExpr1 + tdEnd +
tdStart + statExpr2 + tdEnd).setResultsName(name)
return expr.setParseAction(OnlyOnce(lambda t:None))

stats = [
statPattern("last","Last Price"),
statPattern("hi","52 Week High"),
statPattern("lo","52 Week Low"),
statPattern("vol","Volume", real + Suppress(dollarUnits)),
statPattern("aveDailyVol_13wk","Average Daily Volume (13wk)", real
+ Suppress(dollarUnits)),
statPattern("movingAve_50day","50 Day Moving Average"),
statPattern("movingAve_200day","200 Day Moving Average"),
statPattern("volatility","Volatility (beta)"),
bistatPattern("relStrength_last3","Last 3 Months", pct, integer),
bistatPattern("relStrength_last6","Last 6 Months", pct, integer),
bistatPattern("relStrength_last12","Last 12 Months", pct,
integer),
bistatPattern("sales","Sales", real+Suppress(dollarUnits), pct),
bistatPattern("income","Income", real+Suppress(dollarUnits), pct),
bistatPattern("divRate","Dividend Rate", real, pct | "NA"),
bistatPattern("divYield","Dividend Yield", pct, pct),
statPattern("curQtrEPSest","Qtr("+date+") EPS Estimate"),
statPattern("curFyEPSest","FY("+date+") EPS Estimate"),
statPattern("curPE","Current P/E"),
statPattern("fwdEPSest","FY("+date+") EPS Estimate"),
statPattern("fwdPE","Forward P/E"),
]

# create overall search pattern - things move faster if we verify that
we are positioned
# at a <tdtag before going through the MatchFirst group
statSearchPattern = FollowedBy(tdStart) + MatchFirst(stats)

# SETUP IS DONE - now get the HTML source
# read in web page
pg = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
stockHTML = pg.read()
pg.close()

# extract and merge statistics
ticker =
sum( statSearchPattern.searchString(stockHTML),ParseRes ults([]) )

# print them out
print ticker.dump()
print ticker.last, ticker.hi,ticker.lo,ticker.vol,ticker.volatility

-----------------------
prints:
[39.549999999999997, 43.32, 30.920000000000002, 2.3599999999999999,
2.7400000000000002, 40.920000000000002, 37.659999999999997,
0.72999999999999998, 1.5, 55, 15.5, 69, 9.8000000000000007, 62,
6.2999999999999998, 19.399999999999999, 586.29999999999995,
27.199999999999999, 0.0, 'NA', 0.0, 0.0, 0.78000000000000003,
2.1499999999999999, 19.399999999999999, 2.3900000000000001,
18.399999999999999]
- aveDailyVol_13wk: 2.74
- curFyEPSest: 2.15
- curPE: 19.4
- curQtrEPSest: 0.78
- divRate: [0.0, 'NA']
- divYield: [0.0, 0.0]
- fwdEPSest: 2.39
- fwdPE: 18.4
- hi: 43.32
- income: [586.29999999999995, 27.199999999999999]
- last: 39.55
- lo: 30.92
- movingAve_200day: 37.66
- movingAve_50day: 40.92
- relStrength_last12: [9.8000000000000007, 62]
- relStrength_last3: [1.5, 55]
- relStrength_last6: [15.5, 69]
- sales: [6.2999999999999998, 19.399999999999999]
- vol: 2.36
- volatility: 0.73
39.55 43.32 30.92 2.36 0.73

Mar 5 '07 #6
P.S. Please send me 1% of all the money you make from your automated-
stock speculation program. On the other hand, if you lose money with
your program, don't bother sending me a bill.

-- Paul

Mar 5 '07 #7
sk**@pobox.com wrote:
>
Chrishttp://moneycentral.msn.com/companyreport?Symbol=BBBY

ChrisI can't validate it and xml.minidom.dom.parseString won't work on
Chrisit.

ChrisIf this was just some teenager's web site I'd move on. Is there
Chrisany hope avoiding regular expression hacks to extract the data
Chrisfrom this page?

Tidy it perhaps or use BeautifulSoup? ElementTree can use tidy if it's
available.
ElementTree can also use BeautifulSoup:

http://effbot.org/zone/element-soup.htm

as noted on that page, tidy is a bit too picky for this kind of use; it's better suited
for "normalizing" HTML that you're producing yourself than for parsing arbitrary
HTML.

</F>

Mar 5 '07 #8
On 4 Mar, 20:21, Nikita the Spider <NikitaTheSpi...@gmail.comwrote:
In article <1173030156.276363.174...@i80g2000cwc.googlegroups .com>,
I can't validate it and xml.minidom.dom.parseString won't work on it.
[...]
Valid XHTML is scarcer than hen's teeth.
It probably doesn't need to be valid: being well-formed would be
sufficient for the operation of an XML parser, and for many
applications it'd be sufficient to consider the content as vanilla XML
without the XHTML overtones.

Paul

Mar 5 '07 #9
se******@spawar.navy.mil schrieb:
I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.
Interestingly, no-one mentioned lxml so far:

http://codespeak.net/lxml
http://codespeak.net/lxml/dev/parsing.html#parsers

Parse it as HTML and then use anything from XPath to XSLT to treat it.

Have fun,
Stefan
Mar 5 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

303
by: mike420 | last post by:
In the context of LATEX, some Pythonista asked what the big successes of Lisp were. I think there were at least three *big* successes. a. orbitz.com web site uses Lisp for algorithms, etc. b....
8
by: Gerrit Holl | last post by:
Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...
4
by: Brett | last post by:
Are there any good HTML parsing tools available for VB.NET? I'd like something that will list: - tables (table, tr, td) - anchor tags - image tabs - DIVs and so. For example, it may list...
4
by: Chris Dubea | last post by:
This might seem like a stupid question, but I can't seem to find the answer. I've got a table full of information that I want to get into an XML format. I have a schema for the format and I can...
13
by: 31337one | last post by:
Hello everyone, I am writing an application that uses a command line interface. It will be configurable by passing arguments on the command line. The program is going to run in windows and...
5
by: randy | last post by:
Can some point me to a good example of parsing XML using C# 2.0? Thanks
9
by: sebzzz | last post by:
Hi, I work at this company and we are re-building our website: http://caslt.org/. The new website will be built by an external firm (I could do it myself, but since I'm just the summer student...
2
by: nedelm | last post by:
My problem's with parsing. I have this (arbitrary, from a file) string, lets say: "Directory: /file{File:/filename(/size) }" I would like it to behave similar to LaTeX. I parse it, and then I...
13
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.