How use XML parsing tools on this one specific URL?

seberino

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Chris

Mar 4 '07 #1

Subscribe Post Reply

1426

skip

Chrishttp://moneycentral.msn.com/companyreport?Symbol=BBBY

ChrisI can't validate it and xml.minidom.dom.parseString won't work on
Chrisit.

ChrisIf this was just some teenager's web site I'd move on. Is there
Chrisany hope avoiding regular expression hacks to extract the data
Chrisfrom this page?

Tidy it perhaps or use BeautifulSoup? ElementTree can use tidy if it's
available.

Skip

Mar 4 '07 #2

Jorge Godoy

"se******@spawar.navy.mil" <se******@spawar.navy.milwrites:

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

Yes... And Microsoft is responsible for a lot of the ill-formed pages on the
web be it on their website or made by their applications.

>
http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

It all depends on what data you want. Probably a non-validating parser would
be able to extract some things. Another option is pass the page through some
validator that can fix the page, like tidy...
--
Jorge Godoy <jg****@gmail.com>

Mar 4 '07 #3

Nikita the Spider

In article <11**********************@i80g2000cwc.googlegroups .com>,
"se******@spawar.navy.mil" <se******@spawar.navy.milwrote:

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Valid XHTML is scarcer than hen's teeth. Luckily, someone else has
already written the ugly regex parsing hacks for you. Try Connelly
Barnes' HTMLData:
http://oregonstate.edu/~barnesc/htmldata/

Or BeautifulSoup as others have suggested.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Mar 4 '07 #4

Paul Boddie

se******@spawar.navy.mil wrote:

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

Yes, thank you Microsoft!

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

The standards adherence from Microsoft services is clearly at "teenage
level", but here's a recipe:

import libxml2dom
import urllib
f = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
d = libxml2dom.parse(f, html=1)
f.close()

You now have a document which contains a DOM providing libxml2's
interpretation of the HTML. Sadly, PyXML's HtmlLib doesn't seem to
work with the given document. Other tools may give acceptable results,
however.

Paul

Mar 4 '07 #5

Paul McGuire

On Mar 4, 11:42 am, "seber...@spawar.navy.mil"
<seber...@spawar.navy.milwrote:

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Chris

How about a pyparsing hack instead? With English-readable expression
names and a few comments, I think this is fairly easy to follow. Also
note the sample statement at then end showing how to use the results
names to access the individual data fields (much easier than indexing
into a 20-element list!).

(You should also verify you are not running afoul of any terms of
service related to the content of this page.)

-- Paul

=======================
from pyparsing import *
import urllib

# define matching elements
integer = Word(nums).setParseAction(lambda t:int(t[0]))
real = Combine(Word(nums) + Word(".",nums)).setParseAction(lambda
t:float(t[0]))
pct = real + Suppress("%")
date = Combine(Word(nums) + '/' + Word(nums))
tdStart,tdEnd = map(Suppress,makeHTMLTags("td"))
dollarUnits = oneOf("Mil Bil")

# stats are one of two patterns - single value or double value stat,
wrapped in HTML <tdtags
# also, attach parse action to make sure each matches only once
def statPattern(name,label,statExpr=real):
if (isinstance(statExpr,And)):
statExpr.exprs[0] = statExpr.exprs[0].setResultsName(name)
else:
statExpr = statExpr.setResultsName(name)
expr = tdStart + Suppress(label) + tdEnd + tdStart + statExpr +
tdEnd
return expr.setParseAction(OnlyOnce(lambda t:None))

def bistatPattern(name,label,statExpr1=real,statExpr2= real):
expr = (tdStart + Suppress(label) + tdEnd +
tdStart + statExpr1 + tdEnd +
tdStart + statExpr2 + tdEnd).setResultsName(name)
return expr.setParseAction(OnlyOnce(lambda t:None))

stats = [
statPattern("last","Last Price"),
statPattern("hi","52 Week High"),
statPattern("lo","52 Week Low"),
statPattern("vol","Volume", real + Suppress(dollarUnits)),
statPattern("aveDailyVol_13wk","Average Daily Volume (13wk)", real
+ Suppress(dollarUnits)),
statPattern("movingAve_50day","50 Day Moving Average"),
statPattern("movingAve_200day","200 Day Moving Average"),
statPattern("volatility","Volatility (beta)"),
bistatPattern("relStrength_last3","Last 3 Months", pct, integer),
bistatPattern("relStrength_last6","Last 6 Months", pct, integer),
bistatPattern("relStrength_last12","Last 12 Months", pct,
integer),
bistatPattern("sales","Sales", real+Suppress(dollarUnits), pct),
bistatPattern("income","Income", real+Suppress(dollarUnits), pct),
bistatPattern("divRate","Dividend Rate", real, pct | "NA"),
bistatPattern("divYield","Dividend Yield", pct, pct),
statPattern("curQtrEPSest","Qtr("+date+") EPS Estimate"),
statPattern("curFyEPSest","FY("+date+") EPS Estimate"),
statPattern("curPE","Current P/E"),
statPattern("fwdEPSest","FY("+date+") EPS Estimate"),
statPattern("fwdPE","Forward P/E"),
]

# create overall search pattern - things move faster if we verify that
we are positioned
# at a <tdtag before going through the MatchFirst group
statSearchPattern = FollowedBy(tdStart) + MatchFirst(stats)

# SETUP IS DONE - now get the HTML source
# read in web page
pg = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
stockHTML = pg.read()
pg.close()

# extract and merge statistics
ticker =
sum( statSearchPattern.searchString(stockHTML),ParseRes ults([]) )

# print them out
print ticker.dump()
print ticker.last, ticker.hi,ticker.lo,ticker.vol,ticker.volatility

-----------------------
prints:
[39.549999999999997, 43.32, 30.920000000000002, 2.3599999999999999,
2.7400000000000002, 40.920000000000002, 37.659999999999997,
0.72999999999999998, 1.5, 55, 15.5, 69, 9.8000000000000007, 62,
6.2999999999999998, 19.399999999999999, 586.29999999999995,
27.199999999999999, 0.0, 'NA', 0.0, 0.0, 0.78000000000000003,
2.1499999999999999, 19.399999999999999, 2.3900000000000001,
18.399999999999999]
- aveDailyVol_13wk: 2.74
- curFyEPSest: 2.15
- curPE: 19.4
- curQtrEPSest: 0.78
- divRate: [0.0, 'NA']
- divYield: [0.0, 0.0]
- fwdEPSest: 2.39
- fwdPE: 18.4
- hi: 43.32
- income: [586.29999999999995, 27.199999999999999]
- last: 39.55
- lo: 30.92
- movingAve_200day: 37.66
- movingAve_50day: 40.92
- relStrength_last12: [9.8000000000000007, 62]
- relStrength_last3: [1.5, 55]
- relStrength_last6: [15.5, 69]
- sales: [6.2999999999999998, 19.399999999999999]
- vol: 2.36
- volatility: 0.73
39.55 43.32 30.92 2.36 0.73

Mar 5 '07 #6

Paul McGuire

P.S. Please send me 1% of all the money you make from your automated-
stock speculation program. On the other hand, if you lose money with
your program, don't bother sending me a bill.

-- Paul

Mar 5 '07 #7

Fredrik Lundh

sk**@pobox.com wrote:

>
Chrishttp://moneycentral.msn.com/companyreport?Symbol=BBBY

ChrisI can't validate it and xml.minidom.dom.parseString won't work on
Chrisit.

ChrisIf this was just some teenager's web site I'd move on. Is there
Chrisany hope avoiding regular expression hacks to extract the data
Chrisfrom this page?

Tidy it perhaps or use BeautifulSoup? ElementTree can use tidy if it's
available.

ElementTree can also use BeautifulSoup:

http://effbot.org/zone/element-soup.htm

as noted on that page, tidy is a bit too picky for this kind of use; it's better suited
for "normalizing" HTML that you're producing yourself than for parsing arbitrary
HTML.

</F>

Mar 5 '07 #8

Paul Boddie

On 4 Mar, 20:21, Nikita the Spider <NikitaTheSpi...@gmail.comwrote:

In article <1173030156.276363.174...@i80g2000cwc.googlegroups .com>,

I can't validate it and xml.minidom.dom.parseString won't work on it.

[...]

Valid XHTML is scarcer than hen's teeth.

It probably doesn't need to be valid: being well-formed would be
sufficient for the operation of an XML parser, and for many
applications it'd be sufficient to consider the content as vanilla XML
without the XHTML overtones.

Paul

Mar 5 '07 #9

Stefan Behnel

se******@spawar.navy.mil schrieb:

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

Interestingly, no-one mentioned lxml so far:

http://codespeak.net/lxml
http://codespeak.net/lxml/dev/parsing.html#parsers

Parse it as HTML and then use anything from XPath to XSLT to treat it.

Have fun,
Stefan

Mar 5 '07 #10

Similar topics

303

BIG successes of Lisp (was ...)

by: mike420 | last post by:

In the context of LATEX, some Pythonista asked what the big successes of Lisp were. I think there were at least three *big* successes. a. orbitz.com web site uses Lisp for algorithms, etc. b....

Python

PEP 321: Date/Time Parsing and Formatting

by: Gerrit Holl | last post by:

Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...

Python

Third party tool for parsing HTML?

by: Brett | last post by:

Are there any good HTML parsing tools available for VB.NET? I'd like something that will list: - tables (table, tr, td) - anchor tags - image tabs - DIVs and so. For example, it may list...

Visual Basic .NET

Parsing text to XML

by: Chris Dubea | last post by:

This might seem like a stupid question, but I can't seem to find the answer. I've got a table full of information that I want to get into an XML format. I have a schema for the format and I can...

.NET Framework

Standard for parsing

by: 31337one | last post by:

Hello everyone, I am writing an application that uses a command line interface. It will be configurable by passing arguments on the command line. The program is going to run in windows and...

C / C++

Parsing XML

by: randy | last post by:

Can some point me to a good example of parsing XML using C# 2.0? Thanks

C# / C Sharp

Parsing HTML, extracting text and changing attributes.

by: sebzzz | last post by:

Hi, I work at this company and we are re-building our website: http://caslt.org/. The new website will be built by an external firm (I could do it myself, but since I'm just the summer student...

Python

LaTeX-Like Parsing in C

by: nedelm | last post by:

My problem's with parsing. I have this (arbitrary, from a file) string, lets say: "Directory: /file{File:/filename(/size) }" I would like it to behave similar to LaTeX. I parse it, and then I...

C / C++

Command language parsing - how formal to get?

by: Chris Carlen | last post by:

Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...

C / C++

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware