How use XML parsing tools on this one specific URL?

seberino

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom .parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Chris

Mar 4 '07 #1

Subscribe Reply

1436

skip

Chrishttp://moneycentral.msn.com/companyreport?Symbol=BBBY

ChrisI can't validate it and xml.minidom.dom .parseString won't work on
Chrisit.

ChrisIf this was just some teenager's web site I'd move on. Is there
Chrisany hope avoiding regular expression hacks to extract the data
Chrisfrom this page?

Tidy it perhaps or use BeautifulSoup? ElementTree can use tidy if it's
available.

Skip

Mar 4 '07 #2

Jorge Godoy

"se******@spawa r.navy.mil" <se******@spawa r.navy.milwrite s:

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

Yes... And Microsoft is responsible for a lot of the ill-formed pages on the
web be it on their website or made by their applications.

>
http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom .parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

It all depends on what data you want. Probably a non-validating parser would
be able to extract some things. Another option is pass the page through some
validator that can fix the page, like tidy...
--
Jorge Godoy <jg****@gmail.c om>

Mar 4 '07 #3

Nikita the Spider

In article <11************ **********@i80g 2000cwc.googleg roups.com>,
"se******@spawa r.navy.mil" <se******@spawa r.navy.milwrote :

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom .parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Valid XHTML is scarcer than hen's teeth. Luckily, someone else has
already written the ugly regex parsing hacks for you. Try Connelly
Barnes' HTMLData:
http://oregonstate.edu/~barnesc/htmldata/

Or BeautifulSoup as others have suggested.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Mar 4 '07 #4

Paul Boddie

se******@spawar .navy.mil wrote:

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

Yes, thank you Microsoft!

I can't validate it and xml.minidom.dom .parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

The standards adherence from Microsoft services is clearly at "teenage
level", but here's a recipe:

import libxml2dom
import urllib
f = urllib.urlopen( "http://moneycentral.ms n.com/companyreport?
Symbol=BBBY")
d = libxml2dom.pars e(f, html=1)
f.close()

You now have a document which contains a DOM providing libxml2's
interpretation of the HTML. Sadly, PyXML's HtmlLib doesn't seem to
work with the given document. Other tools may give acceptable results,
however.

Paul

Mar 4 '07 #5

Paul McGuire

On Mar 4, 11:42 am, "seber...@spawa r.navy.mil"
<seber...@spawa r.navy.milwrote :

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom .parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Chris

How about a pyparsing hack instead? With English-readable expression
names and a few comments, I think this is fairly easy to follow. Also
note the sample statement at then end showing how to use the results
names to access the individual data fields (much easier than indexing
into a 20-element list!).

(You should also verify you are not running afoul of any terms of
service related to the content of this page.)

-- Paul

=============== ========
from pyparsing import *
import urllib

# define matching elements
integer = Word(nums).setP arseAction(lamb da t:int(t[0]))
real = Combine(Word(nu ms) + Word(".",nums)) .setParseAction (lambda
t:float(t[0]))
pct = real + Suppress("%")
date = Combine(Word(nu ms) + '/' + Word(nums))
tdStart,tdEnd = map(Suppress,ma keHTMLTags("td" ))
dollarUnits = oneOf("Mil Bil")

# stats are one of two patterns - single value or double value stat,
wrapped in HTML <tdtags
# also, attach parse action to make sure each matches only once
def statPattern(nam e,label,statExp r=real):
if (isinstance(sta tExpr,And)):
statExpr.exprs[0] = statExpr.exprs[0].setResultsName (name)
else:
statExpr = statExpr.setRes ultsName(name)
expr = tdStart + Suppress(label) + tdEnd + tdStart + statExpr +
tdEnd
return expr.setParseAc tion(OnlyOnce(l ambda t:None))

def bistatPattern(n ame,label,statE xpr1=real,statE xpr2=real):
expr = (tdStart + Suppress(label) + tdEnd +
tdStart + statExpr1 + tdEnd +
tdStart + statExpr2 + tdEnd).setResul tsName(name)
return expr.setParseAc tion(OnlyOnce(l ambda t:None))

stats = [
statPattern("la st","Last Price"),
statPattern("hi ","52 Week High"),
statPattern("lo ","52 Week Low"),
statPattern("vo l","Volume", real + Suppress(dollar Units)),
statPattern("av eDailyVol_13wk" ,"Average Daily Volume (13wk)", real
+ Suppress(dollar Units)),
statPattern("mo vingAve_50day", "50 Day Moving Average"),
statPattern("mo vingAve_200day" ,"200 Day Moving Average"),
statPattern("vo latility","Vola tility (beta)"),
bistatPattern(" relStrength_las t3","Last 3 Months", pct, integer),
bistatPattern(" relStrength_las t6","Last 6 Months", pct, integer),
bistatPattern(" relStrength_las t12","Last 12 Months", pct,
integer),
bistatPattern(" sales","Sales", real+Suppress(d ollarUnits), pct),
bistatPattern(" income","Income ", real+Suppress(d ollarUnits), pct),
bistatPattern(" divRate","Divid end Rate", real, pct | "NA"),
bistatPattern(" divYield","Divi dend Yield", pct, pct),
statPattern("cu rQtrEPSest","Qt r("+date+") EPS Estimate"),
statPattern("cu rFyEPSest","FY( "+date+") EPS Estimate"),
statPattern("cu rPE","Current P/E"),
statPattern("fw dEPSest","FY("+ date+") EPS Estimate"),
statPattern("fw dPE","Forward P/E"),
]

# create overall search pattern - things move faster if we verify that
we are positioned
# at a <tdtag before going through the MatchFirst group
statSearchPatte rn = FollowedBy(tdSt art) + MatchFirst(stat s)

# SETUP IS DONE - now get the HTML source
# read in web page
pg = urllib.urlopen( "http://moneycentral.ms n.com/companyreport?
Symbol=BBBY")
stockHTML = pg.read()
pg.close()

# extract and merge statistics
ticker =
sum( statSearchPatte rn.searchString (stockHTML),Par seResults([]) )

# print them out
print ticker.dump()
print ticker.last, ticker.hi,ticke r.lo,ticker.vol ,ticker.volatil ity

-----------------------
prints:
[39.549999999999 997, 43.32, 30.920000000000 002, 2.3599999999999 999,
2.7400000000000 002, 40.920000000000 002, 37.659999999999 997,
0.7299999999999 9998, 1.5, 55, 15.5, 69, 9.8000000000000 007, 62,
6.2999999999999 998, 19.399999999999 999, 586.29999999999 995,
27.199999999999 999, 0.0, 'NA', 0.0, 0.0, 0.7800000000000 0003,
2.1499999999999 999, 19.399999999999 999, 2.3900000000000 001,
18.399999999999 999]
- aveDailyVol_13w k: 2.74
- curFyEPSest: 2.15
- curPE: 19.4
- curQtrEPSest: 0.78
- divRate: [0.0, 'NA']
- divYield: [0.0, 0.0]
- fwdEPSest: 2.39
- fwdPE: 18.4
- hi: 43.32
- income: [586.29999999999 995, 27.199999999999 999]
- last: 39.55
- lo: 30.92
- movingAve_200da y: 37.66
- movingAve_50day : 40.92
- relStrength_las t12: [9.8000000000000 007, 62]
- relStrength_las t3: [1.5, 55]
- relStrength_las t6: [15.5, 69]
- sales: [6.2999999999999 998, 19.399999999999 999]
- vol: 2.36
- volatility: 0.73
39.55 43.32 30.92 2.36 0.73

Mar 5 '07 #6

Paul McGuire

P.S. Please send me 1% of all the money you make from your automated-
stock speculation program. On the other hand, if you lose money with
your program, don't bother sending me a bill.

-- Paul

Mar 5 '07 #7

Fredrik Lundh

sk**@pobox.com wrote:

>
Chrishttp://moneycentral.msn.com/companyreport?Symbol=BBBY

ChrisI can't validate it and xml.minidom.dom .parseString won't work on
Chrisit.

ChrisIf this was just some teenager's web site I'd move on. Is there
Chrisany hope avoiding regular expression hacks to extract the data
Chrisfrom this page?

Tidy it perhaps or use BeautifulSoup? ElementTree can use tidy if it's
available.

ElementTree can also use BeautifulSoup:

http://effbot.org/zone/element-soup.htm

as noted on that page, tidy is a bit too picky for this kind of use; it's better suited
for "normalizin g" HTML that you're producing yourself than for parsing arbitrary
HTML.

</F>

Mar 5 '07 #8

Paul Boddie

On 4 Mar, 20:21, Nikita the Spider <NikitaTheSpi.. .@gmail.comwrot e:

In article <1173030156.276 363.174...@i80g 2000cwc.googleg roups.com>,

I can't validate it and xml.minidom.dom .parseString won't work on it.

[...]

Valid XHTML is scarcer than hen's teeth.

It probably doesn't need to be valid: being well-formed would be
sufficient for the operation of an XML parser, and for many
applications it'd be sufficient to consider the content as vanilla XML
without the XHTML overtones.

Paul

Mar 5 '07 #9

Stefan Behnel

se******@spawar .navy.mil schrieb:

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom .parseString won't work on it.

Interestingly, no-one mentioned lxml so far:

http://codespeak.net/lxml
http://codespeak.net/lxml/dev/parsing.html#parsers

Parse it as HTML and then use anything from XPath to XSLT to treat it.

Have fun,
Stefan

Mar 5 '07 #10

Similar topics

303

17591

BIG successes of Lisp (was ...)

by: mike420 | last post by:

In the context of LATEX, some Pythonista asked what the big successes of Lisp were. I think there were at least three *big* successes. a. orbitz.com web site uses Lisp for algorithms, etc. b. Yahoo store was originally written in Lisp. c. Emacs The issues with these will probably come up, so I might as well mention them myself (which will also make this a more balanced

Python

9436

PEP 321: Date/Time Parsing and Formatting

by: Gerrit Holl | last post by:

Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $ Last-Modified: $Date: 2003/10/28 19:48:44 $ Author: A.M. Kuchling <amk@amk.ca> Status: Draft Type: Standards Track

Python

1669

Third party tool for parsing HTML?

by: Brett | last post by:

Are there any good HTML parsing tools available for VB.NET? I'd like something that will list: - tables (table, tr, td) - anchor tags - image tabs - DIVs and so. For example, it may list all <A> tags in an array. I can then search each

Visual Basic .NET

1184

Parsing text to XML

by: Chris Dubea | last post by:

This might seem like a stupid question, but I can't seem to find the answer. I've got a table full of information that I want to get into an XML format. I have a schema for the format and I can pretty much manipulate the table however needs to get it into XML. Is it possible to do this without writing code? If so, what tools would I use?

.NET Framework

1976

Standard for parsing

by: 31337one | last post by:

Hello everyone, I am writing an application that uses a command line interface. It will be configurable by passing arguments on the command line. The program is going to run in windows and linux. I was wondering what most linux/windows tools use to parse arguments. Example: ls -a or ls --help

C / C++

4297

Parsing XML

by: randy | last post by:

Can some point me to a good example of parsing XML using C# 2.0? Thanks

C# / C Sharp

2460

Parsing HTML, extracting text and changing attributes.

by: sebzzz | last post by:

Hi, I work at this company and we are re-building our website: http://caslt.org/. The new website will be built by an external firm (I could do it myself, but since I'm just the summer student worker...). Anyways, to help them, they first asked me to copy all the text from all the pages of the site (and there is a lot!) to word documents. I found the idea pretty stupid since style would have to be applied from scratch anyway since we...

Python

3049

LaTeX-Like Parsing in C

by: nedelm | last post by:

My problem's with parsing. I have this (arbitrary, from a file) string, lets say: "Directory: /file{File:/filename(/size) }" I would like it to behave similar to LaTeX. I parse it, and then I write it out for diferent variables, like:

C / C++

4489

Command language parsing - how formal to get?

by: Chris Carlen | last post by:

Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple command set consisting of several single letter commands which take no arguments. A few additional single letter commands take arguments:

C / C++

8411

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8838

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8739

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8513

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8613

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

4329

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2740

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1969

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

1732

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General