Parsing Baseball Stats

ankitdesai

I would like to parse a couple of tables within an individual player's
SHTML page. For example, I would like to get the "Actual Pitching
Statistics" and the "Translated Pitching Statistics" portions of Babe
Ruth page (http://www.baseballprospectus.com/dt/ruthba01.shtml) and
store that info in a CSV file.

Also, I would like to do this for numerous players whose IDs I have
stored in a text file (e.g.: cobbty01, ruthba01, speaktr01, etc.).
These IDs should change the URL to get the corresponding player's
stats. Is this doable and if yes, how? I have only recently finished
learning Python (used the book: How to Think Like a Computer Scientist:
Learning with Python). Thanks for your help...

Jul 24 '06 #1

Subscribe Post Reply

4034

Paul McGuire

<an********@gmail.comwrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...

I would like to parse a couple of tables within an individual player's
SHTML page. For example, I would like to get the "Actual Pitching
Statistics" and the "Translated Pitching Statistics" portions of Babe
Ruth page (http://www.baseballprospectus.com/dt/ruthba01.shtml) and
store that info in a CSV file.

Also, I would like to do this for numerous players whose IDs I have
stored in a text file (e.g.: cobbty01, ruthba01, speaktr01, etc.).
These IDs should change the URL to get the corresponding player's
stats. Is this doable and if yes, how? I have only recently finished
learning Python (used the book: How to Think Like a Computer Scientist:
Learning with Python). Thanks for your help...

Pyparsing and BeautifulSoup are both useful options to look into.

Also, take care not to run afoul of the terms of service for this site
(included below). A strict interpretation of them probably prohibits what
you intend to do.

-- Paul
Restrictions

You agree to not use the Service:

* to upload, post, email, transmit or otherwise make available (A) any
content that you do not have a right to make available and any commercial
publication or exploitation of the Service or any content provided in
connection therewith is specifically prohibited and anyone wishing to do so
must first request and receive prior written permission from PEV to do so;
(B) any content that infringes any patent, trademark, trade secret,
copyright or other proprietary rights ("Rights") of any party; (C) any
unsolicited or unauthorized advertising, promotional materials, "junk mail,"
"spam," "chain letters," "pyramid schemes," or any other form of
solicitation except for vendors so authorized to do so; or (D) any content
that is unlawful, harmful, threatening, abusive, harassing, tortious,
defamatory, vulgar, obscene, libelous, invasive of another's privacy,
hateful, or racially, ethnically or otherwise objectionable;

* to forge headers or otherwise manipulate identifiers in order to disguise
the origin of any content transmitted through or made available through the
Service;

* to collect or store personal data about other users; or deleting or
revising any content (including, but not limited to, legal notices) posted
by PEV or attempting to decipher, decompile, disassemble or reverse engineer
any of the software or content provided through, comprising or making up the
Service;

* to use or attempt to use any engine, software, tool, agent or other device
or mechanism (including without limitation browsers, spiders, robots,
avatars or intelligent agents) to navigate or search this Site other than
the search engine and search agents available from Experience on this Site
and other than generally available third-party web browsers (e.g., Netscape
Navigator, Microsoft Explorer) or;

* to intentionally or unintentionally violate any applicable local, state,
national or international law.

Jul 24 '06 #2

Anthra Norell

Hi,

Below your solution ready to run. Put get_statistics () in a loop that feeds it the names from your file, makes an ouput file
name from it and passes both 'statistics' and the ouput file name to file_statistics ().

Cheers,

Frederic
----- Original Message -----
From: <an********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Monday, July 24, 2006 5:48 PM
Subject: Parsing Baseball Stats

I would like to parse a couple of tables within an individual player's
SHTML page. For example, I would like to get the "Actual Pitching
Statistics" and the "Translated Pitching Statistics" portions of Babe
Ruth page (http://www.baseballprospectus.com/dt/ruthba01.shtml) and
store that info in a CSV file.

Also, I would like to do this for numerous players whose IDs I have
stored in a text file (e.g.: cobbty01, ruthba01, speaktr01, etc.).
These IDs should change the URL to get the corresponding player's
stats. Is this doable and if yes, how? I have only recently finished
learning Python (used the book: How to Think Like a Computer Scientist:
Learning with Python). Thanks for your help...

--
http://mail.python.org/mailman/listinfo/python-list

import SE, urllib

Tag_Stripper = SE.SE ('"~<.*?>~= " "~<[^>]*~=" "~[^<]*>~=" ')
CSV_Maker = SE.SE (' "~\s+~=(9)" ')

# SE is the hacker's Swiss army knife. You find it in the Cheese Shop.
# It strips your tags and puts in the CSV separator and if you needed other
# translations, it would do those too on two lines of code.
# If you don't want tabs, define the CSV_Maker accordingly, putting
# your separator in the place of '(9)':
# CSV_Maker = SE.SE ('"~\s+~=,"') # Now it's a comma

def get_statistics (name_of_player):

statistics = {

# Uncomment those you want
# 'Actual Batting Statistics' : [],
'Actual Pitching Statistics' : [],
# 'Advanced Batting Statistics' : [],
'Advanced Pitching Statistics' : [],
# 'Fielding Statistics as Center Fielder' : [],
# 'Fielding Statistics as First Baseman' : [],
# 'Fielding Statistics as Left Fielder' : [],
# 'Fielding Statistics as Pitcher' : [],
# 'Fielding Statistics as Right Fielder' : [],
# 'Statistics as DH/PH/Other' : [],
# 'Translated Batting Statistics' : [],
# 'Translated Pitching Statistics' : [],

}

url = 'http://www.baseballprospectus.com/dt/%s.shtml' % name_of_player
htm_page = urllib.urlopen (url)
htm_lines = htm_page.readlines ()
htm_page.close ()
current_list = None
for line in htm_lines:
text_line = Tag_Stripper (line).strip ()
if line.startswith ('<h3'):
if statistics.has_key (text_line):
current_list = statistics [text_line]
current_list.append (text_line)
else:
current_list = None
else:
if current_list != None:
if text_line:
current_list.append (CSV_Maker (text_line))

return statistics
def show_statistics (statistics):
for category in statistics:
for record in statistics [category]:
print record
print
def file_statistics (file_name, statistics):
f = file (file_name, 'wa')
for category in statistics:
f.write ('%s\n' % category)
for line in statistics [category][1:]:
f.write ('%s\n' % line)
f.close ()

Jul 25 '06 #3

Paul McGuire

"Anthra Norell" <an***********@tiscalinet.chwrote in message
news:ma***************************************@pyt hon.org...

Hi,

Below your solution ready to run. Put get_statistics () in a loop

that feeds it the names from your file, makes an ouput file

name from it and passes both 'statistics' and the ouput file name to

file_statistics ().

>
Cheers,

Frederic

What Perlish line noise is this?

Tag_Stripper = SE.SE('"~<.*?>~= " "~<[^>]*~=" "~[^<]*>~=" ')

A pyparsing version certainly wont fit in two lines, but at least I'll still
respect myself in the morning, *and* 6 months from now... (Unfortunately,
I'm loath to post the pyparsing program, given the target site's TOS -
sorry.)

-- Paul

Jul 25 '06 #4

Anthra Norell

----- Original Message -----
From: "Paul McGuire" <pt***@austin.rr._bogus_.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Tuesday, July 25, 2006 7:48 PM
Subject: Re: Parsing Baseball Stats

"Anthra Norell" <an***********@tiscalinet.chwrote in message
news:mailman.8518.1153826522.27775.py*********@pyt hon.org...
Hi,

Below your solution ready to run. Put get_statistics () in a loop
that feeds it the names from your file, makes an ouput file
name from it and passes both 'statistics' and the ouput file name to
file_statistics ().

Cheers,

Frederic
What Perlish line noise is this?

Tag_Stripper = SE.SE('"~<.*?>~= " "~<[^>]*~=" "~[^<]*>~=" ')

A pyparsing version certainly wont fit in two lines, but at least I'll still
respect myself in the morning, *and* 6 months from now... (Unfortunately,
I'm loath to post the pyparsing program, given the target site's TOS -
sorry.)

-- Paul

Paul,

I think self-respect is a very essential attitude and I am pleased to know that you value it as much as I do.
The off topic thus dispatched, let me first express my appreciation or your interest. Next I agree to anything you say about
pyparsing, because I know little about it. Perhaps you will agree that parsing is overkill if you don't need to. Correct me if I'm
wrong, but I have a hunch that HTM tags don't nest. So you can actually strip them with two or three lines of plain python code,
looking at the letters one by one.
As to my proposal I can only invite you to run my program first and then pass judgment, unlike the clerical Authorities in the
story who steadfastly refused to take a peek through Galileo's telescope, arguing that, since planets didn't have rings, taking a
peek was an exercise of utter futility.
And do tell me what a TOS is. If I am missing a point it is rather this one. Does it have to do with copyright? How would it
be then if you did your pyparsing demo on a site with a more accommodating TOS?

Regards

Frederic

Jul 25 '06 #5

Paul McGuire

"Anthra Norell" <an***********@tiscalinet.chwrote in message
news:ma***************************************@pyt hon.org...

>
Paul,

I think self-respect is a very essential attitude and I am pleased to know

that you value it as much as I do.

The off topic thus dispatched, let me first express my appreciation

or your interest. Next I agree to anything you say about

pyparsing, because I know little about it. Perhaps you will agree that

parsing is overkill if you don't need to. Correct me if I'm

wrong, but I have a hunch that HTM tags don't nest. So you can actually

strip them with two or three lines of plain python code,

looking at the letters one by one.
As to my proposal I can only invite you to run my program first and

then pass judgment, unlike the clerical Authorities in the

story who steadfastly refused to take a peek through Galileo's telescope,

arguing that, since planets didn't have rings, taking a

peek was an exercise of utter futility.
And do tell me what a TOS is. If I am missing a point it is rather

this one. Does it have to do with copyright? How would it

be then if you did your pyparsing demo on a site with a more accommodating

TOS?

>
Regards

Frederic

Frederic -

HTML parsing is one of those slippery slopes - or perhaps "tar babies" might
be a better metaphor - that starts out as a simple problem, but then one
exception after the next drags the solution out for daaaays. Probably once
or twice a week, there is a posting here from someone trying to extract data
from a website, usually something like trying to pull the href's out of some
random <a(is for "anchor") tags, usually followed by a well-meaning regexp
suggestion along the lines of "Use '<a href="[^"]*">'", followed by "oh
wait, I forgot to mention that sometimes there are other attributes in the
tag besides href, or sometimes the referenced url is not in double quotes,
but single quotes, or no quotes, or href is sometimes HREF, or Href, or
there are spaces around the '=' sign, or spaces before the closing ">", or
there are two spaces between the 'a' and 'href', followed by "no problem,
just use *this* regexp: (@#\$(#&$)q(\/\/\<*w&)(Q&!!!!". You are correct -
in general, HTML tags themselves do not nest, and this is certainly a
blessing. But HTML tags expressions *do* nest - lists within lists, tables
within tables, bold within italic, italic within bold, etc. And HTML, being
less well-behaved than XML, also allows tags to overlap, as in
<foo><bar></foo></bar>. Not to mention, some HTML contains comments, so
href's inside comments *probably* want to be ignored.

So utilities like BeautifulSoup, targeted specifically at cracking
less-than-pristine HTML, come to the fore, to bring some order to this
parsing chaos. pyparsing is more generic than BeautifulSoup, and I believe
less HTML-capable, especially with overlapping tags. But it does include
some out-of-the-box helpers, such as makeHTMLTags("tagname"), which
generates opening and closing <tagnameand </tagnametag expressions,
including handling for odd casing, unpredictable whitespace, and
unanticipated tag attributes. And for applications such as this, pyparsing
(I believe) provides some programming interfaces that are easier to deal
with than the BeautifulSoup accessors I've seen posted here.

I'm less and less open to the "parsing is overkill, I'll just slap together
a regexp" line of reasoning. With a little practice, pyparsing apps can
come together fairly quickly, and keep their readability for a long time. I
find regexp's, which I *don't* use every day, and whose typoglyphics quickly
bleed out of my frontal lobe, to lose their readability over time,
especially when the expression to be cracked contains any of the regex
special characters, namely (,),[,],.,/,^,$,+,*,?,\,... they become a riddle
of backslashes. The difficulty is that regexp's commingle their semantic
punctuation with the text they are trying to process, and it demands an
extra mental focus to decipher them (doubly so when trying to debug them!).

I try to save my (limited number of remaining) mentally focused moments for
things like overall system design, and so I wrote pyparsing to help separate
the syntax and mental processing used for text content vs. pattern
symbology. Pyparsing uses explicit Python classes, such as Optional,
OneOrMore, Group, Literal, and Word, with *some* operator shortcuts, such as
+ for And, | for match-first-alternative Or, ^ for match-longest-alternative
Or, and ~ for Not. Yes, the expressions are not as terse as regexps, and
they are not as fast at runtime. But they are quickly graspable at a
glance. Here is a BNF for part of the Verilog language, and the
corresponding code in the Verilog language parser:

<UDP::= primitive <name_of_UDP( <name_of_variable>
<,<name_of_variable>>* ) ;
<UDP_declaration>+
<UDP_initial_statement>?
<table_definition>
endprimitive

udp = Group( "primitive" + identifier +
"(" + Group( delimitedList( identifier ) ) + ")" + semi +
OneOrMore( udpDeclaration ) +
Optional( udpInitialStmt ) +
udpTableDefn +
"endprimitive" )

and the udp expression is inherently tolerant of random embedded whitespace
(and comments too, as provided for later in the code).

Is it overkill to use this same tool to allow me to write:

anchorStart,anchorEnd = makeHTMLTags("a")
print [ t.href for t,s,e in anchorStart.scanString(htmlSourceText) ]

to list out the href's in an arbitrary body of HTML, independent of
whitespace, presence of other attributes, casing of tag and attribute names,
and other typical pitfalls of HTML parsing? (Note: this 2-liner does not
handle HTML comments, this would require at least one more line, something
like "aStart.ignore(htmlComment)", but I haven't tested this for this
mini-manifesto.)

Frederic, I'm sorry, but I probably wont get around to giving SE a try. I'm
not that interested in becoming proficient in writing (or even just being
able to read) expressions such as these:
Tag_Stripper = SE.SE ('"~<.*?>~= " "~<[^>]*~=" "~[^<]*>~=" ')
CSV_Maker = SE.SE (' "~\s+~=(9)" ')
CSV_Maker = SE.SE ('"~\s+~=,"')
If these make sense to you, more power to you. If I were to puzzle them out
this evening, the gift would be lost within a week. I can't afford to lose
my grasp of code I've written in so short a time.

As for "TOS", it is "Terms of Service," and it's kind of like copyright.
It's one of those unilateral contracts that basically says, "I offer this
information, but only if you follow my terms - such as, don't rip off my
data just to set up a competing service, or to sell the data in some
repackaged form - and if you don't like my terms, you're welcome to not use
my service" (liberally translated from the legalese). Friendly consumer
sites like Google, UPS, FedEx, Amazon all list some form of TOS stating
"you're welcome to use our website as long as it is within the bounds of
normal consumer use - and no bots, agents, etc. allowed, 'cause they'll
bring our server to its knees and the innocent bystander users end up losing
out because of your selfishness/greed/lack of consideration for your fellow
human person." As much as the content on the Internet is freely available,
it isn't always free. (Pyparsing comes with some simple HTML scrapers that
use public websites, such as NIST's listing of public NTP servers.)

So what started out as a little joke (microscopic, even) has eventually
touched a nerve, so thanks and apologies to those who have read this whole
mess. Frederic, SE looks like a killer - may it become the next regexp!

-- Paul

Jul 25 '06 #6

Anthra Norell

----- Original Message -----
From: "Paul McGuire" <pt***@austin.rr._bogus_.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Wednesday, July 26, 2006 1:01 AM
Subject: Re: Parsing Baseball Stats

"Anthra Norell" <an***********@tiscalinet.chwrote in message
news:mailman.8551.1153861590.27775.py*********@pyt hon.org...

snip

Frederic -

HTML parsing is one of those slippery slopes - or perhaps "tar babies" might
be a better metaphor - that starts out as a simple problem, but then one
exception after the next drags the solution out for daaaays. Probably once
or twice a week, there is a posting here from someone trying to extract data
from a website, usually something like trying to pull the href's out of some

snip

So what started out as a little joke (microscopic, even) has eventually
touched a nerve, so thanks and apologies to those who have read this whole
mess. Frederic, SE looks like a killer - may it become the next regexp!

-- Paul

Paul,

A year ago or so someone posted a call for ideas on encoding passwords for his own private use. I suggested a solution using
python's random number generator and was immediately reminded by several knowledgeable people, quite sharply by some, that the
random number generator was not to be used for cryptographic applications, since the doc specifically said so. I was also given good
advice on what to read.
I thought that my solution was good, if not by the catechism, then by the requirements of the OP's problem which I considered
to be the issue. I hoped the OP would come back with his opinion, but he didn't.
Not then and there. He did some time later, off list, telling me privately that he had incorporated my solution with some
adaptations and that it was exactly what he had been looking for.

So let me pursue this on two lines: A) your response and B) the issue.

A) I thank you for the considerable time you must have taken to explain pyparse in such detail. I didn't know you're the author.
Congratulations! It certainly looks very professional. I have no doubt that it is an excellent and powerful tool.
Thanks also for your explanation of the TOS concept. It isn't alien to me and I have no problem with it. But I don't believe
it means that one should voluntarily argue against one's own freedom, barking at oneself with the voice of the legal watchdogs out
there that would restrict our freedom preemptively, getting a tug on the leash for excessive zeal but a pat on the head nontheless.
We have little cause to assume that the OP is setting up a baseball information service and have much cause to assume that he is
not. So let us reserve the benefit of the doubt because this is what the others do. And work by plausible assumption--necessarily,
because the realm of certainty is too small an action base.
SE is not a parser. It is a stream editor. I believe it fills a gap, handling a certain kind of problem very gracefully while
being particularly easy to use. Your spontaneous reaction of horror was the consequence of a misinterpretation. The Tag_Stripper's
argument ('"~<.*?>~= " "~<[^>]*~=" "~[^<]*>~=") is not the frightful incarnation of a novel, yet more arcane regular expression
syntax. It is simply a string consisting of three very simple expressions: '<.*?>', '<[^>]*' and '[^<]*>'. They could also be
written as or-ed alternatives: '<.*?>|<[^>]*|[^<]*>'. The tildes brace the regex to identify it as such. The equal sign says replace
what precedes with what follows. Nothing happens to follow, which means replace it with nothing, which means delete it (tags).
That's all. SE allows--encourages--to break down a complex search into any number of simple components.
(Having just said 'easy to use' I notice a mistake. I correct it below in section C.)

B) I would welcome the OP's opinion.

Regards

Frederic
C) Correction: The second and third expression were meant to catch tags spanning lines. There weren't any such tags and so the
expressions were useless--and inoffensive too: the second one, as a matter of fact, could also delete text. The Tag Stripper should
be defined like this:

Tag_Stripper = ('"~<(.|\n)*?>~=" "~~="')

It now deletes tags even if they span lines and it incorporates a second definition that deletes comments which, as you made me
aware, may contain tags. I now have to run the whole file through this before I look at the lines.

def get_statistics (name_of_player):

statistics = {
'Actual Pitching Statistics' : [],
'Advanced Pitching Statistics' : [],
}

url = 'http://www.baseballprospectus.com/dt/%s.shtml' % name_of_player
htm_page = urllib.urlopen (url)
lines = StringIO.StringIO (Tag_Stripper (htm_page.read ()))
htm_page.close ()
current_list = None
for line in lines:
line = line.strip ()
if line == '':
continue
if 'Statistics' in line: # That's the section headings.
if statistics.has_key (line):
current_list = statistics [line]
current_list.append (line)
else:
current_list = None
else:
if current_list != None:
current_list.append (CSV_Maker (line))

return statistics
show_statistics (statistics) displays this tab-delimited CSV:

Advanced Pitching Statistics
AGE YEAR TEAM XIP RA DH DR DW NRA RAA PRAA PRAR DERA NRA RAA PRAA PRAR DERA STF
19 1914 BOS-A 25.3 4.70 -2 3 1 5.75 -4 -5 -2 6.15 6.19 -5 -5 -2 6.36 -25
20 1915 BOS-A 225.3 3.31 -12 3 2 4.01 12 4 45 4.33 4.25 6 1 42 4.44 12
21 1916 BOS-A 318.2 2.31 -32 -8 0 3.19 46 41 101 3.35 3.30 43 39 99 3.41 24
22 1917 BOS-A 336.5 2.56 -20 -7 1 3.49 38 23 83 3.88 3.72 29 20 80 3.96 13
23 1918 BOS-A 171.6 2.76 -16 5 0 3.80 13 6 34 4.20 4.16 6 3 31 4.36 3
24 1919 BOS-A 129.4 3.98 4 -16 2 4.63 -2 -2 19 4.61 4.79 -4 -3 17 4.70 -6
25 1920 NY_-A 6.4 9.00 -1 3 1 8.64 -3 -3 -3 8.96 8.95 -3 -3 -3 9.14 -35
26 1921 NY_-A 13.2 10.00 2 0 1 9.16 -7 -7 -5 9.36 9.61 -8 -8 -5 9.65 -41
35 1930 NY_-A 8.8 3.00 1 -2 0 2.84 2 2 4 2.57 3.07 1 2 3 2.66 13
38 1933 NY_-A 8.8 5.00 1 -1 0 5.01 -1 0 0 4.59 5.27 -1 0 0 4.73 -22
1243.5 2.95 -76 -22 8 3.78 96 59 275 4.07 3.95 65 45 262 4.17 10

Actual Pitching Statistics
AGE YEAR TEAM W L SV ERA G GS TBF IP H R ER HR BB SO HBP IBB WP BK CG SHO
19 1914 BOS-A 2 1 0 3.91 4 3 96 23.0 21 12 10 1 7 3 0 0 0 0 1 0
20 1915 BOS-A 18 8 0 2.44 32 28 874 217.7 166 80 59 3 85 112 6 0 9 1 16 1
21 1916 BOS-A 23 12 1 1.75 44 41 1272 323.7 230 83 63 0 118 170 8 0 3 1 23 9
22 1917 BOS-A 24 13 2 2.01 41 38 1277 326.3 244 93 73 2 108 128 11 0 5 0 35 6
23 1918 BOS-A 13 7 0 2.22 20 19 660 166.3 125 51 41 1 49 40 2 0 3 1 18 1
24 1919 BOS-A 9 5 1 2.97 17 15 570 133.3 148 59 44 2 58 30 2 0 5 1 12 0
25 1920 NY_-A 1 0 0 4.50 1 1 17 4.0 3 4 2 0 2 0 0 0 0 0 0 0
26 1921 NY_-A 2 0 0 9.00 2 1 49 9.0 14 10 9 1 9 2 0 0 0 0 0 0
35 1930 NY_-A 1 0 0 3.00 1 1 39 9.0 11 3 3 0 2 3 0 0 0 0 1 0
38 1933 NY_-A 1 0 0 5.00 1 1 42 9.0 12 5 5 0 3 0 0 0 0 0 1 0
94 46 4 2.28 163 148 4896 1221.3 974 400 309 10 441 488 29 0 25 4 107 17

(The last line remains to be shifted three columns to the right.)

Jul 26 '06 #7

Ankit

Frederic,

Thanks for posting the solution. I used the original solution you
posted and it worked beautifully.

Paul,

I understand your concern for the site's TOS. Although, this may not
mean anything, the reason I wanted this "parser" was because I wanted
to get the Advanced, and Translated Stats for personal use. I don't
have any commercial motives but play with baseball stats is my hobby.
The site does allow one to download stuff for personal use, which I
abide by. Also, I am only looking to get the aforementioned stats for
some players. The site has player pages for over 16,000 players. I
think it would be unfair to the site owners if I went to download all
16,000 players using the script. In the end, they might just move the
stats in to their premium package (not free) and then I would be really
screwed.

So, I understand your concerns and thank you for posting them.

Ankit

Anthra Norell wrote:

----- Original Message -----
From: "Paul McGuire" <pt***@austin.rr._bogus_.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Wednesday, July 26, 2006 1:01 AM
Subject: Re: Parsing Baseball Stats

"Anthra Norell" <an***********@tiscalinet.chwrote in message
news:mailman.8551.1153861590.27775.py*********@pyt hon.org...
>

snip

>
Frederic -

HTML parsing is one of those slippery slopes - or perhaps "tar babies" might
be a better metaphor - that starts out as a simple problem, but then one
exception after the next drags the solution out for daaaays. Probably once
or twice a week, there is a posting here from someone trying to extract data
from a website, usually something like trying to pull the href's out of some

snip

So what started out as a little joke (microscopic, even) has eventually
touched a nerve, so thanks and apologies to those who have read this whole
mess. Frederic, SE looks like a killer - may it become the next regexp!

-- Paul

Paul,

A year ago or so someone posted a call for ideas on encoding passwords for his own private use. I suggested a solution using
python's random number generator and was immediately reminded by several knowledgeable people, quite sharply by some, that the
random number generator was not to be used for cryptographic applications, since the doc specifically said so. I was also given good
advice on what to read.
I thought that my solution was good, if not by the catechism, then by the requirements of the OP's problem which I considered
to be the issue. I hoped the OP would come back with his opinion, but he didn't.
Not then and there. He did some time later, off list, telling me privately that he had incorporated my solution with some
adaptations and that it was exactly what he had been looking for.

So let me pursue this on two lines: A) your response and B) the issue.

A) I thank you for the considerable time you must have taken to explain pyparse in such detail. I didn't know you're the author.
Congratulations! It certainly looks very professional. I have no doubt that it is an excellent and powerful tool.
Thanks also for your explanation of the TOS concept. It isn't alien to me and I have no problem with it. But I don't believe
it means that one should voluntarily argue against one's own freedom, barking at oneself with the voice of the legal watchdogs out
there that would restrict our freedom preemptively, getting a tug on the leash for excessive zeal but a pat on the head nontheless.
We have little cause to assume that the OP is setting up a baseball information service and have much cause to assume that he is
not. So let us reserve the benefit of the doubt because this is what the others do. And work by plausible assumption--necessarily,
because the realm of certainty is too small an action base.
SE is not a parser. It is a stream editor. I believe it fills a gap, handling a certain kind of problem very gracefully while
being particularly easy to use. Your spontaneous reaction of horror was the consequence of a misinterpretation. The Tag_Stripper's
argument ('"~<.*?>~= " "~<[^>]*~=" "~[^<]*>~=") is not the frightful incarnation of a novel, yet more arcane regular expression
syntax. It is simply a string consisting of three very simple expressions: '<.*?>', '<[^>]*' and '[^<]*>'. They could also be
written as or-ed alternatives: '<.*?>|<[^>]*|[^<]*>'. The tildes brace the regex to identify it as such. The equal sign says replace
what precedes with what follows. Nothing happens to follow, which means replace it with nothing, which means delete it (tags).
That's all. SE allows--encourages--to break down a complex search into any number of simple components.
(Having just said 'easy to use' I notice a mistake. I correct it below in section C.)

B) I would welcome the OP's opinion.

Regards

Frederic
C) Correction: The second and third expression were meant to catch tags spanning lines. There weren't any such tags and so the
expressions were useless--and inoffensive too: the second one, as a matter of fact, could also delete text. The Tag Stripper should
be defined like this:

Tag_Stripper = ('"~<(.|\n)*?>~=" "~~="')

It now deletes tags even if they span lines and it incorporates a second definition that deletes comments which, as you made me
aware, may contain tags. I now have to run the whole file through this before I look at the lines.

def get_statistics (name_of_player):

statistics = {
'Actual Pitching Statistics' : [],
'Advanced Pitching Statistics' : [],
}

url = 'http://www.baseballprospectus.com/dt/%s.shtml' % name_of_player
htm_page = urllib.urlopen (url)
lines = StringIO.StringIO (Tag_Stripper (htm_page.read ()))
htm_page.close ()
current_list = None
for line in lines:
line = line.strip ()
if line == '':
continue
if 'Statistics' in line: # That's the section headings.
if statistics.has_key (line):
current_list = statistics [line]
current_list.append (line)
else:
current_list = None
else:
if current_list != None:
current_list.append (CSV_Maker (line))

return statistics
show_statistics (statistics) displays this tab-delimited CSV:

Advanced Pitching Statistics
AGE YEAR TEAM XIP RA DH DR DW NRA RAA PRAA PRAR DERA NRA RAA PRAA PRAR DERA STF
19 1914 BOS-A 25.3 4.70 -2 3 1 5.75 -4 -5 -2 6.15 6.19 -5 -5 -2 6.36 -25
20 1915 BOS-A 225.3 3.31 -12 3 2 4.01 12 4 45 4.33 4.25 6 1 42 4.44 12
21 1916 BOS-A 318.2 2.31 -32 -8 0 3.19 46 41 101 3.35 3.30 43 39 99 3.41 24
22 1917 BOS-A 336.5 2.56 -20 -7 1 3.49 38 23 83 3.88 3.72 29 20 80 3.96 13
23 1918 BOS-A 171.6 2.76 -16 5 0 3.80 13 6 34 4.20 4.16 6 3 31 4.36 3
24 1919 BOS-A 129.4 3.98 4 -16 2 4.63 -2 -2 19 4.61 4.79 -4 -3 17 4.70 -6
25 1920 NY_-A 6.4 9.00 -1 3 1 8.64 -3 -3 -3 8.96 8.95 -3 -3 -3 9.14 -35
26 1921 NY_-A 13.2 10.00 2 0 1 9.16 -7 -7 -5 9.36 9.61 -8 -8 -5 9.65 -41
35 1930 NY_-A 8.8 3.00 1 -2 0 2.84 2 2 4 2.57 3.07 1 2 3 2.66 13
38 1933 NY_-A 8.8 5.00 1 -1 0 5.01 -1 0 0 4.59 5.27 -1 0 0 4.73 -22
1243.5 2.95 -76 -22 8 3.78 96 59 275 4.07 3.95 65 45 262 4.17 10

Actual Pitching Statistics
AGE YEAR TEAM W L SV ERA G GS TBF IP H R ER HR BB SO HBP IBB WP BK CG SHO
19 1914 BOS-A 2 1 0 3.91 4 3 96 23.0 21 12 10 1 7 3 0 0 0 0 1 0
20 1915 BOS-A 18 8 0 2.44 32 28 874 217.7 166 80 59 3 85 112 6 0 9 1 16 1
21 1916 BOS-A 23 12 1 1.75 44 41 1272 323.7 230 83 63 0 118 170 8 0 3 1 23 9
22 1917 BOS-A 24 13 2 2.01 41 38 1277 326.3 244 93 73 2 108 128 11 0 5 0 35 6
23 1918 BOS-A 13 7 0 2.22 20 19 660 166.3 125 51 41 1 49 40 2 0 3 1 18 1
24 1919 BOS-A 9 5 1 2.97 17 15 570 133.3 148 59 44 2 58 30 2 0 5 1 12 0
25 1920 NY_-A 1 0 0 4.50 1 1 17 4.0 3 4 2 0 2 0 0 0 0 0 0 0
26 1921 NY_-A 2 0 0 9.00 2 1 49 9.0 14 10 9 1 9 2 0 0 0 0 0 0
35 1930 NY_-A 1 0 0 3.00 1 1 39 9.0 11 3 3 0 2 3 0 0 0 0 1 0
38 1933 NY_-A 1 0 0 5.00 1 1 42 9.0 12 5 5 0 3 0 0 0 0 0 1 0
94 46 4 2.28 163 148 4896 1221.3 974 400 309 10 441 488 29 0 25 4 107 17

(The last line remains to be shifted three columns to the right.)

Jul 26 '06 #8

Paul McGuire

"Ankit" <an********@gmail.comwrote in message
news:11*********************@h48g2000cwc.googlegro ups.com...

Frederic,

Thanks for posting the solution. I used the original solution you
posted and it worked beautifully.

Paul,

I understand your concern for the site's TOS. Although, this may not
mean anything, the reason I wanted this "parser" was because I wanted
to get the Advanced, and Translated Stats for personal use. I don't
have any commercial motives but play with baseball stats is my hobby.
The site does allow one to download stuff for personal use, which I
abide by. Also, I am only looking to get the aforementioned stats for
some players. The site has player pages for over 16,000 players. I
think it would be unfair to the site owners if I went to download all
16,000 players using the script. In the end, they might just move the
stats in to their premium package (not free) and then I would be really
screwed.

So, I understand your concerns and thank you for posting them.

Ankit

Frederic and Ankit -

I guess you may have caught me in a more-than-curmudgeon-ly mood. Thanks
for giving me the benefit of the doubt.

I guess I should put more faith in our "consenting adults" environment - if
someone wants to use posted code to create a bot or virus or TOS-violating
web page scraper, that is their business, not mine. I've noticed that the
esteemed C. Titus Brown in his twill intro gives an example violating
Google's TOS, but at least he gives a suitable admonition in the code to the
effect of "this is just an example, but don't do it."

So in that spirit, for EDUCATION AND PERSONAL USE PURPOSES ONLY, here is a
pyparsing rendition that processes the HTML of the previously cited web
site. Ankit, you already know the suitable url's to use for this, so I
don't need to post them again (in a weak attempt to shield that web site
from casual slamming).

At first glance, this is *way* more complicated than Frederic's SE-based
solution. The catch is that the pattern we are keying off of has a lot of
HTML junk in it. Frederic just dumps it on the floor, and really this
program doesn't do much more with it. Note that we suppress almost all of
the parsed HTML tags, which is just pyparsing's way of saying "don't need
this...", but the tag expression still needs to be included in the pattern
we are scanning for.

There are a couple of beyond-beginner pyparsing techniques in this example:
- Using a parse action to reject text that matches syntax, but not
semantics. In this case, we reject <h3tags that don't have the right
section name. From a parsing standpoint, all <h3>'s match the h3Start
expression, so we attach a parse action to perform the additional filtering.
- Using Dict is always kind of magic. At parse time, the Dict class
instructs the parser to build a dict-style result, use the first token in
each matched group as a key, and the remainder as the value. This gives us
a keyed lookup by age to the yearly stats values.
- We have to stop reading stats at the line break, so we first check if we
are not at the end-of-line before accepting the next number. That is why
the expression reads "OneOrMore(~lineEnd + number)" to parse in the actual
statistics values.

Once the parsing is done, I go through a little extra work showing different
ways to get at the parsed results. pyparsing does much more than just
return nested lists of strings. In this case, we are associating field
names with some content, and also dynamically generating dict-style access
to statistics by age. Finally, there is also the output to CSV format,
which was the original intent.

I think that as HTML-scraping apps go, this is fairly typical for a
pyparsing approach. The feedback I get is that people take an hour or two
getting their programs just the way they want them, but then the resulting
code is pretty robust over time, as minor page changes or enhancement
require simple if any updates to the scraper. For instance, if new stat
columns were added to this page, there would be *no* change to the parser.

Anyway, here is the pyparsing datapoint for your comparison.

-- Paul
(... and what was Babe Ruth doing between the ages of 26 and 35? Did he
retire for 9 years and then come back?)

from pyparsing import *
import urllib

playerURL = "http://rest_of_URL_goes_here"

# define start/end HTML tags for key items
# makeHTMLTags takes care of unexpected attributes, whitespace, case, etc.
h3Start,h3End = makeHTMLTags("h3")
aStart,aEnd = makeHTMLTags("a")
preStart,preEnd = makeHTMLTags("pre")
aStart = aStart.suppress()
aEnd = aEnd.suppress()
preStart = preStart.suppress()
preEnd = preEnd.suppress()

# spell out some of the specific HTML patterns we are looking for
sectionStart = (h3Start + aStart + SkipTo(aEnd).setResultsName("section") +
aEnd + h3End ) | \
(h3Start + SkipTo(h3End).setResultsName("section") + h3End )
sectionHeading = OneOrMore(aStart + SkipTo(aEnd) +
aEnd).setResultsName("statsNames")
sectionHeading2 = OneOrMore(~lineEnd +
Word(alphanums.upper()+"/")).setResultsName("statsNames")

integer = Combine(Optional("-") + Word(nums))
real = Combine(Optional("-") + Optional(Word(nums)) + "." + Word(nums))
number = real | integer
teamName = Word(alphas.upper() + "_-")

# create parse action that will filter for sections of a particular name
wrongSectionName = ParseException("",0,"")
def onlyAcceptSectionNamed(sec):
def parseAction(tokens):
if tokens.section != sec:
raise wrongSectionName
return parseAction

import pprint
def getStatistics(url):
htm_page = urllib.urlopen(url)
htm_lines = htm_page.read()
htm_page.close ()

actualPitchingStats = \
sectionStart.copy().setParseAction(onlyAcceptSecti onNamed("Actual
Pitching Statistics ")) + \
preStart + \
sectionHeading + \
Dict( OneOrMore( Group(integer + aStart.suppress() + integer +
teamName + aEnd.suppress() + \
OneOrMore(~lineEnd +
number).setResultsName("stats") ) )).setResultsName("statsByAge") + \
Group( OneOrMore(number) ).setResultsName("careerStats") + preEnd
aps = actualPitchingStats.searchString(htm_lines)[0]

translatedPitchingStats = \

sectionStart.copy().setParseAction(onlyAcceptSecti onNamed("Translated
Pitching Statistics")) + \
preStart + lineEnd + \
sectionHeading2 + \
Dict( OneOrMore( Group(integer + aStart.suppress() + integer +
teamName + aEnd.suppress() + \
OneOrMore(~lineEnd +
number).setResultsName("stats") ) )).setResultsName("statsByAge") + \
Suppress("Career") + Group(
OneOrMore(number) ).setResultsName("careerStats") + preEnd
tps = translatedPitchingStats.searchString(htm_lines)[0]
# examples of accessing data fields in returned parse results
for res in (aps,tps):
print res.section
print '-'*len(res.section.rstrip())
for k in res.keys():
print "- %s: %s" % (k,res[k])
# career stats don't have age, year, or team name, so skip over
those stats names
pprint.pprint( zip(res.statsNames[3:],res.careerStats) )
print
# print stats for year at age 24
# by-age stats don't include age, so skip over first stats name
pprint.pprint( zip(res.statsNames[1:],res.statsByAge["24"]) )
print
# output CSV-style data, for each year and then for career
for yearlyStats in res.statsByAge:
print ", ".join(yearlyStats)
print " , , ,",", ".join(res.careerStats)
print

getStatistics(playerURL)

Gives this output:

Actual Pitching Statistics
--------------------------
- endH3: </h3>
- statsByAge: [['19', '1914', 'BOS-A', '2', '1', '0', '3.91', '4', '3',
'96', '23.0', '21', '12', '10', '1', '7', '3', '0', '0', '0', '0', '1',
'0'], ['20', '1915', 'BOS-A', '18', '8', '0', '2.44', '32', '28', '874',
'217.7', '166', '80', '59', '3', '85', '112', '6', '0', '9', '1', '16',
'1'], ['21', '1916', 'BOS-A', '23', '12', '1', '1.75', '44', '41', '1272',
'323.7', '230', '83', '63', '0', '118', '170', '8', '0', '3', '1', '23',
'9'], ['22', '1917', 'BOS-A', '24', '13', '2', '2.01', '41', '38', '1277',
'326.3', '244', '93', '73', '2', '108', '128', '11', '0', '5', '0', '35',
'6'], ['23', '1918', 'BOS-A', '13', '7', '0', '2.22', '20', '19', '660',
'166.3', '125', '51', '41', '1', '49', '40', '2', '0', '3', '1', '18', '1'],
['24', '1919', 'BOS-A', '9', '5', '1', '2.97', '17', '15', '570', '133.3',
'148', '59', '44', '2', '58', '30', '2', '0', '5', '1', '12', '0'], ['25',
'1920', 'NY_-A', '1', '0', '0', '4.50', '1', '1', '17', '4.0', '3', '4',
'2', '0', '2', '0', '0', '0', '0', '0', '0', '0'], ['26', '1921', 'NY_-A',
'2', '0', '0', '9.00', '2', '1', '49', '9.0', '14', '10', '9', '1', '9',
'2', '0', '0', '0', '0', '0', '0'], ['35', '1930', 'NY_-A', '1', '0', '0',
'3.00', '1', '1', '39', '9.0', '11', '3', '3', '0', '2', '3', '0', '0', '0',
'0', '1', '0'], ['38', '1933', 'NY_-A', '1', '0', '0', '5.00', '1', '1',
'42', '9.0', '12', '5', '5', '0', '3', '0', '0', '0', '0', '0', '1', '0']]
- startH3: ['h3', ['class', 'cardsect'], False]
- section: Actual Pitching Statistics
- statsNames: ['AGE', 'YEAR', 'TEAM', 'W', 'L', 'SV', 'ERA', 'G', 'GS',
'TBF', 'IP', 'H', 'R', 'ER', 'HR', 'BB', 'SO', 'HBP', 'IBB', 'WP', 'BK',
'CG', 'SHO']
- careerStats: ['94', '46', '4', '2.28', '163', '148', '4896', '1221.3',
'974', '400', '309', '10', '441', '488', '29', '0', '25', '4', '107', '17']
- class: cardsect
- empty: False
[('W', '94'),
('L', '46'),
('SV', '4'),
('ERA', '2.28'),
('G', '163'),
('GS', '148'),
('TBF', '4896'),
('IP', '1221.3'),
('H', '974'),
('R', '400'),
('ER', '309'),
('HR', '10'),
('BB', '441'),
('SO', '488'),
('HBP', '29'),
('IBB', '0'),
('WP', '25'),
('BK', '4'),
('CG', '107'),
('SHO', '17')]

[('YEAR', '1919'),
('TEAM', 'BOS-A'),
('W', '9'),
('L', '5'),
('SV', '1'),
('ERA', '2.97'),
('G', '17'),
('GS', '15'),
('TBF', '570'),
('IP', '133.3'),
('H', '148'),
('R', '59'),
('ER', '44'),
('HR', '2'),
('BB', '58'),
('SO', '30'),
('HBP', '2'),
('IBB', '0'),
('WP', '5'),
('BK', '1'),
('CG', '12'),
('SHO', '0')]

19, 1914, BOS-A, 2, 1, 0, 3.91, 4, 3, 96, 23.0, 21, 12, 10, 1, 7, 3, 0, 0,
0, 0, 1, 0
20, 1915, BOS-A, 18, 8, 0, 2.44, 32, 28, 874, 217.7, 166, 80, 59, 3, 85,
112, 6, 0, 9, 1, 16, 1
21, 1916, BOS-A, 23, 12, 1, 1.75, 44, 41, 1272, 323.7, 230, 83, 63, 0, 118,
170, 8, 0, 3, 1, 23, 9
22, 1917, BOS-A, 24, 13, 2, 2.01, 41, 38, 1277, 326.3, 244, 93, 73, 2, 108,
128, 11, 0, 5, 0, 35, 6
23, 1918, BOS-A, 13, 7, 0, 2.22, 20, 19, 660, 166.3, 125, 51, 41, 1, 49, 40,
2, 0, 3, 1, 18, 1
24, 1919, BOS-A, 9, 5, 1, 2.97, 17, 15, 570, 133.3, 148, 59, 44, 2, 58, 30,
2, 0, 5, 1, 12, 0
25, 1920, NY_-A, 1, 0, 0, 4.50, 1, 1, 17, 4.0, 3, 4, 2, 0, 2, 0, 0, 0, 0, 0,
0, 0
26, 1921, NY_-A, 2, 0, 0, 9.00, 2, 1, 49, 9.0, 14, 10, 9, 1, 9, 2, 0, 0, 0,
0, 0, 0
35, 1930, NY_-A, 1, 0, 0, 3.00, 1, 1, 39, 9.0, 11, 3, 3, 0, 2, 3, 0, 0, 0,
0, 1, 0
38, 1933, NY_-A, 1, 0, 0, 5.00, 1, 1, 42, 9.0, 12, 5, 5, 0, 3, 0, 0, 0, 0,
0, 1, 0
, , , 94, 46, 4, 2.28, 163, 148, 4896, 1221.3, 974, 400, 309, 10,
441, 488, 29, 0, 25, 4, 107, 17

Translated Pitching Statistics
------------------------------
- endH3: </h3>
- statsByAge: [['19', '1914', 'BOS-A', '20.0', '19', '15', '5', '6', '0',
'4', '6.75', '1', '1', '0', '8.6', '2.2', '2.7', '1.8'], ['20', '1915',
'BOS-A', '191.3', '163', '87', '24', '74', '6', '134', '4.09', '13', '9',
'0', '7.7', '1.1', '3.5', '6.3'], ['21', '1916', 'BOS-A', '274.0', '212',
'82', '21', '101', '9', '212', '2.69', '22', '8', '1', '7.0', '.7', '3.3',
'7.0'], ['22', '1917', 'BOS-A', '277.3', '239', '107', '29', '98', '13',
'178', '3.47', '20', '11', '2', '7.8', '.9', '3.2', '5.8'], ['23', '1918',
'BOS-A', '149.0', '128', '69', '19', '51', '3', '65', '4.17', '9', '8', '0',
'7.7', '1.1', '3.1', '3.9'], ['24', '1919', 'BOS-A', '123.3', '147', '65',
'14', '59', '3', '47', '4.74', '7', '6', '1', '10.7', '1.0', '4.3', '3.4'],
['25', '1920', 'NY_-A', '3.3', '3', '4', '0', '2', '0', '0', '10.80', '0',
'1', '0', '8.1', '.0', '5.4', '.0'], ['26', '1921', 'NY_-A', '7.7', '10',
'9', '2', '9', '0', '3', '10.57', '0', '1', '0', '11.7', '2.3', '10.6',
'3.5'], ['35', '1930', 'NY_-A', '8.7', '11', '3', '0', '2', '0', '4',
'3.12', '1', '0', '0', '11.4', '.0', '2.1', '4.2'], ['38', '1933', 'NY_-A',
'8.7', '15', '6', '0', '3', '0', '1', '6.23', '0', '1', '0', '15.6', '.0',
'3.1', '1.0']]
- startH3: ['h3', ['class', 'cardsect'], False]
- section: Translated Pitching Statistics
- statsNames: ['AGE', 'YEAR', 'TEAM', 'IP', 'H', 'ER', 'HR', 'BB', 'HBP',
'SO', 'ERA', 'W', 'L', 'SV', 'H/9', 'HR/9', 'BB/9', 'SO/9']
- careerStats: ['1063.3', '947', '447', '114', '405', '34', '648', '3.78',
'73', '46', '6', '8.0', '1.0', '3.4', '5.5']
- class: cardsect
- empty: False
[('IP', '1063.3'),
('H', '947'),
('ER', '447'),
('HR', '114'),
('BB', '405'),
('HBP', '34'),
('SO', '648'),
('ERA', '3.78'),
('W', '73'),
('L', '46'),
('SV', '6'),
('H/9', '8.0'),
('HR/9', '1.0'),
('BB/9', '3.4'),
('SO/9', '5.5')]

[('YEAR', '1919'),
('TEAM', 'BOS-A'),
('IP', '123.3'),
('H', '147'),
('ER', '65'),
('HR', '14'),
('BB', '59'),
('HBP', '3'),
('SO', '47'),
('ERA', '4.74'),
('W', '7'),
('L', '6'),
('SV', '1'),
('H/9', '10.7'),
('HR/9', '1.0'),
('BB/9', '4.3'),
('SO/9', '3.4')]

19, 1914, BOS-A, 20.0, 19, 15, 5, 6, 0, 4, 6.75, 1, 1, 0, 8.6, 2.2, 2.7, 1.8
20, 1915, BOS-A, 191.3, 163, 87, 24, 74, 6, 134, 4.09, 13, 9, 0, 7.7, 1.1,
3.5, 6.3
21, 1916, BOS-A, 274.0, 212, 82, 21, 101, 9, 212, 2.69, 22, 8, 1, 7.0, .7,
3.3, 7.0
22, 1917, BOS-A, 277.3, 239, 107, 29, 98, 13, 178, 3.47, 20, 11, 2, 7.8, .9,
3.2, 5.8
23, 1918, BOS-A, 149.0, 128, 69, 19, 51, 3, 65, 4.17, 9, 8, 0, 7.7, 1.1,
3.1, 3.9
24, 1919, BOS-A, 123.3, 147, 65, 14, 59, 3, 47, 4.74, 7, 6, 1, 10.7, 1.0,
4.3, 3.4
25, 1920, NY_-A, 3.3, 3, 4, 0, 2, 0, 0, 10.80, 0, 1, 0, 8.1, .0, 5.4, .0
26, 1921, NY_-A, 7.7, 10, 9, 2, 9, 0, 3, 10.57, 0, 1, 0, 11.7, 2.3, 10.6,
3.5
35, 1930, NY_-A, 8.7, 11, 3, 0, 2, 0, 4, 3.12, 1, 0, 0, 11.4, .0, 2.1, 4.2
38, 1933, NY_-A, 8.7, 15, 6, 0, 3, 0, 1, 6.23, 0, 1, 0, 15.6, .0, 3.1, 1.0
, , , 1063.3, 947, 447, 114, 405, 34, 648, 3.78, 73, 46, 6, 8.0,
1.0, 3.4, 5.5

Jul 26 '06 #9

Schronos

Hi.

The webpage you need to parse is not very wellformed (I think), but
no problem. perhaps the best option is to locate the portion of HTML yo
want, in this case from "<h3 class="cardsect">Actual Pitching
Statistics </h3><pre>" to "</pre>". Between this you have a few entries
like this one: " 19 <a
href=http://www.baseballprospectus.com/dt//1914BOS-A.shtml>1914
BOS-A</a 2 1 0 3.91 4 3 96 23.0 21 12 10 1
7 3 0 0 0 0 1 0".

I'll put here a little portion of code using RE that I think will help
you to develop the rest of the app.

import re
data=" 19 <a
href=http://www.baseballprospectus.com/dt//1914BOS-A.shtml>1914
BOS-A</a 2 1 0 3.91 4 3 96 23.0 21 12 10 1
7 3 0 0 0 0 1 0"
pt=re.compile("(<a.*?>|</a>)") # this and the next line delete the html
tags
data1=pt.sub("",data) # Now data1 doesn't contain any html tag
pt=re.compile(" +") # This sentence and te next will substitute spaces
by "-"
data2=pt.sub("-",data1)
arrange_data=data2.aplit("-") # this make a list with data

after this few sentences you'll have a list with the data you need,
like the next:
['', '19', '1914', 'BOS', 'A', '2', '1', '0', '3.91', '4', '3', '96',
'23.0', '21', '12', '10', '1', '7', '3', '0', '0',
'0', '0', '1', '0']

I think is a good init for you.

Tell me if you can resolve the the problem with this or if you need
more help.

Bye

Jul 26 '06 #10

Parsing Baseball Stats

Similar topics