Connecting Tech Pros Worldwide Forums | Help | Site Map

help!! *extra* tricky web page to extract data from...

seberino@spawar.navy.mil
Guest
 
Posts: n/a
#1: Mar 13 '07
How extract the visible numerical data from this Microsoft financial
web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow right
in a Python script?

Any help greatly appreciated.

Sincerely,

Chris


Diez B. Roggisch
Guest
 
Posts: n/a
#2: Mar 13 '07

re: help!! *extra* tricky web page to extract data from...


seberino@spawar.navy.mil schrieb:
Quote:
How extract the visible numerical data from this Microsoft financial
web site?
>
http://tinyurl.com/yw2w4h
>
If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.
>
Surely if I can see the data in my browser I can grab it somehow right
in a Python script?
>
Any help greatly appreciated.
It's an AJAX-site. You have to carefully analyze it and see what
actually happens in the javascript, then use that. Maybe something like
the http header plugin for firefox helps you there.

Diez
Max Erickson
Guest
 
Posts: n/a
#3: Mar 13 '07

re: help!! *extra* tricky web page to extract data from...


"seberino@spawar.navy.mil" <seberino@spawar.navy.milwrote:
Quote:
How extract the visible numerical data from this Microsoft
financial web site?
>
http://tinyurl.com/yw2w4h
>
If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.
>
Surely if I can see the data in my browser I can grab it somehow
right in a Python script?
>
Any help greatly appreciated.
>
Sincerely,
>
Chris
>
The url for the data is in an iframe. If you need to scrape the
original page for some reason(instead of iframe url directly), you can
use urlparse.urljoin to resolve the relative url.


max

Diez B. Roggisch
Guest
 
Posts: n/a
#4: Mar 13 '07

re: help!! *extra* tricky web page to extract data from...


It's an AJAX-site. You have to carefully analyze it and see what
Quote:
actually happens in the javascript, then use that. Maybe something like
the http header plugin for firefox helps you there.

ups, obviously I wasn't looking enough at the site. Sorry for the confusion.

Still, some pages are AJAX, you won't be able to scrape them easily
without analyzing the JS code.

Diez
Paul Rubin
Guest
 
Posts: n/a
#5: Mar 13 '07

re: help!! *extra* tricky web page to extract data from...


"Diez B. Roggisch" <deets@nospam.web.dewrites:
Quote:
Still, some pages are AJAX, you won't be able to scrape them easily
without analyzing the JS code.
Sooner or later it would be great to have a JS interpreter written in
Python for this purpose. It would do all the same operations on an
HTML/XML DOM that a browser does, basically all the stuff of a browser
except rendering into pixels. JS semantics are similar enough to
Python that maybe the JS could be compiled into Python byte code.
Diez B. Roggisch
Guest
 
Posts: n/a
#6: Mar 13 '07

re: help!! *extra* tricky web page to extract data from...


Paul Rubin schrieb:
Quote:
"Diez B. Roggisch" <deets@nospam.web.dewrites:
Quote:
>Still, some pages are AJAX, you won't be able to scrape them easily
>without analyzing the JS code.
>
Sooner or later it would be great to have a JS interpreter written in
Python for this purpose. It would do all the same operations on an
HTML/XML DOM that a browser does, basically all the stuff of a browser
except rendering into pixels. JS semantics are similar enough to
Python that maybe the JS could be compiled into Python byte code.
Nice idea, but not really helpful in the end. Besides the rather nasty
parts of the DOMs that make JS programming the PITA it is, I think the
whole event-based stuff makes this basically impossible.

Diez
Paul Rubin
Guest
 
Posts: n/a
#7: Mar 13 '07

re: help!! *extra* tricky web page to extract data from...


"Diez B. Roggisch" <deets@nospam.web.dewrites:
Quote:
Nice idea, but not really helpful in the end. Besides the rather nasty
parts of the DOMs that make JS programming the PITA it is, I think the
whole event-based stuff makes this basically impossible.
Obviously the Python interface would need ways to send events into the
DOM, simulating timer ticks, mouse clicks, and so forth, just like
urllib in a sense simulates a user navigating a browser.
Diez B. Roggisch
Guest
 
Posts: n/a
#8: Mar 14 '07

re: help!! *extra* tricky web page to extract data from...


Paul Rubin schrieb:
Quote:
"Diez B. Roggisch" <deets@nospam.web.dewrites:
Quote:
>Nice idea, but not really helpful in the end. Besides the rather nasty
>parts of the DOMs that make JS programming the PITA it is, I think the
>whole event-based stuff makes this basically impossible.
>
Obviously the Python interface would need ways to send events into the
DOM, simulating timer ticks, mouse clicks, and so forth, just like
urllib in a sense simulates a user navigating a browser.
Obviously this wouldn't really help, as you can't predict what a website
actually wants which events, in possibly which order. Especially if the
site does not _want_ to be scrapable- think of a simple "click on the
images in the order of the numbers shown on them" captcha.

Most time it's easier to sniff the http stream & grab the data directly.

Diez
Paul Rubin
Guest
 
Posts: n/a
#9: Mar 14 '07

re: help!! *extra* tricky web page to extract data from...


"Diez B. Roggisch" <deets@nospam.web.dewrites:
Quote:
Obviously this wouldn't really help, as you can't predict what a
website actually wants which events, in possibly which
order. Especially if the site does not _want_ to be scrapable- think
of a simple "click on the images in the order of the numbers shown on
them" captcha.
Sure, but most sites don't go to such lengths, and even captchas can
be defeated if you're trying to scrape a specific site and are willing
to spend effort on the particular captcha generator that it uses.
Plus there is always www.captchasolver.com (!).
Quote:
Most time it's easier to sniff the http stream & grab the data directly.
Certainly true, but there are times when you have to pull stuff out of
the JS. It's usually possible to do that without actually
interpreting the JS, but an interpreter would make it a lot more
convenient some of the time.
John Nagle
Guest
 
Posts: n/a
#10: Mar 14 '07

re: help!! *extra* tricky web page to extract data from...


seberino@spawar.navy.mil wrote:
Quote:
How extract the visible numerical data from this Microsoft financial
web site?
>
http://tinyurl.com/yw2w4h
>
If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.
>
Surely if I can see the data in my browser I can grab it somehow right
in a Python script?
>
Any help greatly appreciated.
Been there, done that, years ago. Try this:

http://www.downside.com/cgi/testfina...-06-034196.txt

That will get you the data you're looking for.
If you want to try other companies, start at the query box on
"http://www.downside.com".

The data is actually coming from the United States Securities and Exchange
Commission's EDGAR web site, where companies are required to file their
financial statements. The filings are intended to be read by humans, but
it's possible to parse many filings mechanically. They're supposed to be
in HTML 3.2, but this isn't enforced.

There are many EDGAR parsers, some better than ours. To do a really good one,
you have to license a patent from Price Waterhouse. Try
"http://www.10kwizard.com/", which has an API for retrieving this info.
It's not free.

John Nagle
Steve Holden
Guest
 
Posts: n/a
#11: Mar 14 '07

re: help!! *extra* tricky web page to extract data from...


Paul Rubin wrote:
Quote:
"Diez B. Roggisch" <deets@nospam.web.dewrites:
Quote:
>Obviously this wouldn't really help, as you can't predict what a
>website actually wants which events, in possibly which
>order. Especially if the site does not _want_ to be scrapable- think
>of a simple "click on the images in the order of the numbers shown on
>them" captcha.
>
Sure, but most sites don't go to such lengths, and even captchas can
be defeated if you're trying to scrape a specific site and are willing
to spend effort on the particular captcha generator that it uses.
Plus there is always www.captchasolver.com (!).
>
I especially like the rems and conditions they ask you to acknowledge if
you want to sign up as a worker:

http://www.captchasolver.com/join/worker#

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Paul Rubin
Guest
 
Posts: n/a
#12: Mar 14 '07

re: help!! *extra* tricky web page to extract data from...


Steve Holden <steve@holdenweb.comwrites:
Quote:
I especially like the rems and conditions they ask you to acknowledge
if you want to sign up as a worker:
http://www.captchasolver.com/join/worker#
Heh, cute, I guess you have to solve a different type of puzzle to
read them.

I'm surprised anyone is purporting to pay actual money for captcha
solutions. The usual scheme I've herad (dunno if anyone actually does
it) is to feed the captchas you want to solve into a porn site, so
people give you solutions in order to keep viewing porn. You then
funnel the solutions back to the forms you're actually trying to
automate.

I think captchas are proving reasonably effective as a speed bump but
they do get defeated all the time, whether through automatic means or
otherwise.
Closed Thread