473,763 Members | 6,149 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

help!! *extra* tricky web page to extract data from...

How extract the visible numerical data from this Microsoft financial
web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow right
in a Python script?

Any help greatly appreciated.

Sincerely,

Chris

Mar 13 '07 #1
11 1817
se******@spawar .navy.mil schrieb:
How extract the visible numerical data from this Microsoft financial
web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow right
in a Python script?

Any help greatly appreciated.
It's an AJAX-site. You have to carefully analyze it and see what
actually happens in the javascript, then use that. Maybe something like
the http header plugin for firefox helps you there.

Diez
Mar 13 '07 #2
"se******@spawa r.navy.mil" <se******@spawa r.navy.milwrote :
How extract the visible numerical data from this Microsoft
financial web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow
right in a Python script?

Any help greatly appreciated.

Sincerely,

Chris
The url for the data is in an iframe. If you need to scrape the
original page for some reason(instead of iframe url directly), you can
use urlparse.urljoi n to resolve the relative url.
max

Mar 13 '07 #3
It's an AJAX-site. You have to carefully analyze it and see what
actually happens in the javascript, then use that. Maybe something like
the http header plugin for firefox helps you there.

ups, obviously I wasn't looking enough at the site. Sorry for the confusion.

Still, some pages are AJAX, you won't be able to scrape them easily
without analyzing the JS code.

Diez
Mar 13 '07 #4
"Diez B. Roggisch" <de***@nospam.w eb.dewrites:
Still, some pages are AJAX, you won't be able to scrape them easily
without analyzing the JS code.
Sooner or later it would be great to have a JS interpreter written in
Python for this purpose. It would do all the same operations on an
HTML/XML DOM that a browser does, basically all the stuff of a browser
except rendering into pixels. JS semantics are similar enough to
Python that maybe the JS could be compiled into Python byte code.
Mar 13 '07 #5
Paul Rubin schrieb:
"Diez B. Roggisch" <de***@nospam.w eb.dewrites:
>Still, some pages are AJAX, you won't be able to scrape them easily
without analyzing the JS code.

Sooner or later it would be great to have a JS interpreter written in
Python for this purpose. It would do all the same operations on an
HTML/XML DOM that a browser does, basically all the stuff of a browser
except rendering into pixels. JS semantics are similar enough to
Python that maybe the JS could be compiled into Python byte code.
Nice idea, but not really helpful in the end. Besides the rather nasty
parts of the DOMs that make JS programming the PITA it is, I think the
whole event-based stuff makes this basically impossible.

Diez
Mar 13 '07 #6
"Diez B. Roggisch" <de***@nospam.w eb.dewrites:
Nice idea, but not really helpful in the end. Besides the rather nasty
parts of the DOMs that make JS programming the PITA it is, I think the
whole event-based stuff makes this basically impossible.
Obviously the Python interface would need ways to send events into the
DOM, simulating timer ticks, mouse clicks, and so forth, just like
urllib in a sense simulates a user navigating a browser.
Mar 13 '07 #7
Paul Rubin schrieb:
"Diez B. Roggisch" <de***@nospam.w eb.dewrites:
>Nice idea, but not really helpful in the end. Besides the rather nasty
parts of the DOMs that make JS programming the PITA it is, I think the
whole event-based stuff makes this basically impossible.

Obviously the Python interface would need ways to send events into the
DOM, simulating timer ticks, mouse clicks, and so forth, just like
urllib in a sense simulates a user navigating a browser.
Obviously this wouldn't really help, as you can't predict what a website
actually wants which events, in possibly which order. Especially if the
site does not _want_ to be scrapable- think of a simple "click on the
images in the order of the numbers shown on them" captcha.

Most time it's easier to sniff the http stream & grab the data directly.

Diez
Mar 13 '07 #8
"Diez B. Roggisch" <de***@nospam.w eb.dewrites:
Obviously this wouldn't really help, as you can't predict what a
website actually wants which events, in possibly which
order. Especially if the site does not _want_ to be scrapable- think
of a simple "click on the images in the order of the numbers shown on
them" captcha.
Sure, but most sites don't go to such lengths, and even captchas can
be defeated if you're trying to scrape a specific site and are willing
to spend effort on the particular captcha generator that it uses.
Plus there is always www.captchasolver.com (!).
Most time it's easier to sniff the http stream & grab the data directly.
Certainly true, but there are times when you have to pull stuff out of
the JS. It's usually possible to do that without actually
interpreting the JS, but an interpreter would make it a lot more
convenient some of the time.
Mar 13 '07 #9
se******@spawar .navy.mil wrote:
How extract the visible numerical data from this Microsoft financial
web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow right
in a Python script?

Any help greatly appreciated.
Been there, done that, years ago. Try this:

http://www.downside.com/cgi/testfina...-06-034196.txt

That will get you the data you're looking for.
If you want to try other companies, start at the query box on
"http://www.downside.co m".

The data is actually coming from the United States Securities and Exchange
Commission's EDGAR web site, where companies are required to file their
financial statements. The filings are intended to be read by humans, but
it's possible to parse many filings mechanically. They're supposed to be
in HTML 3.2, but this isn't enforced.

There are many EDGAR parsers, some better than ours. To do a really good one,
you have to license a patent from Price Waterhouse. Try
"http://www.10kwizard.c om/", which has an API for retrieving this info.
It's not free.

John Nagle
Mar 13 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
8404
by: fadfdsj | last post by:
Hi, I would like to extract data from the table attached. Could someone help me to create the regular expression to grab that informations? TABLE: <table border=1 cellpadding=4 cellspacing=0 width=100%><tr bgcolor='#dcdcdc'><td align=center><b>Data</b></td><td align=center><b>Apertura</b></td><td
0
5493
by: Ashok | last post by:
Hi, is it possible to extract data from a web based java applet in order to enter that data in mysql? for example, something that would let me extract the data shown in applet on http://gcitrading.com/forex-quotes.htm and enter it in mysql. The data needs to be compared too so that any changes that this applet shows is entered with a time stamp. Or any info source? Thanks. Ashok
4
7356
by: jrefactors | last post by:
How to extract data from html page? For example, if i want to get the information of weather (http://weather.yahoo.com/forecast/USCA1005.html) and put in my web page. Is it possible to do that? please advise. thanks!!
1
3074
by: basyarie | last post by:
Hello All, I`d like to introduce myself. I`m basyarie, now is student of university. Nice to meet you all. I`m beginner in this discussion community. Just want to ask about VB6 for GPS application. How to extract data from GPS (for example with Pioneer GPS M1zz). Can I do it? How to do? Because there are at least 5 format sentences: RMC, VTG, GGA, GSA, ZDA.
2
3842
by: missolsr | last post by:
hi, I am using jpcap to capture OLSR topology control (udp) packets. Does anyone know how to extract data (the way ethereal does it) from the olsr packet? There are methods to extract data from udp and IP packets in jpcap but the issue is that olsr packets have their own header-data and since jpcap can not dig that far, I get nonsense as packet data. 1. Am I right to assume that jpcap can not dig to the data part of the packet...
1
2063
by: bibie | last post by:
How to extract data from mssql and then convert it to mysql using VB6.0. How to connect the mssql..I know a little bit of VB6.0 but only create an interface using STANDARD EXE. Someone told me to create a script to extract data but i dont know how. Where should i create the script? ActiveX EXE or AvtiveX DLL. I really need help..Tq.
5
3590
by: ElTipo | last post by:
Hello People, I made a data base with secure wizard to provide to users a PID and Passwords. I need to extract data from Crystal Reports 7 in this data base but Crystal Reports send me a message like I cant extract data because I don't have rights to this data base. I am the "Admin" I don't Know what happens in this case. I try to change the "Set Location" in Crystal Rpts but no results. Crystal don't show me any window to put the...
1
4334
by: fly2irfan | last post by:
Hi All, I am new to IT/Developer Network I have to create an application which has to Extract data from Excel Spreadsheet using C# or VB.net then save the data into SQL database. Can anybody help me out in this regard. Regards, Cool
5
2849
by: =?Utf-8?B?aWxy?= | last post by:
Hi This is probably fairly simple but I am newish at programming and was wondering if someone can give me some advice on handling the following. I have an array with a large number of elements in it. 0-9 are related data, 10-19, 20-29 are related and so on. What is the best way of extracting groups of elements from the array into another array where each element is the related data or to extract say elements 0,1,5 from the first...
1
1461
by: =?Utf-8?B?THVpZ2k=?= | last post by:
Hi all, is it possible to extract data from Pdf file, in several formats, like .txt or Excel. And from an aspx page (ASP.NET 2.0 - C#). Thanks in advance. -- Luigi
0
9386
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9997
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9937
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9822
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7366
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6642
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5270
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5405
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
3522
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.