467,228 Members | 1,391 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,228 developers. It's quick & easy.

difference between urllib2.urlopen and firefox view 'page source'?

cjl
Hi.

I am trying to screen scrape some stock data from yahoo, so I am
trying to use urllib2 to retrieve the html and beautiful soup for the
parsing.

Maybe (most likely) I am doing something wrong, but when I use
urllib2.urlopen to fetch a page, and when I view 'page source' of the
exact same URL in firefox, I am seeing slight differences in the raw
html.

Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
Is yahoo detecting that urllib2 doesn't process javascript, and
passing different data?

-cjl

Mar 20 '07 #1
  • viewed: 3545
Share:
5 Replies
On Mar 19, 10:30 pm, "cjl" <cjl...@gmail.comwrote:
Hi.

I am trying to screen scrape some stock data from yahoo, so I am
trying to use urllib2 to retrieve the html and beautiful soup for the
parsing.

Maybe (most likely) I am doing something wrong, but when I use
urllib2.urlopen to fetch a page, and when I view 'page source' of the
exact same URL in firefox, I am seeing slight differences in the raw
html.

Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
Is yahoo detecting that urllib2 doesn't process javascript, and
passing different data?

-cjl
http://developer.yahoo.com/yui/articles/gbs/index.html seems to
indicate that Yahoo! passes you different markup depending on which
grade your browser falls into. I'm not sure I'd spoof your User-
Agent, after all your client is unlikely to support the features that
their looking for in Firefox (javascript, css, SVG).

Mar 20 '07 #2
cjl wrote:
Hi.

I am trying to screen scrape some stock data from yahoo, so I am
trying to use urllib2 to retrieve the html and beautiful soup for the
parsing.

Maybe (most likely) I am doing something wrong, but when I use
urllib2.urlopen to fetch a page, and when I view 'page source' of the
exact same URL in firefox, I am seeing slight differences in the raw
html.

Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
Is yahoo detecting that urllib2 doesn't process javascript, and
passing different data?
It's almost certainly a browser detection issue. This may not matter for
your application.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Mar 20 '07 #3
cjl wrote:
Hi.

I am trying to screen scrape some stock data from yahoo, so I am
trying to use urllib2 to retrieve the html and beautiful soup for the
parsing.

Maybe (most likely) I am doing something wrong, but when I use
urllib2.urlopen to fetch a page, and when I view 'page source' of the
exact same URL in firefox, I am seeing slight differences in the raw
html.

Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
Is yahoo detecting that urllib2 doesn't process javascript, and
passing different data?

-cjl
Unless the data you you need depends on the site detecting a specific
browser you will probably receive a 'cleaner' code that's more easily
parsed if you don't set a user agent. Usually the browser optimization
they do is just eye candy, bells and whistles anyway in order to give
you a more 'pleasing experience'. I doubt that your program will care
about that ;)

Tina
Mar 20 '07 #4
On Mar 20, 1:56 am, Tina I <tina...@bestemselv.comwrote:
cjl wrote:
Hi.
I am trying to screen scrape some stock data from yahoo, so I am
trying to use urllib2 to retrieve the html and beautiful soup for the
parsing.
Maybe (most likely) I am doing something wrong, but when I use
urllib2.urlopen to fetch a page, and when I view 'page source' of the
exact same URL in firefox, I am seeing slight differences in the raw
html.
Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
Is yahoo detecting that urllib2 doesn't process javascript, and
passing different data?
-cjl

Unless the data you you need depends on the site detecting a specific
browser you will probably receive a 'cleaner' code that's more easily
parsed if you don't set a user agent. Usually the browser optimization
they do is just eye candy, bells and whistles anyway in order to give
you a more 'pleasing experience'. I doubt that your program will care
about that ;)

Tina
You can do this fairly easily. I found a similar program in the book
Core Python Programming. It actually sticks the stocks into an Excel
spreadsheet. The code is below. You can easily modify it to send the
output elsewhere.
# Core Python Chp 23, pg 994
# estock.pyw

from Tkinter import Tk
from time import sleep, ctime
from tkMessageBox import showwarning
from urllib import urlopen
import win32com.client as win32

warn = lambda app: showwarning(app, 'Exit?')
RANGE = range(3, 8)
TICKS = ('AMZN', 'AMD', 'EBAY', 'GOOG', 'MSFT', 'YHOO')
COLS = ('TICKER', 'PRICE', 'CHG', '%AGE')
URL = 'http://quote.yahoo.com/d/quotes.csv?s=%s&f=sl1c1p2'

def excel():
app = 'Excel'
xl = win32.gencache.EnsureDispatch('%s.Application' % app)
ss = xl.Workbooks.Add()
sh = ss.ActiveSheet
xl.Visible = True
sleep(1)

sh.Cells(1, 1).Value = 'Python-to-%s Stock Quote Demo' % app
sleep(1)
sh.Cells(3, 1).Value = 'Prices quoted as of: %s' % ctime()
sleep(1)
for i in range(4):
sh.Cells(5, i+1).Value = COLS[i]
sleep(1)
sh.Range(sh.Cells(5, 1), sh.Cells(5, 4)).Font.Bold = True
sleep(1)
row = 6

u = urlopen(URL % ','.join(TICKS))
for data in u:
tick, price, chg, per = data.split(',')
sh.Cells(row, 1).Value = eval(tick)
sh.Cells(row, 2).Value = ('%.2f' % round(float(price), 2))
sh.Cells(row, 3).Value = chg
sh.Cells(row, 4).Value = eval(per.rstrip())
row += 1
sleep(1)
u.close()

warn(app)
ss.Close(False)
xl.Application.Quit()
if __name__ == '__main__':
Tk().withdraw()
excel()

# Have fun - Mike

Mar 20 '07 #5
Here's a useful online tool that might help you see what's happening:

http://www.sitetruth.com/experimental/viewer.html

We use this to help webmasters see what our web crawler is seeing.

This reads a page, using Python and FancyURLOpener, with a
USER-AGENT string of "SiteTruth.com site rating system."
Then it parses the page with BeautifulSoup, removes all
<SCRIPT>, <EMBED>, and <OBJECTtags, makes all the links
absolute, then writes the page back out in UTF-8 Unicode.
The resulting cleaned-up page is displayed.

If the page you're trying to read looks OK with our viewer,
you should be able to read it from Python with no problems.

John Nagle

cjl wrote:
Hi.

I am trying to screen scrape some stock data from yahoo, so I am
trying to use urllib2 to retrieve the html and beautiful soup for the
parsing.

Maybe (most likely) I am doing something wrong, but when I use
urllib2.urlopen to fetch a page, and when I view 'page source' of the
exact same URL in firefox, I am seeing slight differences in the raw
html.

Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
Is yahoo detecting that urllib2 doesn't process javascript, and
passing different data?

-cjl
Mar 20 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by bmiras@yahoo.com | last post: by
5 posts views Thread by Pascal | last post: by
1 post views Thread by joemynz@gmail.com | last post: by
1 post views Thread by Alessandro Fachin | last post: by
2 posts views Thread by ken | last post: by
reply views Thread by Jinshi | last post: by
1 post views Thread by Magnus.Moraberg@gmail.com | last post: by
3 posts views Thread by Alexnb | last post: by
2 posts views Thread by silk.odyssey | last post: by
reply views Thread by Adict | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.