By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,908 Members | 1,945 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,908 IT Pros & Developers. It's quick & easy.

question about urllib and parsing a page

P: n/a
hey there,
i am using beautiful soup to parse a few pages (screen scraping)
easy stuff.
the issue i am having is with one particular web page that uses a
javascript to display some numbers in tables.

now if i open the file in mozilla and "save as" i get the numbers in
the source. cool. but i click on the "view source" or download the url
with urlretrieve, i get the source, but not the numbers.

is there a way around this ?

thanks

Nov 2 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Yeah, this tends to be silly, but a workaround (for firefox at least)
is to select the content and rather than saying view source, right
click and click View Selection Source...

Nov 2 '05 #2

P: n/a
thats cool, but i want to do this automatically with python.
what can i do to have urllib download the source with the numbers in
it?

ok, not necessarily urllib, whatever one is best for the occation
thanks
shawn

Nov 2 '05 #3

P: n/a
ne*****@xit.net wrote:
hey there,
i am using beautiful soup to parse a few pages (screen scraping)
easy stuff.
the issue i am having is with one particular web page that uses a
javascript to display some numbers in tables.

now if i open the file in mozilla and "save as" i get the numbers in
the source. cool. but i click on the "view source" or download the url
with urlretrieve, i get the source, but not the numbers.

is there a way around this ?

thanks


If the Javascript is automatically generated by the server with the
numbers in a known location, you can use a regular expression to
extract them. For example, if there's something in the code like:

var numbersToDisplay = [123,456,789];

Then you could use: (warning, this is not fully tested):

import re
js_source = "... the source inside the <script> tag ..."
numbers_str = re.search(r'numbersToDisplay = \[([^]]*)\];', \
js_source).group(1)
numbers_list = numbers_str.split(",")

You'll obviously have to vary this to match your particular script.
Bear in mind that this won't work if the values are computed in
JavaScript, instead of on the server. If that's the case, then unless
you feel like implementing a complete IE- and Mozilla-compatible
browser DOM and JavaScript interpreter, you're out of luck.

-- David

Nov 2 '05 #4

P: n/a
well, i think thats the case, looking at the code, there is a long
string of math functions in page, java math functions. hmmmm. i guess
i'm up that famous creek.
thanks for the info, though
shawn

Nov 2 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.