473,382 Members | 1,733 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

question about urllib and parsing a page

hey there,
i am using beautiful soup to parse a few pages (screen scraping)
easy stuff.
the issue i am having is with one particular web page that uses a
javascript to display some numbers in tables.

now if i open the file in mozilla and "save as" i get the numbers in
the source. cool. but i click on the "view source" or download the url
with urlretrieve, i get the source, but not the numbers.

is there a way around this ?

thanks

Nov 2 '05 #1
4 1648
Yeah, this tends to be silly, but a workaround (for firefox at least)
is to select the content and rather than saying view source, right
click and click View Selection Source...

Nov 2 '05 #2
thats cool, but i want to do this automatically with python.
what can i do to have urllib download the source with the numbers in
it?

ok, not necessarily urllib, whatever one is best for the occation
thanks
shawn

Nov 2 '05 #3
ne*****@xit.net wrote:
hey there,
i am using beautiful soup to parse a few pages (screen scraping)
easy stuff.
the issue i am having is with one particular web page that uses a
javascript to display some numbers in tables.

now if i open the file in mozilla and "save as" i get the numbers in
the source. cool. but i click on the "view source" or download the url
with urlretrieve, i get the source, but not the numbers.

is there a way around this ?

thanks


If the Javascript is automatically generated by the server with the
numbers in a known location, you can use a regular expression to
extract them. For example, if there's something in the code like:

var numbersToDisplay = [123,456,789];

Then you could use: (warning, this is not fully tested):

import re
js_source = "... the source inside the <script> tag ..."
numbers_str = re.search(r'numbersToDisplay = \[([^]]*)\];', \
js_source).group(1)
numbers_list = numbers_str.split(",")

You'll obviously have to vary this to match your particular script.
Bear in mind that this won't work if the values are computed in
JavaScript, instead of on the server. If that's the case, then unless
you feel like implementing a complete IE- and Mozilla-compatible
browser DOM and JavaScript interpreter, you're out of luck.

-- David

Nov 2 '05 #4
well, i think thats the case, looking at the code, there is a long
string of math functions in page, java math functions. hmmmm. i guess
i'm up that famous creek.
thanks for the info, though
shawn

Nov 2 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Richard Shea | last post by:
Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... ...
2
by: Sean Berry | last post by:
I have two lists... one like the following, list1 ... and the other like this, list2 ... , ] Both lists are much more extensive, the first being a list of about 10 strings, and the...
7
by: Stuart McGraw | last post by:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^ Sourceforce and trying to submit a Python bug report but it still won't let me. I give up. Maybe someone who cares will see this post, or...
11
by: Pater Maximus | last post by:
I am trying to implement the recipe listed at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/211886 However, I can not get to first base. When I try to run import urllib...
1
by: Timothy Wu | last post by:
Hi, I'm trying to fill the form on page http://www.cbs.dtu.dk/services/TMHMM/ using urllib. There are two peculiarities. First of all, I am filling in incorrect key/value pairs in the...
11
by: Johnny Lee | last post by:
Hi, I was using urllib to grab urls from web. here is the work flow of my program: 1. Get base url and max number of urls from user 2. Call filter to validate the base url 3. Read the source...
1
by: evanpmeth | last post by:
I have tried multiple ways of posting information to a website and have failed. I have seen this problem on other forums can someone explain or point me to information on how POST works through...
1
by: John Nagle | last post by:
If you try to open a password protected page with "urllib.urlopen()", you get "Enter username for EnterPassword at example.com:" on standard output, followed by a read for input! This seems to...
5
by: chrispoliquin | last post by:
Hi, I have a small Python script to fetch some pages from the internet. There are a lot of pages and I am looping through them and then downloading the page using urlretrieve() in the urllib...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.