question about urllib and parsing a page

nephish

hey there,
i am using beautiful soup to parse a few pages (screen scraping)
easy stuff.
the issue i am having is with one particular web page that uses a
javascript to display some numbers in tables.

now if i open the file in mozilla and "save as" i get the numbers in
the source. cool. but i click on the "view source" or download the url
with urlretrieve, i get the source, but not the numbers.

is there a way around this ?

thanks

Nov 2 '05 #1

Subscribe Post Reply

1648

matt

Yeah, this tends to be silly, but a workaround (for firefox at least)
is to select the content and rather than saying view source, right
click and click View Selection Source...

Nov 2 '05 #2

nephish

thats cool, but i want to do this automatically with python.
what can i do to have urllib download the source with the numbers in
it?

ok, not necessarily urllib, whatever one is best for the occation
thanks
shawn

Nov 2 '05 #3

David Wahler

ne*****@xit.net wrote:

hey there,
i am using beautiful soup to parse a few pages (screen scraping)
easy stuff.
the issue i am having is with one particular web page that uses a
javascript to display some numbers in tables.

now if i open the file in mozilla and "save as" i get the numbers in
the source. cool. but i click on the "view source" or download the url
with urlretrieve, i get the source, but not the numbers.

is there a way around this ?

thanks

If the Javascript is automatically generated by the server with the
numbers in a known location, you can use a regular expression to
extract them. For example, if there's something in the code like:

var numbersToDisplay = [123,456,789];

Then you could use: (warning, this is not fully tested):

import re
js_source = "... the source inside the <script> tag ..."
numbers_str = re.search(r'numbersToDisplay = \[([^]]*)\];', \
js_source).group(1)
numbers_list = numbers_str.split(",")

You'll obviously have to vary this to match your particular script.
Bear in mind that this won't work if the values are computed in
JavaScript, instead of on the server. If that's the case, then unless
you feel like implementing a complete IE- and Mozilla-compatible
browser DOM and JavaScript interpreter, you're out of luck.

-- David

Nov 2 '05 #4

nephish

well, i think thats the case, looking at the code, there is a long
string of math functions in page, java math functions. hmmmm. i guess
i'm up that famous creek.
thanks for the info, though
shawn

Nov 2 '05 #5

Similar topics

Simple Question : files and URLLIB

by: Richard Shea | last post by:

Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... ...

Python

Urllib.urlencode question?

by: Sean Berry | last post by:

I have two lists... one like the following, list1 ... and the other like this, list2 ... , ] Both lists are much more extensive, the first being a list of about 10 strings, and the...

Python

bad data from urllib when run from MS .bat file

by: Stuart McGraw | last post by:

I just spent a $*#@!*&^&% hour registering at ^$#@#%^ Sourceforce and trying to submit a Python bug report but it still won't let me. I give up. Maybe someone who cares will see this post, or...

Python

Can not get urllib.urlopen to work

by: Pater Maximus | last post by:

I am trying to implement the recipe listed at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/211886 However, I can not get to first base. When I try to run import urllib...

Python

urllib problem (maybe bugs?)

by: Timothy Wu | last post by:

Hi, I'm trying to fill the form on page http://www.cbs.dtu.dk/services/TMHMM/ using urllib. There are two peculiarities. First of all, I am filling in incorrect key/value pairs in the...

Python

A problem while using urllib

by: Johnny Lee | last post by:

Hi, I was using urllib to grab urls from web. here is the work flow of my program: 1. Get base url and max number of urls from user 2. Call filter to validate the base url 3. Read the source...

Python

Question about urllib and posting to an external script

by: evanpmeth | last post by:

I have tried multiple ways of posting information to a website and have failed. I have seen this problem on other forums can someone explain or point me to information on how POST works through...

Python

urllib.urlopen unwanted password prompts - documentation problem

by: John Nagle | last post by:

If you try to open a password protected page with "urllib.urlopen()", you get "Enter username for EnterPassword at example.com:" on standard output, followed by a read for input! This seems to...

Python

urllib (54, 'Connection reset by peer') error

by: chrispoliquin | last post by:

Hi, I have a small Python script to fetch some pages from the internet. There are a lot of pages and I am looping through them and then downloading the page using urlretrieve() in the urllib...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware