473,325 Members | 2,442 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,325 software developers and data experts.

Urllib vs. FireFox

Hello

After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.

For instance, when searching Amazon for "Wargames":

URLLIB:
<a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H"><span
class="srTitle">Wargames</span></a>

~ Matthew Broderick, Dabney Coleman, John Wood, et Ally Sheedy
<span class="bindingBlock">(<span class="binding">Cassette
vidéo</span- 2000)</span></td></tr>

FIREFOX:
<div class="productTitle"><a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H/ref=sr_1_1?ie=UTF8&s=dvd&qid=1224872998&sr=8-1">
Wargames</a<span class="binding"~ Matthew Broderick, Dabney
Coleman, John Wood, et Ally Sheedy</span><span class="binding">
(<span class="format">Cassette vidéo</span- 2000)</span></div>

Why do they differ?

Thank you.
Oct 24 '08 #1
6 1434
Gilles Ganault wrote:
After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.

For instance, when searching Amazon for "Wargames":

URLLIB:
<a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H"><span
class="srTitle">Wargames</span></a>

~ Matthew Broderick, Dabney Coleman, John Wood, et Ally Sheedy
<span class="bindingBlock">(<span class="binding">Cassette
vidéo</span- 2000)</span></td></tr>

FIREFOX:
<div class="productTitle"><a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H/ref=sr_1_1?ie=UTF8&s=dvd&qid=1224872998&sr=8-1">
Wargames</a<span class="binding"~ Matthew Broderick, Dabney
Coleman, John Wood, et Ally Sheedy</span><span class="binding">
(<span class="format">Cassette vidéo</span- 2000)</span></div>

Why do they differ?
The browser sends a different client identifier than urllib, and the server
sends back different page content depending on what client is asking.

Stefan
Oct 24 '08 #2
Rex
Right. If you want to get the same results with your Python script
that you did with Firefox, you can modify the browser headers in your
code.

Here's an example with urllib2:
http://vsbabu.org/mt/archives/2003/0...p_headers.html

By the way, if you're doing non-trivial web scraping, the mechanize
module might make your work much easier. You can install it with
easy_install.
http://wwwsearch.sourceforge.net/mechanize/

Oct 24 '08 #3
On Oct 24, 2:53*pm, Rex <rex.eastbou...@gmail.comwrote:
Right. If you want to get the same results with your Python script
that you did with Firefox, you can modify the browser headers in your
code.

Here's an example with urllib2:http://vsbabu.org/mt/archives/2003/0...g_http_headers...

By the way, if you're doing non-trivial web scraping, the mechanize
module might make your work much easier. You can install it with
easy_install.http://wwwsearch.sourceforge.net/mechanize/
Or if you just need to query stuff on Amazon, then you might find this
module helpful:

http://pypi.python.org/pypi/Python-Amazon/

-------------------
Mike Driscoll

Blog: http://blog.pythonlibrary.org
Python Extension Building Network: http://www.pythonlibrary.org
Oct 24 '08 #4
On Fri, 24 Oct 2008 20:38:37 +0200, Gilles Ganault wrote:
Hello

After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.
Cookies?

Oct 25 '08 #5
Lie Ryan <li******@gmail.comwrote:
>
Cookies?
Yes, please. I'll take two. Chocolate chip. With milk.
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Oct 27 '08 #6
On Fri, 24 Oct 2008 13:15:49 -0700 (PDT), Mike Driscoll
<ky******@gmail.comwrote:
>On Oct 24, 2:53*pm, Rex <rex.eastbou...@gmail.comwrote:
>By the way, if you're doing non-trivial web scraping, the mechanize
module might make your work much easier. You can install it with
easy_install.http://wwwsearch.sourceforge.net/mechanize/

Or if you just need to query stuff on Amazon, then you might find this
module helpful:

http://pypi.python.org/pypi/Python-Amazon/
Thanks a bunch. I didn't know about the AWS service.
Oct 28 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Pieter Edelman | last post by:
Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's...
1
by: Timothy Wu | last post by:
Hi, I'm trying to fill the form on page http://www.cbs.dtu.dk/services/TMHMM/ using urllib. There are two peculiarities. First of all, I am filling in incorrect key/value pairs in the...
12
by: sleytr | last post by:
Hi, I'm trying to make a gui for a web service. Site using ± character in value of some fields. But I can't encode this character properly. >>> data = {'key':'±'} >>> urllib.urlencode(data)...
8
by: Gabriel Zachmann | last post by:
Here is a very simple Python script utilizing urllib: import urllib url = "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological" print url print file = urllib.urlopen(...
1
by: evanpmeth | last post by:
I have tried multiple ways of posting information to a website and have failed. I have seen this problem on other forums can someone explain or point me to information on how POST works through...
9
by: Dr. Locke Z2A | last post by:
So I'm writing a bot in python that will be able to do all kinds of weird shit. One of those weird shit is the ability to translate text from one language to another, which I figured I'd use google...
5
by: supercooper | last post by:
I am downloading images using the script below. Sometimes it will go for 10 mins, sometimes 2 hours before timing out with the following error: Traceback (most recent call last): File...
0
by: John Nagle | last post by:
urllib has a "hole" in its timeout protection. Using "socket.setdefaulttimeout" will make urllib time out if a site doesn't open a TCP connection in the indicated time. But if the site opens...
5
by: Thierry | last post by:
Hello fellow pythonists, I'm a relatively new python developer, and I try to adjust my understanding about "how things works" to python, but I have hit a block, that I cannot understand. I...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.