Hello
After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.
For instance, when searching Amazon for "Wargames":
URLLIB:
<a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H"><span
class="srTitle">Wargames</span></a>
~ Matthew Broderick, Dabney Coleman, John Wood, et Ally Sheedy
<span class="bindingBlock">(<span class="binding">Cassette
vidéo</span- 2000)</span></td></tr>
FIREFOX:
<div class="productTitle"><a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H/ref=sr_1_1?ie=UTF8&s=dvd&qid=1224872998&sr=8-1">
Wargames</a<span class="binding"~ Matthew Broderick, Dabney
Coleman, John Wood, et Ally Sheedy</span><span class="binding">
(<span class="format">Cassette vidéo</span- 2000)</span></div>
Why do they differ?
Thank you. 6 1434
Gilles Ganault wrote:
After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.
For instance, when searching Amazon for "Wargames":
URLLIB:
<a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H"><span
class="srTitle">Wargames</span></a>
~ Matthew Broderick, Dabney Coleman, John Wood, et Ally Sheedy
<span class="bindingBlock">(<span class="binding">Cassette
vidéo</span- 2000)</span></td></tr>
FIREFOX:
<div class="productTitle"><a
href="http://www.amazon.fr/Wargames-Matthew-Broderick/dp/B00004RJ7H/ref=sr_1_1?ie=UTF8&s=dvd&qid=1224872998&sr=8-1">
Wargames</a<span class="binding"~ Matthew Broderick, Dabney
Coleman, John Wood, et Ally Sheedy</span><span class="binding">
(<span class="format">Cassette vidéo</span- 2000)</span></div>
Why do they differ?
The browser sends a different client identifier than urllib, and the server
sends back different page content depending on what client is asking.
Stefan
On Fri, 24 Oct 2008 20:38:37 +0200, Gilles Ganault wrote:
Hello
After scratching my head as to why I failed finding data from a web
using the "re" module, I discovered that a web page as downloaded by
urllib doesn't match what is displayed when viewing the source page in
FireFox.
Cookies?
Lie Ryan <li******@gmail.comwrote:
> Cookies?
Yes, please. I'll take two. Chocolate chip. With milk.
--
Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
On Fri, 24 Oct 2008 13:15:49 -0700 (PDT), Mike Driscoll
<ky******@gmail.comwrote:
>On Oct 24, 2:53*pm, Rex <rex.eastbou...@gmail.comwrote:
>By the way, if you're doing non-trivial web scraping, the mechanize module might make your work much easier. You can install it with easy_install.http://wwwsearch.sourceforge.net/mechanize/ Or if you just need to query stuff on Amazon, then you might find this module helpful:
http://pypi.python.org/pypi/Python-Amazon/
Thanks a bunch. I didn't know about the AWS service. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Pieter Edelman |
last post by:
Hi all,
I'm trying to submit some data using a POST request to a HTTP server with
BASIC authentication with python, but I can't get it to work. Since it's
driving me completely nuts, so here's...
|
by: Timothy Wu |
last post by:
Hi,
I'm trying to fill the form on page
http://www.cbs.dtu.dk/services/TMHMM/ using urllib.
There are two peculiarities. First of all, I am filling in incorrect
key/value pairs in the...
|
by: sleytr |
last post by:
Hi, I'm trying to make a gui for a web service. Site using ±
character in value of some fields. But I can't encode this character
properly.
>>> data = {'key':'±'}
>>> urllib.urlencode(data)...
|
by: Gabriel Zachmann |
last post by:
Here is a very simple Python script utilizing urllib:
import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen(...
|
by: evanpmeth |
last post by:
I have tried multiple ways of posting information to a website and have
failed. I have seen this problem on other forums can someone explain or
point me to information on how POST works through...
|
by: Dr. Locke Z2A |
last post by:
So I'm writing a bot in python that will be able to do all kinds of
weird shit. One of those weird shit is the ability to translate text
from one language to another, which I figured I'd use google...
|
by: supercooper |
last post by:
I am downloading images using the script below. Sometimes it will go
for 10 mins, sometimes 2 hours before timing out with the following
error:
Traceback (most recent call last):
File...
|
by: John Nagle |
last post by:
urllib has a "hole" in its timeout protection.
Using "socket.setdefaulttimeout" will make urllib time out if a
site doesn't open a TCP connection in the indicated time. But if the site
opens...
|
by: Thierry |
last post by:
Hello fellow pythonists,
I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I...
|
by: ryjfgjl |
last post by:
ExcelToDatabase: batch import excel into database automatically...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: Vimpel783 |
last post by:
Hello!
Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
|
by: PapaRatzi |
last post by:
Hello,
I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
|
by: Defcon1945 |
last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
|
by: af34tf |
last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
| |