473,320 Members | 1,930 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Deflate with urllib2...

Sam
I'm using urllib2 and accepting gzip and deflate.

It turns out that almost every site returns either normal text or
gzip. But I finally found one that returns deflate.

Here's how I un-gzip:
compressedstream = StringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=compressedstream)
data = gzipper.read()

Un-gzipping works great!

Here's how I un-deflate (inflate??)
data = zlib.decompress(data)

Un-deflating doesn't work. I get "zlib.error: Error -3 while
decompressing data: incorrect header check"

I'm using python 2.5.2. Can someone tell me exactly how to handle
deflated web pages?

Thanks
Sep 9 '08 #1
8 5355
Try this
http://www.paul.sladen.org/projects/pyflate/

2008/9/9 Sam <sa*******@gmail.com>:
I'm using urllib2 and accepting gzip and deflate.

It turns out that almost every site returns either normal text or
gzip. But I finally found one that returns deflate.

Here's how I un-gzip:
compressedstream = StringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=compressedstream)
data = gzipper.read()

Un-gzipping works great!

Here's how I un-deflate (inflate??)
data = zlib.decompress(data)

Un-deflating doesn't work. I get "zlib.error: Error -3 while
decompressing data: incorrect header check"

I'm using python 2.5.2. Can someone tell me exactly how to handle
deflated web pages?

Thanks
--
http://mail.python.org/mailman/listinfo/python-list
Sep 9 '08 #2
En Tue, 09 Sep 2008 16:38:54 -0300, Sam <sa*******@gmail.comescribió:
I'm using urllib2 and accepting gzip and deflate.

It turns out that almost every site returns either normal text or
gzip. But I finally found one that returns deflate.

Here's how I un-gzip:
compressedstream = StringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=compressedstream)
data = gzipper.read()

Un-gzipping works great!

Here's how I un-deflate (inflate??)
data = zlib.decompress(data)

Un-deflating doesn't work. I get "zlib.error: Error -3 while
decompressing data: incorrect header check"

I'm using python 2.5.2. Can someone tell me exactly how to handle
deflated web pages?
zlib.decompress should work - can you provide a site that uses deflate to
test?

--
Gabriel Genellina

Sep 10 '08 #3
Sam
Gabriel, et al.

It's hard to find a web site that uses deflate these days.

Luckily, slashdot to the rescue.

I even wrote a test script.

If someone can tell me what's wrong that would be great.

Here's what I get when I run it:
Data is compressed using deflate. Length is: 107160
Traceback (most recent call last):
File "my_deflate_test.py", line 19, in <module>
data = zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check

And here's my test script:

#!/usr/bin/env python

import urllib2
import zlib

opener = urllib2.build_opener()
opener.addheaders = [('Accept-encoding', 'deflate')]

stream = opener.open('http://www.slashdot.org')
data = stream.read()
encoded = stream.headers.get('Content-Encoding')

if encoded == 'deflate':
print "Data is compressed using deflate. Length is: ",
str(len(data))
data = zlib.decompress(data)
print "After uncompressing, length is: ", str(len(data))
else:
print "Data is not deflated."

On Sep 10, 12:50*am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Tue, 09 Sep 2008 16:38:54 -0300, Sam <samsli...@gmail.comescribió:
Un-deflating doesn't work. *I get "zlib.error: Error -3 while
decompressing data: incorrect header check"

zlib.decompress should work - can you provide a site that usesdeflateto *
test?

Sep 17 '08 #4
Sam
On Sep 18, 2:10*pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Tue, 16 Sep 2008 21:58:31 -0300, Sam <samsli...@gmail.comescribió:
The code is correct - try with another server. I tested it with a *
LightHTTPd server and worked fine.
Gabriel...

I found a bunch of servers to test it on. It fails on every server I
could find (sans one).

Here's the ones it fails on:
slashdot.org
hotmail.com
godaddy.com
linux.com
lighttpd.net

I did manage to find one webserver it succeeded on---that is
kenrockwel.com --- a domain squatter with a typoed domain of one of my
favorite photographer's websites (the actual website should be
kenrockwell.com)

This squatter's site is indeed running lighttpd---but it appears to be
an earlier version, because the official lighttpd site fails on this
test.

We have all the major web servers failing the test:
* Apache 1.3
* Apache 2.2
* Microsoft-IIS/6.0
* lighttpd/1.5.0

So I think it's the python side that is wrong, regardless of what the
standard is.

What should I do next?

I've rewritten the code to make it easier to test. Just run it as is
and it will try all my test cases; or pass in a site on the command
line, and it will try just that.

Thanks!

#!/usr/bin/env python
"""Put the site you want to test as a command line parameter.
Otherwise tests the list of defaults."""

import urllib2
import zlib
import sys

opener = urllib2.build_opener()
opener.addheaders = [('Accept-encoding', 'deflate')]

try:
sites = [sys.argv[1]]
except IndexError:
sites = ['http://slashdot.org', 'http://www.hotmail.com',
'http://www.godaddy.com', 'http://www.linux.com',
'http://www.lighttpd.net', 'http://www.kenrockwel.com']

for site in sites:
print "Trying: ", site
stream = opener.open(site)
data = stream.read()
encoded = stream.headers.get('Content-Encoding')
server = stream.headers.get('Server')

print " %s - %s (%s)" % (site, server, encoded)

if encoded == 'deflate':
before = len(data)
try:
data = zlib.decompress(data)
after = len(data)
print " Able to decompress...went from %i to %i." %
(before, after)
except zlib.error:
print " Errored out on this site."
else:
print " Data is not deflated."
print
Sep 19 '08 #5
Sam
For those that are interested, but don't want to bother running the
program themselves, here's the output I get.

Trying: http://slashdot.org
http://slashdot.org - Apache/1.3.41 (Unix) mod_perl/1.31-rc4
(deflate)
Errored out on this site.

Trying: http://www.hotmail.com
http://www.hotmail.com - Microsoft-IIS/6.0 (deflate)
Errored out on this site.

Trying: http://www.godaddy.com
http://www.godaddy.com - Microsoft-IIS/6.0 (deflate)
Errored out on this site.

Trying: http://www.linux.com
http://www.linux.com - Apache/2.2.8 (Unix) PHP/5.2.5 (deflate)
Errored out on this site.

Trying: http://www.lighttpd.net
http://www.lighttpd.net - lighttpd/1.5.0 (deflate)
Errored out on this site.

Trying: http://www.kenrockwel.com
http://www.kenrockwel.com - lighttpd (deflate)
Able to decompress...went from 414 to 744.

On Sep 18, 7:29*pm, Sam <samsli...@gmail.comwrote:
On Sep 18, 2:10*pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Tue, 16 Sep 2008 21:58:31 -0300, Sam <samsli...@gmail.comescribió:
The code is correct - try with another server. I tested it with a *
LightHTTPd server and worked fine.

Gabriel...

I found a bunch of servers to test it on. *It fails on every server I
could find (sans one).

Here's the ones it fails on:
slashdot.org
hotmail.com
godaddy.com
linux.com
lighttpd.net

I did manage to find one webserver it succeeded on---that is
kenrockwel.com --- a domain squatter with a typoed domain of one of my
favorite photographer's websites (the actual website should be
kenrockwell.com)

This squatter's site is indeed running lighttpd---but it appears to be
an earlier version, because the official lighttpd site fails on this
test.

We have all the major web servers failing the test:
* Apache 1.3
* Apache 2.2
* Microsoft-IIS/6.0
* lighttpd/1.5.0

So I think it's the python side that is wrong, regardless of what the
standard is.

What should I do next?

I've rewritten the code to make it easier to test. *Just run it as is
and it will try all my test cases; or pass in a site on the command
line, and it will try just that.

Thanks!

#!/usr/bin/env python
"""Put the site you want to test as a command line parameter.
Otherwise tests the list of defaults."""

import urllib2
import zlib
import sys

opener = urllib2.build_opener()
opener.addheaders = [('Accept-encoding', 'deflate')]

try:
* * sites = [sys.argv[1]]
except IndexError:
* * sites = ['http://slashdot.org', 'http://www.hotmail.com',
* * * * * * *'http://www.godaddy.com', 'http://www.linux.com',
* * * * * * *'http://www.lighttpd.net', 'http://www.kenrockwel.com']

for site in sites:
* * print "Trying: ", site
* * stream = opener.open(site)
* * data = stream.read()
* * encoded = stream.headers.get('Content-Encoding')
* * server = stream.headers.get('Server')

* * print " *%s - %s (%s)" % (site, server, encoded)

* * if encoded == 'deflate':
* * * * before = len(data)
* * * * try:
* * * * * * data = zlib.decompress(data)
* * * * * * after = len(data)
* * * * * * print " *Able to decompress...went from %i to %i." %
(before, after)
* * * * except zlib.error:
* * * * * * print " *Errored out on this site."
* * else:
* * * * print " *Data is not deflated."
* * print
Sep 19 '08 #6
En Thu, 18 Sep 2008 23:29:30 -0300, Sam <sa*******@gmail.comescribió:
On Sep 18, 2:10*pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
>En Tue, 16 Sep 2008 21:58:31 -0300, Sam <samsli...@gmail.comescribió:
The code is correct - try with another server. I tested it with a *
LightHTTPd server and worked fine.

Gabriel...

I found a bunch of servers to test it on. It fails on every server I
could find (sans one).
I'll try to check later. Anyway, why are you so interested in deflate?
Both "deflate" and "gzip" coding use the same algorithm and generate
exactly the same compressed stream, the only difference being the header
and tail format. Have you found any server supporting deflate that doesn't
support gzip as well?

--
Gabriel Genellina

Sep 19 '08 #7
En Thu, 18 Sep 2008 23:29:30 -0300, Sam <sa*******@gmail.comescribió:
On Sep 18, 2:10*pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
>En Tue, 16 Sep 2008 21:58:31 -0300, Sam <samsli...@gmail.comescribió:
The code is correct - try with another server. I tested it with a *
LightHTTPd server and worked fine.

Gabriel...

I found a bunch of servers to test it on. It fails on every server I
could find (sans one).

Here's the ones it fails on:
slashdot.org
hotmail.com
godaddy.com
linux.com
lighttpd.net

I did manage to find one webserver it succeeded on---that is
kenrockwel.com --- a domain squatter with a typoed domain of one of my
favorite photographer's websites (the actual website should be
kenrockwell.com)

This squatter's site is indeed running lighttpd---but it appears to be
an earlier version, because the official lighttpd site fails on this
test.

We have all the major web servers failing the test:
* Apache 1.3
* Apache 2.2
* Microsoft-IIS/6.0
* lighttpd/1.5.0

So I think it's the python side that is wrong, regardless of what the
standard is.
I've found the problem. The zlib header is missing (2 bytes), data begins
right with the compressed stream. You may decode it if you pass a negative
value for wsize:

try:
data = zlib.decompress(data)
except zlib.error:
data = zlib.decompress(data, -zlib.MAX_WBITS)

Note that this is clearly in violation of RFC 1950: the header is *not*
optional.

BTW, the curl developers had this same problem some time ago
<http://curl.haxx.se/mail/lib-2005-12/0130.htmland the proposed solution
is the same as above.

This is the output from your test script modified as above. (Note that in
some cases, the compressed stream is larger than the uncompressed data):

Trying: http://slashdot.org
http://slashdot.org - Apache/1.3.41 (Unix) mod_perl/1.31-rc4 (deflate)
len(def
late)=73174 len(gzip)=73208
Able to decompress...went from 73174 to 73073.

Trying: http://www.hotmail.com
http://www.hotmail.com - Microsoft-IIS/6.0 (deflate) len(deflate)=1609
len(gzi
p)=1635
Able to decompress...went from 1609 to 3969.

Trying: http://www.godaddy.com
http://www.godaddy.com - Microsoft-IIS/6.0 (deflate) len(deflate)=40646
len(gz
ip)=157141
Able to decompress...went from 40646 to 157141.

Trying: http://www.linux.com
http://www.linux.com - Apache/2.2.8 (Unix) PHP/5.2.5 (deflate)
len(deflate)=52
862 len(gzip)=52880
Able to decompress...went from 52862 to 52786.

Trying: http://www.lighttpd.net
http://www.lighttpd.net - lighttpd/1.5.0 (deflate) len(deflate)=5669
len(gzip)
=5687
Able to decompress...went from 5669 to 15746.

Trying: http://www.kenrockwel.com
http://www.kenrockwel.com - lighttpd (deflate) len(deflate)=414
len(gzip)=426
Able to decompress...went from 414 to 744.

--
Gabriel Genellina

Sep 19 '08 #8
Sam
Gabriel...

Awesome! Thank you so much for the solution.

And yeah, I found exactly one website that strangely enough only does
deflate, not gzip. I'd rather not say what website it is, since it's
small and not mine. They may be few and in between, but they do
exist.

Thanks
On Sep 19, 3:48*am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Thu, 18 Sep 2008 23:29:30 -0300, Sam <samsli...@gmail.comescribió:
On Sep 18, 2:10*pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Tue, 16 Sep 2008 21:58:31 -0300, Sam <samsli...@gmail.comescribió:
The code is correct - try with another server. I tested it with a *
LightHTTPd server and worked fine.
Gabriel...
I found a bunch of servers to test it on. *It fails on every server I
could find (sans one).

I'll try to check later. Anyway, why are you so interested in deflate? *
Both "deflate" and "gzip" coding use the same algorithm and generate *
exactly the same compressed stream, the only difference being the header *
and tail format. Have you found any server supporting deflate that doesn't *
support gzip as well?

--
Gabriel Genellina
Sep 20 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: bmiras | last post by:
I've got a problem using urllib2 to get a web page. I'm going through a proxy using user/password authentification and i'm trying to get a page asking for a HTTP authentification. And I'm using...
1
by: Matthew Wilson | last post by:
I am writing a script to check on my router's external IP address. My ISP refreshes my IP very often and I use dyndns for the hostname for my computer. My Netgear mr814 router has a webserver that...
2
by: John F Dutcher | last post by:
Can anyone comment on why the code shown in the Python error is in some way incorrect...or is there a problem with Python on my hoster's site ?? The highlites don't seem to show here...but line...
0
by: jacob c. | last post by:
When I request a URL using urllib2, it appears that urllib2 always makes the request using HTTP 1.0, and not HTTP 1.1. I'm trying to use the "If-None-Match"/"ETag" HTTP headers to conserve...
0
by: Gil Tal | last post by:
Hi, I use urllib2 to download a redirected url and I get an exception from the bowels of urllib2. It seems that urllib2 implements some super sophisticated self check and tries to control the...
1
by: Ray Slakinski | last post by:
Hello, I have defined a function to set an opener for urllib2, this opener defines any proxy and http authentication that is required. If the proxy has authencation itself and requests an...
0
by: Ali.Sabil | last post by:
hello all, I just maybe hit a bug in both urllib and urllib2, actually urllib doesn't support proxy authentication, and if you setup the http_proxy env var to...
3
by: m.banaouas | last post by:
Hi all, I started to use urllib2 library and HTTPBasicAuthHandler class in order to authenticate with a http server (Zope in this case). I don't know why but it doesn't work, while authenticating...
1
by: Alessandro Fachin | last post by:
I write this simply code that should give me the access to private page with htaccess using a proxy, i don't known because it's wrong... import urllib,urllib2 #input url...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.