473,385 Members | 1,707 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

removing the header from a gzip'd string

Hi, I have some code that takes a string and obtains a compressed
version using zlib.compress

Does anybody know how I can remove the header portion of the compressed
bytes, such that I only have the compressed data remaining? (Obviously
I do not intend to perform the decompression!)

Thanks,

Dec 21 '06 #1
9 7218
Rajarshi wrote:
Hi, I have some code that takes a string and obtains a compressed
version using zlib.compress

Does anybody know how I can remove the header portion of the compressed
bytes, such that I only have the compressed data remaining?
what makes you think there's a "header portion" in the data you get
from zlib.compress ? it's just a continuous stream of bits, all of
which are needed by the decoder.
(Obviously I do not intend to perform the decompression!)
oh. in that case, this should be good enough:

data[random.randint(0,len(data)):]

</F>

Dec 21 '06 #2
Rajarshi wrote:
Does anybody know how I can remove the header portion of the
compressed bytes, such that I only have the compressed data
remaining? (Obviously I do not intend to perform the
decompression!)
Just curious: What's your goal? :) A home made hash function?

Regards,
Björn

--
BOFH excuse #80:

That's a great computer you have there; have you considered how it
would work as a BSD machine?

Dec 21 '06 #3
At Thursday 21/12/2006 18:32, Fredrik Lundh wrote:
Hi, I have some code that takes a string and obtains a compressed
version using zlib.compress

Does anybody know how I can remove the header portion of the compressed
bytes, such that I only have the compressed data remaining?

what makes you think there's a "header portion" in the data you get
from zlib.compress ? it's just a continuous stream of bits, all of
which are needed by the decoder.
No. The first 2 bytes (or more if using a preset dictionary) are
header information. The last 4 bytes are for checksum. In-between
lies the encoded bit stream.
Using the default options ("deflate", default compression level, no
custom dictionary) will make those first two bytes 0x78 0x9c.
If you want to encrypt a compressed text, you must remove redundant
information first. Knowing part of the clear message is a security
hole. Using an structured container (like a zip/rar/... file) gets
worse because the fixed (or "guessable") part is longer, but anyway,
2 bytes may be bad enough.
See RFC1950 <ftp://ftp.isi.edu/in-notes/rfc1950.txt>
--
Gabriel Genellina
Softlab SRL


__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Dec 22 '06 #4
Gabriel Genellina wrote:
Using the default options ("deflate", default compression level, no
custom dictionary) will make those first two bytes 0x78 0x9c.

If you want to encrypt a compressed text, you must remove redundant
information first.
encryption? didn't the OP say that he *didn't* plan to decompress the
resulting data stream?
Knowing part of the clear message is a security hole.
well, knowing the algorithm used to convert from the original clear
text to the text that's actually encrypted also gives an attacker
plenty of clues (especially if the original is regular in some way,
such as "always an XML file" or "always a record having this format").
sounds to me like trying to address this potential hole by stripping
off 16 bits from the payload won't really solve that problem...

</F>

Dec 22 '06 #5

Fredrik Lundh wrote:
Gabriel Genellina wrote:
Using the default options ("deflate", default compression level, no
custom dictionary) will make those first two bytes 0x78 0x9c.
>
If you want to encrypt a compressed text, you must remove redundant
information first.

encryption? didn't the OP say that he *didn't* plan to decompress the
resulting data stream?
Knowing part of the clear message is a security hole.

well, knowing the algorithm used to convert from the original clear
text to the text that's actually encrypted also gives an attacker
plenty of clues (especially if the original is regular in some way,
such as "always an XML file" or "always a record having this format").
sounds to me like trying to address this potential hole by stripping
off 16 bits from the payload won't really solve that problem...

</F>
Yes, I'm also interested to know why the OP wants to remove the header.

Though I'm not an expert on the zip format, my understanding is that
most binary formats are not of much use in pieces (though some
composite formats might be, e.g. you might be able to meaningfully
extract a piece, such as an image embedded in a Word file). I somehow
don't think a compressed zip file would be of use in pieces (except
possibly for the header itself). But I could be wrong of course.

Vasudev Ram
http://www.dancingbison.com

Dec 22 '06 #6
Fredrik Lundh ha escrito:
Gabriel Genellina wrote:
If you want to encrypt a compressed text, you must remove redundant
information first.

encryption? didn't the OP say that he *didn't* plan to decompress the
resulting data stream?
I was trying to imagine any motivation for asking that question. And I
considered the second part as "I'm not the guy who will reconstruct the
original data". But I'm still intrigued by the actual use case...

--
Gabriel Genellina

Dec 22 '06 #7

Bjoern Schliessmann wrote:
Rajarshi wrote:
Does anybody know how I can remove the header portion of the
compressed bytes, such that I only have the compressed data
remaining? (Obviously I do not intend to perform the
decompression!)

Just curious: What's your goal? :) A home made hash function?
Actually I was implementing the use of the normalized compression
distance to evaluate molecular similarity as described in an article in
J.Chem.Inf.Model (http://dx.doi.org/10.1021/ci600384z, subscriber
access only, unfortunately).

Essentially, they note that the NCD does not always bevave like a
metric and one reason they put forward is that this may be due to the
size of the header portion (they were using the command line gzip and
bzip2 programs) compared to the strings being compressed (which are on
average 48 bytes long).

So I was interested to see if the NCD behaved like a metric if I
removed everything that was not the compressed string. And since I only
need to calculate similarity between two strings, I do not need to do
any decompression.

Dec 23 '06 #8
de**************@gmail.com wrote:
Actually I was implementing the use of the normalized compression
distance to evaluate molecular similarity as described in an
article in J.Chem.Inf.Model (http://dx.doi.org/10.1021/ci600384z,
subscriber access only, unfortunately).
Interesting. Thanks for the reply.

Regards,
Björn

--
BOFH excuse #438:

sticky bit has come loose

Dec 23 '06 #9
de**************@gmail.com wrote:
Essentially, they note that the NCD does not always bevave like a
metric and one reason they put forward is that this may be due to the
size of the header portion (they were using the command line gzip and
bzip2 programs) compared to the strings being compressed (which are on
average 48 bytes long).
gzip datastreams have a real header, with a file type identifier,
optional filenames, comments, and a bunch of flags.

but even if you strip that off (which is basically what happens if you
use zlib.compress instead of gzip), I doubt you'll get representative
"compressability" metrics on strings that short. like most other
compression algorithms, those algorithms are designed for much larger
datasets.

</F>

Dec 24 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: sameer | last post by:
Hi All, I am adding a custom header (Gzip header for compression) to the request when calling a webservice( sitting on a webserver) over the internet from my application ( If interested in the...
1
by: sameer | last post by:
Hi All, I am adding a custom header (Gzip header for compression) to the request when calling a webservice( sitting on a webserver) over the internet from my application ( If interested in the...
0
by: jimmyfingers | last post by:
I've just tried the following code for decompressing a .zip file, but get the following error message: "The magic number in GZip header is not correct. Make sure you are passing in a GZip stream."...
2
by: Chaos | last post by:
I have tried to search Google, but I cannot seem to find a library to decompress a gzip string or char to a string or char. I want to write something that allows libcurl to access a page, save the...
5
by: RyanL | last post by:
I'm a newbie with a large number of data files in multiple directories. I want to uncompress, read, and copy the contents of each file into one master data file. The code below seems to be doing...
3
by: Sean Davis | last post by:
I have a set of numpy arrays which I would like to save to a gzip file. Here is an example without gzip: b=numpy.ones(1000000,dtype=numpy.uint8) a=numpy.zeros(1000000,dtype=numpy.uint8) fd =...
1
by: Nader | last post by:
Hello, I have a gzip file and I try to read from this file withe the next statements: gunziped_file = gzip.GzipFile('gzip-file') input_file = open(gunziped_file,'r') But I get the nezt...
1
by: Nader | last post by:
Hello, I have a gzip file and I try to read from this file withe the next statements: gunziped_file = gzip.GzipFile('gzip-file') input_file = open(gunziped_file,'r') But I get the nezt...
6
by: pooppoop | last post by:
Hi, and thanks for viewing my post. i have an odd result when trying to compress and decompress a string. it seems that when i replace the Zero's in the input stream it works, if not the string...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.