removing the header from a gzip'd string

Rajarshi

Hi, I have some code that takes a string and obtains a compressed
version using zlib.compress

Does anybody know how I can remove the header portion of the compressed
bytes, such that I only have the compressed data remaining? (Obviously
I do not intend to perform the decompression!)

Thanks,

Dec 21 '06 #1

Subscribe Post Reply

7218

Fredrik Lundh

Rajarshi wrote:

Hi, I have some code that takes a string and obtains a compressed
version using zlib.compress

Does anybody know how I can remove the header portion of the compressed
bytes, such that I only have the compressed data remaining?

what makes you think there's a "header portion" in the data you get
from zlib.compress ? it's just a continuous stream of bits, all of
which are needed by the decoder.

(Obviously I do not intend to perform the decompression!)

oh. in that case, this should be good enough:

data[random.randint(0,len(data)):]

</F>

Dec 21 '06 #2

Bjoern Schliessmann

Rajarshi wrote:

Does anybody know how I can remove the header portion of the
compressed bytes, such that I only have the compressed data
remaining? (Obviously I do not intend to perform the
decompression!)

Just curious: What's your goal? :) A home made hash function?

Regards,
Björn

--
BOFH excuse #80:

That's a great computer you have there; have you considered how it
would work as a BSD machine?

Dec 21 '06 #3

Gabriel Genellina

At Thursday 21/12/2006 18:32, Fredrik Lundh wrote:

Hi, I have some code that takes a string and obtains a compressed
version using zlib.compress

Does anybody know how I can remove the header portion of the compressed
bytes, such that I only have the compressed data remaining?

what makes you think there's a "header portion" in the data you get
from zlib.compress ? it's just a continuous stream of bits, all of
which are needed by the decoder.

No. The first 2 bytes (or more if using a preset dictionary) are
header information. The last 4 bytes are for checksum. In-between
lies the encoded bit stream.
Using the default options ("deflate", default compression level, no
custom dictionary) will make those first two bytes 0x78 0x9c.
If you want to encrypt a compressed text, you must remove redundant
information first. Knowing part of the clear message is a security
hole. Using an structured container (like a zip/rar/... file) gets
worse because the fixed (or "guessable") part is longer, but anyway,
2 bytes may be bad enough.
See RFC1950 <ftp://ftp.isi.edu/in-notes/rfc1950.txt>
--
Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Dec 22 '06 #4

Fredrik Lundh

Gabriel Genellina wrote:

Using the default options ("deflate", default compression level, no
custom dictionary) will make those first two bytes 0x78 0x9c.

If you want to encrypt a compressed text, you must remove redundant
information first.

encryption? didn't the OP say that he *didn't* plan to decompress the
resulting data stream?

Knowing part of the clear message is a security hole.

well, knowing the algorithm used to convert from the original clear
text to the text that's actually encrypted also gives an attacker
plenty of clues (especially if the original is regular in some way,
such as "always an XML file" or "always a record having this format").
sounds to me like trying to address this potential hole by stripping
off 16 bits from the payload won't really solve that problem...

</F>

Dec 22 '06 #5

vasudevram

Fredrik Lundh wrote:

Gabriel Genellina wrote:

Using the default options ("deflate", default compression level, no
custom dictionary) will make those first two bytes 0x78 0x9c.
>
If you want to encrypt a compressed text, you must remove redundant
information first.

encryption? didn't the OP say that he *didn't* plan to decompress the
resulting data stream?

Knowing part of the clear message is a security hole.

well, knowing the algorithm used to convert from the original clear
text to the text that's actually encrypted also gives an attacker
plenty of clues (especially if the original is regular in some way,
such as "always an XML file" or "always a record having this format").
sounds to me like trying to address this potential hole by stripping
off 16 bits from the payload won't really solve that problem...

</F>

Yes, I'm also interested to know why the OP wants to remove the header.

Though I'm not an expert on the zip format, my understanding is that
most binary formats are not of much use in pieces (though some
composite formats might be, e.g. you might be able to meaningfully
extract a piece, such as an image embedded in a Word file). I somehow
don't think a compressed zip file would be of use in pieces (except
possibly for the header itself). But I could be wrong of course.

Vasudev Ram
http://www.dancingbison.com

Dec 22 '06 #6

Gabriel Genellina

Fredrik Lundh ha escrito:

Gabriel Genellina wrote:

If you want to encrypt a compressed text, you must remove redundant
information first.

encryption? didn't the OP say that he *didn't* plan to decompress the
resulting data stream?

I was trying to imagine any motivation for asking that question. And I
considered the second part as "I'm not the guy who will reconstruct the
original data". But I'm still intrigued by the actual use case...

--
Gabriel Genellina

Dec 22 '06 #7

debarchana.ghosh

Bjoern Schliessmann wrote:

Rajarshi wrote:

Does anybody know how I can remove the header portion of the
compressed bytes, such that I only have the compressed data
remaining? (Obviously I do not intend to perform the
decompression!)

Just curious: What's your goal? :) A home made hash function?

Actually I was implementing the use of the normalized compression
distance to evaluate molecular similarity as described in an article in
J.Chem.Inf.Model (http://dx.doi.org/10.1021/ci600384z, subscriber
access only, unfortunately).

Essentially, they note that the NCD does not always bevave like a
metric and one reason they put forward is that this may be due to the
size of the header portion (they were using the command line gzip and
bzip2 programs) compared to the strings being compressed (which are on
average 48 bytes long).

So I was interested to see if the NCD behaved like a metric if I
removed everything that was not the compressed string. And since I only
need to calculate similarity between two strings, I do not need to do
any decompression.

Dec 23 '06 #8

Bjoern Schliessmann

de**************@gmail.com wrote:

Actually I was implementing the use of the normalized compression
distance to evaluate molecular similarity as described in an
article in J.Chem.Inf.Model (http://dx.doi.org/10.1021/ci600384z,
subscriber access only, unfortunately).

Interesting. Thanks for the reply.

Regards,
Björn

--
BOFH excuse #438:

sticky bit has come loose

Dec 23 '06 #9

Fredrik Lundh

de**************@gmail.com wrote:

Essentially, they note that the NCD does not always bevave like a
metric and one reason they put forward is that this may be due to the
size of the header portion (they were using the command line gzip and
bzip2 programs) compared to the strings being compressed (which are on
average 48 bytes long).

gzip datastreams have a real header, with a file type identifier,
optional filenames, comments, and a bunch of flags.

but even if you strip that off (which is basically what happens if you
use zlib.compress instead of gzip), I doubt you'll get representative
"compressability" metrics on strings that short. like most other
compression algorithms, those algorithms are designed for much larger
datasets.

</F>

Dec 24 '06 #10

by: sameer | last post by:

Hi All, I am adding a custom header (Gzip header for compression) to the request when calling a webservice( sitting on a webserver) over the internet from my application ( If interested in the...

.NET Framework

http custom Gzip header being stripped on outbound request.

by: sameer | last post by:

Hi All, I am adding a custom header (Gzip header for compression) to the request when calling a webservice( sitting on a webserver) over the internet from my application ( If interested in the...

.NET Framework

The magic number in GZip header is not correct.(decompressing a .zip file)

by: jimmyfingers | last post by:

I've just tried the following code for decompressing a .zip file, but get the following error message: "The magic number in GZip header is not correct. Make sure you are passing in a GZip stream."...

.NET Framework

Gzip decompression without saving data to file.

by: Chaos | last post by:

I have tried to search Google, but I cannot seem to find a library to decompress a gzip string or char to a string or char. I want to write something that allows libcurl to access a page, save the...

C / C++

remove header line when reading/writing files

by: RyanL | last post by:

I'm a newbie with a large number of data files in multiple directories. I want to uncompress, read, and copy the contents of each file into one master data file. The code below seems to be doing...

Python

Numpy array to gzip file

by: Sean Davis | last post by:

I have a set of numpy arrays which I would like to save to a gzip file. Here is an example without gzip: b=numpy.ones(1000000,dtype=numpy.uint8) a=numpy.zeros(1000000,dtype=numpy.uint8) fd =...

Python

reading from an a gzip file

by: Nader | last post by:

Hello, I have a gzip file and I try to read from this file withe the next statements: gunziped_file = gzip.GzipFile('gzip-file') input_file = open(gunziped_file,'r') But I get the nezt...

Python

reading from a gzip file

by: Nader | last post by:

Hello, I have a gzip file and I try to read from this file withe the next statements: gunziped_file = gzip.GzipFile('gzip-file') input_file = open(gunziped_file,'r') But I get the nezt...

Python

.net 2.0 - Gzip Compression Question

by: pooppoop | last post by:

Hi, and thanks for viewing my post. i have an odd result when trying to compress and decompress a string. it seems that when i replace the Zero's in the input stream it works, if not the string...

.NET Framework

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

removing the header from a gzip'd string

Similar topics