By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,836 Members | 1,725 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,836 IT Pros & Developers. It's quick & easy.

Copying zlib compression objects

P: n/a
I'm writing a program in python that creates tar files of a certain
maximum size (to fit onto CD/DVD). One of the problems I'm running
into is that when using compression, it's pretty much impossible to
determine if a file, once added to an archive, will cause the archive
size to exceed the maximum size.

I believe that to do this properly, you need to copy the state of tar
file (basically the current file offset as well as the state of the
compression object), then add the file. If the new size of the archive
exceeds the maximum, you need to restore the original state.

The critical part is being able to copy the compression object.
Without compression it is trivial to determine if a given file will
"fit" inside the archive. When using compression, the compression
ratio of a file depends partially on all the data that has been
compressed prior to it.

The current implementation in the standard library does not allow you
to copy these compression objects in a useful way, so I've made some
minor modifications (patch follows) to the standard 2.4.2 library:
- Add copy() method to zlib compression object. This returns a new
compression object with the same internal state. I named it copy() to
keep it consistent with things like sha.copy().
- Add snapshot() / restore() methods to GzipFile and TarFile. These
work only in write mode. snapshot() returns a state object. Passing
in this state object to restore() will restore the state of the
GzipFile / TarFile to the state represented by the object.

Future work:
- Decompression objects could use a copy() method too
- Add support for copying bzip2 compression objects

Does this seem like a good approach?

Cheers,
Chris

diff -ur Python-2.4.2.orig/Lib/gzip.py Python-2.4.2/Lib/gzip.py
--- Python-2.4.2.orig/Lib/gzip.py 2005-06-09 10:22:07.000000000 -0400
+++ Python-2.4.2/Lib/gzip.py 2006-02-14 13:12:29.000000000 -0500
@@ -433,6 +433,17 @@
else:
raise StopIteration

+ def snapshot(self):
+ if self.mode == READ:
+ raise IOError("Can't create a snapshot in READ mode")
+ return (self.size, self.crc, self.fileobj.tell(), self.offset,
self.compress.copy())
+
+ def restore(self, s):
+ if self.mode == READ:
+ raise IOError("Can't restore a snapshot in READ mode")
+ self.size, self.crc, offset, self.offset, self.compress = s
+ self.fileobj.seek(offset)
+ self.fileobj.truncate()

def _test():
# Act like gzip; with -d, act like gunzip.
diff -ur Python-2.4.2.orig/Lib/tarfile.py Python-2.4.2/Lib/tarfile.py
--- Python-2.4.2.orig/Lib/tarfile.py 2005-08-27 06:08:21.000000000
-0400
+++ Python-2.4.2/Lib/tarfile.py 2006-02-14 16:50:41.000000000 -0500
@@ -1825,6 +1825,28 @@
"""
if level <= self.debug:
print >> sys.stderr, msg
+
+ def snapshot(self):
+ """Save the current state of the tarfile
+ """
+ self._check("_aw")
+ if hasattr(self.fileobj, "snapshot"):
+ return self.fileobj.snapshot(), self.offset,
self.members[:]
+ else:
+ return self.fileobj.tell(), self.offset, self.members[:]
+
+ def restore(self, s):
+ """Restore the state of the tarfile from a previous snapshot
+ """
+ self._check("_aw")
+ if hasattr(self.fileobj, "restore"):
+ snapshot, self.offset, self.members = s
+ self.fileobj.restore(snapshot)
+ else:
+ offset, self.offset, self.members = s
+ self.fileobj.seek(offset)
+ self.fileobj.truncate()
+
# class TarFile

class TarIter:
diff -ur Python-2.4.2.orig/Modules/zlibmodule.c
Python-2.4.2/Modules/zlibmodule.c
--- Python-2.4.2.orig/Modules/zlibmodule.c 2004-12-28
15:12:31.000000000 -0500
+++ Python-2.4.2/Modules/zlibmodule.c 2006-02-14 14:05:35.000000000
-0500
@@ -653,6 +653,36 @@
return RetVal;
}

+PyDoc_STRVAR(comp_copy__doc__,
+"copy() -- Return a copy of the compression object.");
+
+static PyObject *
+PyZlib_copy(compobject *self, PyObject *args)
+{
+ compobject *retval;
+
+ retval = newcompobject(&Comptype);
+
+ /* Copy the zstream state */
+ /* TODO: Are the ENTER / LEAVE needed? */
+ ENTER_ZLIB
+ deflateCopy(&retval->zst, &self->zst);
+ LEAVE_ZLIB
+
+ /* Make references to the original unused_data and unconsumed_tail
+ * They're not used by compression objects so we don't have to do
+ * anything special here */
+ retval->unused_data = self->unused_data;
+ retval->unconsumed_tail = self->unconsumed_tail;
+ Py_INCREF(retval->unused_data);
+ Py_INCREF(retval->unconsumed_tail);
+
+ /* Mark it as being initialized */
+ retval->is_initialised = 1;
+
+ return (PyObject*)retval;
+}
+
PyDoc_STRVAR(decomp_flush__doc__,
"flush() -- Return a string containing any remaining decompressed
data.\n"
"\n"
@@ -723,6 +753,8 @@
comp_compress__doc__},
{"flush", (binaryfunc)PyZlib_flush, METH_VARARGS,
comp_flush__doc__},
+ {"copy", (binaryfunc)PyZlib_copy, METH_VARARGS,
+ comp_copy__doc__},
{NULL, NULL}
};

Feb 14 '06 #1
Share this Question
Share on Google+
1 Reply


P: n/a
No comments?

I found a small bug in TarFile.snapshot() / restore() - they need to
save and restore self.inodes as well.

Feb 16 '06 #2

This discussion thread is closed

Replies have been disabled for this discussion.