473,396 Members | 2,039 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Add a file to a compressed tarfile

Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't support
compressed tarfiles... :(

Any thoughts on what I can do to get around this?

Cheers!
Jul 18 '05 #1
8 8809
On Sat, 06 Nov 2004 00:13:16 +1100, Dennis Hotson <dj********@hotmail.com>
wrote:
Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't support
compressed tarfiles... :(

Any thoughts on what I can do to get around this?

Cheers!

From the tarfile docs in python 2.3:-

New in version 2.3.

The tarfile module makes it possible to read and create tar archives. Some
facts and figures:

reads and writes gzip and bzip2 compressed archives.
creates POSIX 1003.1-1990 compliant or GNU tar compatible archives.
reads GNU tar extensions longname, longlink and sparse.
stores pathnames of unlimited length using GNU tar extensions.
handles directories, regular files, hardlinks, symbolic links, fifos,
character devices and block devices and is able to acquire and restore
file information like timestamp, access permissions and owner.
can handle tape devices.

open( [name[, mode [, fileobj[, bufsize]]]])
Return a TarFile object for the pathname name. For detailed information on
TarFile objects, see TarFile Objects (section 7.19.1).

mode has to be a string of the form 'filemode[:compression]', it defaults
to 'r'. Here is a full list of mode combinations:

mode action
'r' Open for reading with transparent compression (recommended).
'r:' Open for reading exclusively without compression.
'r:gz' Open for reading with gzip compression.
'r:bz2' Open for reading with bzip2 compression.
'a' or 'a:' Open for appending with no compression.
'w' or 'w:' Open for uncompressed writing.
'w:gz' Open for gzip compressed writing.
'w:bz2' Open for bzip2 compressed writing.

Note that 'a:gz' or 'a:bz2' is not possible. If mode is not suitable to
open a certain (compressed) file for reading, ReadError is raised. Use
mode 'r' to avoid this. If a compression method is not supported,
CompressionError is raised.

If fileobj is specified, it is used as an alternative to a file object
opened for name.
HTH,
Martin.
--
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Jul 18 '05 #2
On Fri, 05 Nov 2004 13:26:22 -0000, Martin Franklin
<mf********@gatwick.westerngeco.slb.com> wrote:
On Sat, 06 Nov 2004 00:13:16 +1100, Dennis Hotson
<dj********@hotmail.com> wrote:
Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't
support
compressed tarfiles... :(

Any thoughts on what I can do to get around this?

Cheers!

<snip - useless info from myself>

Sorry I just re-read your message after sending my reply....

Jul 18 '05 #3
On Fri, 05 Nov 2004 13:40:22 +0000, Martin Franklin wrote:
On Fri, 05 Nov 2004 13:26:22 -0000, Martin Franklin
<mf********@gatwick.westerngeco.slb.com> wrote:
On Sat, 06 Nov 2004 00:13:16 +1100, Dennis Hotson
<dj********@hotmail.com> wrote:
Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't
support
compressed tarfiles... :(

Any thoughts on what I can do to get around this?

Cheers!

<snip - useless info from myself>

Sorry I just re-read your message after sending my reply....


Ahh ok... Yeah, I've already seen the docs... thanks anyway! :D

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..

Cheers!

Dennis
Jul 18 '05 #4
Dennis Hotson <dj********@hotmail.com> writes:

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..


There isn't really any other way. A tar file is terminated by two empty
blocks. In order to append to a tar file you simply append a new tar file two
blocks from the end of the original. If it was uncompressed you just seek
back from the end and write but if it's compressed you can't find that point
without decompressing[1]. In some cases a more time efficient but less space
efficient method would be to just compress individual files in a directory and
then tar them up before the final distribution (or whatever you do with your
tar file)

Eddie

[1] I think, unless there's a clever way of just decompressing the last few
blocks.
Jul 18 '05 #5

ed***@holyrood.ed.ac.uk (Eddie Corns) wrote:

Dennis Hotson <dj********@hotmail.com> writes:

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..


There isn't really any other way. A tar file is terminated by two empty
blocks. In order to append to a tar file you simply append a new tar file two
blocks from the end of the original. If it was uncompressed you just seek
back from the end and write but if it's compressed you can't find that point
without decompressing[1]. In some cases a more time efficient but less space
efficient method would be to just compress individual files in a directory and
then tar them up before the final distribution (or whatever you do with your
tar file)

Eddie

[1] I think, unless there's a clever way of just decompressing the last few
blocks.


I am not aware of any such method. I am fairly certain gzip (and the
associated zlib) does the following:

while bytes remaining:
reset/initialize state
while state is not crappy and bytes remaining:
compress portion of remaining bytes
update state

Even if one could discover the last reset/initialization of state, one
would still need to decompress the data from then on in order to
discover the two empty blocks.

A 'resume compression friendly' algorithm would necessarily need to
describe its internal state at the end of the byte stream. In the case
of gzip (or other similar compression algorithms), really the only way
this is reasonable is to just give an offset in the file to the last
reset/initialization. Of course the internal state must still be
regenerated from the remaining portion of the file (which may be the
entire file), so isn't really a win over just processing the entire file
again with an algorithm that discovers when/where to pick up where it
left off before.

- Josiah

Jul 18 '05 #6
Am Freitag, 5. November 2004 19:19 schrieb Josiah Carlson:
I am not aware of any such method. I am fairly certain gzip (and the
associated zlib) does the following:

while bytes remaining:
reset/initialize state
while state is not crappy and bytes remaining:
compress portion of remaining bytes
update state

Even if one could discover the last reset/initialization of state, one
would still need to decompress the data from then on in order to
discover the two empty blocks.


This is not entirely true... There is a full flush which is done every n bytes
(n > 100000 bytes, IIRC), and can also be forced by the programmer. In case
you do a full flush, the block which you read is complete as is up till the
point you did the flush.

From the documentation:

"""flush([mode])

All pending input is processed, and a string containing the remaining
compressed output is returned. mode can be selected from the constants
Z_SYNC_FLUSH, Z_FULL_FLUSH, or Z_FINISH, defaulting to Z_FINISH. Z_SYNC_FLUSH
and Z_FULL_FLUSH allow compressing further strings of data and are used to
allow partial error recovery on decompression, while Z_FINISH finishes the
compressed stream and prevents compressing any more data. After calling
flush() with mode set to Z_FINISH, the compress() method cannot be called
again; the only realistic action is to delete the object."""

Anyway, the state is reset to the initial state after the full flush, so that
the next block of data is independent from the block that was flushed. So,
you might start writing after the full flush, but you'd have to make sure
that the compressed stream was of the same format specification as the one
previously written (see the compression level parameter of
compress/decompress), and you'd also have to make sure that the gzip header
is supressed, and that the FINISH compression block correctly reflects the
data that was appended (because you basically overwrite the finish block of
the first compress).

Little example:
import zlib
x = zlib.compressobj(6)
x <zlib.Compress object at 0xb7e39de0> a = x.compress("hahahahahaha"*20)
a += x.flush(zlib.Z_FULL_FLUSH)
a 'x\x9c\xcaH\xcc\x18Q\x10\x00\x00\x00\xff\xff' b = x.flush(zlib.Z_FINISH)
b '\x03\x00^\x84^9' x = zlib.compressobj(6) # New compression object with same compression.
c = x.compress("hahahahahaha"*20)
c += x.flush(zlib.Z_FULL_FLUSH)
c 'x\x9c\xcaH\xcc\x18Q\x10\x00\x00\x00\xff\xff' d = x.flush(zlib.Z_FINISH)
d '\x03\x00^\x84^9' e = a+c[2:] # Strip header of second block.
x = zlib.decompressobj()
f = x.decompress(e)
len(f) 480 # Two times 240 = 480. f 'haha...' # Rest stripped for clarity.

So, as far as this goes, it works. But:
x = zlib.decompressobj()
e = a+c[2:]+d
f = x.decompress(e)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
zlib.error: Error -3 while decompressing: incorrect data check

You see here that if you append the new end of stream marker of the second
block (which is written by x.flush(zlib.Z_FINISH)), the data checksum is
broken, as the data checksum is always written for the entire data, but
leaving out the end of stream marker doesn't cause data-decompression to
fail.

I know too little about the internal format of a gzip file (which appends more
header data, but otherwise is just a zlib compressed stream) to tell whether
an approach such as this one would also work on gzip-files, but I presume it
should.

Hope this little explanation helps!

Heiko.
Jul 18 '05 #7

Thanks Heiko, Thats really interesting..

To tell you the truth though, I'm not that familiar with the structure of
tar or gzip files. I've got a much better idea of how it works now though.
:D

I managed to get my function working... although it decompresses
everything and then compresses it back... Not the best, but good enough I
think.

Speed isn't a huge issue in my case anyway because this is for a web app
I'm writing... It's a directory tree which allows people to download and
upload files into/from directories as well as compressed archives.

Anyway.. thanks a lot for your help. I really appreciate it. Cheers mate!
:)
Jul 18 '05 #8
Dennis Hotson wrote:
I managed to get my function working... although it decompresses
everything and then compresses it back... Not the best, but good enough I
think.


If you want a solution that allows to append files to an archive, while
allowing compression, take a look at FileNode, a module that has been added
to the latest PyTables package (www.pytables.org). You can see the
documentation (and tutorials) for the module here:

http://pytables.sourceforge.net/html-doc/c3616.html

It supports the zlib, ucl and lzo compressors, as well as the shuffle
compression pre-conditioner.

HTH,

Francesc Altet
Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Lars Behrens | last post by:
Hi, Pythonistas! I'm quite new to Python and have a problem with a simple backup script. This code: tar = tarfile.open('/home/lars/test.tar.gz', 'w:gz') tar.addfile('/home/lars') brings...
8
by: Jay Donnell | last post by:
Is there a way to use the tarfile module to recursively compress the contents of a directory and maintain the directory structure in the tar archive? Simply doing os.system('tar -czvf ' +...
5
by: Matt Doucleff | last post by:
Hi everyone! I must be doing something wrong here :) I have a tarball that contains a single file whose contents are a pickled object. I would like to unpickle the object directly from the...
7
by: Matthew Thorley | last post by:
I'm writing a web app whereby a user uploads a tar acrhive which is then opened and processed. My web form reads the file like this: while 1: data = value.file.read(1024 * 8) # Read blocks of...
1
by: Matthew Thorley | last post by:
I've been using tarfile like this import tarfile tar = tarfile.open('path_to_tar_archive', 'r:gz') But I need to use it like this: archive = open('path_to_tar_archive', 'r') tar =...
8
by: mrstephengross | last post by:
I want to find a way to embed a tar file *in* my python script, and then use the tarfile module to extract it. That is, instead of distributing two files (extractor.py and archive.tar) I want to be...
0
by: itzel | last post by:
Hello!! In using tarfile to group thousands of small files from a directory and then compress it. I already compress a group of files in my pc, but I need do it in a server and I'm testing the...
1
by: boblatest | last post by:
Hello, I'm trying to catch an "EOFError" exception that occurs when reading truncated tarfile. Here's my routine, and below that the callback trace. Note that although I'm trying to catch all...
1
by: Steven D'Aprano | last post by:
I have a proxy class that wraps an arbitrary file-like object fp and reads blocks of data from it. Is it safe to assume that fp.read(-1) will read until EOF? I know that's true for file.read() and...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.