Add a file to a compressed tarfile

Dennis Hotson

Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't support
compressed tarfiles... :(

Any thoughts on what I can do to get around this?

Cheers!

Jul 18 '05 #1

Subscribe Reply

8842

Martin Franklin

On Sat, 06 Nov 2004 00:13:16 +1100, Dennis Hotson <dj********@hot mail.com>
wrote:

Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't support
compressed tarfiles... :(

Any thoughts on what I can do to get around this?

Cheers!

From the tarfile docs in python 2.3:-

New in version 2.3.

The tarfile module makes it possible to read and create tar archives. Some
facts and figures:

reads and writes gzip and bzip2 compressed archives.
creates POSIX 1003.1-1990 compliant or GNU tar compatible archives.
reads GNU tar extensions longname, longlink and sparse.
stores pathnames of unlimited length using GNU tar extensions.
handles directories, regular files, hardlinks, symbolic links, fifos,
character devices and block devices and is able to acquire and restore
file information like timestamp, access permissions and owner.
can handle tape devices.

open( [name[, mode [, fileobj[, bufsize]]]])
Return a TarFile object for the pathname name. For detailed information on
TarFile objects, see TarFile Objects (section 7.19.1).

mode has to be a string of the form 'filemode[:compression]', it defaults
to 'r'. Here is a full list of mode combinations:

mode action
'r' Open for reading with transparent compression (recommended).
'r:' Open for reading exclusively without compression.
'r:gz' Open for reading with gzip compression.
'r:bz2' Open for reading with bzip2 compression.
'a' or 'a:' Open for appending with no compression.
'w' or 'w:' Open for uncompressed writing.
'w:gz' Open for gzip compressed writing.
'w:bz2' Open for bzip2 compressed writing.

Note that 'a:gz' or 'a:bz2' is not possible. If mode is not suitable to
open a certain (compressed) file for reading, ReadError is raised. Use
mode 'r' to avoid this. If a compression method is not supported,
CompressionErro r is raised.

If fileobj is specified, it is used as an alternative to a file object
opened for name.
HTH,
Martin.
--
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Jul 18 '05 #2

Martin Franklin

On Fri, 05 Nov 2004 13:26:22 -0000, Martin Franklin
<mf********@gat wick.westerngec o.slb.com> wrote:

On Sat, 06 Nov 2004 00:13:16 +1100, Dennis Hotson
<dj********@hot mail.com> wrote:
Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't
support
compressed tarfiles... :(

Any thoughts on what I can do to get around this?

Cheers!

<snip - useless info from myself>

Sorry I just re-read your message after sending my reply....

Jul 18 '05 #3

Dennis Hotson

On Fri, 05 Nov 2004 13:40:22 +0000, Martin Franklin wrote:

On Fri, 05 Nov 2004 13:26:22 -0000, Martin Franklin
<mf********@gat wick.westerngec o.slb.com> wrote:
On Sat, 06 Nov 2004 00:13:16 +1100, Dennis Hotson
<dj********@hot mail.com> wrote:
Hi,

I'm trying to write a function that adds a file-like-object to a
compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't
support
compressed tarfiles... :(

Any thoughts on what I can do to get around this?

Cheers!

<snip - useless info from myself>

Sorry I just re-read your message after sending my reply....

Ahh ok... Yeah, I've already seen the docs... thanks anyway! :D

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..

Cheers!

Dennis

Jul 18 '05 #4

Eddie Corns

Dennis Hotson <dj********@hot mail.com> writes:

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..

There isn't really any other way. A tar file is terminated by two empty
blocks. In order to append to a tar file you simply append a new tar file two
blocks from the end of the original. If it was uncompressed you just seek
back from the end and write but if it's compressed you can't find that point
without decompressing[1]. In some cases a more time efficient but less space
efficient method would be to just compress individual files in a directory and
then tar them up before the final distribution (or whatever you do with your
tar file)

Eddie

[1] I think, unless there's a clever way of just decompressing the last few
blocks.

Jul 18 '05 #5

Josiah Carlson

ed***@holyrood. ed.ac.uk (Eddie Corns) wrote:

Dennis Hotson <dj********@hot mail.com> writes:

I'm currently trying to read all of the files inside the tarfile... then
writing them all back. Bit of a kludge, but it should work..

There isn't really any other way. A tar file is terminated by two empty
blocks. In order to append to a tar file you simply append a new tar file two
blocks from the end of the original. If it was uncompressed you just seek
back from the end and write but if it's compressed you can't find that point
without decompressing[1]. In some cases a more time efficient but less space
efficient method would be to just compress individual files in a directory and
then tar them up before the final distribution (or whatever you do with your
tar file)

Eddie

[1] I think, unless there's a clever way of just decompressing the last few
blocks.

I am not aware of any such method. I am fairly certain gzip (and the
associated zlib) does the following:

while bytes remaining:
reset/initialize state
while state is not crappy and bytes remaining:
compress portion of remaining bytes
update state

Even if one could discover the last reset/initialization of state, one
would still need to decompress the data from then on in order to
discover the two empty blocks.

A 'resume compression friendly' algorithm would necessarily need to
describe its internal state at the end of the byte stream. In the case
of gzip (or other similar compression algorithms), really the only way
this is reasonable is to just give an offset in the file to the last
reset/initialization. Of course the internal state must still be
regenerated from the remaining portion of the file (which may be the
entire file), so isn't really a win over just processing the entire file
again with an algorithm that discovers when/where to pick up where it
left off before.

- Josiah

Jul 18 '05 #6

Heiko Wundram

Am Freitag, 5. November 2004 19:19 schrieb Josiah Carlson:

I am not aware of any such method. I am fairly certain gzip (and the
associated zlib) does the following:

while bytes remaining:
reset/initialize state
while state is not crappy and bytes remaining:
compress portion of remaining bytes
update state

Even if one could discover the last reset/initialization of state, one
would still need to decompress the data from then on in order to
discover the two empty blocks.

This is not entirely true... There is a full flush which is done every n bytes
(n > 100000 bytes, IIRC), and can also be forced by the programmer. In case
you do a full flush, the block which you read is complete as is up till the
point you did the flush.

From the documentation:

"""flush([mode])

All pending input is processed, and a string containing the remaining
compressed output is returned. mode can be selected from the constants
Z_SYNC_FLUSH, Z_FULL_FLUSH, or Z_FINISH, defaulting to Z_FINISH. Z_SYNC_FLUSH
and Z_FULL_FLUSH allow compressing further strings of data and are used to
allow partial error recovery on decompression, while Z_FINISH finishes the
compressed stream and prevents compressing any more data. After calling
flush() with mode set to Z_FINISH, the compress() method cannot be called
again; the only realistic action is to delete the object."""

Anyway, the state is reset to the initial state after the full flush, so that
the next block of data is independent from the block that was flushed. So,
you might start writing after the full flush, but you'd have to make sure
that the compressed stream was of the same format specification as the one
previously written (see the compression level parameter of
compress/decompress), and you'd also have to make sure that the gzip header
is supressed, and that the FINISH compression block correctly reflects the
data that was appended (because you basically overwrite the finish block of
the first compress).

Little example:

import zlib
x = zlib.compressob j(6)
x <zlib.Compres s object at 0xb7e39de0> a = x.compress("hah ahahahaha"*20)
a += x.flush(zlib.Z_ FULL_FLUSH)
a 'x\x9c\xcaH\xcc \x18Q\x10\x00\x 00\x00\xff\xff' b = x.flush(zlib.Z_ FINISH)
b '\x03\x00^\x84^ 9' x = zlib.compressob j(6) # New compression object with same compression.
c = x.compress("hah ahahahaha"*20)
c += x.flush(zlib.Z_ FULL_FLUSH)
c 'x\x9c\xcaH\xcc \x18Q\x10\x00\x 00\x00\xff\xff' d = x.flush(zlib.Z_ FINISH)
d '\x03\x00^\x84^ 9' e = a+c[2:] # Strip header of second block.
x = zlib.decompress obj()
f = x.decompress(e)
len(f) 480 # Two times 240 = 480. f 'haha...' # Rest stripped for clarity.

So, as far as this goes, it works. But:
x = zlib.decompress obj()
e = a+c[2:]+d
f = x.decompress(e)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
zlib.error: Error -3 while decompressing: incorrect data check

You see here that if you append the new end of stream marker of the second
block (which is written by x.flush(zlib.Z_ FINISH)), the data checksum is
broken, as the data checksum is always written for the entire data, but
leaving out the end of stream marker doesn't cause data-decompression to
fail.

I know too little about the internal format of a gzip file (which appends more
header data, but otherwise is just a zlib compressed stream) to tell whether
an approach such as this one would also work on gzip-files, but I presume it
should.

Hope this little explanation helps!

Heiko.

Jul 18 '05 #7

Dennis Hotson

Thanks Heiko, Thats really interesting..

To tell you the truth though, I'm not that familiar with the structure of
tar or gzip files. I've got a much better idea of how it works now though.
:D

I managed to get my function working... although it decompresses
everything and then compresses it back... Not the best, but good enough I
think.

Speed isn't a huge issue in my case anyway because this is for a web app
I'm writing... It's a directory tree which allows people to download and
upload files into/from directories as well as compressed archives.

Anyway.. thanks a lot for your help. I really appreciate it. Cheers mate!
:)

Jul 18 '05 #8

Francesc Alted

Dennis Hotson wrote:

I managed to get my function working... although it decompresses
everything and then compresses it back... Not the best, but good enough I
think.

If you want a solution that allows to append files to an archive, while
allowing compression, take a look at FileNode, a module that has been added
to the latest PyTables package (www.pytables.org). You can see the
documentation (and tutorials) for the module here:

http://pytables.sourceforge.net/html-doc/c3616.html

It supports the zlib, ucl and lzo compressors, as well as the shuffle
compression pre-conditioner.

HTH,

Francesc Altet

Jul 18 '05 #9

Similar topics

2543

'name is too long' (tarfile, python 2.2, Debian Woody)

by: Lars Behrens | last post by:

Hi, Pythonistas! I'm quite new to Python and have a problem with a simple backup script. This code: tar = tarfile.open('/home/lars/test.tar.gz', 'w:gz') tar.addfile('/home/lars') brings up the following error message:

Python

11143

can tarfile maintain directory structure?

by: Jay Donnell | last post by:

Is there a way to use the tarfile module to recursively compress the contents of a directory and maintain the directory structure in the tar archive? Simply doing os.system('tar -czvf ' + fileName +'.tar.gz ' + directory) works great on linux, but I need this script to work on windows as well :(

Python

4866

tarfile's tar.extractfile() file-like object incompatible with pickle.load()?

by: Matt Doucleff | last post by:

Hi everyone! I must be doing something wrong here :) I have a tarball that contains a single file whose contents are a pickled object. I would like to unpickle the object directly from the tarball using the file-like object provided by extractfile(). Attempts to do this result in EOFError. However if I first extract to a temporary file, then unpickle from there, it works. The below code reproduces the problem (on my machine at...

Python

20450

How do you convert a string obj to a file obj?

by: Matthew Thorley | last post by:

I'm writing a web app whereby a user uploads a tar acrhive which is then opened and processed. My web form reads the file like this: while 1: data = value.file.read(1024 * 8) # Read blocks of 8KB at a time if not data: break which leaves me with data as a string obj. The problem that I have is that the function that processes the archive expects a file object. So far the only solution I have found it to write the file to disk and then

Python

3184

using tarfile with an open file object

by: Matthew Thorley | last post by:

I've been using tarfile like this import tarfile tar = tarfile.open('path_to_tar_archive', 'r:gz') But I need to use it like this: archive = open('path_to_tar_archive', 'r') tar = tarfile.open(archive.readlines())

Python

7459

Embedding a binary file in a python script

by: mrstephengross | last post by:

I want to find a way to embed a tar file *in* my python script, and then use the tarfile module to extract it. That is, instead of distributing two files (extractor.py and archive.tar) I want to be able to distribute *one* file (extractor-with-embedded-archive.py). Is there a way to do this? Thanks, --Steve (mrstephengross@hotmail.com)

Python

1205

ReadError, "not a bzip2 file"

by: itzel | last post by:

Hello!! In using tarfile to group thousands of small files from a directory and then compress it. I already compress a group of files in my pc, but I need do it in a server and I'm testing the same procedure, but it doesn't work . A "ReadError" appear: "not a bzip2 file". I'm using this script: import os import tarfile

Python

2583

[tarfile] Difficultis catching an exception

by: boblatest | last post by:

Hello, I'm trying to catch an "EOFError" exception that occurs when reading truncated tarfile. Here's my routine, and below that the callback trace. Note that although I'm trying to catch all TarFile exceptions, the tarfile.EOFError ecxeption, and the global EOFError exception, the program still falls through and fails. def query_archive(batch_base): arc_name = os.path.join(archive_dir, 'B_'+batch_base+'.tar.bz2')

Python

1156

Negative block sizes with file-like objects

by: Steven D'Aprano | last post by:

I have a proxy class that wraps an arbitrary file-like object fp and reads blocks of data from it. Is it safe to assume that fp.read(-1) will read until EOF? I know that's true for file.read() and StringIO.read(), but is it a reasonable assumption to make for arbitrary file-like objects? To put it in more concrete terms, I have a class like this: class C(object): # Much simplified version. def __init__(self, fp):

Python

8788

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9476

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9335

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9263

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8210

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6751

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6053

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4825

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2193

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General