473,605 Members | 2,136 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to process a very large (4Gb) tarfile from python?

I am trying to do something with a very large tarfile from within
Python, and am running into memory constraints. The tarfile in
question is a 4-gigabyte datafile from freedb.org,
http://ftp.freedb.org/pub/freedb/ , and has about 2.5 million members
in it.

Here's a simple toy program that just goes through and counts the
number of members in the tarfile, printing a status message every N
records (N=10,000 for the smaller file; N=100,000 for the larger).

I'm finding that memory usage goes through the roof, simply iterating
over the tarfile. I'm using over 2G when I'm barely halfway through
the file. This surprises me; I'd expect the memory associated with
each iteration to be released at the end of the iteration; but
something's obviously building up.

On one system, this ends with a MemoryError exception. On another
system, it just hangs, bringing the system to its knees, to the point
that it takes a minute or so to do simple task switching.

Any suggestions to process this beast? I suppose I could just untar
the file, and process 2.5 million individual files, but I'm thinking
I'd rather process it directly if that's possible.

Here's the toy code. (One explanation about the "import tarfilex as
tarfile" statement. I'm running Activestate Python 2.5.0, and the
tarfile.py module of that vintage was buggy, to the point that it
couldn't read these files at all. I brought down the most recent
tarfile.py from http://svn.python.org/view/python/trunk/Lib/tarfile.py
and saved it as tarfilex.py. It works, at least until I start
processing some very large files, anyway.)

import tarfilex as tarfile
import os, time
SOURCEDIR = "F:/Installs/FreeDB/"
smallfile = "freedb-update-20080601-20080708.tar" # 63M file
smallint = 10000
bigfile = "freedb-complete-20080708.tar" # 4,329M file
bigiTnt = 100000

TARFILENAME, INTERVAL = smallfile, smallint
# TARFILENAME, INTERVAL = bigfile, bigint
def filetype(filena me):
return os.path.splitex t(filename)[1]

def memusage(units= "M"):
import win32process
current_process = win32process.Ge tCurrentProcess ()
memory_info = win32process.Ge tProcessMemoryI nfo(current_pro cess)
bytes = 1
Kbytes = 1024*bytes
Mbytes = 1024*Kbytes
Gbytes = 1024*Mbytes
unitfactors = {'B':1, 'K':Kbytes, 'M':Mbytes, 'G':Gbytes}
return memory_info["WorkingSetSize "]//unitfactors[units]

def opentar(filenam e):
modes = {".tar":"r", ".gz":"r:gz ", ".bz2":"r:b z2"}
openmode = modes[filetype(filena me)]
openedfile = tarfile.open(fi lename, openmode)
return openedfile
TFPATH=SOURCEDI R+'/'+TARFILENAME
assert os.path.exists( TFPATH)
assert tarfile.is_tarf ile(TFPATH)
tf = opentar(TFPATH)
count = 0
print "%s memory: %sM count: %s (starting)" % (time.asctime() ,
memusage(), count)
for tarinfo in tf:
count += 1
if count % INTERVAL == 0:
print "%s memory: %sM count: %s" % (time.asctime() ,
memusage(), count)
print "%s memory: %sM count: %s (completed)" % (time.asctime() ,
memusage(), count)
Results with the smaller (63M) file:

Thu Jul 17 00:18:21 2008 memory: 4M count: 0 (starting)
Thu Jul 17 00:18:23 2008 memory: 18M count: 10000
Thu Jul 17 00:18:26 2008 memory: 32M count: 20000
Thu Jul 17 00:18:28 2008 memory: 46M count: 30000
Thu Jul 17 00:18:30 2008 memory: 55M count: 36128 (completed)
Results with the larger (4.3G) file:

Thu Jul 17 00:18:47 2008 memory: 4M count: 0 (starting)
Thu Jul 17 00:19:40 2008 memory: 146M count: 100000
Thu Jul 17 00:20:41 2008 memory: 289M count: 200000
Thu Jul 17 00:21:41 2008 memory: 432M count: 300000
Thu Jul 17 00:22:42 2008 memory: 574M count: 400000
Thu Jul 17 00:23:47 2008 memory: 717M count: 500000
Thu Jul 17 00:24:49 2008 memory: 860M count: 600000
Thu Jul 17 00:25:51 2008 memory: 1002M count: 700000
Thu Jul 17 00:26:54 2008 memory: 1145M count: 800000
Thu Jul 17 00:27:59 2008 memory: 1288M count: 900000
Thu Jul 17 00:29:03 2008 memory: 1430M count: 1000000
Thu Jul 17 00:30:07 2008 memory: 1573M count: 1100000
Thu Jul 17 00:31:11 2008 memory: 1716M count: 1200000
Thu Jul 17 00:32:15 2008 memory: 1859M count: 1300000
Thu Jul 17 00:33:23 2008 memory: 2001M count: 1400000
Traceback (most recent call last):
File "C:\test\freedb \tardemo.py", line 40, in <module>
for tarinfo in tf:
File "C:\test\freedb \tarfilex.py", line 2406, in next
tarinfo = self.tarfile.ne xt()
File "C:\test\freedb \tarfilex.py", line 2311, in next
tarinfo = self.tarinfo.fr omtarfile(self)
File "C:\test\freedb \tarfilex.py", line 1235, in fromtarfile
obj = cls.frombuf(buf )
File "C:\test\freedb \tarfilex.py", line 1193, in frombuf
if chksum not in calc_chksums(bu f):
File "C:\test\freedb \tarfilex.py", line 261, in calc_chksums
unsigned_chksum = 256 + sum(struct.unpa ck("148B", buf[:148]) +
struct.unpack(" 356B", buf[156:512]))
MemoryError
Jul 17 '08 #1
6 7451
On 17 Jul., 10:01, Terry Carroll <carr...@nosp am-tjc.comwrote:
I am trying to do something with a very large tarfile from within
Python, and am running into memory constraints. *The tarfile in
question is a 4-gigabyte datafile from freedb.org,http://ftp.freedb.org/pub/freedb/, and has about 2.5 million members
in it.

Here's a simple toy program that just goes through and counts the
number of members in the tarfile, printing a status message every N
records (N=10,000 for the smaller file; N=100,000 for the larger).

I'm finding that memory usage goes through the roof, simply iterating
over the tarfile. *I'm using over 2G when I'm barely halfway through
the file. This surprises me; I'd expect the memory associated with
each iteration to be released at the end of the iteration; but
something's obviously building up.

On one system, this ends with a MemoryError exception. *On another
system, it just hangs, bringing the system to its knees, to the point
that it takes a minute or so to do simple task switching.

Any suggestions to process this beast? *I suppose I could just untar
the file, and process 2.5 million individual files, but I'm thinking
I'd rather process it directly if that's possible.

Here's the toy code. *(One explanation about the "import tarfilex as
tarfile" statement. I'm running Activestate Python 2.5.0, and the
tarfile.py module of that vintage was buggy, to the point that it
couldn't read these files at all. *I brought down the most recent
tarfile.py fromhttp://svn.python.org/view/python/trunk/Lib/tarfile.py
and saved it as tarfilex.py. *It works, at least until I start
processing some very large files, anyway.)

import tarfilex as tarfile
import os, time
SOURCEDIR = "F:/Installs/FreeDB/"
smallfile = "freedb-update-20080601-20080708.tar" # 63M file
smallint = 10000
bigfile * = "freedb-complete-20080708.tar" *# 4,329M file
bigiTnt = 100000

TARFILENAME, INTERVAL = smallfile, smallint
# TARFILENAME, INTERVAL = bigfile, bigint

def filetype(filena me):
* * return os.path.splitex t(filename)[1]

def memusage(units= "M"):
* * import win32process
* * current_process = win32process.Ge tCurrentProcess ()
* * memory_info = win32process.Ge tProcessMemoryI nfo(current_pro cess)
* * bytes = 1
* * Kbytes = 1024*bytes
* * Mbytes = 1024*Kbytes
* * Gbytes = 1024*Mbytes
* * unitfactors = {'B':1, 'K':Kbytes, 'M':Mbytes, 'G':Gbytes}
* * return memory_info["WorkingSetSize "]//unitfactors[units]

def opentar(filenam e):
* * modes = {".tar":"r", ".gz":"r:gz ", ".bz2":"r:b z2"}
* * openmode = modes[filetype(filena me)]
* * openedfile = tarfile.open(fi lename, openmode)
* * return openedfile

TFPATH=SOURCEDI R+'/'+TARFILENAME
assert os.path.exists( TFPATH)
assert tarfile.is_tarf ile(TFPATH)
tf = opentar(TFPATH)
count = 0
print "%s memory: %sM count: %s (starting)" % (time.asctime() ,
memusage(), count)
for tarinfo in tf:
* * count += 1
* * if count % INTERVAL == 0:
* * * * print "%s memory: %sM count: %s" % (time.asctime() ,
memusage(), count)
print "%s memory: %sM count: %s (completed)" % (time.asctime() ,
memusage(), count)

Results with the smaller (63M) file:

Thu Jul 17 00:18:21 2008 memory: 4M count: 0 (starting)
Thu Jul 17 00:18:23 2008 memory: 18M count: 10000
Thu Jul 17 00:18:26 2008 memory: 32M count: 20000
Thu Jul 17 00:18:28 2008 memory: 46M count: 30000
Thu Jul 17 00:18:30 2008 memory: 55M count: 36128 (completed)

Results with the larger (4.3G) file:

Thu Jul 17 00:18:47 2008 memory: 4M count: 0 (starting)
Thu Jul 17 00:19:40 2008 memory: 146M count: 100000
Thu Jul 17 00:20:41 2008 memory: 289M count: 200000
Thu Jul 17 00:21:41 2008 memory: 432M count: 300000
Thu Jul 17 00:22:42 2008 memory: 574M count: 400000
Thu Jul 17 00:23:47 2008 memory: 717M count: 500000
Thu Jul 17 00:24:49 2008 memory: 860M count: 600000
Thu Jul 17 00:25:51 2008 memory: 1002M count: 700000
Thu Jul 17 00:26:54 2008 memory: 1145M count: 800000
Thu Jul 17 00:27:59 2008 memory: 1288M count: 900000
Thu Jul 17 00:29:03 2008 memory: 1430M count: 1000000
Thu Jul 17 00:30:07 2008 memory: 1573M count: 1100000
Thu Jul 17 00:31:11 2008 memory: 1716M count: 1200000
Thu Jul 17 00:32:15 2008 memory: 1859M count: 1300000
Thu Jul 17 00:33:23 2008 memory: 2001M count: 1400000
Traceback (most recent call last):
* File "C:\test\freedb \tardemo.py", line 40, in <module>
* * for tarinfo in tf:
* File "C:\test\freedb \tarfilex.py", line 2406, in next
* * tarinfo = self.tarfile.ne xt()
* File "C:\test\freedb \tarfilex.py", line 2311, in next
* * tarinfo = self.tarinfo.fr omtarfile(self)
* File "C:\test\freedb \tarfilex.py", line 1235, in fromtarfile
* * obj = cls.frombuf(buf )
* File "C:\test\freedb \tarfilex.py", line 1193, in frombuf
* * if chksum not in calc_chksums(bu f):
* File "C:\test\freedb \tarfilex.py", line 261, in calc_chksums
* * unsigned_chksum = 256 + sum(struct.unpa ck("148B", buf[:148]) +
struct.unpack(" 356B", buf[156:512]))
MemoryError
I had a look at tarfile.py in my current Python 2.5 installations
lib path. The iterator caches TarInfo objects in a list
tf.members . If you only want to iterate and you are not interested
in more functionallity, you could use "tf.members =[]" inside
your loop. This is a dirty hack !

Greetings, Uwe
Jul 17 '08 #2
On Thu, 17 Jul 2008 06:14:45 -0700 (PDT), Uwe Schmitt
<ro************ *@googlemail.co mwrote:
>I had a look at tarfile.py in my current Python 2.5 installations
lib path. The iterator caches TarInfo objects in a list
tf.members . If you only want to iterate and you are not interested
in more functionallity, you could use "tf.members =[]" inside
your loop. This is a dirty hack !
Thanks, Uwe. That works fine for me. It now reads through all 2.5
million members, in about 30 minutes, never going above a 4M working
set.
Jul 17 '08 #3
On 17 Jul., 17:55, Terry Carroll <carr...@nosp am-tjc.comwrote:
On Thu, 17 Jul 2008 06:14:45 -0700 (PDT), Uwe Schmitt

<rocksportroc.. .@googlemail.co mwrote:
I had a look at tarfile.py in my current Python 2.5 installations
lib path. The iterator caches TarInfo objects in a list
tf.members . If you only want to iterate and you *are not interested
in more functionallity, you could use "tf.members =[]" inside
your loop. This is a dirty hack !

Thanks, Uwe. *That works fine for me. *It now reads through all 2.5
million members, in about 30 minutes, never going above a 4M working
set.
Maybe we should post this issue to python-dev mailing list.
Parsing large tar-files is not uncommon.

Greetings, Uwe
Jul 17 '08 #4
On Thu, Jul 17, 2008 at 10:39:23AM -0700, Uwe Schmitt wrote:
On 17 Jul., 17:55, Terry Carroll <carr...@nosp am-tjc.comwrote:
On Thu, 17 Jul 2008 06:14:45 -0700 (PDT), Uwe Schmitt

<rocksportroc.. .@googlemail.co mwrote:
>I had a look at tarfile.py in my current Python 2.5 installations
>lib path. The iterator caches TarInfo objects in a list
>tf.members . If you only want to iterate and you *are not interested
>in more functionallity, you could use "tf.members =[]" inside
>your loop. This is a dirty hack !
Thanks, Uwe. *That works fine for me. *It now reads through all 2.5
million members, in about 30 minutes, never going above a 4M working
set.

Maybe we should post this issue to python-dev mailing list.
Parsing large tar-files is not uncommon.
This issue is known and was fixed for Python 3.0, see
http://bugs.python.org/issue2058.

--
Lars Gustäbel
la**@gustaebel. de

Es genügt nicht nur, keine Gedanken zu haben,
man muß auch unfähig sein, sie auszudrücken.
(anonym)
Jul 17 '08 #5
On 17 Jul., 22:21, Lars Gustäbel <l...@gustaebel .dewrote:
>
Maybe we should post this issue to python-dev mailing list.
Parsing large tar-files is not uncommon.

This issue is known and was fixed for Python 3.0, seehttp://bugs.python.org/issue2058.
The proposed patch does not avoid caching the previous values of the
iterator, it just
reduces the size of each cached object.
It would be nice to be able to avoid caching on demand, which would
make
iteration independent of the size of the tar file.

Greetings Uwe
Jul 18 '08 #6
Due to the discussion above I wrote a python module for scanning of
large tarfiles.
You can get it from http://www.procoders.net/wp-content/tarfile_scanner.zip

Greetings, Uwe

Jul 18 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
2593
by: Peter Åstrand | last post by:
There's a new PEP available: PEP 324: popen5 - New POSIX process module A copy is included below. Comments are appreciated. ---- PEP: 324 Title: popen5 - New POSIX process module
8
11126
by: Jay Donnell | last post by:
Is there a way to use the tarfile module to recursively compress the contents of a directory and maintain the directory structure in the tar archive? Simply doing os.system('tar -czvf ' + fileName +'.tar.gz ' + directory) works great on linux, but I need this script to work on windows as well :(
5
4839
by: Matt Doucleff | last post by:
Hi everyone! I must be doing something wrong here :) I have a tarball that contains a single file whose contents are a pickled object. I would like to unpickle the object directly from the tarball using the file-like object provided by extractfile(). Attempts to do this result in EOFError. However if I first extract to a temporary file, then unpickle from there, it works. The below code reproduces the problem (on my machine at...
8
8829
by: Dennis Hotson | last post by:
Hi, I'm trying to write a function that adds a file-like-object to a compressed tarfile... eg ".tar.gz" or ".tar.bz2" I've had a look at the tarfile module but the append mode doesn't support compressed tarfiles... :( Any thoughts on what I can do to get around this?
4
3361
by: Claudio Grondi | last post by:
I need to unpack on a Windows 2000 machine some Wikipedia media .tar archives which are compressed with TAR 1.14 (support for long file names and maybe some other features) . It seems, that Pythons tarfile module is able to list far more files inside the archives than WinRAR or 7zip or TotalCommander, but will it unpack all available files (largest archive size 17 GByte)? If tarfile is build on TAR 1.14 or TAR 1.15 it will be
22
3003
by: Zen | last post by:
Hi, My production machine has 2G of memory, when aspnet_wp.exe goes up to about ~1.2G of memory usage, I start get out-of-memory exception. Other processes don't use as much memory and I added all the peak memory usage of all the processes (including aspnet_wp.exe), it goes up to no more than 1.5. How is that possible? Would anyone know please help? thanks!
28
4926
by: Jon Davis | last post by:
We're looking at running a memory-intensive process for a web site as a Windows service in isolation of IIS because IIS refuses to consume all of the available physical RAM. Considering remoting to move data in and out of this process. Need something that's quick and dirty and easy to implement, but that's performant and secure at the same time. Any suggestions / tutorials? Would prefer not to go on the TCP/IP stack (socket) as it is not...
2
3193
by: robert | last post by:
Somebody who uses my app gets a error : os.stat('/path/filename') OSError: Value too large for defined data type: '/path/filename' on a big file >4GB ( Python 2.4.4 / Linux )
6
5993
by: sebastian.noack | last post by:
Hi, is there a way to or at least a reason why I can not use tarfile to create a gzip or bunzip2 compressed archive in the memory? You might might wanna answer "use StringIO" but this isn't such easy as it seems to be. ;) I am using Python 2.5.2, by the way. I think this is a bug in at least in this version of python, but maybe StringIO isn't just file-like enough for this "korky" tarfile module. But this would conflict with its...
0
7937
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8428
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8423
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8290
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6749
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
5889
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5446
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
3960
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1546
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.