471,873 Members | 989 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,873 software developers and data experts.

How to process a very large (4Gb) tarfile from python?

I am trying to do something with a very large tarfile from within
Python, and am running into memory constraints. The tarfile in
question is a 4-gigabyte datafile from freedb.org,
http://ftp.freedb.org/pub/freedb/ , and has about 2.5 million members
in it.

Here's a simple toy program that just goes through and counts the
number of members in the tarfile, printing a status message every N
records (N=10,000 for the smaller file; N=100,000 for the larger).

I'm finding that memory usage goes through the roof, simply iterating
over the tarfile. I'm using over 2G when I'm barely halfway through
the file. This surprises me; I'd expect the memory associated with
each iteration to be released at the end of the iteration; but
something's obviously building up.

On one system, this ends with a MemoryError exception. On another
system, it just hangs, bringing the system to its knees, to the point
that it takes a minute or so to do simple task switching.

Any suggestions to process this beast? I suppose I could just untar
the file, and process 2.5 million individual files, but I'm thinking
I'd rather process it directly if that's possible.

Here's the toy code. (One explanation about the "import tarfilex as
tarfile" statement. I'm running Activestate Python 2.5.0, and the
tarfile.py module of that vintage was buggy, to the point that it
couldn't read these files at all. I brought down the most recent
tarfile.py from http://svn.python.org/view/python/trunk/Lib/tarfile.py
and saved it as tarfilex.py. It works, at least until I start
processing some very large files, anyway.)

import tarfilex as tarfile
import os, time
SOURCEDIR = "F:/Installs/FreeDB/"
smallfile = "freedb-update-20080601-20080708.tar" # 63M file
smallint = 10000
bigfile = "freedb-complete-20080708.tar" # 4,329M file
bigiTnt = 100000

TARFILENAME, INTERVAL = smallfile, smallint
# TARFILENAME, INTERVAL = bigfile, bigint
def filetype(filename):
return os.path.splitext(filename)[1]

def memusage(units="M"):
import win32process
current_process = win32process.GetCurrentProcess()
memory_info = win32process.GetProcessMemoryInfo(current_process)
bytes = 1
Kbytes = 1024*bytes
Mbytes = 1024*Kbytes
Gbytes = 1024*Mbytes
unitfactors = {'B':1, 'K':Kbytes, 'M':Mbytes, 'G':Gbytes}
return memory_info["WorkingSetSize"]//unitfactors[units]

def opentar(filename):
modes = {".tar":"r", ".gz":"r:gz", ".bz2":"r:bz2"}
openmode = modes[filetype(filename)]
openedfile = tarfile.open(filename, openmode)
return openedfile
TFPATH=SOURCEDIR+'/'+TARFILENAME
assert os.path.exists(TFPATH)
assert tarfile.is_tarfile(TFPATH)
tf = opentar(TFPATH)
count = 0
print "%s memory: %sM count: %s (starting)" % (time.asctime(),
memusage(), count)
for tarinfo in tf:
count += 1
if count % INTERVAL == 0:
print "%s memory: %sM count: %s" % (time.asctime(),
memusage(), count)
print "%s memory: %sM count: %s (completed)" % (time.asctime(),
memusage(), count)
Results with the smaller (63M) file:

Thu Jul 17 00:18:21 2008 memory: 4M count: 0 (starting)
Thu Jul 17 00:18:23 2008 memory: 18M count: 10000
Thu Jul 17 00:18:26 2008 memory: 32M count: 20000
Thu Jul 17 00:18:28 2008 memory: 46M count: 30000
Thu Jul 17 00:18:30 2008 memory: 55M count: 36128 (completed)
Results with the larger (4.3G) file:

Thu Jul 17 00:18:47 2008 memory: 4M count: 0 (starting)
Thu Jul 17 00:19:40 2008 memory: 146M count: 100000
Thu Jul 17 00:20:41 2008 memory: 289M count: 200000
Thu Jul 17 00:21:41 2008 memory: 432M count: 300000
Thu Jul 17 00:22:42 2008 memory: 574M count: 400000
Thu Jul 17 00:23:47 2008 memory: 717M count: 500000
Thu Jul 17 00:24:49 2008 memory: 860M count: 600000
Thu Jul 17 00:25:51 2008 memory: 1002M count: 700000
Thu Jul 17 00:26:54 2008 memory: 1145M count: 800000
Thu Jul 17 00:27:59 2008 memory: 1288M count: 900000
Thu Jul 17 00:29:03 2008 memory: 1430M count: 1000000
Thu Jul 17 00:30:07 2008 memory: 1573M count: 1100000
Thu Jul 17 00:31:11 2008 memory: 1716M count: 1200000
Thu Jul 17 00:32:15 2008 memory: 1859M count: 1300000
Thu Jul 17 00:33:23 2008 memory: 2001M count: 1400000
Traceback (most recent call last):
File "C:\test\freedb\tardemo.py", line 40, in <module>
for tarinfo in tf:
File "C:\test\freedb\tarfilex.py", line 2406, in next
tarinfo = self.tarfile.next()
File "C:\test\freedb\tarfilex.py", line 2311, in next
tarinfo = self.tarinfo.fromtarfile(self)
File "C:\test\freedb\tarfilex.py", line 1235, in fromtarfile
obj = cls.frombuf(buf)
File "C:\test\freedb\tarfilex.py", line 1193, in frombuf
if chksum not in calc_chksums(buf):
File "C:\test\freedb\tarfilex.py", line 261, in calc_chksums
unsigned_chksum = 256 + sum(struct.unpack("148B", buf[:148]) +
struct.unpack("356B", buf[156:512]))
MemoryError
Jul 17 '08 #1
6 6992
On 17 Jul., 10:01, Terry Carroll <carr...@nospam-tjc.comwrote:
I am trying to do something with a very large tarfile from within
Python, and am running into memory constraints. *The tarfile in
question is a 4-gigabyte datafile from freedb.org,http://ftp.freedb.org/pub/freedb/, and has about 2.5 million members
in it.

Here's a simple toy program that just goes through and counts the
number of members in the tarfile, printing a status message every N
records (N=10,000 for the smaller file; N=100,000 for the larger).

I'm finding that memory usage goes through the roof, simply iterating
over the tarfile. *I'm using over 2G when I'm barely halfway through
the file. This surprises me; I'd expect the memory associated with
each iteration to be released at the end of the iteration; but
something's obviously building up.

On one system, this ends with a MemoryError exception. *On another
system, it just hangs, bringing the system to its knees, to the point
that it takes a minute or so to do simple task switching.

Any suggestions to process this beast? *I suppose I could just untar
the file, and process 2.5 million individual files, but I'm thinking
I'd rather process it directly if that's possible.

Here's the toy code. *(One explanation about the "import tarfilex as
tarfile" statement. I'm running Activestate Python 2.5.0, and the
tarfile.py module of that vintage was buggy, to the point that it
couldn't read these files at all. *I brought down the most recent
tarfile.py fromhttp://svn.python.org/view/python/trunk/Lib/tarfile.py
and saved it as tarfilex.py. *It works, at least until I start
processing some very large files, anyway.)

import tarfilex as tarfile
import os, time
SOURCEDIR = "F:/Installs/FreeDB/"
smallfile = "freedb-update-20080601-20080708.tar" # 63M file
smallint = 10000
bigfile * = "freedb-complete-20080708.tar" *# 4,329M file
bigiTnt = 100000

TARFILENAME, INTERVAL = smallfile, smallint
# TARFILENAME, INTERVAL = bigfile, bigint

def filetype(filename):
* * return os.path.splitext(filename)[1]

def memusage(units="M"):
* * import win32process
* * current_process = win32process.GetCurrentProcess()
* * memory_info = win32process.GetProcessMemoryInfo(current_process)
* * bytes = 1
* * Kbytes = 1024*bytes
* * Mbytes = 1024*Kbytes
* * Gbytes = 1024*Mbytes
* * unitfactors = {'B':1, 'K':Kbytes, 'M':Mbytes, 'G':Gbytes}
* * return memory_info["WorkingSetSize"]//unitfactors[units]

def opentar(filename):
* * modes = {".tar":"r", ".gz":"r:gz", ".bz2":"r:bz2"}
* * openmode = modes[filetype(filename)]
* * openedfile = tarfile.open(filename, openmode)
* * return openedfile

TFPATH=SOURCEDIR+'/'+TARFILENAME
assert os.path.exists(TFPATH)
assert tarfile.is_tarfile(TFPATH)
tf = opentar(TFPATH)
count = 0
print "%s memory: %sM count: %s (starting)" % (time.asctime(),
memusage(), count)
for tarinfo in tf:
* * count += 1
* * if count % INTERVAL == 0:
* * * * print "%s memory: %sM count: %s" % (time.asctime(),
memusage(), count)
print "%s memory: %sM count: %s (completed)" % (time.asctime(),
memusage(), count)

Results with the smaller (63M) file:

Thu Jul 17 00:18:21 2008 memory: 4M count: 0 (starting)
Thu Jul 17 00:18:23 2008 memory: 18M count: 10000
Thu Jul 17 00:18:26 2008 memory: 32M count: 20000
Thu Jul 17 00:18:28 2008 memory: 46M count: 30000
Thu Jul 17 00:18:30 2008 memory: 55M count: 36128 (completed)

Results with the larger (4.3G) file:

Thu Jul 17 00:18:47 2008 memory: 4M count: 0 (starting)
Thu Jul 17 00:19:40 2008 memory: 146M count: 100000
Thu Jul 17 00:20:41 2008 memory: 289M count: 200000
Thu Jul 17 00:21:41 2008 memory: 432M count: 300000
Thu Jul 17 00:22:42 2008 memory: 574M count: 400000
Thu Jul 17 00:23:47 2008 memory: 717M count: 500000
Thu Jul 17 00:24:49 2008 memory: 860M count: 600000
Thu Jul 17 00:25:51 2008 memory: 1002M count: 700000
Thu Jul 17 00:26:54 2008 memory: 1145M count: 800000
Thu Jul 17 00:27:59 2008 memory: 1288M count: 900000
Thu Jul 17 00:29:03 2008 memory: 1430M count: 1000000
Thu Jul 17 00:30:07 2008 memory: 1573M count: 1100000
Thu Jul 17 00:31:11 2008 memory: 1716M count: 1200000
Thu Jul 17 00:32:15 2008 memory: 1859M count: 1300000
Thu Jul 17 00:33:23 2008 memory: 2001M count: 1400000
Traceback (most recent call last):
* File "C:\test\freedb\tardemo.py", line 40, in <module>
* * for tarinfo in tf:
* File "C:\test\freedb\tarfilex.py", line 2406, in next
* * tarinfo = self.tarfile.next()
* File "C:\test\freedb\tarfilex.py", line 2311, in next
* * tarinfo = self.tarinfo.fromtarfile(self)
* File "C:\test\freedb\tarfilex.py", line 1235, in fromtarfile
* * obj = cls.frombuf(buf)
* File "C:\test\freedb\tarfilex.py", line 1193, in frombuf
* * if chksum not in calc_chksums(buf):
* File "C:\test\freedb\tarfilex.py", line 261, in calc_chksums
* * unsigned_chksum = 256 + sum(struct.unpack("148B", buf[:148]) +
struct.unpack("356B", buf[156:512]))
MemoryError
I had a look at tarfile.py in my current Python 2.5 installations
lib path. The iterator caches TarInfo objects in a list
tf.members . If you only want to iterate and you are not interested
in more functionallity, you could use "tf.members=[]" inside
your loop. This is a dirty hack !

Greetings, Uwe
Jul 17 '08 #2
On Thu, 17 Jul 2008 06:14:45 -0700 (PDT), Uwe Schmitt
<ro*************@googlemail.comwrote:
>I had a look at tarfile.py in my current Python 2.5 installations
lib path. The iterator caches TarInfo objects in a list
tf.members . If you only want to iterate and you are not interested
in more functionallity, you could use "tf.members=[]" inside
your loop. This is a dirty hack !
Thanks, Uwe. That works fine for me. It now reads through all 2.5
million members, in about 30 minutes, never going above a 4M working
set.
Jul 17 '08 #3
On 17 Jul., 17:55, Terry Carroll <carr...@nospam-tjc.comwrote:
On Thu, 17 Jul 2008 06:14:45 -0700 (PDT), Uwe Schmitt

<rocksportroc...@googlemail.comwrote:
I had a look at tarfile.py in my current Python 2.5 installations
lib path. The iterator caches TarInfo objects in a list
tf.members . If you only want to iterate and you *are not interested
in more functionallity, you could use "tf.members=[]" inside
your loop. This is a dirty hack !

Thanks, Uwe. *That works fine for me. *It now reads through all 2.5
million members, in about 30 minutes, never going above a 4M working
set.
Maybe we should post this issue to python-dev mailing list.
Parsing large tar-files is not uncommon.

Greetings, Uwe
Jul 17 '08 #4
On Thu, Jul 17, 2008 at 10:39:23AM -0700, Uwe Schmitt wrote:
On 17 Jul., 17:55, Terry Carroll <carr...@nospam-tjc.comwrote:
On Thu, 17 Jul 2008 06:14:45 -0700 (PDT), Uwe Schmitt

<rocksportroc...@googlemail.comwrote:
>I had a look at tarfile.py in my current Python 2.5 installations
>lib path. The iterator caches TarInfo objects in a list
>tf.members . If you only want to iterate and you *are not interested
>in more functionallity, you could use "tf.members=[]" inside
>your loop. This is a dirty hack !
Thanks, Uwe. *That works fine for me. *It now reads through all 2.5
million members, in about 30 minutes, never going above a 4M working
set.

Maybe we should post this issue to python-dev mailing list.
Parsing large tar-files is not uncommon.
This issue is known and was fixed for Python 3.0, see
http://bugs.python.org/issue2058.

--
Lars Gustäbel
la**@gustaebel.de

Es genügt nicht nur, keine Gedanken zu haben,
man muß auch unfähig sein, sie auszudrücken.
(anonym)
Jul 17 '08 #5
On 17 Jul., 22:21, Lars Gustäbel <l...@gustaebel.dewrote:
>
Maybe we should post this issue to python-dev mailing list.
Parsing large tar-files is not uncommon.

This issue is known and was fixed for Python 3.0, seehttp://bugs.python.org/issue2058.
The proposed patch does not avoid caching the previous values of the
iterator, it just
reduces the size of each cached object.
It would be nice to be able to avoid caching on demand, which would
make
iteration independent of the size of the tar file.

Greetings Uwe
Jul 18 '08 #6
Due to the discussion above I wrote a python module for scanning of
large tarfiles.
You can get it from http://www.procoders.net/wp-content/tarfile_scanner.zip

Greetings, Uwe

Jul 18 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Peter Åstrand | last post: by
8 posts views Thread by Jay Donnell | last post: by
8 posts views Thread by Dennis Hotson | last post: by
22 posts views Thread by Zen | last post: by
28 posts views Thread by Jon Davis | last post: by
2 posts views Thread by robert | last post: by
reply views Thread by YellowAndGreen | last post: by
reply views Thread by zermasroor | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.