Is there no compression support for large sized strings in Python?

Claudio Grondi

What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz
system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.

Test the provided code and see yourself.

At least on my system:
zlib fails to decompress raising a memory error
pylzma fails to decompress running endlessly consuming 99% of CPU time
bz2 fails to compress running endlessly consuming 99% of CPU time

The same works with a 10 MByte string without any problem.

So what? Is there no compression support for large sized strings in Python?
Am I doing something the wrong way here?
Is there any and if yes, what is the theoretical upper limit of string size
which can be processed by each of the compression libraries?

The only limit I know about is 2 GByte for the python.exe process itself,
but this seems not to be the actual problem in this case.
There are also some other strange effects when trying to create large
strings using following code:
m = 'm'*1048576
# str1024MB = 1024*m # fails with memory error, but:
str512MB_01 = 512*m # works ok
# str512MB_02 = 512*m # fails with memory error, but:
str256MB_01 = 256*m # works ok
str256MB_02 = 256*m # works ok
etc. . etc. and so on
down to allocation of each single MB in separate string to push python.exe
to the experienced upper limit
of memory reported by Windows task manager available to python.exe of
2.065.352 KByte.
Is the question why did the str1024MB = 1024*m instruction fail,
when the memory is apparently there and the target size of 1 GByte can be
achieved
out of the scope of this discussion thread, or is this the same problem
causing
the compression libraries to fail? Why is no memory error raised then?

Any hints towards understanding what is going on and why and/or towards a
workaround are welcome.

Claudio

================================================== ==========
# HDvsArchiveUnpackingSpeed_WriteFiles.py

strSize10MB = '1234567890'*1048576 # 10 MB
strSize500MB = 50*strSize10MB
fObj = file(r'c:\strSize500MB.dat', 'wb')
fObj.write(strSize500MB)
fObj.close()

fObj = file(r'c:\strSize500MBCompressed.zlib', 'wb')
import zlib
strSize500MBCompressed = zlib.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()

fObj = file(r'c:\strSize500MBCompressed.pylzma', 'wb')
import pylzma
strSize500MBCompressed = pylzma.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()

fObj = file(r'c:\strSize500MBCompressed.bz2', 'wb')
import bz2
strSize500MBCompressed = bz2.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()

print
print ' Created files: '
print ' %s \n %s \n %s \n %s' %(
r'c:\strSize500MB.dat'
,r'c:\strSize500MBCompressed.zlib'
,r'c:\strSize500MBCompressed.pylzma'
,r'c:\strSize500MBCompressed.bz2'
)

raw_input(' EXIT with Enter /> ')

================================================== ==========
# HDvsArchiveUnpackingSpeed_TestSpeed.py
import time

startTime = time.clock()
fObj = file(r'c:\strSize500MB.dat', 'rb')
strSize500MB = fObj.read()
fObj.close()
print
print ' loading uncompressed data from file: %7.3f
seconds'%(time.clock()-startTime,)

startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.zlib', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import zlib
try:
startTime = time.clock()
strSize500MB = zlib.decompress(strSize500MBCompressed)
print 'decompressing zlib data: %7.3f
seconds'%(time.clock()-startTime,)
except:
print 'decompressing zlib data FAILED'
startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.pylzma', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import pylzma
try:
startTime = time.clock()
strSize500MB = pylzma.decompress(strSize500MBCompressed)
print 'decompressing pylzma data: %7.3f
seconds'%(time.clock()-startTime,)
except:
print 'decompressing pylzma data FAILED'
startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.bz2', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import bz2
try:
startTime = time.clock()
strSize500MB = bz2.decompress(strSize500MBCompressed)
print 'decompressing bz2 data: %7.3f
seconds'%(time.clock()-startTime,)
except:
print 'decompressing bz2 data FAILED'

raw_input(' EXIT with Enter /> ')

Dec 1 '05 #1

Subscribe Post Reply

3372

Fredrik Lundh

Claudio Grondi wrote:

What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz
system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.

Test the provided code and see yourself.

At least on my system:
zlib fails to decompress raising a memory error
pylzma fails to decompress running endlessly consuming 99% of CPU time
bz2 fails to compress running endlessly consuming 99% of CPU time

The same works with a 10 MByte string without any problem.

So what? Is there no compression support for large sized strings in Python?

you're probably measuring windows' memory managment rather than the com-
pression libraries themselves (Python delegates all memory allocations >256 bytes
to the system).

I suggest using incremental (streaming) processing instead; from what I can tell,
all three libraries support that.

</F>

Dec 1 '05 #2

jepler

On this system (Linux 2.6.x, AMD64, 2 GB RAM, python2.4) I am able to
construct a 1 GB string by repetition, as well as compress a 512MB
string with gzip in one gulp.

$ cat claudio.py
s = '1234567890'*(1048576*50)

import zlib
c = zlib.compress(s)
print len(c)
open("/tmp/claudio.gz", "wb").write(c)

$ python claudio.py
1017769

$ python -c 'print len("m" * (1048576*1024))'
1073741824

I was also able to create a 1GB string on a different system (Linux 2.4.x,
32-bit Dual Intel Xeon, 8GB RAM, python 2.2).

$ python -c 'print len("m" * 1024*1024*1024)'
1073741824

I agree with another poster that you may be hitting Windows limitations rather
than Python ones, but I am certainly not familiar with the details of Windows
memory allocation.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDjvx7Jd01MZaTXX0RAo7HAKCEhtbvyS3GSfJPsqq0W5 R5EOLgTwCfVb7o
OSlY79Rl7HCLLNQQ4axI6AA=
=qTA2
-----END PGP SIGNATURE-----

Dec 1 '05 #3

Claudio Grondi

"Fredrik Lundh" <fr*****@pythonware.com> schrieb im Newsbeitrag
news:ma***************************************@pyt hon.org...

Claudio Grondi wrote:
What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.

Test the provided code and see yourself.

At least on my system:
zlib fails to decompress raising a memory error
pylzma fails to decompress running endlessly consuming 99% of CPU time
bz2 fails to compress running endlessly consuming 99% of CPU time

The same works with a 10 MByte string without any problem.

So what? Is there no compression support for large sized strings in
Python?
you're probably measuring windows' memory managment rather than the com-
pression libraries themselves (Python delegates all memory allocations
256 bytes
to the system).

I suggest using incremental (streaming) processing instead; from what I can tell, all three libraries support that.

</F>

Have solved the problem with bz2 compression the way Frederic suggested:

fObj = file(r'd:\strSize500MBCompressed.bz2', 'wb')
import bz2
objBZ2Compressor = bz2.BZ2Compressor()
lstCompressBz2 = []
for indx in range(0, len(strSize500MB), 1048576):
lowerIndx = indx
upperIndx = indx+1048576
if(upperIndx > len(strSize500MB)): upperIndx = len(strSize500MB)

lstCompressBz2.append(objBZ2Compressor.compress(st rSize500MB[lowerIndx:upper
Indx]))
#:for
lstCompressBz2.append(objBZ2Compressor.flush())
strSize500MBCompressed = ''.join(lstCompressBz2)
fObj.write(strSize500MBCompressed)
fObj.close()

:-)

so I suppose, that the decompression problems can also be solved that way,
but :

This still doesn't for me answer the question what the core of the problem
was, how to avoid it and what are the memory request limits which should be
considered when working with large strings?
Is it actually so, that on other systems than Windows 2000/XP there is no
problem with the original code I have provided?
Maybe a good reason to go for Linux instead of Windows? Does e.g. Suse or
Mandriva Linux have also a memory limit a single Python process can use?
Please let me know about your experience.

Claudio

Dec 1 '05 #4

Gerald Klix

Did you consider the mmap library?
Perhaps it is possible to avoid to hold these big stings in memory.
BTW: AFAIK it is not possible in 32bit windows for an ordinary programm
to allocate more than 2 GB. That restriction comes from the jurrasic
MIPS-Processors, that reserved the upper 2 GB for the OS.

HTH,
Gerald

Claudio Grondi schrieb:

"Fredrik Lundh" <fr*****@pythonware.com> schrieb im Newsbeitrag
news:ma***************************************@pyt hon.org...
Claudio Grondi wrote:

What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0
GHz
system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.

Test the provided code and see yourself.

At least on my system:
zlib fails to decompress raising a memory error
pylzma fails to decompress running endlessly consuming 99% of CPU time
bz2 fails to compress running endlessly consuming 99% of CPU time

The same works with a 10 MByte string without any problem.

So what? Is there no compression support for large sized strings in

Python?
you're probably measuring windows' memory managment rather than the com-
pression libraries themselves (Python delegates all memory allocations
256 bytes
to the system).

I suggest using incremental (streaming) processing instead; from what I

can tell,
all three libraries support that.

</F>

Have solved the problem with bz2 compression the way Frederic suggested:

fObj = file(r'd:\strSize500MBCompressed.bz2', 'wb')
import bz2
objBZ2Compressor = bz2.BZ2Compressor()
lstCompressBz2 = []
for indx in range(0, len(strSize500MB), 1048576):
lowerIndx = indx
upperIndx = indx+1048576
if(upperIndx > len(strSize500MB)): upperIndx = len(strSize500MB)

lstCompressBz2.append(objBZ2Compressor.compress(st rSize500MB[lowerIndx:upper
Indx]))
#:for
lstCompressBz2.append(objBZ2Compressor.flush())
strSize500MBCompressed = ''.join(lstCompressBz2)
fObj.write(strSize500MBCompressed)
fObj.close()

:-)

so I suppose, that the decompression problems can also be solved that way,
but :

This still doesn't for me answer the question what the core of the problem
was, how to avoid it and what are the memory request limits which should be
considered when working with large strings?
Is it actually so, that on other systems than Windows 2000/XP there is no
problem with the original code I have provided?
Maybe a good reason to go for Linux instead of Windows? Does e.g. Suse or
Mandriva Linux have also a memory limit a single Python process can use?
Please let me know about your experience.

Claudio

Dec 1 '05 #5

Christophe

Gerald Klix a écrit :

Did you consider the mmap library?
Perhaps it is possible to avoid to hold these big stings in memory.
BTW: AFAIK it is not possible in 32bit windows for an ordinary programm
to allocate more than 2 GB. That restriction comes from the jurrasic
MIPS-Processors, that reserved the upper 2 GB for the OS.

As a matter of fact, it's Windows which reserved the upper 2 GB. There a
simple setting to change that value so that you have 3 GB available and
another setting which can even go as far as 3.5 GB available per process.

Dec 1 '05 #6

Fredrik Lundh

Christophe wrote:

Did you consider the mmap library?
Perhaps it is possible to avoid to hold these big stings in memory.
BTW: AFAIK it is not possible in 32bit windows for an ordinary programm
to allocate more than 2 GB. That restriction comes from the jurrasic
MIPS-Processors, that reserved the upper 2 GB for the OS.

As a matter of fact, it's Windows which reserved the upper 2 GB. There a
simple setting to change that value so that you have 3 GB available and
another setting which can even go as far as 3.5 GB available per process.

random raymond chen link:

http://blogs.msdn.com/oldnewthing/ar...05/208908.aspx

</F>

Dec 1 '05 #7

Claudio Grondi

I was also able to create a 1GB string on a different system (Linux 2.4.x,
32-bit Dual Intel Xeon, 8GB RAM, python 2.2).
$ python -c 'print len("m" * 1024*1024*1024)'
1073741824
I agree with another poster that you may be hitting Windows limitations
rather
than Python ones, but I am certainly not familiar with the details of
Windows
memory allocation.
Jeff
----------

Here my experience with hunting after the memory limit exactly the way Jeff
did it (Windows 2000, Intel Pentium4, 3GB RAM, Python 2.4.2):

\>python -c "print len('m' * 1024*1024*1024)"
1073741824

\>python -c "print len('m' * 1136*1024*1024)"
1191182336

\>python -c "print len('m' * 1236*1024*1024)"
Traceback (most recent call last):
File "<string>", line 1, in ?
MemoryError

Anyone on a big Linux machine able to do e.g. :
\>python -c "print len('m' * 2500*1024*1024)"
or even more without a memory error?

I suppose, that on the Dual Intel Xeon, even with 8 GByte RAM the upper
limit for available memory will be not larger than 4 GByte.
Can someone point me to an Intel compatible PC which is able to provide more
than 4 GByte RAM to Python?

Claudio

Dec 1 '05 #8

Harald Karner

Claudio Grondi wrote:

Anyone on a big Linux machine able to do e.g. :
\>python -c "print len('m' * 2500*1024*1024)"
or even more without a memory error?
I tried on a Sun with 16GB Ram (Python 2.3.2)
seems like 2GB is the limit for string size:
python -c "print len('m' * 2048*1024*1024)" Traceback (most recent call last):
File "<string>", line 1, in ?
OverflowError: repeated string is too long
python -c "print len('m' * ((2048*1024*1024)-1))"

2147483647

Dec 1 '05 #9

Fredrik Lundh

Harald Karner wrote:

I tried on a Sun with 16GB Ram (Python 2.3.2)
seems like 2GB is the limit for string size:
python -c "print len('m' * 2048*1024*1024)"

Traceback (most recent call last):
File "<string>", line 1, in ?
OverflowError: repeated string is too long
python -c "print len('m' * ((2048*1024*1024)-1))"

2147483647

the string type uses the ob_size field to hold the string length, and
ob_size is an integer:

$ more Include/object.h
...
int ob_size; /* Number of items in variable part */
...

anyone out there with an ILP64 system?

</F>

Dec 1 '05 #10

Claudio Grondi

"Harald Karner" <ha***********@a1.net> schrieb im Newsbeitrag
news:ne*********************@inet.ecofinance.com.. .

Claudio Grondi wrote:
Anyone on a big Linux machine able to do e.g. :
\>python -c "print len('m' * 2500*1024*1024)"
or even more without a memory error?

I tried on a Sun with 16GB Ram (Python 2.3.2)
seems like 2GB is the limit for string size:
> python -c "print len('m' * 2048*1024*1024)"

Traceback (most recent call last):
File "<string>", line 1, in ?
OverflowError: repeated string is too long
> python -c "print len('m' * ((2048*1024*1024)-1))"

2147483647

In this context I am very curious how many of such
2 GByte strings is it possible to create within a
single Python process?
i.e. at which of the following lines executed
as one script is there a memory error?

dataStringA = 'A'*((2048*1024*1024)-1) # 2 GByte
dataStringB = 'B'*((2048*1024*1024)-1) # 4 GByte
dataStringC = 'C'*((2048*1024*1024)-1) # 6 GByte
dataStringD = 'D'*((2048*1024*1024)-1) # 8 GByte
dataStringE = 'E'*((2048*1024*1024)-1) # 10 GByte
dataStringF = 'F'*((2048*1024*1024)-1) # 12 GByte
dataStringG = 'G'*((2048*1024*1024)-1) # 14 GByte

let 2 GByte for the system on a 16 GByte machine ... ;-)

Claudio

Dec 1 '05 #11

Claudio Grondi

> the string type uses the ob_size field to hold the string length, and

ob_size is an integer:

$ more Include/object.h
...
int ob_size; /* Number of items in variable part */

If this is what you mean,

#define PyObject_VAR_HEAD \
PyObject_HEAD \
int ob_size; /* Number of items in variable part */

and if I understand it the proper way
(i.e. that all Python types are derived from Python objects)
also the unlimited size integers are limited to integers which
fit into 2 GByte memory, right?
And also a list or dictionary are not designed to have
more than 2 Giga of elements, etc.

So the question which still remains open is, can Python by design
handle adress space larger than 2 GByte?

I can't check it out myself beeing on a Windows system which
limits already a single process to this address space.
With lists I hit the memory limit at around:
python -c "print len(280*1024*1024*[None])"
(where the required memory for this list is larger or
equal around 1.15 GByte - on Windows 2000, Pentium4,
with 3GByte RAM and Python 2.4.2).

Claudio

Dec 2 '05 #12

Alex Martelli

Claudio Grondi <cl************@freenet.de> wrote:
...

In this context I am very curious how many of such
2 GByte strings is it possible to create within a
single Python process?

VM (Virtual Memory) may make the issue difficult to answer precisely.

With a Python build for 64-bit addressing (and running, of course, on a
64-bit machine), you could go on for a long time. If your virtual
memory space is large enough (say a nice entire terabyte RAID diskset),
and you don't use resource limiting to throttle the process, you could
be trashing (with about 1000 GB of VM backed by only 14 GB of physical
RAM, I predict *LOTS AND LOTS* of disk activity!) for a very, very long
time before you finally get an out-of-memory error.

Change the parameters and the answer will change, of course -- Python
has relatively little to do with it, as you can build it for either
64-bit or 32-bit addressing, on suitable CPUs; the OS's VM
implementation (and of course the CPU) essentially dominate this
"problem space".
Alex

Dec 2 '05 #13

Claudio Grondi

"Gerald Klix" <Ge*********@klix.ch> schrieb im Newsbeitrag
news:ma***************************************@pyt hon.org...

Did you consider the mmap library?
Perhaps it is possible to avoid to hold these big stings in memory.
BTW: AFAIK it is not possible in 32bit windows for an ordinary programm
to allocate more than 2 GB. That restriction comes from the jurrasic
MIPS-Processors, that reserved the upper 2 GB for the OS.

HTH,
Gerald

objMmap = mmap.mmap(fHdl,os.fstat(fHdl)[6])
Traceback (most recent call last):
File "<pyshell#21>", line 1, in -toplevel-
objMmap = mmap.mmap(fHdl,os.fstat(fHdl)[6])
OverflowError: memory mapped size is too large (limited by C int) os.fstat(fHdl)[6]

4498001104L

Max. allowed value is here 256*256*256*128-1
i.e. 2147483647

'jurrasic' lets greet us also in Python.

The only existing 'workaround' seem to be,
to go for a 64 bit machine with a 64 bit Python version.

No other known way?
Can the Python code not be adjusted, so that C long long is used instead of
C int?

Claudio

Dec 2 '05 #14

Christopher Subich

Fredrik Lundh wrote:

Harald Karner wrote:

python -c "print len('m' * ((2048*1024*1024)-1))"

2147483647

the string type uses the ob_size field to hold the string length, and
ob_size is an integer:

$ more Include/object.h
...
int ob_size; /* Number of items in variable part */
...

anyone out there with an ILP64 system?

I have access to an itanium system with a metric ton of memory. I
-think- that the Python version is still only a 32-bit python, though
(any easy way of checking?). Old version of Python, but I'm not the
sysadmin and "I want to play around with python" isn't a good enough
reason for an upgrade. :)
Python 2.2.3 (#1, Nov 12 2004, 13:02:04)
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-42)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

str = 'm'*2047*1024*1024 + 'n'*2047*1024*1024
len(str) -2097152

Yes, that's a negative length. And I don't really care about rebinding
str for this demo. :)
str[0] Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: string index out of range str[1] Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: string index out of range str[-1] Traceback (most recent call last):
File "<stdin>", line 1, in ?
SystemError: error return without exception set len(str[:]) -2097152 l = list(str)
len(l) 0 l

[]

The string is actually created -- top reports 4.0GB of memory usage.

Dec 2 '05 #15

Fredrik Lundh

Christopher Subich wrote:

anyone out there with an ILP64 system?

I have access to an itanium system with a metric ton of memory. I
-think- that the Python version is still only a 32-bit python

an ILP64 system is a system where int, long, and pointer are all 64 bits,
so a 32-bit python on a 64-bit platform doesn't really qualify.

/... snip examples that show that python's string handling could need
some work for the len(s) > maxint case .../

</F>

Dec 3 '05 #16

Christopher Subich

Fredrik Lundh wrote:

Christopher Subich wrote:

I have access to an itanium system with a metric ton of memory. I
-think- that the Python version is still only a 32-bit python

an ILP64 system is a system where int, long, and pointer are all 64 bits,
so a 32-bit python on a 64-bit platform doesn't really qualify.

Did a quick check, and int is 32 bits, while long and pointer are each 64:
Python 2.2.3 (#1, Nov 12 2004, 13:02:04)
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-42)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import struct
struct.calcsize('i'),struct.calcsize('l'),struct.c alcsize('P')

(4, 8, 8)

So, as of 2.2.3, there might still be a problem.

Dec 5 '05 #17

Is there no compression support for large sized strings in Python?

Similar topics