Is there no compression support for large sized strings in Python?

Claudio Grondi

What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz
system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.

Test the provided code and see yourself.

At least on my system:
zlib fails to decompress raising a memory error
pylzma fails to decompress running endlessly consuming 99% of CPU time
bz2 fails to compress running endlessly consuming 99% of CPU time

The same works with a 10 MByte string without any problem.

So what? Is there no compression support for large sized strings in Python?
Am I doing something the wrong way here?
Is there any and if yes, what is the theoretical upper limit of string size
which can be processed by each of the compression libraries?

The only limit I know about is 2 GByte for the python.exe process itself,
but this seems not to be the actual problem in this case.
There are also some other strange effects when trying to create large
strings using following code:
m = 'm'*1048576
# str1024MB = 1024*m # fails with memory error, but:
str512MB_01 = 512*m # works ok
# str512MB_02 = 512*m # fails with memory error, but:
str256MB_01 = 256*m # works ok
str256MB_02 = 256*m # works ok
etc. . etc. and so on
down to allocation of each single MB in separate string to push python.exe
to the experienced upper limit
of memory reported by Windows task manager available to python.exe of
2.065.352 KByte.
Is the question why did the str1024MB = 1024*m instruction fail,
when the memory is apparently there and the target size of 1 GByte can be
achieved
out of the scope of this discussion thread, or is this the same problem
causing
the compression libraries to fail? Why is no memory error raised then?

Any hints towards understanding what is going on and why and/or towards a
workaround are welcome.

Claudio

=============== =============== =============== ===============
# HDvsArchiveUnpa ckingSpeed_Writ eFiles.py

strSize10MB = '1234567890'*10 48576 # 10 MB
strSize500MB = 50*strSize10MB
fObj = file(r'c:\strSi ze500MB.dat', 'wb')
fObj.write(strS ize500MB)
fObj.close()

fObj = file(r'c:\strSi ze500MBCompress ed.zlib', 'wb')
import zlib
strSize500MBCom pressed = zlib.compress(s trSize500MB)
fObj.write(strS ize500MBCompres sed)
fObj.close()

fObj = file(r'c:\strSi ze500MBCompress ed.pylzma', 'wb')
import pylzma
strSize500MBCom pressed = pylzma.compress (strSize500MB)
fObj.write(strS ize500MBCompres sed)
fObj.close()

fObj = file(r'c:\strSi ze500MBCompress ed.bz2', 'wb')
import bz2
strSize500MBCom pressed = bz2.compress(st rSize500MB)
fObj.write(strS ize500MBCompres sed)
fObj.close()

print
print ' Created files: '
print ' %s \n %s \n %s \n %s' %(
r'c:\strSize500 MB.dat'
,r'c:\strSize50 0MBCompressed.z lib'
,r'c:\strSize50 0MBCompressed.p ylzma'
,r'c:\strSize50 0MBCompressed.b z2'
)

raw_input(' EXIT with Enter /> ')

=============== =============== =============== ===============
# HDvsArchiveUnpa ckingSpeed_Test Speed.py
import time

startTime = time.clock()
fObj = file(r'c:\strSi ze500MB.dat', 'rb')
strSize500MB = fObj.read()
fObj.close()
print
print ' loading uncompressed data from file: %7.3f
seconds'%(time. clock()-startTime,)

startTime = time.clock()
fObj = file(r'c:\strSi ze500MBCompress ed.zlib', 'rb')
strSize500MBCom pressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time. clock()-startTime,)
import zlib
try:
startTime = time.clock()
strSize500MB = zlib.decompress (strSize500MBCo mpressed)
print 'decompressing zlib data: %7.3f
seconds'%(time. clock()-startTime,)
except:
print 'decompressing zlib data FAILED'
startTime = time.clock()
fObj = file(r'c:\strSi ze500MBCompress ed.pylzma', 'rb')
strSize500MBCom pressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time. clock()-startTime,)
import pylzma
try:
startTime = time.clock()
strSize500MB = pylzma.decompre ss(strSize500MB Compressed)
print 'decompressing pylzma data: %7.3f
seconds'%(time. clock()-startTime,)
except:
print 'decompressing pylzma data FAILED'
startTime = time.clock()
fObj = file(r'c:\strSi ze500MBCompress ed.bz2', 'rb')
strSize500MBCom pressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time. clock()-startTime,)
import bz2
try:
startTime = time.clock()
strSize500MB = bz2.decompress( strSize500MBCom pressed)
print 'decompressing bz2 data: %7.3f
seconds'%(time. clock()-startTime,)
except:
print 'decompressing bz2 data FAILED'

raw_input(' EXIT with Enter /> ')

Dec 1 '05 #1

Subscribe Reply

3402

Fredrik Lundh

Claudio Grondi wrote:

What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz
system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.

Test the provided code and see yourself.

At least on my system:
zlib fails to decompress raising a memory error
pylzma fails to decompress running endlessly consuming 99% of CPU time
bz2 fails to compress running endlessly consuming 99% of CPU time

The same works with a 10 MByte string without any problem.

So what? Is there no compression support for large sized strings in Python?

you're probably measuring windows' memory managment rather than the com-
pression libraries themselves (Python delegates all memory allocations >256 bytes
to the system).

I suggest using incremental (streaming) processing instead; from what I can tell,
all three libraries support that.

</F>

Dec 1 '05 #2

jepler

On this system (Linux 2.6.x, AMD64, 2 GB RAM, python2.4) I am able to
construct a 1 GB string by repetition, as well as compress a 512MB
string with gzip in one gulp.

$ cat claudio.py
s = '1234567890'*(1 048576*50)

import zlib
c = zlib.compress(s )
print len(c)
open("/tmp/claudio.gz", "wb").write (c)

$ python claudio.py
1017769

$ python -c 'print len("m" * (1048576*1024)) '
1073741824

I was also able to create a 1GB string on a different system (Linux 2.4.x,
32-bit Dual Intel Xeon, 8GB RAM, python 2.2).

$ python -c 'print len("m" * 1024*1024*1024) '
1073741824

I agree with another poster that you may be hitting Windows limitations rather
than Python ones, but I am certainly not familiar with the details of Windows
memory allocation.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDjvx7Jd0 1MZaTXX0RAo7HAK CEhtbvyS3GSfJPs qq0W5R5EOLgTwCf Vb7o
OSlY79Rl7HCLLNQ Q4axI6AA=
=qTA2
-----END PGP SIGNATURE-----

Dec 1 '05 #3

Claudio Grondi

"Fredrik Lundh" <fr*****@python ware.com> schrieb im Newsbeitrag
news:ma******** *************** *************** *@python.org...

Claudio Grondi wrote:
What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.

Test the provided code and see yourself.

At least on my system:
zlib fails to decompress raising a memory error
pylzma fails to decompress running endlessly consuming 99% of CPU time
bz2 fails to compress running endlessly consuming 99% of CPU time

The same works with a 10 MByte string without any problem.

So what? Is there no compression support for large sized strings in
Python?
you're probably measuring windows' memory managment rather than the com-
pression libraries themselves (Python delegates all memory allocations
256 bytes
to the system).

I suggest using incremental (streaming) processing instead; from what I can tell, all three libraries support that.

</F>

Have solved the problem with bz2 compression the way Frederic suggested:

fObj = file(r'd:\strSi ze500MBCompress ed.bz2', 'wb')
import bz2
objBZ2Compresso r = bz2.BZ2Compress or()
lstCompressBz2 = []
for indx in range(0, len(strSize500M B), 1048576):
lowerIndx = indx
upperIndx = indx+1048576
if(upperIndx > len(strSize500M B)): upperIndx = len(strSize500M B)

lstCompressBz2. append(objBZ2Co mpressor.compre ss(strSize500MB[lowerIndx:upper
Indx]))
#:for
lstCompressBz2. append(objBZ2Co mpressor.flush( ))
strSize500MBCom pressed = ''.join(lstComp ressBz2)
fObj.write(strS ize500MBCompres sed)
fObj.close()

:-)

so I suppose, that the decompression problems can also be solved that way,
but :

This still doesn't for me answer the question what the core of the problem
was, how to avoid it and what are the memory request limits which should be
considered when working with large strings?
Is it actually so, that on other systems than Windows 2000/XP there is no
problem with the original code I have provided?
Maybe a good reason to go for Linux instead of Windows? Does e.g. Suse or
Mandriva Linux have also a memory limit a single Python process can use?
Please let me know about your experience.

Claudio

Dec 1 '05 #4

Gerald Klix

Did you consider the mmap library?
Perhaps it is possible to avoid to hold these big stings in memory.
BTW: AFAIK it is not possible in 32bit windows for an ordinary programm
to allocate more than 2 GB. That restriction comes from the jurrasic
MIPS-Processors, that reserved the upper 2 GB for the OS.

HTH,
Gerald

Claudio Grondi schrieb:

"Fredrik Lundh" <fr*****@python ware.com> schrieb im Newsbeitrag
news:ma******** *************** *************** *@python.org...
Claudio Grondi wrote:

What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0
GHz
system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.

Test the provided code and see yourself.

At least on my system:
zlib fails to decompress raising a memory error
pylzma fails to decompress running endlessly consuming 99% of CPU time
bz2 fails to compress running endlessly consuming 99% of CPU time

The same works with a 10 MByte string without any problem.

So what? Is there no compression support for large sized strings in

Python?
you're probably measuring windows' memory managment rather than the com-
pression libraries themselves (Python delegates all memory allocations
256 bytes
to the system).

I suggest using incremental (streaming) processing instead; from what I

can tell,
all three libraries support that.

</F>

Have solved the problem with bz2 compression the way Frederic suggested:

fObj = file(r'd:\strSi ze500MBCompress ed.bz2', 'wb')
import bz2
objBZ2Compresso r = bz2.BZ2Compress or()
lstCompressBz2 = []
for indx in range(0, len(strSize500M B), 1048576):
lowerIndx = indx
upperIndx = indx+1048576
if(upperIndx > len(strSize500M B)): upperIndx = len(strSize500M B)

lstCompressBz2. append(objBZ2Co mpressor.compre ss(strSize500MB[lowerIndx:upper
Indx]))
#:for
lstCompressBz2. append(objBZ2Co mpressor.flush( ))
strSize500MBCom pressed = ''.join(lstComp ressBz2)
fObj.write(strS ize500MBCompres sed)
fObj.close()

:-)

so I suppose, that the decompression problems can also be solved that way,
but :

This still doesn't for me answer the question what the core of the problem
was, how to avoid it and what are the memory request limits which should be
considered when working with large strings?
Is it actually so, that on other systems than Windows 2000/XP there is no
problem with the original code I have provided?
Maybe a good reason to go for Linux instead of Windows? Does e.g. Suse or
Mandriva Linux have also a memory limit a single Python process can use?
Please let me know about your experience.

Claudio

Dec 1 '05 #5

Christophe

Gerald Klix a écrit :

Did you consider the mmap library?
Perhaps it is possible to avoid to hold these big stings in memory.
BTW: AFAIK it is not possible in 32bit windows for an ordinary programm
to allocate more than 2 GB. That restriction comes from the jurrasic
MIPS-Processors, that reserved the upper 2 GB for the OS.

As a matter of fact, it's Windows which reserved the upper 2 GB. There a
simple setting to change that value so that you have 3 GB available and
another setting which can even go as far as 3.5 GB available per process.

Dec 1 '05 #6

Fredrik Lundh

Christophe wrote:

Did you consider the mmap library?
Perhaps it is possible to avoid to hold these big stings in memory.
BTW: AFAIK it is not possible in 32bit windows for an ordinary programm
to allocate more than 2 GB. That restriction comes from the jurrasic
MIPS-Processors, that reserved the upper 2 GB for the OS.

As a matter of fact, it's Windows which reserved the upper 2 GB. There a
simple setting to change that value so that you have 3 GB available and
another setting which can even go as far as 3.5 GB available per process.

random raymond chen link:

http://blogs.msdn.com/oldnewthing/ar...05/208908.aspx

</F>

Dec 1 '05 #7

Claudio Grondi

I was also able to create a 1GB string on a different system (Linux 2.4.x,
32-bit Dual Intel Xeon, 8GB RAM, python 2.2).
$ python -c 'print len("m" * 1024*1024*1024) '
1073741824
I agree with another poster that you may be hitting Windows limitations
rather
than Python ones, but I am certainly not familiar with the details of
Windows
memory allocation.
Jeff
----------

Here my experience with hunting after the memory limit exactly the way Jeff
did it (Windows 2000, Intel Pentium4, 3GB RAM, Python 2.4.2):

\>python -c "print len('m' * 1024*1024*1024) "
1073741824

\>python -c "print len('m' * 1136*1024*1024) "
1191182336

\>python -c "print len('m' * 1236*1024*1024) "
Traceback (most recent call last):
File "<string>", line 1, in ?
MemoryError

Anyone on a big Linux machine able to do e.g. :
\>python -c "print len('m' * 2500*1024*1024) "
or even more without a memory error?

I suppose, that on the Dual Intel Xeon, even with 8 GByte RAM the upper
limit for available memory will be not larger than 4 GByte.
Can someone point me to an Intel compatible PC which is able to provide more
than 4 GByte RAM to Python?

Claudio

Dec 1 '05 #8

Harald Karner

Claudio Grondi wrote:

Anyone on a big Linux machine able to do e.g. :
\>python -c "print len('m' * 2500*1024*1024) "
or even more without a memory error?
I tried on a Sun with 16GB Ram (Python 2.3.2)
seems like 2GB is the limit for string size:
python -c "print len('m' * 2048*1024*1024) " Traceback (most recent call last):
File "<string>", line 1, in ?
OverflowError: repeated string is too long
python -c "print len('m' * ((2048*1024*102 4)-1))"

2147483647

Dec 1 '05 #9

Fredrik Lundh

Harald Karner wrote:

I tried on a Sun with 16GB Ram (Python 2.3.2)
seems like 2GB is the limit for string size:
python -c "print len('m' * 2048*1024*1024) "

Traceback (most recent call last):
File "<string>", line 1, in ?
OverflowError: repeated string is too long
python -c "print len('m' * ((2048*1024*102 4)-1))"

2147483647

the string type uses the ob_size field to hold the string length, and
ob_size is an integer:

$ more Include/object.h
...
int ob_size; /* Number of items in variable part */
...

anyone out there with an ILP64 system?

</F>

Dec 1 '05 #10

Similar topics

3395

Python for large projects

by: assaf__ | last post by:

Hello, I am beginning to work on a fairly large project and I'm considering to use python for most of the coding, but I need to make sure first that it is reliable enough. I need to make sure that I won't have surprises when my program runs on different real-world systems. So far I wrote a little script with python using urllib, and on one computer it failed completely because of a problem in getting the proxies (in my opinion this is a...

Python

6353

Is there a "Large Scale Python Software Design" ?

by: Andrea Griffini | last post by:

I did it. I proposed python as the main language for our next CAD/CAM software because I think that it has all the potential needed for it. I'm not sure yet if the decision will get through, but something I'll need in this case is some experience-based set of rules about how to use python in this context. For example... is defining readonly attributes in classes worth the hassle ? Does duck-typing scale well in complex

Python

2482

Is it safe to compress & reformat web-page HTML at will? Recommend any compression tool?

by: Shiperton Henethe | last post by:

Dreamweaver 4 Hi Can anyone recommend a decent utility for compressing HTML that is safe - i.e. that *definitely* doesn't mess with the appearance in any browsers. I run a growing website whose pages are inexorably getting too "heavy".

HTML / CSS

1987

Issues compiling with large file support

by: pruebauno | last post by:

Hello all, I am having issues compiling Python with large file support. I tried forcing the configure script to add it but then it bombs in the make process. Any help will be appreciated. Information: Architecture: PowerPc on AIX version 5 Compiler:

Python

6287

2.3 -> 2.4: long int too large to convert to int

by: Grant Edwards | last post by:

I give up, how do I make this not fail under 2.4? fcntl.ioctl(self.dev.fileno(),0xc0047a80,struct.pack("HBB",0x1c,0x00,0x00)) I get an OverflowError: long int too large to convert to int ioctl() is expecting a 32-bit integer value, and 0xc0047a80 has the high-order bit set. I'm assuming Python thinks it's a signed value. How do I tell Python that 0xc0047a80 is an unsigned 32-bit value?

Python

2315

Unicode Compression not working under DAO.

by: jazaret | last post by:

I've been having a hard time getting the benefits that Unicode Compression offers (2003 Access). I've got a test database that I'd like to set the Unicode Compression for the fields. For this test I've got a simple table with 16 text fields with size 255 each. Now I'd like to modify the UC property in code. One of them with DAO like so... Set db = CurrentDb() Set tdef = db.TableDefs("Testtable")

Microsoft Access / VBA

2339

Copying zlib compression objects

by: chris.atlee | last post by:

I'm writing a program in python that creates tar files of a certain maximum size (to fit onto CD/DVD). One of the problems I'm running into is that when using compression, it's pretty much impossible to determine if a file, once added to an archive, will cause the archive size to exceed the maximum size. I believe that to do this properly, you need to copy the state of tar file (basically the current file offset as well as the state of...

Python

17220

How to specify COMPRESSION level when saving (loseless) PNGs ?

by: pamela fluente | last post by:

I have been using something like: public void SaveJPG(Image Image, string FileName, long QualityLevel_0_100, long ColorDepthLevel) { ImageCodecInfo ImageCodecInfoJPG = GetEncoderInfo("image/jpeg"); EncoderParameters EP = new EncoderParameters(2); EP.Param(0) = new EncoderParameter(Encoder.Quality, QualityLevel_0_100); EP.Param(1) = new EncoderParameter(Encoder.ColorDepth,

C# / C Sharp

3272

Gzip compression and SendTimeout

by: Giorgio Parmeggiani | last post by:

Hi I'm using the gzip compression found in WCG samples kit. It works well, but how can I set the SendTimeout and ReceiveTimeout parameters? Thank in advance Giorgio

.NET Framework

8015

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

7951

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8439

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8430

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8305

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6770

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5966

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5465

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

3930

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration