What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz
system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.
Test the provided code and see yourself.
At least on my system:
zlib fails to decompress raising a memory error
pylzma fails to decompress running endlessly consuming 99% of CPU time
bz2 fails to compress running endlessly consuming 99% of CPU time
The same works with a 10 MByte string without any problem.
So what? Is there no compression support for large sized strings in Python?
Am I doing something the wrong way here?
Is there any and if yes, what is the theoretical upper limit of string size
which can be processed by each of the compression libraries?
The only limit I know about is 2 GByte for the python.exe process itself,
but this seems not to be the actual problem in this case.
There are also some other strange effects when trying to create large
strings using following code:
m = 'm'*1048576
# str1024MB = 1024*m # fails with memory error, but:
str512MB_01 = 512*m # works ok
# str512MB_02 = 512*m # fails with memory error, but:
str256MB_01 = 256*m # works ok
str256MB_02 = 256*m # works ok
etc. . etc. and so on
down to allocation of each single MB in separate string to push python.exe
to the experienced upper limit
of memory reported by Windows task manager available to python.exe of
2.065.352 KByte.
Is the question why did the str1024MB = 1024*m instruction fail,
when the memory is apparently there and the target size of 1 GByte can be
achieved
out of the scope of this discussion thread, or is this the same problem
causing
the compression libraries to fail? Why is no memory error raised then?
Any hints towards understanding what is going on and why and/or towards a
workaround are welcome.
Claudio
================================================== ==========
# HDvsArchiveUnpackingSpeed_WriteFiles.py
strSize10MB = '1234567890'*1048576 # 10 MB
strSize500MB = 50*strSize10MB
fObj = file(r'c:\strSize500MB.dat', 'wb')
fObj.write(strSize500MB)
fObj.close()
fObj = file(r'c:\strSize500MBCompressed.zlib', 'wb')
import zlib
strSize500MBCompressed = zlib.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()
fObj = file(r'c:\strSize500MBCompressed.pylzma', 'wb')
import pylzma
strSize500MBCompressed = pylzma.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()
fObj = file(r'c:\strSize500MBCompressed.bz2', 'wb')
import bz2
strSize500MBCompressed = bz2.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()
print
print ' Created files: '
print ' %s \n %s \n %s \n %s' %(
r'c:\strSize500MB.dat'
,r'c:\strSize500MBCompressed.zlib'
,r'c:\strSize500MBCompressed.pylzma'
,r'c:\strSize500MBCompressed.bz2'
)
raw_input(' EXIT with Enter /> ')
================================================== ==========
# HDvsArchiveUnpackingSpeed_TestSpeed.py
import time
startTime = time.clock()
fObj = file(r'c:\strSize500MB.dat', 'rb')
strSize500MB = fObj.read()
fObj.close()
print
print ' loading uncompressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.zlib', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import zlib
try:
startTime = time.clock()
strSize500MB = zlib.decompress(strSize500MBCompressed)
print 'decompressing zlib data: %7.3f
seconds'%(time.clock()-startTime,)
except:
print 'decompressing zlib data FAILED'
startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.pylzma', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import pylzma
try:
startTime = time.clock()
strSize500MB = pylzma.decompress(strSize500MBCompressed)
print 'decompressing pylzma data: %7.3f
seconds'%(time.clock()-startTime,)
except:
print 'decompressing pylzma data FAILED'
startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.bz2', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import bz2
try:
startTime = time.clock()
strSize500MB = bz2.decompress(strSize500MBCompressed)
print 'decompressing bz2 data: %7.3f
seconds'%(time.clock()-startTime,)
except:
print 'decompressing bz2 data FAILED'
raw_input(' EXIT with Enter /> ') 16 3257
Claudio Grondi wrote: What started as a simple test if it is better to load uncompressed data directly from the harddisk or load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz system with 3 GByte RAM) seems to show that none of the in Python available compression libraries really works for large sized (i.e. 500 MByte) strings.
Test the provided code and see yourself.
At least on my system: zlib fails to decompress raising a memory error pylzma fails to decompress running endlessly consuming 99% of CPU time bz2 fails to compress running endlessly consuming 99% of CPU time
The same works with a 10 MByte string without any problem.
So what? Is there no compression support for large sized strings in Python?
you're probably measuring windows' memory managment rather than the com-
pression libraries themselves (Python delegates all memory allocations >256 bytes
to the system).
I suggest using incremental (streaming) processing instead; from what I can tell,
all three libraries support that.
</F>
On this system (Linux 2.6.x, AMD64, 2 GB RAM, python2.4) I am able to
construct a 1 GB string by repetition, as well as compress a 512MB
string with gzip in one gulp.
$ cat claudio.py
s = '1234567890'*(1048576*50)
import zlib
c = zlib.compress(s)
print len(c)
open("/tmp/claudio.gz", "wb").write(c)
$ python claudio.py
1017769
$ python -c 'print len("m" * (1048576*1024))'
1073741824
I was also able to create a 1GB string on a different system (Linux 2.4.x,
32-bit Dual Intel Xeon, 8GB RAM, python 2.2).
$ python -c 'print len("m" * 1024*1024*1024)'
1073741824
I agree with another poster that you may be hitting Windows limitations rather
than Python ones, but I am certainly not familiar with the details of Windows
memory allocation.
Jeff
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDjvx7Jd01MZaTXX0RAo7HAKCEhtbvyS3GSfJPsqq0W5 R5EOLgTwCfVb7o
OSlY79Rl7HCLLNQQ4axI6AA=
=qTA2
-----END PGP SIGNATURE-----
"Fredrik Lundh" <fr*****@pythonware.com> schrieb im Newsbeitrag
news:ma***************************************@pyt hon.org... Claudio Grondi wrote:
What started as a simple test if it is better to load uncompressed data directly from the harddisk or load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0
GHz system with 3 GByte RAM) seems to show that none of the in Python available compression libraries really works for large sized (i.e. 500 MByte) strings.
Test the provided code and see yourself.
At least on my system: zlib fails to decompress raising a memory error pylzma fails to decompress running endlessly consuming 99% of CPU time bz2 fails to compress running endlessly consuming 99% of CPU time
The same works with a 10 MByte string without any problem.
So what? Is there no compression support for large sized strings in
Python? you're probably measuring windows' memory managment rather than the com- pression libraries themselves (Python delegates all memory allocations 256 bytes to the system).
I suggest using incremental (streaming) processing instead; from what I
can tell, all three libraries support that.
</F>
Have solved the problem with bz2 compression the way Frederic suggested:
fObj = file(r'd:\strSize500MBCompressed.bz2', 'wb')
import bz2
objBZ2Compressor = bz2.BZ2Compressor()
lstCompressBz2 = []
for indx in range(0, len(strSize500MB), 1048576):
lowerIndx = indx
upperIndx = indx+1048576
if(upperIndx > len(strSize500MB)): upperIndx = len(strSize500MB)
lstCompressBz2.append(objBZ2Compressor.compress(st rSize500MB[lowerIndx:upper
Indx]))
#:for
lstCompressBz2.append(objBZ2Compressor.flush())
strSize500MBCompressed = ''.join(lstCompressBz2)
fObj.write(strSize500MBCompressed)
fObj.close()
:-)
so I suppose, that the decompression problems can also be solved that way,
but :
This still doesn't for me answer the question what the core of the problem
was, how to avoid it and what are the memory request limits which should be
considered when working with large strings?
Is it actually so, that on other systems than Windows 2000/XP there is no
problem with the original code I have provided?
Maybe a good reason to go for Linux instead of Windows? Does e.g. Suse or
Mandriva Linux have also a memory limit a single Python process can use?
Please let me know about your experience.
Claudio
Did you consider the mmap library?
Perhaps it is possible to avoid to hold these big stings in memory.
BTW: AFAIK it is not possible in 32bit windows for an ordinary programm
to allocate more than 2 GB. That restriction comes from the jurrasic
MIPS-Processors, that reserved the upper 2 GB for the OS.
HTH,
Gerald
Claudio Grondi schrieb: "Fredrik Lundh" <fr*****@pythonware.com> schrieb im Newsbeitrag news:ma***************************************@pyt hon.org...
Claudio Grondi wrote:
What started as a simple test if it is better to load uncompressed data directly from the harddisk or load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz system with 3 GByte RAM) seems to show that none of the in Python available compression libraries really works for large sized (i.e. 500 MByte) strings.
Test the provided code and see yourself.
At least on my system: zlib fails to decompress raising a memory error pylzma fails to decompress running endlessly consuming 99% of CPU time bz2 fails to compress running endlessly consuming 99% of CPU time
The same works with a 10 MByte string without any problem.
So what? Is there no compression support for large sized strings in
Python?
you're probably measuring windows' memory managment rather than the com- pression libraries themselves (Python delegates all memory allocations 256 bytes to the system).
I suggest using incremental (streaming) processing instead; from what I
can tell,
all three libraries support that.
</F>
Have solved the problem with bz2 compression the way Frederic suggested:
fObj = file(r'd:\strSize500MBCompressed.bz2', 'wb') import bz2 objBZ2Compressor = bz2.BZ2Compressor() lstCompressBz2 = [] for indx in range(0, len(strSize500MB), 1048576): lowerIndx = indx upperIndx = indx+1048576 if(upperIndx > len(strSize500MB)): upperIndx = len(strSize500MB)
lstCompressBz2.append(objBZ2Compressor.compress(st rSize500MB[lowerIndx:upper Indx])) #:for lstCompressBz2.append(objBZ2Compressor.flush()) strSize500MBCompressed = ''.join(lstCompressBz2) fObj.write(strSize500MBCompressed) fObj.close()
:-)
so I suppose, that the decompression problems can also be solved that way, but :
This still doesn't for me answer the question what the core of the problem was, how to avoid it and what are the memory request limits which should be considered when working with large strings? Is it actually so, that on other systems than Windows 2000/XP there is no problem with the original code I have provided? Maybe a good reason to go for Linux instead of Windows? Does e.g. Suse or Mandriva Linux have also a memory limit a single Python process can use? Please let me know about your experience.
Claudio
Gerald Klix a écrit : Did you consider the mmap library? Perhaps it is possible to avoid to hold these big stings in memory. BTW: AFAIK it is not possible in 32bit windows for an ordinary programm to allocate more than 2 GB. That restriction comes from the jurrasic MIPS-Processors, that reserved the upper 2 GB for the OS.
As a matter of fact, it's Windows which reserved the upper 2 GB. There a
simple setting to change that value so that you have 3 GB available and
another setting which can even go as far as 3.5 GB available per process.
Christophe wrote: Did you consider the mmap library? Perhaps it is possible to avoid to hold these big stings in memory. BTW: AFAIK it is not possible in 32bit windows for an ordinary programm to allocate more than 2 GB. That restriction comes from the jurrasic MIPS-Processors, that reserved the upper 2 GB for the OS.
As a matter of fact, it's Windows which reserved the upper 2 GB. There a simple setting to change that value so that you have 3 GB available and another setting which can even go as far as 3.5 GB available per process.
random raymond chen link: http://blogs.msdn.com/oldnewthing/ar...05/208908.aspx
</F>
I was also able to create a 1GB string on a different system (Linux 2.4.x,
32-bit Dual Intel Xeon, 8GB RAM, python 2.2).
$ python -c 'print len("m" * 1024*1024*1024)'
1073741824
I agree with another poster that you may be hitting Windows limitations
rather
than Python ones, but I am certainly not familiar with the details of
Windows
memory allocation.
Jeff
----------
Here my experience with hunting after the memory limit exactly the way Jeff
did it (Windows 2000, Intel Pentium4, 3GB RAM, Python 2.4.2):
\>python -c "print len('m' * 1024*1024*1024)"
1073741824
\>python -c "print len('m' * 1136*1024*1024)"
1191182336
\>python -c "print len('m' * 1236*1024*1024)"
Traceback (most recent call last):
File "<string>", line 1, in ?
MemoryError
Anyone on a big Linux machine able to do e.g. :
\>python -c "print len('m' * 2500*1024*1024)"
or even more without a memory error?
I suppose, that on the Dual Intel Xeon, even with 8 GByte RAM the upper
limit for available memory will be not larger than 4 GByte.
Can someone point me to an Intel compatible PC which is able to provide more
than 4 GByte RAM to Python?
Claudio
Claudio Grondi wrote: Anyone on a big Linux machine able to do e.g. : \>python -c "print len('m' * 2500*1024*1024)" or even more without a memory error?
I tried on a Sun with 16GB Ram (Python 2.3.2)
seems like 2GB is the limit for string size:
python -c "print len('m' * 2048*1024*1024)"
Traceback (most recent call last):
File "<string>", line 1, in ?
OverflowError: repeated string is too long
python -c "print len('m' * ((2048*1024*1024)-1))"
2147483647
Harald Karner wrote: I tried on a Sun with 16GB Ram (Python 2.3.2) seems like 2GB is the limit for string size:
python -c "print len('m' * 2048*1024*1024)" Traceback (most recent call last): File "<string>", line 1, in ? OverflowError: repeated string is too long
python -c "print len('m' * ((2048*1024*1024)-1))" 2147483647
the string type uses the ob_size field to hold the string length, and
ob_size is an integer:
$ more Include/object.h
...
int ob_size; /* Number of items in variable part */
...
anyone out there with an ILP64 system?
</F>
"Harald Karner" <ha***********@a1.net> schrieb im Newsbeitrag
news:ne*********************@inet.ecofinance.com.. . Claudio Grondi wrote: Anyone on a big Linux machine able to do e.g. : \>python -c "print len('m' * 2500*1024*1024)" or even more without a memory error?
I tried on a Sun with 16GB Ram (Python 2.3.2) seems like 2GB is the limit for string size:
> python -c "print len('m' * 2048*1024*1024)" Traceback (most recent call last): File "<string>", line 1, in ? OverflowError: repeated string is too long
> python -c "print len('m' * ((2048*1024*1024)-1))" 2147483647
In this context I am very curious how many of such
2 GByte strings is it possible to create within a
single Python process?
i.e. at which of the following lines executed
as one script is there a memory error?
dataStringA = 'A'*((2048*1024*1024)-1) # 2 GByte
dataStringB = 'B'*((2048*1024*1024)-1) # 4 GByte
dataStringC = 'C'*((2048*1024*1024)-1) # 6 GByte
dataStringD = 'D'*((2048*1024*1024)-1) # 8 GByte
dataStringE = 'E'*((2048*1024*1024)-1) # 10 GByte
dataStringF = 'F'*((2048*1024*1024)-1) # 12 GByte
dataStringG = 'G'*((2048*1024*1024)-1) # 14 GByte
let 2 GByte for the system on a 16 GByte machine ... ;-)
Claudio
> the string type uses the ob_size field to hold the string length, and ob_size is an integer:
$ more Include/object.h ... int ob_size; /* Number of items in variable part */
If this is what you mean,
#define PyObject_VAR_HEAD \
PyObject_HEAD \
int ob_size; /* Number of items in variable part */
and if I understand it the proper way
(i.e. that all Python types are derived from Python objects)
also the unlimited size integers are limited to integers which
fit into 2 GByte memory, right?
And also a list or dictionary are not designed to have
more than 2 Giga of elements, etc.
So the question which still remains open is, can Python by design
handle adress space larger than 2 GByte?
I can't check it out myself beeing on a Windows system which
limits already a single process to this address space.
With lists I hit the memory limit at around:
python -c "print len(280*1024*1024*[None])"
(where the required memory for this list is larger or
equal around 1.15 GByte - on Windows 2000, Pentium4,
with 3GByte RAM and Python 2.4.2).
Claudio
Claudio Grondi <cl************@freenet.de> wrote:
... In this context I am very curious how many of such 2 GByte strings is it possible to create within a single Python process?
VM (Virtual Memory) may make the issue difficult to answer precisely.
With a Python build for 64-bit addressing (and running, of course, on a
64-bit machine), you could go on for a long time. If your virtual
memory space is large enough (say a nice entire terabyte RAID diskset),
and you don't use resource limiting to throttle the process, you could
be trashing (with about 1000 GB of VM backed by only 14 GB of physical
RAM, I predict *LOTS AND LOTS* of disk activity!) for a very, very long
time before you finally get an out-of-memory error.
Change the parameters and the answer will change, of course -- Python
has relatively little to do with it, as you can build it for either
64-bit or 32-bit addressing, on suitable CPUs; the OS's VM
implementation (and of course the CPU) essentially dominate this
"problem space".
Alex
"Gerald Klix" <Ge*********@klix.ch> schrieb im Newsbeitrag
news:ma***************************************@pyt hon.org... Did you consider the mmap library? Perhaps it is possible to avoid to hold these big stings in memory. BTW: AFAIK it is not possible in 32bit windows for an ordinary programm to allocate more than 2 GB. That restriction comes from the jurrasic MIPS-Processors, that reserved the upper 2 GB for the OS.
HTH, Gerald objMmap = mmap.mmap(fHdl,os.fstat(fHdl)[6])
Traceback (most recent call last):
File "<pyshell#21>", line 1, in -toplevel-
objMmap = mmap.mmap(fHdl,os.fstat(fHdl)[6])
OverflowError: memory mapped size is too large (limited by C int) os.fstat(fHdl)[6]
4498001104L
Max. allowed value is here 256*256*256*128-1
i.e. 2147483647
'jurrasic' lets greet us also in Python.
The only existing 'workaround' seem to be,
to go for a 64 bit machine with a 64 bit Python version.
No other known way?
Can the Python code not be adjusted, so that C long long is used instead of
C int?
Claudio
Fredrik Lundh wrote: Harald Karner wrote: python -c "print len('m' * ((2048*1024*1024)-1))"
2147483647
the string type uses the ob_size field to hold the string length, and ob_size is an integer:
$ more Include/object.h ... int ob_size; /* Number of items in variable part */ ...
anyone out there with an ILP64 system?
I have access to an itanium system with a metric ton of memory. I
-think- that the Python version is still only a 32-bit python, though
(any easy way of checking?). Old version of Python, but I'm not the
sysadmin and "I want to play around with python" isn't a good enough
reason for an upgrade. :)
Python 2.2.3 (#1, Nov 12 2004, 13:02:04)
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-42)] on linux2
Type "help", "copyright", "credits" or "license" for more information. str = 'm'*2047*1024*1024 + 'n'*2047*1024*1024 len(str)
-2097152
Yes, that's a negative length. And I don't really care about rebinding
str for this demo. :)
str[0]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: string index out of range str[1]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: string index out of range str[-1]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
SystemError: error return without exception set len(str[:])
-2097152 l = list(str) len(l)
0 l
[]
The string is actually created -- top reports 4.0GB of memory usage.
Christopher Subich wrote: anyone out there with an ILP64 system?
I have access to an itanium system with a metric ton of memory. I -think- that the Python version is still only a 32-bit python
an ILP64 system is a system where int, long, and pointer are all 64 bits,
so a 32-bit python on a 64-bit platform doesn't really qualify.
/... snip examples that show that python's string handling could need
some work for the len(s) > maxint case .../
</F>
Fredrik Lundh wrote: Christopher Subich wrote: I have access to an itanium system with a metric ton of memory. I -think- that the Python version is still only a 32-bit python
an ILP64 system is a system where int, long, and pointer are all 64 bits, so a 32-bit python on a 64-bit platform doesn't really qualify.
Did a quick check, and int is 32 bits, while long and pointer are each 64:
Python 2.2.3 (#1, Nov 12 2004, 13:02:04)
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-42)] on linux2
Type "help", "copyright", "credits" or "license" for more information. import struct struct.calcsize('i'),struct.calcsize('l'),struct.c alcsize('P')
(4, 8, 8)
So, as of 2.2.3, there might still be a problem. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: assaf__ |
last post by:
Hello,
I am beginning to work on a fairly large project and I'm considering to use python for most of the coding, but I need to make sure first...
|
by: Andrea Griffini |
last post by:
I did it.
I proposed python as the main language for our next CAD/CAM
software because I think that it has all the potential needed
for it. I'm...
|
by: Shiperton Henethe |
last post by:
Dreamweaver 4
Hi
Can anyone recommend a decent utility for compressing
HTML that is safe - i.e. that *definitely* doesn't mess
with the...
|
by: pruebauno |
last post by:
Hello all,
I am having issues compiling Python with large file support. I tried
forcing the configure script to add it but then it bombs in the...
|
by: Grant Edwards |
last post by:
I give up, how do I make this not fail under 2.4?
fcntl.ioctl(self.dev.fileno(),0xc0047a80,struct.pack("HBB",0x1c,0x00,0x00))
I get an...
|
by: jazaret |
last post by:
I've been having a hard time getting the benefits that Unicode
Compression offers (2003 Access). I've got a test database that I'd
like to set the...
|
by: chris.atlee |
last post by:
I'm writing a program in python that creates tar files of a certain
maximum size (to fit onto CD/DVD). One of the problems I'm running
into is...
|
by: pamela fluente |
last post by:
I have been using something like:
public void SaveJPG(Image Image, string FileName, long
QualityLevel_0_100, long ColorDepthLevel)
{...
|
by: Giorgio Parmeggiani |
last post by:
Hi
I'm using the gzip compression found in WCG samples kit.
It works well, but how can I set the SendTimeout and ReceiveTimeout
parameters?
...
|
by: better678 |
last post by:
Question:
Discuss your understanding of the Java platform. Is the statement "Java is interpreted" correct?
Answer:
Java is an object-oriented...
|
by: Kemmylinns12 |
last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
|
by: jalbright99669 |
last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
|
by: antdb |
last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine
In the overall architecture, a new "hyper-convergence" concept was...
|
by: Matthew3360 |
last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function.
Here is my code.
...
|
by: Matthew3360 |
last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
|
by: Arjunsri |
last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and...
|
by: WisdomUfot |
last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific...
|
by: Oralloy |
last post by:
Hello Folks,
I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA.
My problem (spelled failure) is with the...
| |