473,748 Members | 8,779 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Unable to read large files from zip


I posted this on the forum, but nobody seems to know the solution: http://python-forum.org/py/viewtopic.php?t=5230

I have a zip file that is several GB in size, and one of the files inside of it is several GB in size. When it comes time to read the 5+GB file from inside the zip file, it fails with the following error:
File "...\zipfile.py ", line 491, in read bytes = self.fp.read(zi nfo.compress_si ze)
OverflowError: long it too large to convert to int
Note: all the other smaller files up to that point come out just fine.
Here's the code:
------------------
import zipfile
import re
dataObj = zipfile.ZipFile ("zip.zip"," r")
for i in dataObj.namelis t():
-----print i+" -- >="+str(dataObj .getinfo(i).com press_size /1024 / 1024)+"MB"
-----if(i[-1] == "/"):
----------print "Directory -- won't extract"
-----else:
----------fileName = re.split(r".*/",i,0)[1]
----------fileData = dataObj.read(i)
There have been one or more posts about 2GB limits with the zipfile module,as well as this bug report: http://bugs.python.org/issue1189216 Also, older zip formats have a 4GB limit. However, I can't say for sure what the problem is. Does anyone know if my code is wrong or if there is a problem with Python itself?
If Python has a bug in it, then is there any other alternative library thatI can use (It must be free source: BSD, MIT, Public Domain, Python license; not copyleft/*GPL)? If not that, is there any similarly licensed code inanother language (like c++, lisp, etc...) that I can use?
_______________ _______________ _______________ _______________ _____
Messenger Café — open for fun 24/7. Hot games, cool activities served daily. Visit now.
http://cafemessenger.com?ocid=TXT_TAGLM_AugWLtagline
Aug 29 '07 #1
2 5452
Kevin Ar18 <ke*******@hotm ail.comwrote:
>
I posted this on the forum, but nobody seems to know the solution: http://python-forum.org/py/viewtopic.php?t=5230

I have a zip file that is several GB in size, and one of the files inside of it is several GB in size. When it comes time to read the 5+GB file from inside the zip file, it fails with the following error:
File "...\zipfile.py ", line 491, in read bytes = self.fp.read(zi nfo.compress_si ze)
OverflowError: long it too large to convert to int
That will be an number which is bigger than 2**31 == 2 GB which can't
be converted to an int.

It would be explained if zinfo.compress_ size is 2GB, eg
>>f=open("z")
f.read(2**3 1)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
OverflowError: long int too large to convert to int

However it would seem nuts that zipfile is trying to read 2GB into
memory at once!
There have been one or more posts about 2GB limits with the zipfile
module, as well as this bug report:
http://bugs.python.org/issue1189216 Also, older zip formats have a
4GB limit. However, I can't say for sure what the problem is.
Does anyone know if my code is wrong
Your code looks OK to me.
or if there is a problem with Python itself?
Looks likely.
If Python has a bug in it
....then you have the source and you can have a go at fixing it!

Try editing zipfile.py and getting it to print out some debug info and
see if you can fix the problem. When you have done submit the patch
to the python bug tracker and you'll get that nice glow from helping
others! Remember python is open source and is made by *us* for *us* :-)

If you need help fixing zipfile.py then you'd probably be better off
asking on python-dev.

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick
Aug 29 '07 #2
Nick Craig-Wood <ni**@craig-wood.comwrites:
Kevin Ar18 <ke*******@hotm ail.comwrote:
>>
I posted this on the forum, but nobody seems to know the solution: http://python-forum.org/py/viewtopic.php?t=5230

I have a zip file that is several GB in size, and one of the files inside of it is several GB in size. When it comes time to read the 5+GB file from inside the zip file, it fails with the following error:
File "...\zipfile.py ", line 491, in read bytes = self.fp.read(zi nfo.compress_si ze)
OverflowError: long it too large to convert to int

That will be an number which is bigger than 2**31 == 2 GB which can't
be converted to an int.

It would be explained if zinfo.compress_ size is 2GB, eg
>>f=open("z")
>>f.read(2**3 1)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
OverflowError: long int too large to convert to int

However it would seem nuts that zipfile is trying to read 2GB into
memory at once!
Perhaps, but that's what the read(name) method does - returns a string
containing the contents of the selected file. So I think this runs
into a basic issue of the maximum length of Python strings (at least
in 32bit builds, not sure about 64bit) as much as it does an issue
with the zipfile module. Of course, the fact that the only "read"
method zipfile has is to return the entire file as a string might
be considered a design flaw.

For the OP, if you know you are going to be dealing with very large
files, you might want to implement your own individual file
extraction, since I'm guessing you don't actually need all 5+GB of the
problematic file loaded into memory in a single I/O operation,
particularly if you're just going to write it out again, which is what
your original forum code was doing.

I'd probably suggest just using the getinfo(name) method to return the
ZipInfo object for the file in question, then process the appropriate
section of the zip file directly. E.g., just seek to the proper
offset, then read the data incrementally up to the full size from the
ZipInfo compress_size attribute. If the files are compressed, you can
incrementally hand their data to the decompressor prior to other
processing.

E.g., instead of your original:

fileData = dataObj.read(i)
fileHndl = file(fileName," wb")
fileHndl.write( fileData)
fileHndl.close( )

something like (untested):

CHUNK = 65536 # I/O chunk size

fileHndl = file(fileName," wb")

zinfo = dataObj.getinfo (i)
compressed = (zinfo.compress _type == ZLIB_DEFLATED)
if compressed:
dc = zlib.decompress obj(-15)

dataObj.fp.seek (zinfo.header_o ffset+30)
remain = zinfo.compress_ size
while remain:
bytes = dataObj.fp.read (min(remain, CHUNK))
remain -= len(bytes)
if compressed:
bytes = dc.decompress(b ytes)
fileHndl.write( bytes)

if compressed:
bytes = dc.decompress(' Z') + dc.flush()
if bytes:
fileHndl.write( bytes)

fileHndl.close( )

Note the above assumes you are only reading from the zip file as it
doesn't maintain the current read() method invariant of leaving the
file pointer position unchanged, but you could add that too. You
could also verify the file CRC along the way if you wanted to.

Might be even better if you turned the above into a generator, perhaps
as a new method on a local ZipFile subclass. Use the above as a
read_gen method with the write() calls replaced with "yield bytes",
and your outer code could look like:

fileHndl = file(fileName," wb")
for bytes in dataObj.read_ge n(i):
fileHndle.write (bytes)
fileHndl.close( )

-- David
Aug 29 '07 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
9033
by: Sebastian Krause | last post by:
Hello, I tried to read in some large ascii files (200MB-2GB) in Python using scipy.io.read_array, but it did not work as I expected. The whole idea was to find a fast Python routine to read in arbitrary ascii files, to replace Yorick (which I use right now and which is really fast, but not as general as Python). The problem with scipy.io.read_array was, that it is really slow, returns errors when trying to process large files and it...
6
1836
by: Shawn | last post by:
Hi. I'm letting the users upload images into a sybase 12.5 database. The problem is that when ContentLength exceeds 131 071 an exception occur. The e.Message property is empty so it doesn't tell me much. I know that images much larger than this exist in the database. Any help is appreciated! Here is my code: Dim cn As New Connection Dim oleDbCommand As OleDbCommand Dim strSqlImage As String Dim stream As System.IO.Stream =...
6
2331
by: comp.lang.php | last post by:
if (!function_exists('bigfile')) { /** * Works like file() in PHP except that it will work more efficiently with very large files * * @access public * @param mixed $fullFilePath * @return array $lineArray * @see actual_path */
0
789
by: Buddy Home | last post by:
Hello, I'm trying to upload a file programatically and occasionally I get the following error message. Unable to write data to the transport connection: An established connection was aborted by the software in your host machine. Stack Trace at System.Net.Sockets.NetworkStream.Write(Byte buffer, Int32 offset, Int32
3
14054
by: Buddy Home | last post by:
Hello, I'm trying to upload a file programatically and occasionally I get the following error message. Unable to write data to the transport connection: An established connection was aborted by the software in your host machine. Stack Trace at System.Net.Sockets.NetworkStream.Write(Byte buffer, Int32 offset, Int32
3
2960
by: =?Utf-8?B?ZGF2aWQ=?= | last post by:
I try to follow Steve's paper to build a database, and store a small text file into SQL Server database and retrieve it later. Only difference between my table and Steve's table is that I use NTEXT datatype for the file instead of using IMAGE datatype. I can not use SqlDataReader to read the data. I need your help, Thanks. -David (1) I have a table TestFile for testing: ID int FileName navrchar(255)
1
2468
by: Lars B | last post by:
Hey everyone, I've been working on a C++ program that uses several threads to read data from a large file into several buffers at speed and pass them to different classes at different rates. I'm working in linux using Eclipse as my editor and a simple makefile to compile the files. My problem is that my main reader class (FileReader:- a thread class) does not seem to recognise the close() function that, as we all know, is the complement to...
0
4061
by: wildman | last post by:
Trying to read XML files from SSIS and load into SQL Server. I tested this before and it was working before I placed in a forevery contrainer. also, my simple xml file had to be retyped cause I deleted by accident.. Getting this error: The component "XML Source" (108) was unable to read the XML data.
0
2673
by: ll | last post by:
I'm working with 'pure ASP upload' script which is designed to redirect to an alert/error message, should a file larger than the set limit be attempted to be uploaded. The problem is that, while smaller files do upload successfully, the script does not catch the larger files and rather than a specific error message in Firefox (and IE7), I just get the following: ------------------------------------ The connection was reset The...
0
8989
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8828
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9367
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8241
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6795
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6073
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4869
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3309
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2780
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.