Unable to read large files from zip

Kevin Ar18

I posted this on the forum, but nobody seems to know the solution: http://python-forum.org/py/viewtopic.php?t=5230

I have a zip file that is several GB in size, and one of the files inside of it is several GB in size. When it comes time to read the 5+GB file from inside the zip file, it fails with the following error:
File "...\zipfile.py", line 491, in read bytes = self.fp.read(zinfo.compress_size)
OverflowError: long it too large to convert to int
Note: all the other smaller files up to that point come out just fine.
Here's the code:
------------------
import zipfile
import re
dataObj = zipfile.ZipFile("zip.zip","r")
for i in dataObj.namelist():
-----print i+" -- >="+str(dataObj.getinfo(i).compress_size /1024 / 1024)+"MB"
-----if(i[-1] == "/"):
----------print "Directory -- won't extract"
-----else:
----------fileName = re.split(r".*/",i,0)[1]
----------fileData = dataObj.read(i)
There have been one or more posts about 2GB limits with the zipfile module,as well as this bug report: http://bugs.python.org/issue1189216 Also, older zip formats have a 4GB limit. However, I can't say for sure what the problem is. Does anyone know if my code is wrong or if there is a problem with Python itself?
If Python has a bug in it, then is there any other alternative library thatI can use (It must be free source: BSD, MIT, Public Domain, Python license; not copyleft/*GPL)? If not that, is there any similarly licensed code inanother language (like c++, lisp, etc...) that I can use?
__________________________________________________ _______________
Messenger Café — open for fun 24/7. Hot games, cool activities served daily. Visit now.
http://cafemessenger.com?ocid=TXT_TAGLM_AugWLtagline

Aug 29 '07 #1

Subscribe Post Reply

5428

Nick Craig-Wood

Kevin Ar18 <ke*******@hotmail.comwrote:

>
I posted this on the forum, but nobody seems to know the solution: http://python-forum.org/py/viewtopic.php?t=5230

I have a zip file that is several GB in size, and one of the files inside of it is several GB in size. When it comes time to read the 5+GB file from inside the zip file, it fails with the following error:
File "...\zipfile.py", line 491, in read bytes = self.fp.read(zinfo.compress_size)
OverflowError: long it too large to convert to int

That will be an number which is bigger than 2**31 == 2 GB which can't
be converted to an int.

It would be explained if zinfo.compress_size is 2GB, eg

>>f=open("z")
f.read(2**31)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
OverflowError: long int too large to convert to int

However it would seem nuts that zipfile is trying to read 2GB into
memory at once!

There have been one or more posts about 2GB limits with the zipfile
module, as well as this bug report:
http://bugs.python.org/issue1189216 Also, older zip formats have a
4GB limit. However, I can't say for sure what the problem is.
Does anyone know if my code is wrong

Your code looks OK to me.

or if there is a problem with Python itself?

Looks likely.

If Python has a bug in it

....then you have the source and you can have a go at fixing it!

Try editing zipfile.py and getting it to print out some debug info and
see if you can fix the problem. When you have done submit the patch
to the python bug tracker and you'll get that nice glow from helping
others! Remember python is open source and is made by *us* for *us* :-)

If you need help fixing zipfile.py then you'd probably be better off
asking on python-dev.

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick

Aug 29 '07 #2

David Bolen

Nick Craig-Wood <ni**@craig-wood.comwrites:

Kevin Ar18 <ke*******@hotmail.comwrote:
>>
I posted this on the forum, but nobody seems to know the solution: http://python-forum.org/py/viewtopic.php?t=5230

I have a zip file that is several GB in size, and one of the files inside of it is several GB in size. When it comes time to read the 5+GB file from inside the zip file, it fails with the following error:
File "...\zipfile.py", line 491, in read bytes = self.fp.read(zinfo.compress_size)
OverflowError: long it too large to convert to int

That will be an number which is bigger than 2**31 == 2 GB which can't
be converted to an int.

It would be explained if zinfo.compress_size is 2GB, eg

>>f=open("z")
>>f.read(2**31)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
OverflowError: long int too large to convert to int

However it would seem nuts that zipfile is trying to read 2GB into
memory at once!

Perhaps, but that's what the read(name) method does - returns a string
containing the contents of the selected file. So I think this runs
into a basic issue of the maximum length of Python strings (at least
in 32bit builds, not sure about 64bit) as much as it does an issue
with the zipfile module. Of course, the fact that the only "read"
method zipfile has is to return the entire file as a string might
be considered a design flaw.

For the OP, if you know you are going to be dealing with very large
files, you might want to implement your own individual file
extraction, since I'm guessing you don't actually need all 5+GB of the
problematic file loaded into memory in a single I/O operation,
particularly if you're just going to write it out again, which is what
your original forum code was doing.

I'd probably suggest just using the getinfo(name) method to return the
ZipInfo object for the file in question, then process the appropriate
section of the zip file directly. E.g., just seek to the proper
offset, then read the data incrementally up to the full size from the
ZipInfo compress_size attribute. If the files are compressed, you can
incrementally hand their data to the decompressor prior to other
processing.

E.g., instead of your original:

fileData = dataObj.read(i)
fileHndl = file(fileName,"wb")
fileHndl.write(fileData)
fileHndl.close()

something like (untested):

CHUNK = 65536 # I/O chunk size

fileHndl = file(fileName,"wb")

zinfo = dataObj.getinfo(i)
compressed = (zinfo.compress_type == ZLIB_DEFLATED)
if compressed:
dc = zlib.decompressobj(-15)

dataObj.fp.seek(zinfo.header_offset+30)
remain = zinfo.compress_size
while remain:
bytes = dataObj.fp.read(min(remain, CHUNK))
remain -= len(bytes)
if compressed:
bytes = dc.decompress(bytes)
fileHndl.write(bytes)

if compressed:
bytes = dc.decompress('Z') + dc.flush()
if bytes:
fileHndl.write(bytes)

fileHndl.close()

Note the above assumes you are only reading from the zip file as it
doesn't maintain the current read() method invariant of leaving the
file pointer position unchanged, but you could add that too. You
could also verify the file CRC along the way if you wanted to.

Might be even better if you turned the above into a generator, perhaps
as a new method on a local ZipFile subclass. Use the above as a
read_gen method with the write() calls replaced with "yield bytes",
and your outer code could look like:

fileHndl = file(fileName,"wb")
for bytes in dataObj.read_gen(i):
fileHndle.write(bytes)
fileHndl.close()

-- David

Aug 29 '07 #3

Similar topics

python: ascii read

by: Sebastian Krause | last post by:

Hello, I tried to read in some large ascii files (200MB-2GB) in Python using scipy.io.read_array, but it did not work as I expected. The whole idea was to find a fast Python routine to read in...

Python

Unable to upload images larger than 131 071 bytes

by: Shawn | last post by:

Hi. I'm letting the users upload images into a sybase 12.5 database. The problem is that when ContentLength exceeds 131 071 an exception occur. The e.Message property is empty so it doesn't tell me...

ASP.NET

can't read large files - help

by: comp.lang.php | last post by:

if (!function_exists('bigfile')) { /** * Works like file() in PHP except that it will work more efficiently with very large files * * @access public * @param mixed $fullFilePath * @return...

PHP

Unable to write data to the transport connection: An established connection

by: Buddy Home | last post by:

Hello, I'm trying to upload a file programatically and occasionally I get the following error message. Unable to write data to the transport connection: An established connection was aborted...

C# / C Sharp

Unable to write data to the transport connection

by: Buddy Home | last post by:

Hello, I'm trying to upload a file programatically and occasionally I get the following error message. Unable to write data to the transport connection: An established connection was aborted...

C# / C Sharp

I can not read a small file from NTEXT field in the database

by: =?Utf-8?B?ZGF2aWQ=?= | last post by:

I try to follow Steve's paper to build a database, and store a small text file into SQL Server database and retrieve it later. Only difference between my table and Steve's table is that I use NTEXT...

ASP.NET

C++ program unable to utilise close() in linux environment

by: Lars B | last post by:

Hey everyone, I've been working on a C++ program that uses several threads to read data from a large file into several buffers at speed and pass them to different classes at different rates. I'm...

C / C++

"XML Source" (108) was unable to read the XML data.

by: wildman | last post by:

Trying to read XML files from SSIS and load into SQL Server. I tested this before and it was working before I placed in a forevery contrainer. also, my simple xml file had to be retyped cause I...

.NET Framework

Pure ASP Upload - script unable to redirect for larger files

by: ll | last post by:

I'm working with 'pure ASP upload' script which is designed to redirect to an alert/error message, should a file larger than the set limit be attempted to be uploaded. The problem is that, while...

ASP / Active Server Pages

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General