itertools.izip brokeness

rurpy

The code below should be pretty self-explanatory.
I want to read two files in parallel, so that I
can print corresponding lines from each, side by
side. itertools.izip( ) seems the obvious way
to do this.

izip() will stop interating when it reaches the
end of the shortest file. I don't know how to
tell which file was exhausted so I just try printing
them both. The exhausted one will generate a
StopInteration, the other will continue to be
iterable.

The problem is that sometimes, depending on which
file is the shorter, a line ends up missing,
appearing neither in the izip() output, or in
the subsequent direct file iteration. I would
guess that it was in izip's buffer when izip
terminates due to the exception on the other file.

This behavior seems plain out broken, especially
because it is dependent on order of izip's
arguments, and not documented anywhere I saw.
It makes using izip() for iterating files in
parallel essentially useless (unless you are
lucky enough to have files of the same length).

Also, it seems to me that this is likely a problem
with any iterables with different lengths.
I am hoping I am missing something...

#---------------------------------------------------------
# Task: print contents of file1 in column 1, and
# contents of file2 in column two. iterators and
# izip() are the "obvious" way to do it.

from itertools import izip
import cStringIO, pdb

def prt_files (file1, file2):

for line1, line2 in izip (file1, file2):
print line1.rstrip(), "\t", line2.rstrip()

try:
for line1 in file1:
print line1,
except StopIteration: pass

try:
for line2 in file2:
print "\t",line2,
except StopIteration: pass

if __name__ == "__main__":
# Use StringIO to simulate files. Real files
# show the same behavior.
f = cStringIO.Strin gIO

print "Two files with same number of lines work ok."
prt_files (f("abc\nde\nfg h\n"), f("xyz\nwv\nstu \n"))

print "\nFirst file shorter is also ok."
prt_files (f("abc\nde\n") , f("xyz\nwv\nstu \n"))

print "\nSecond file shorter is a problem."
prt_files (f("abc\nde\nfg h\n"), f("xyz\nwv\n" ))
print "What happened to \"fgh\" line that should be in column
1?"

print "\nBut only a problem for one line."
prt_files (f("abc\nde\nfg h\nijk\nlm\n"), f("xyz\nwv\n" ))
print "The line \"fgh\" is still missing, but following\n" \
"line(s) are ok! Looks like izip() ate a line."

Jan 3 '06 #1

Subscribe Reply

2683

1
2
3
>
Last »

Paul Rubin

ru***@yahoo.com writes:

The problem is that sometimes, depending on which file is the
shorter, a line ends up missing, appearing neither in the izip()
output, or in the subsequent direct file iteration. I would guess
that it was in izip's buffer when izip terminates due to the
exception on the other file.

Oh man, this is ugly. The problem is there's no way to tell whether
an iterator is empty, other than by reading from it.

http://aspn.activestate.com/ASPN/Coo.../Recipe/413614

has a kludge that you can use inside a function but that's no good
for something like izip.

For a temporary hack you could make a wrapped iterator that allows
pushing items back onto the iterator (sort of like ungetc) and a
version of izip that uses it, or a version of izip that tests the
iterators you pass it using the above recipe.

It's probably not reasonable to ask that an emptiness test be added to
the iterator interface, since the zillion iterator implementations now
existing won't support it.

A different possible long term fix: change StopIteration so that it
takes an optional arg that the program can use to figure out what
happened. Then change izip so that when one of its iterator args runs
out, it wraps up the remaining ones in a new tuple and passes that
to the StopIteration it raises. Untested:

def izip(*iterlist) :
while True:
z = []
finished = [] # iterators that have run out
still_alive = [] # iterators that are still alive
for i in iterlist:
try:
z.append(i.next ())
still_alive.app end(i)
except StopIteration:
finished.append (i)
if not finished:
yield tuple(z)
else:
raise StopIteration, (still_alive, finished)

You would want some kind of extended for-loop syntax (maybe involving
the new "with" statement) with a clean way to capture the exception info.
You'd then use it to continue the izip where it left off, with the
new (smaller) list of iterators.

Jan 3 '06 #2

bonono

But that is exactly the behaviour of python iterator, I don't see what
is broken.

izip/zip just read from the respectives streams and give back a tuple,
if it can get one from each, otherwise stop. And because python
iterator can only go in one direction, those consumed do lose in the
zip/izip calls.

I think you need to use map(None,...) which would not drop anything,
just None filled. Though you don't have a relatively lazy version as
imap(None,...) doesn't behave like map but a bit like zip.

ru***@yahoo.com wrote:

The code below should be pretty self-explanatory.
I want to read two files in parallel, so that I
can print corresponding lines from each, side by
side. itertools.izip( ) seems the obvious way
to do this.

izip() will stop interating when it reaches the
end of the shortest file. I don't know how to
tell which file was exhausted so I just try printing
them both. The exhausted one will generate a
StopInteration, the other will continue to be
iterable.

The problem is that sometimes, depending on which
file is the shorter, a line ends up missing,
appearing neither in the izip() output, or in
the subsequent direct file iteration. I would
guess that it was in izip's buffer when izip
terminates due to the exception on the other file.

This behavior seems plain out broken, especially
because it is dependent on order of izip's
arguments, and not documented anywhere I saw.
It makes using izip() for iterating files in
parallel essentially useless (unless you are
lucky enough to have files of the same length).

Also, it seems to me that this is likely a problem
with any iterables with different lengths.
I am hoping I am missing something...

#---------------------------------------------------------
# Task: print contents of file1 in column 1, and
# contents of file2 in column two. iterators and
# izip() are the "obvious" way to do it.

from itertools import izip
import cStringIO, pdb

def prt_files (file1, file2):

for line1, line2 in izip (file1, file2):
print line1.rstrip(), "\t", line2.rstrip()

try:
for line1 in file1:
print line1,
except StopIteration: pass

try:
for line2 in file2:
print "\t",line2,
except StopIteration: pass

if __name__ == "__main__":
# Use StringIO to simulate files. Real files
# show the same behavior.
f = cStringIO.Strin gIO

print "Two files with same number of lines work ok."
prt_files (f("abc\nde\nfg h\n"), f("xyz\nwv\nstu \n"))

print "\nFirst file shorter is also ok."
prt_files (f("abc\nde\n") , f("xyz\nwv\nstu \n"))

print "\nSecond file shorter is a problem."
prt_files (f("abc\nde\nfg h\n"), f("xyz\nwv\n" ))
print "What happened to \"fgh\" line that should be in column
1?"

print "\nBut only a problem for one line."
prt_files (f("abc\nde\nfg h\nijk\nlm\n"), f("xyz\nwv\n" ))
print "The line \"fgh\" is still missing, but following\n" \
"line(s) are ok! Looks like izip() ate a line."

Jan 3 '06 #3

David Murmann

ru***@yahoo.com schrieb:

[izip() eats one line]

as far as i can see the current implementation cannot be changed
to do the Right Thing in your case. pythons iterators don't allow
to "look ahead", so izip can only get the next element. if this
fails for an iterator, everything up to that point is lost.

maybe the documentation for izip should note that the given
iterators are not necessarily in a sane state afterwards.

for your problem you can do something like:

def izipall(*args):
iters = [iter(it) for it in args]
while iters:
result = []
for it in iters:
try:
x = it.next()
except StopIteration:
iters.remove(it )
else:
result.append(x )
yield tuple(result)

note that this does not yield tuples that are always the same
length, so "for x, y in izipall()" won't work. instead, do something
like "for seq in izipall(): print '\t'.join(seq)" .

hope i was clear enough, David.

Jan 3 '06 #4

Peter Otten

ru***@yahoo.com wrote:

The problem is that sometimes, depending on which
file is the shorter, a line ends up missing,
appearing neither in the izip() output, or in
the subsequent direct file iteration.**I*w ould
guess that it was in izip's buffer when izip
terminates due to the exception on the other file.

With the current iterator protocol you cannot feed an item that you've read
from an iterator by calling its next() method back into it; but invoking
next() is the only way to see whether the iterator is exhausted. Therefore
the behaviour that breaks your prt_files() function has nothing to do with
the itertools.
I think of itertools more as of a toolbox instead of a set of ready-made
solutions and came up with

from itertools import izip, chain, repeat

def prt_files (file1, file2):
file1 = chain(file1, repeat(""))
file2 = chain(file2, repeat(""))
for line1, line2 in iter(izip(file1 , file2).next, ("", "")):
print line1.rstrip(), "\t", line2.rstrip()

which can easily be generalized for an arbitrary number of files.

Peter

Jan 3 '06 #5

Paul Rubin

bo****@gmail.co m writes:

But that is exactly the behaviour of python iterator, I don't see what
is broken.
What's broken is the iterator interface is insufficient to deal with
this cleanly.
And because python iterator can only go in one direction, those
consumed do lose in the zip/izip calls.
Yes, that's the problem. It's proven useful for i/o streams to support
a pushback operation like ungetc. Maybe something like it can be done
for iterators.
I think you need to use map(None,...) which would not drop anything,
just None filled. Though you don't have a relatively lazy version as
imap(None,...) doesn't behave like map but a bit like zip.

I don't understand what you mean by this? None is not callable.

How about this (untested):

def myzip(iterlist) :
"""return zip of smaller and smaller list of iterables as the
individual iterators run out"""
sentinel = object() # unique sentinel
def sentinel_append (iterable):
return itertools.chain (iterable, itertools.repea t(sentinel))
for i in itertools.izip( map(sentinel_ap pend, iterlist)):
r = [x for x in i.next() if x is not sentinel]
if r: yield r
else: break

Jan 3 '06 #6

bonono

Paul Rubin wrote:

I think you need to use map(None,...) which would not drop anything,
just None filled. Though you don't have a relatively lazy version as
imap(None,...) doesn't behave like map but a bit like zip.
I don't understand what you mean by this? None is not callable.

zip([1,2,3],[4,5]) gives [(1,4),(2,5)]

map(None,[1,2,3],[4,5]) gives [(1,4),(2,5),(3, None)]

So the result of map() can be filtered out for special processing. Of
course, your empty/sentinel filled version is doing more or less the
same thing.

How about this (untested):

def myzip(iterlist) :
"""return zip of smaller and smaller list of iterables as the
individual iterators run out"""
sentinel = object() # unique sentinel
def sentinel_append (iterable):
return itertools.chain (iterable, itertools.repea t(sentinel))
for i in itertools.izip( map(sentinel_ap pend, iterlist)):
r = [x for x in i.next() if x is not sentinel]
if r: yield r
else: break

Jan 3 '06 #7

Paul Rubin

bo****@gmail.co m writes:

map(None,[1,2,3],[4,5]) gives [(1,4),(2,5),(3, None)]

I didn't know that until checking the docs just now. Oh man, what a
hack! I always thought Python should have a built-in identity
function for situations like that. I guess it does the above instead.
Thanks. Jeez ;-)

Jan 3 '06 #8

bonono

Paul Rubin wrote:

bo****@gmail.co m writes:
map(None,[1,2,3],[4,5]) gives [(1,4),(2,5),(3, None)]

I didn't know that until checking the docs just now. Oh man, what a
hack! I always thought Python should have a built-in identity
function for situations like that. I guess it does the above instead.
Thanks. Jeez ;-)

Of course, for OP's particular case, I think a specialized func() is
even better, as the None are turned into "" in the process which is
needed for string operation.

map(lambda *arg: tuple(map(lambd a x: x is not None and x or "", arg)),
["a","b","c"],["d","e"])

Jan 3 '06 #9

Diez B. Roggisch

> What's broken is the iterator interface is insufficient to deal with

this cleanly.
I don't consider it broken. You just think too much in terms of the OPs
problems or probably other fields where the actual data is available for
"rewinding" .

But as iterators serve as abstraction for lots of things - especially
generatiors - you can't enhance the interface.

Yes, that's the problem. It's proven useful for i/o streams to support
a pushback operation like ungetc. Maybe something like it can be done
for iterators.
No. If you want that, use

list(iterable)

Then you have random access. If you _know_ there will be only so much data
needed to "unget", write yourself a buffered iterator like this:

buffered(iterab le, size)

Maybe something like that _could_ go in the itertools. But I'm not really
convinced, as it is too tied to special cases - and besides that very
easily done.

How about this (untested):

def myzip(iterlist) :
"""return zip of smaller and smaller list of iterables as the
individual iterators run out"""
sentinel = object() # unique sentinel
def sentinel_append (iterable):
return itertools.chain (iterable, itertools.repea t(sentinel))
for i in itertools.izip( map(sentinel_ap pend, iterlist)):
r = [x for x in i.next() if x is not sentinel]
if r: yield r
else: break

If that fits your semantics - of course. But the general zip shouldn't
behave that way.

Regards,

Diez

Jan 3 '06 #10

Similar topics

1742

itertools candidate: warehouse()

by: Robert Brewer | last post by:

def warehouse(stock, factory=None): """warehouse(stock, factory=None) -> iavailable, iremainder. Iterate over stock, yielding each value. Once the 'stock' sequence is exhausted, the factory function (or any callable, such as a class) is called to produce a new valid object upon each subsequent call to next().

Python

2639

Wishlist item: itertools.flatten

by: Ville Vainio | last post by:

For quick-and-dirty stuff, it's often convenient to flatten a sequence (which perl does, surprise surprise, by default): ]]] -> One such implementation is at http://aspn.activestate.com/ASPN/Mail/Message/python-tutor/2302348

Python

2188

itertools to iter transition (WAS: Pre-PEP: Dictionary accumulatormethods)

by: Steven Bethard | last post by:

Jack Diederich wrote: > > itertools to iter transition, huh? I slipped that one in, I mentioned > it to Raymond at PyCon and he didn't flinch. It would be nice not to > have to sprinkle 'import itertools as it' in code. iter could also > become a type wrapper instead of a function, so an iter instance could > be a wrapper that figures out whether to call .next or __getitem__ > depending on it's argument. > for item in...

Python

9687

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10484

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10228

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

10027

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

9072

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6805

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5463

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5585

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2938

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General