Canonical way of dealing with null-separated lines?

Douglas Alan

Is there a canonical way of iterating over the lines of a file that
are null-separated rather than newline-separated? Sure, I can
implement my own iterator using read() and split(), etc., but
considering that using "find -print0" is so common, it seems like
there should be a more cannonical way.

|>oug

Jul 18 '05 #1

Subscribe Post Reply

4152

Christopher De Vries

On Wed, Feb 23, 2005 at 10:54:50PM -0500, Douglas Alan wrote:

Is there a canonical way of iterating over the lines of a file that
are null-separated rather than newline-separated?

I'm not sure if there is a canonical method, but I would recommending using a
generator to get something like this, where 'f' is a file object:

def readnullsep(f):
# Need a place to put potential pieces of a null separated string
# across buffer boundaries
retain = []

while True:
instr = f.read(2048)
if len(instr)==0:
# End of file
break

# Split over nulls
splitstr = instr.split('\0')

# Combine with anything left over from previous read
retain.append(splitstr[0])
splitstr[0] = ''.join(retain)

# Keep last piece for next loop and yield the rest
retain = [splitstr[-1]]
for element in splitstr[:-1]:
yield element

# yield anything left over
yield retain[0]

Chris

Jul 18 '05 #2

Scott David Daniels

Douglas Alan wrote:

Is there a canonical way of iterating over the lines of a file that
are null-separated rather than newline-separated? Sure, I can
implement my own iterator using read() and split(), etc., but
considering that using "find -print0" is so common, it seems like
there should be a more cannonical way.

You could start with this code and add '\0' as a line terminator:

http://members.dsl-only.net/~daniels/ilines.html
--Scott David Daniels
Sc***********@Acm.Org

Jul 18 '05 #3

Douglas Alan

Christopher De Vries <de*****@idolstarastronomer.com> writes:

I'm not sure if there is a canonical method, but I would
recommending using a generator to get something like this, where 'f'
is a file object:

Thanks for the generator. It returns an extra blank line at the end
when used with "find -print0", which is probably not ideal, and is
also not how the normal file line iterator behaves. But don't worry
-- I can fix it.

In any case, as a suggestion to the whomever it is that arranges for
stuff to be put into the standard library, there should be something
like this there, so everyone doesn't have to reinvent the wheel (even
if it's an easy wheel to reinvent) for something that any sysadmin
(and many other users) would want to do on practically a daily basis.

|>oug

Jul 18 '05 #4

Scott David Daniels

Douglas Alan wrote:
....

In any case, as a suggestion to the whomever it is that arranges for
stuff to be put into the standard library, there should be something
like this there, so everyone doesn't have to reinvent the wheel (even
if it's an easy wheel to reinvent) for something that any sysadmin
(and many other users) would want to do on practically a daily basis.

The general model is that you produce a module, and if it gains a
audience to a stable interface, inclusion might be considered. I'd
suggest you put up a recipe at ActiveState.

--Scott David Daniels
Sc***********@Acm.Org

Jul 18 '05 #5

Christopher De Vries

On Thu, Feb 24, 2005 at 02:03:52PM -0500, Douglas Alan wrote:

Thanks for the generator. It returns an extra blank line at the end
when used with "find -print0", which is probably not ideal, and is
also not how the normal file line iterator behaves. But don't worry
-- I can fix it.

Sorry... I forgot to try it with a null terminated string. I guess it further
illustrates the power of writing good test cases. Something like this would
help:

# yield anything left over
if retain[0]:
yield retain[0]

The other modification would be an option to ignore multiple nulls in a row,
rather than returning empty strings, which could be done in a similar way.

Chris

Jul 18 '05 #6

John Machin

On Thu, 24 Feb 2005 11:53:32 -0500, Christopher De Vries
<de*****@idolstarastronomer.com> wrote:

On Wed, Feb 23, 2005 at 10:54:50PM -0500, Douglas Alan wrote:
Is there a canonical way of iterating over the lines of a file that
are null-separated rather than newline-separated?
I'm not sure if there is a canonical method, but I would recommending using a
generator to get something like this, where 'f' is a file object:

def readnullsep(f):
# Need a place to put potential pieces of a null separated string
# across buffer boundaries
retain = []

while True:
instr = f.read(2048)
if len(instr)==0:
# End of file
break

# Split over nulls
splitstr = instr.split('\0')

# Combine with anything left over from previous read
retain.append(splitstr[0])
splitstr[0] = ''.join(retain)

# Keep last piece for next loop and yield the rest
retain = [splitstr[-1]]
for element in splitstr[:-1]:

(1) Inefficient (copies all but the last element of splitstr)
yield element

# yield anything left over
yield retain[0]

(2) Dies when the input file is empty.

(3) As noted by the OP, can return a spurious empty line at the end.

Try this:

!def readweird(f, line_end='\0', bufsiz=8192):
! retain = ''
! while True:
! instr = f.read(bufsiz)
! if not instr:
! # End of file
! break
! splitstr = instr.split(line_end)
! if splitstr[-1]:
! # last piece not terminated
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = splitstr.pop()
! else:
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = ''
! del splitstr[-1]
! for element in splitstr:
! yield element
! if retain:
! yield retain

Cheers,
John

Jul 18 '05 #7

John Machin

On Thu, 24 Feb 2005 14:51:07 -0500, Christopher De Vries
<de*****@idolstarastronomer.com> wrote:

The other modification would be an option to ignore multiple nulls in a row,
rather than returning empty strings, which could be done in a similar way.

Why not leave this to the caller? Efficiency?? Filtering out empty
lines is the least of your worries.

Try giving the callers options to do things they *can't* do
themselves, like a different line-terminator or a buffer size > 2048
[which could well enhance efficiency] or < 10 [which definitely
enhances testing]

Jul 18 '05 #8

Christopher De Vries

On Fri, Feb 25, 2005 at 07:56:49AM +1100, John Machin wrote:

Try this:
!def readweird(f, line_end='\0', bufsiz=8192):
! retain = ''
! while True:
! instr = f.read(bufsiz)
! if not instr:
! # End of file
! break
! splitstr = instr.split(line_end)
! if splitstr[-1]:
! # last piece not terminated
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = splitstr.pop()
! else:
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = ''
! del splitstr[-1]
! for element in splitstr:
! yield element
! if retain:
! yield retain

I think this is a definite improvement... especially putting the buffer size
and line terminators as optional arguments, and handling empty files. I think,
however that the if splitstr[-1]: ... else: ... clauses aren't necessary, so I
would probably reduce it to this:

!def readweird(f, line_end='\0', bufsiz=8192):
! retain = ''
! while True:
! instr = f.read(bufsiz)
! if not instr:
! # End of file
! break
! splitstr = instr.split(line_end)
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = splitstr.pop()
! for element in splitstr:
! yield element
! if retain:
! yield retain

Popping off that last member and then iterating over the rest of the list as
you suggested is so much more efficient, and it looks a lot better.

Chris

Jul 18 '05 #9

John Machin

On Thu, 24 Feb 2005 16:51:22 -0500, Christopher De Vries
<de*****@idolstarastronomer.com> wrote:

[snip]

I think this is a definite improvement... especially putting the buffer size
and line terminators as optional arguments, and handling empty files. I think,
however that the if splitstr[-1]: ... else: ... clauses aren't necessary,
Indeed. Any efficiency gain would be negated by the if test and it's
only once per buffer-full anyway. I left all that stuff in to show
that I had actually analyzed the four cases i.e. it wasn't arrived at
by lucky accident.
so I
would probably reduce it to this:
[snip]
Popping off that last member and then iterating over the rest of the list as
you suggested is so much more efficient, and it looks a lot better.

Yeah. If it looks like a warthog, it is a warthog. The converse is of
course not true; examples of elegant insufficiency abound.

Cheers,
John

Jul 18 '05 #10

Douglas Alan

Okay, here's the definitive version (or so say I). Some good doobie
please make sure it makes its way into the standard library:

def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.

You can also set the read size and control whether or not the newline string
is left on the end of the iterated lines. Setting newline to '\0' is
particularly good for use with an input file created with something like
"os.popen('find -print0')".
"""
partialLine = []
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = charsJustRead.split(newline)
if len(lines) > 1:
partialLine.append(lines[0])
lines[0] = "".join(partialLine)
partialLine = [lines.pop()]
else:
partialLine.append(lines.pop())
for line in lines: yield line + ("", newline)[leaveNewline]
if partialLine and partialLine[-1] != '': yield "".join(partialLine)

|>oug

Jul 18 '05 #11

Nick Coghlan

Douglas Alan wrote:

Okay, here's the definitive version (or so say I). Some good doobie
please make sure it makes its way into the standard library:

def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.

You can also set the read size and control whether or not the newline string
is left on the end of the iterated lines. Setting newline to '\0' is
particularly good for use with an input file created with something like
"os.popen('find -print0')".
"""
partialLine = []
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = charsJustRead.split(newline)
if len(lines) > 1:
partialLine.append(lines[0])
lines[0] = "".join(partialLine)
partialLine = [lines.pop()]
else:
partialLine.append(lines.pop())
for line in lines: yield line + ("", newline)[leaveNewline]
if partialLine and partialLine[-1] != '': yield "".join(partialLine)

|>oug

Hmm, adding optional arguments to file.readlines() would seem to be the place to
start, as well as undeprecating file.xreadlines() (with the same optional
arguments) for the iterator version.

I've put an RFE (#1152248) on Python's Sourceforge project so the idea doesn't
get completely lost. Actually making it happen needs someone to step up and
offer a patch to the relevant C code and documentation, though.

Cheers,
Nick.

--
Nick Coghlan | nc******@email.com | Brisbane, Australia
---------------------------------------------------------------
http://boredomandlaziness.skystorm.net

Jul 18 '05 #12

Douglas Alan

I wrote:

Okay, here's the definitive version (or so say I). Some good doobie
please make sure it makes its way into the standard library:

Oops, I just realized that my previously definitive version did not
handle multi-character newlines. So here is a new definition
version. Oog, now my brain hurts:

def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.

The newline string can be arbitrarily long; it need not be restricted to a
single character. You can also set the read size and control whether or not
the newline string is left on the end of the iterated lines. Setting
newline to '\0' is particularly good for use with an input file created with
something like "os.popen('find -print0')".
"""
isNewlineMultiChar = len(newline) > 1
outputLineEnd = ("", newline)[leaveNewline]

# 'partialLine' is a list of strings to be concatinated later:
partialLine = []

# Because read() might unfortunately split across our newline string, we
# have to regularly check to see if the newline string appears in what we
# previously thought was only a partial line. We do so with this generator:
def linesInPartialLine():
if isNewlineMultiChar:
linesInPartialLine = "".join(partialLine).split(newline)
if linesInPartialLine > 1:
partialLine[:] = [linesInPartialLine.pop()]
for line in linesInPartialLine:
yield line + outputLineEnd

while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = charsJustRead.split(newline)
if len(lines) > 1:
for line in linesInPartialLine(): yield line
partialLine.append(lines[0])
lines[0] = "".join(partialLine)
partialLine[:] = [lines.pop()]
else:
partialLine.append(lines.pop())
for line in linesInPartialLine(): yield line
for line in lines: yield line + outputLineEnd
for line in linesInPartialLine(): yield line
if partialLine and partialLine[-1] != '':
yield "".join(partialLine)
|>oug

Jul 18 '05 #13

Douglas Alan

I wrote:

Oops, I just realized that my previously definitive version did not
handle multi-character newlines. So here is a new definitive
version. Oog, now my brain hurts:

I dunno what I was thinking. That version sucked! Here's a version
that's actually comprehensible, a fraction of the size, and works in
all cases. (I think.)

def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.

The newline string can be arbitrarily long; it need not be restricted to a
single character. You can also set the read size and control whether or not
the newline string is left on the end of the iterated lines. Setting
newline to '\0' is particularly good for use with an input file created with
something like "os.popen('find -print0')".
"""
outputLineEnd = ("", newline)[leaveNewline]
partialLine = ''
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = (partialLine + charsJustRead).split(newline)
partialLine = lines.pop()
for line in lines: yield line + outputLineEnd
if partialLine: yield partialLine

|>oug

Jul 18 '05 #14

John Machin

Douglas Alan wrote:

I wrote:
Oops, I just realized that my previously definitive version did not
handle multi-character newlines. So here is a new definitive
version. Oog, now my brain hurts:
I dunno what I was thinking. That version sucked! Here's a version
that's actually comprehensible, a fraction of the size, and works in
all cases. (I think.)

def fileLineIter(inputFile, newline='\n', leaveNewline=False,

readSize=8192): """Like the normal file iter but you can set what string indicates newline.
The newline string can be arbitrarily long; it need not be restricted to a single character. You can also set the read size and control whether or not the newline string is left on the end of the iterated lines. Setting newline to '\0' is particularly good for use with an input file created with something like "os.popen('find -print0')".
"""
outputLineEnd = ("", newline)[leaveNewline]
partialLine = ''
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = (partialLine + charsJustRead).split(newline)
The above line is prepending a short string to what will typically be a
whole buffer full. There's gotta be a better way to do it. Perhaps you
might like to refer back to CdV's solution which was prepending the
residue to the first element of the split() result.
partialLine = lines.pop()
for line in lines: yield line + outputLineEnd
In the case of leaveNewline being false, you are concatenating an empty
string. IMHO, to quote Jon Bentley, one should "do nothing gracefully".

if partialLine: yield partialLine

|>oug

Jul 18 '05 #15

Douglas Alan

"John Machin" <sj******@lexicon.net> writes:

lines = (partialLine + charsJustRead).split(newline)
The above line is prepending a short string to what will typically be a
whole buffer full. There's gotta be a better way to do it.
If there is, I'm all ears. In a previous post I provided code that
doesn't concatinate any strings together until the last possible
moment (i.e. when yielding a value). The problem with that the code
was that it was complicated and didn't work right in all cases.

One way of solving the string concatination issue would be to write a
string find routine that will work on lists of strings while ignoring
the boundaries between list elements. (I.e., it will consider the
list of strings to be one long string for its purposes.) Unless it is
written in C, however, I bet it will typically be much slower than the
code I just provided.
Perhaps you might like to refer back to CdV's solution which was
prepending the residue to the first element of the split() result.
The problem with that solution is that it doesn't work in all cases
when the line-separation string is more than one character.
for line in lines: yield line + outputLineEnd

In the case of leaveNewline being false, you are concatenating an empty
string. IMHO, to quote Jon Bentley, one should "do nothing gracefully".

In Python,

longString + "" is longString

evaluates to True. I don't know how you can do nothing more
gracefully than that.

|>oug

Jul 18 '05 #16

John Machin

Douglas Alan wrote:

"John Machin" <sj******@lexicon.net> writes:
lines = (partialLine + charsJustRead).split(newline)
The above line is prepending a short string to what will typically be a whole buffer full. There's gotta be a better way to do it.
If there is, I'm all ears. In a previous post I provided code that
doesn't concatinate any strings together until the last possible
moment (i.e. when yielding a value). The problem with that the code
was that it was complicated and didn't work right in all cases.

One way of solving the string concatination issue would be to write a
string find routine that will work on lists of strings while ignoring
the boundaries between list elements. (I.e., it will consider the
list of strings to be one long string for its purposes.) Unless it

is written in C, however, I bet it will typically be much slower than the code I just provided.
Perhaps you might like to refer back to CdV's solution which was
prepending the residue to the first element of the split() result.
The problem with that solution is that it doesn't work in all cases
when the line-separation string is more than one character.
for line in lines: yield line + outputLineEnd
In the case of leaveNewline being false, you are concatenating an

empty string. IMHO, to quote Jon Bentley, one should "do nothing

gracefully".
In Python,

longString + "" is longString

evaluates to True. I don't know how you can do nothing more
gracefully than that.

And also "" + longString is longString

The string + operator provides those graceful *external* results by
ugly special-case testing internally.

It is not graceful IMHO to concatenate a variable which you already
know refers to a null string.

Let's go back to the first point, and indeed further back to the use
cases:

(1) multi-byte separator for lines in test files: never heard of one
apart from '\r\n'; presume this is rare, so test for length of 1 and
use Chris's simplification of my effort in this case.

(2) keep newline: with the standard file reading routines, if one is
going to do anything much with the line other than write it out again,
one does buffer = buffer.rstrip('\n') anyway. In the case of a
non-standard separator, one is likely to want to write the line out
with the standard '\n'. So, specialisation for this is indicated:

! if keepNewline:
! for line in lines: yield line + newline
! else:
! for line in lines: yield line

Cheers,
John

Jul 18 '05 #17

Douglas Alan

"John Machin" <sj******@lexicon.net> writes:

In Python, longString + "" is longString evaluates to True. I don't know how you can do nothing more
gracefully than that.
And also "" + longString is longString The string + operator provides those graceful *external* results by
ugly special-case testing internally.
I guess I don't know what you are getting at. If Python peforms ugly
special-case testing internally so that I can write more simple,
elegant code, then more power to it! Concentrating ugliness in one,
small, reusable place is a good thing.

It is not graceful IMHO to concatenate a variable which you already
know refers to a null string.
It's better than making my code bigger, uglier, and putting in extra
tests for no particularly good reason.

Let's go back to the first point, and indeed further back to the use
cases: (1) multi-byte separator for lines in test files: never heard of one
apart from '\r\n'; presume this is rare, so test for length of 1 and
use Chris's simplification of my effort in this case.
I want to ability to handle multibyte separators, and so I coded for
it. There are plenty of other uses for an iterator that handles
multi-byte separators. Not all of them would typically be considered
"newline-delimited lines" as opposed to "records delimited by a
separation string", but a rose by any other name....

If one wants to special case for single-byte separators in the name of
efficiency, I provided one back there in the thread that never
degrades to N^2, as the ones you and Chris provided.

(2) keep newline: with the standard file reading routines, if one is
going to do anything much with the line other than write it out again,
one does buffer = buffer.rstrip('\n') anyway. In the case of a
non-standard separator, one is likely to want to write the line out
with the standard '\n'. So, specialisation for this is indicated: ! if keepNewline:
! for line in lines: yield line + newline
! else:
! for line in lines: yield line

I would certainly never want the iterator to tack on a standard "\n"
as a replacement for whatever newline string the input used. That
seems like completely gratuitous functionality to me. The standard
(but not the only) reason that I want the line terminator left on the
yielded strings is so that you can tell whether or not there is a
line-separator terminating the very last line of the input. Usually I
want the line-terminator discarded, and it kind of annoys me that the
standard line iterator leaves it on.

|>oug

Jul 18 '05 #18

Canonical way of dealing with null-separated lines?

Similar topics