473,322 Members | 1,480 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Canonical way of dealing with null-separated lines?

Is there a canonical way of iterating over the lines of a file that
are null-separated rather than newline-separated? Sure, I can
implement my own iterator using read() and split(), etc., but
considering that using "find -print0" is so common, it seems like
there should be a more cannonical way.

|>oug
Jul 18 '05 #1
17 4138
On Wed, Feb 23, 2005 at 10:54:50PM -0500, Douglas Alan wrote:
Is there a canonical way of iterating over the lines of a file that
are null-separated rather than newline-separated?


I'm not sure if there is a canonical method, but I would recommending using a
generator to get something like this, where 'f' is a file object:

def readnullsep(f):
# Need a place to put potential pieces of a null separated string
# across buffer boundaries
retain = []

while True:
instr = f.read(2048)
if len(instr)==0:
# End of file
break

# Split over nulls
splitstr = instr.split('\0')

# Combine with anything left over from previous read
retain.append(splitstr[0])
splitstr[0] = ''.join(retain)

# Keep last piece for next loop and yield the rest
retain = [splitstr[-1]]
for element in splitstr[:-1]:
yield element

# yield anything left over
yield retain[0]

Chris
Jul 18 '05 #2
Douglas Alan wrote:
Is there a canonical way of iterating over the lines of a file that
are null-separated rather than newline-separated? Sure, I can
implement my own iterator using read() and split(), etc., but
considering that using "find -print0" is so common, it seems like
there should be a more cannonical way.


You could start with this code and add '\0' as a line terminator:

http://members.dsl-only.net/~daniels/ilines.html
--Scott David Daniels
Sc***********@Acm.Org
Jul 18 '05 #3
Christopher De Vries <de*****@idolstarastronomer.com> writes:
I'm not sure if there is a canonical method, but I would
recommending using a generator to get something like this, where 'f'
is a file object:


Thanks for the generator. It returns an extra blank line at the end
when used with "find -print0", which is probably not ideal, and is
also not how the normal file line iterator behaves. But don't worry
-- I can fix it.

In any case, as a suggestion to the whomever it is that arranges for
stuff to be put into the standard library, there should be something
like this there, so everyone doesn't have to reinvent the wheel (even
if it's an easy wheel to reinvent) for something that any sysadmin
(and many other users) would want to do on practically a daily basis.

|>oug
Jul 18 '05 #4
Douglas Alan wrote:
....
In any case, as a suggestion to the whomever it is that arranges for
stuff to be put into the standard library, there should be something
like this there, so everyone doesn't have to reinvent the wheel (even
if it's an easy wheel to reinvent) for something that any sysadmin
(and many other users) would want to do on practically a daily basis.


The general model is that you produce a module, and if it gains a
audience to a stable interface, inclusion might be considered. I'd
suggest you put up a recipe at ActiveState.

--Scott David Daniels
Sc***********@Acm.Org
Jul 18 '05 #5
On Thu, Feb 24, 2005 at 02:03:52PM -0500, Douglas Alan wrote:
Thanks for the generator. It returns an extra blank line at the end
when used with "find -print0", which is probably not ideal, and is
also not how the normal file line iterator behaves. But don't worry
-- I can fix it.


Sorry... I forgot to try it with a null terminated string. I guess it further
illustrates the power of writing good test cases. Something like this would
help:

# yield anything left over
if retain[0]:
yield retain[0]

The other modification would be an option to ignore multiple nulls in a row,
rather than returning empty strings, which could be done in a similar way.

Chris
Jul 18 '05 #6
On Thu, 24 Feb 2005 11:53:32 -0500, Christopher De Vries
<de*****@idolstarastronomer.com> wrote:
On Wed, Feb 23, 2005 at 10:54:50PM -0500, Douglas Alan wrote:
Is there a canonical way of iterating over the lines of a file that
are null-separated rather than newline-separated?
I'm not sure if there is a canonical method, but I would recommending using a
generator to get something like this, where 'f' is a file object:

def readnullsep(f):
# Need a place to put potential pieces of a null separated string
# across buffer boundaries
retain = []

while True:
instr = f.read(2048)
if len(instr)==0:
# End of file
break

# Split over nulls
splitstr = instr.split('\0')

# Combine with anything left over from previous read
retain.append(splitstr[0])
splitstr[0] = ''.join(retain)

# Keep last piece for next loop and yield the rest
retain = [splitstr[-1]]
for element in splitstr[:-1]:


(1) Inefficient (copies all but the last element of splitstr)
yield element

# yield anything left over
yield retain[0]


(2) Dies when the input file is empty.

(3) As noted by the OP, can return a spurious empty line at the end.

Try this:

!def readweird(f, line_end='\0', bufsiz=8192):
! retain = ''
! while True:
! instr = f.read(bufsiz)
! if not instr:
! # End of file
! break
! splitstr = instr.split(line_end)
! if splitstr[-1]:
! # last piece not terminated
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = splitstr.pop()
! else:
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = ''
! del splitstr[-1]
! for element in splitstr:
! yield element
! if retain:
! yield retain

Cheers,
John
Jul 18 '05 #7
On Thu, 24 Feb 2005 14:51:07 -0500, Christopher De Vries
<de*****@idolstarastronomer.com> wrote:

The other modification would be an option to ignore multiple nulls in a row,
rather than returning empty strings, which could be done in a similar way.


Why not leave this to the caller? Efficiency?? Filtering out empty
lines is the least of your worries.

Try giving the callers options to do things they *can't* do
themselves, like a different line-terminator or a buffer size > 2048
[which could well enhance efficiency] or < 10 [which definitely
enhances testing]

Jul 18 '05 #8
On Fri, Feb 25, 2005 at 07:56:49AM +1100, John Machin wrote:
Try this:
!def readweird(f, line_end='\0', bufsiz=8192):
! retain = ''
! while True:
! instr = f.read(bufsiz)
! if not instr:
! # End of file
! break
! splitstr = instr.split(line_end)
! if splitstr[-1]:
! # last piece not terminated
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = splitstr.pop()
! else:
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = ''
! del splitstr[-1]
! for element in splitstr:
! yield element
! if retain:
! yield retain


I think this is a definite improvement... especially putting the buffer size
and line terminators as optional arguments, and handling empty files. I think,
however that the if splitstr[-1]: ... else: ... clauses aren't necessary, so I
would probably reduce it to this:

!def readweird(f, line_end='\0', bufsiz=8192):
! retain = ''
! while True:
! instr = f.read(bufsiz)
! if not instr:
! # End of file
! break
! splitstr = instr.split(line_end)
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = splitstr.pop()
! for element in splitstr:
! yield element
! if retain:
! yield retain

Popping off that last member and then iterating over the rest of the list as
you suggested is so much more efficient, and it looks a lot better.

Chris
Jul 18 '05 #9
On Thu, 24 Feb 2005 16:51:22 -0500, Christopher De Vries
<de*****@idolstarastronomer.com> wrote:

[snip]
I think this is a definite improvement... especially putting the buffer size
and line terminators as optional arguments, and handling empty files. I think,
however that the if splitstr[-1]: ... else: ... clauses aren't necessary,
Indeed. Any efficiency gain would be negated by the if test and it's
only once per buffer-full anyway. I left all that stuff in to show
that I had actually analyzed the four cases i.e. it wasn't arrived at
by lucky accident.
so I
would probably reduce it to this:
[snip]
Popping off that last member and then iterating over the rest of the list as
you suggested is so much more efficient, and it looks a lot better.


Yeah. If it looks like a warthog, it is a warthog. The converse is of
course not true; examples of elegant insufficiency abound.

Cheers,
John

Jul 18 '05 #10
Okay, here's the definitive version (or so say I). Some good doobie
please make sure it makes its way into the standard library:

def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.

You can also set the read size and control whether or not the newline string
is left on the end of the iterated lines. Setting newline to '\0' is
particularly good for use with an input file created with something like
"os.popen('find -print0')".
"""
partialLine = []
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = charsJustRead.split(newline)
if len(lines) > 1:
partialLine.append(lines[0])
lines[0] = "".join(partialLine)
partialLine = [lines.pop()]
else:
partialLine.append(lines.pop())
for line in lines: yield line + ("", newline)[leaveNewline]
if partialLine and partialLine[-1] != '': yield "".join(partialLine)

|>oug
Jul 18 '05 #11
Douglas Alan wrote:
Okay, here's the definitive version (or so say I). Some good doobie
please make sure it makes its way into the standard library:

def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.

You can also set the read size and control whether or not the newline string
is left on the end of the iterated lines. Setting newline to '\0' is
particularly good for use with an input file created with something like
"os.popen('find -print0')".
"""
partialLine = []
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = charsJustRead.split(newline)
if len(lines) > 1:
partialLine.append(lines[0])
lines[0] = "".join(partialLine)
partialLine = [lines.pop()]
else:
partialLine.append(lines.pop())
for line in lines: yield line + ("", newline)[leaveNewline]
if partialLine and partialLine[-1] != '': yield "".join(partialLine)

|>oug


Hmm, adding optional arguments to file.readlines() would seem to be the place to
start, as well as undeprecating file.xreadlines() (with the same optional
arguments) for the iterator version.

I've put an RFE (#1152248) on Python's Sourceforge project so the idea doesn't
get completely lost. Actually making it happen needs someone to step up and
offer a patch to the relevant C code and documentation, though.

Cheers,
Nick.

--
Nick Coghlan | nc******@email.com | Brisbane, Australia
---------------------------------------------------------------
http://boredomandlaziness.skystorm.net
Jul 18 '05 #12
I wrote:
Okay, here's the definitive version (or so say I). Some good doobie
please make sure it makes its way into the standard library:


Oops, I just realized that my previously definitive version did not
handle multi-character newlines. So here is a new definition
version. Oog, now my brain hurts:

def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.

The newline string can be arbitrarily long; it need not be restricted to a
single character. You can also set the read size and control whether or not
the newline string is left on the end of the iterated lines. Setting
newline to '\0' is particularly good for use with an input file created with
something like "os.popen('find -print0')".
"""
isNewlineMultiChar = len(newline) > 1
outputLineEnd = ("", newline)[leaveNewline]

# 'partialLine' is a list of strings to be concatinated later:
partialLine = []

# Because read() might unfortunately split across our newline string, we
# have to regularly check to see if the newline string appears in what we
# previously thought was only a partial line. We do so with this generator:
def linesInPartialLine():
if isNewlineMultiChar:
linesInPartialLine = "".join(partialLine).split(newline)
if linesInPartialLine > 1:
partialLine[:] = [linesInPartialLine.pop()]
for line in linesInPartialLine:
yield line + outputLineEnd

while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = charsJustRead.split(newline)
if len(lines) > 1:
for line in linesInPartialLine(): yield line
partialLine.append(lines[0])
lines[0] = "".join(partialLine)
partialLine[:] = [lines.pop()]
else:
partialLine.append(lines.pop())
for line in linesInPartialLine(): yield line
for line in lines: yield line + outputLineEnd
for line in linesInPartialLine(): yield line
if partialLine and partialLine[-1] != '':
yield "".join(partialLine)
|>oug
Jul 18 '05 #13
I wrote:
Oops, I just realized that my previously definitive version did not
handle multi-character newlines. So here is a new definitive
version. Oog, now my brain hurts:


I dunno what I was thinking. That version sucked! Here's a version
that's actually comprehensible, a fraction of the size, and works in
all cases. (I think.)

def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.

The newline string can be arbitrarily long; it need not be restricted to a
single character. You can also set the read size and control whether or not
the newline string is left on the end of the iterated lines. Setting
newline to '\0' is particularly good for use with an input file created with
something like "os.popen('find -print0')".
"""
outputLineEnd = ("", newline)[leaveNewline]
partialLine = ''
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = (partialLine + charsJustRead).split(newline)
partialLine = lines.pop()
for line in lines: yield line + outputLineEnd
if partialLine: yield partialLine

|>oug
Jul 18 '05 #14

Douglas Alan wrote:
I wrote:
Oops, I just realized that my previously definitive version did not
handle multi-character newlines. So here is a new definitive
version. Oog, now my brain hurts:
I dunno what I was thinking. That version sucked! Here's a version
that's actually comprehensible, a fraction of the size, and works in
all cases. (I think.)

def fileLineIter(inputFile, newline='\n', leaveNewline=False,

readSize=8192): """Like the normal file iter but you can set what string indicates newline.
The newline string can be arbitrarily long; it need not be restricted to a single character. You can also set the read size and control whether or not the newline string is left on the end of the iterated lines. Setting newline to '\0' is particularly good for use with an input file created with something like "os.popen('find -print0')".
"""
outputLineEnd = ("", newline)[leaveNewline]
partialLine = ''
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = (partialLine + charsJustRead).split(newline)
The above line is prepending a short string to what will typically be a
whole buffer full. There's gotta be a better way to do it. Perhaps you
might like to refer back to CdV's solution which was prepending the
residue to the first element of the split() result.
partialLine = lines.pop()
for line in lines: yield line + outputLineEnd
In the case of leaveNewline being false, you are concatenating an empty
string. IMHO, to quote Jon Bentley, one should "do nothing gracefully".

if partialLine: yield partialLine

|>oug


Jul 18 '05 #15
"John Machin" <sj******@lexicon.net> writes:
lines = (partialLine + charsJustRead).split(newline)
The above line is prepending a short string to what will typically be a
whole buffer full. There's gotta be a better way to do it.
If there is, I'm all ears. In a previous post I provided code that
doesn't concatinate any strings together until the last possible
moment (i.e. when yielding a value). The problem with that the code
was that it was complicated and didn't work right in all cases.

One way of solving the string concatination issue would be to write a
string find routine that will work on lists of strings while ignoring
the boundaries between list elements. (I.e., it will consider the
list of strings to be one long string for its purposes.) Unless it is
written in C, however, I bet it will typically be much slower than the
code I just provided.
Perhaps you might like to refer back to CdV's solution which was
prepending the residue to the first element of the split() result.
The problem with that solution is that it doesn't work in all cases
when the line-separation string is more than one character.
for line in lines: yield line + outputLineEnd

In the case of leaveNewline being false, you are concatenating an empty
string. IMHO, to quote Jon Bentley, one should "do nothing gracefully".


In Python,

longString + "" is longString

evaluates to True. I don't know how you can do nothing more
gracefully than that.

|>oug
Jul 18 '05 #16

Douglas Alan wrote:
"John Machin" <sj******@lexicon.net> writes:
lines = (partialLine + charsJustRead).split(newline)
The above line is prepending a short string to what will typically be a whole buffer full. There's gotta be a better way to do it.
If there is, I'm all ears. In a previous post I provided code that
doesn't concatinate any strings together until the last possible
moment (i.e. when yielding a value). The problem with that the code
was that it was complicated and didn't work right in all cases.

One way of solving the string concatination issue would be to write a
string find routine that will work on lists of strings while ignoring
the boundaries between list elements. (I.e., it will consider the
list of strings to be one long string for its purposes.) Unless it

is written in C, however, I bet it will typically be much slower than the code I just provided.
Perhaps you might like to refer back to CdV's solution which was
prepending the residue to the first element of the split() result.
The problem with that solution is that it doesn't work in all cases
when the line-separation string is more than one character.
for line in lines: yield line + outputLineEnd
In the case of leaveNewline being false, you are concatenating an

empty string. IMHO, to quote Jon Bentley, one should "do nothing

gracefully".
In Python,

longString + "" is longString

evaluates to True. I don't know how you can do nothing more
gracefully than that.


And also "" + longString is longString

The string + operator provides those graceful *external* results by
ugly special-case testing internally.

It is not graceful IMHO to concatenate a variable which you already
know refers to a null string.

Let's go back to the first point, and indeed further back to the use
cases:

(1) multi-byte separator for lines in test files: never heard of one
apart from '\r\n'; presume this is rare, so test for length of 1 and
use Chris's simplification of my effort in this case.

(2) keep newline: with the standard file reading routines, if one is
going to do anything much with the line other than write it out again,
one does buffer = buffer.rstrip('\n') anyway. In the case of a
non-standard separator, one is likely to want to write the line out
with the standard '\n'. So, specialisation for this is indicated:

! if keepNewline:
! for line in lines: yield line + newline
! else:
! for line in lines: yield line

Cheers,
John

Jul 18 '05 #17
"John Machin" <sj******@lexicon.net> writes:
In Python, longString + "" is longString evaluates to True. I don't know how you can do nothing more
gracefully than that.
And also "" + longString is longString The string + operator provides those graceful *external* results by
ugly special-case testing internally.
I guess I don't know what you are getting at. If Python peforms ugly
special-case testing internally so that I can write more simple,
elegant code, then more power to it! Concentrating ugliness in one,
small, reusable place is a good thing.

It is not graceful IMHO to concatenate a variable which you already
know refers to a null string.
It's better than making my code bigger, uglier, and putting in extra
tests for no particularly good reason.

Let's go back to the first point, and indeed further back to the use
cases: (1) multi-byte separator for lines in test files: never heard of one
apart from '\r\n'; presume this is rare, so test for length of 1 and
use Chris's simplification of my effort in this case.
I want to ability to handle multibyte separators, and so I coded for
it. There are plenty of other uses for an iterator that handles
multi-byte separators. Not all of them would typically be considered
"newline-delimited lines" as opposed to "records delimited by a
separation string", but a rose by any other name....

If one wants to special case for single-byte separators in the name of
efficiency, I provided one back there in the thread that never
degrades to N^2, as the ones you and Chris provided.

(2) keep newline: with the standard file reading routines, if one is
going to do anything much with the line other than write it out again,
one does buffer = buffer.rstrip('\n') anyway. In the case of a
non-standard separator, one is likely to want to write the line out
with the standard '\n'. So, specialisation for this is indicated: ! if keepNewline:
! for line in lines: yield line + newline
! else:
! for line in lines: yield line


I would certainly never want the iterator to tack on a standard "\n"
as a replacement for whatever newline string the input used. That
seems like completely gratuitous functionality to me. The standard
(but not the only) reason that I want the line terminator left on the
yielded strings is so that you can tell whether or not there is a
line-separator terminating the very last line of the input. Usually I
want the line-terminator discarded, and it kind of annoys me that the
standard line iterator leaves it on.

|>oug
Jul 18 '05 #18

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: jerrygarciuh | last post by:
Hello, If you have the whole server path for a file is there a canonical way to get the path from document root for that file so that you can present the file ina browser or for download? Check...
3
by: deko | last post by:
I have a (Access 2003) contact management database where the user can double-click a contact's phone number in a form and have the Windows Phone Dialer dial the number. The problem is the number...
13
by: Eric Lilja | last post by:
Hello, consider the following complete program: #include <assert.h> #include <ctype.h> #include <stdlib.h> #include <stdio.h> #include <string.h> #include <time.h> static int...
20
by: Max Sandman | last post by:
I'm getting increasingly frustrated with C# and its exceptions on null values. Rather than try to deal with it on a hit-or-miss basis as exceptions pop up, I thought I should try to learn exactly...
1
by: Matt | last post by:
I could use some help dealing with null blobs. I'm returning a transaction from an Image BLOB field in SQL Server 2000 using C#. If the transaction exists the value is returned with out trouble,...
1
by: Juan R. | last post by:
Introduction I am developing the CanonML language (version 1.0) as a way to generate, store, and publish canonical science documents on the Internet. This language will be the basis for the next...
1
by: Juan R. | last post by:
The initial CanonMath program presented here http://canonicalscience.blogspot.com/2006/02/choosing-notationsyntax-for-canonmath.html] was discussed with several specialists, including father of...
5
by: Stephen Cawood | last post by:
I'm trying to use a C++ .lib from C# (I tried the Interop group will no results). I have a working wrapper DLL (I can get back simple things like int), but I'm having issues dealing with an array...
1
by: zzz | last post by:
Hi all, I was recently reading the book "Write Great code by ryndall Hyde" in this in chapter 8 the following are given. given n input variables there are two raised to two raised to n unique...
2
by: Jeffrey Walton | last post by:
Hi All, BMP Strings are a subset of Universal Strings.The BMP string uses approximately 65,000 code points from Universal String encoding. BMP Strings: ISO/IEC 10646, 2-octet canonical form,...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.