Is there a canonical way of iterating over the lines of a file that
are null-separated rather than newline-separated? Sure, I can
implement my own iterator using read() and split(), etc., but
considering that using "find -print0" is so common, it seems like
there should be a more cannonical way.
|>oug 17 4152
On Wed, Feb 23, 2005 at 10:54:50PM -0500, Douglas Alan wrote: Is there a canonical way of iterating over the lines of a file that are null-separated rather than newline-separated?
I'm not sure if there is a canonical method, but I would recommending using a
generator to get something like this, where 'f' is a file object:
def readnullsep(f):
# Need a place to put potential pieces of a null separated string
# across buffer boundaries
retain = []
while True:
instr = f.read(2048)
if len(instr)==0:
# End of file
break
# Split over nulls
splitstr = instr.split('\0')
# Combine with anything left over from previous read
retain.append(splitstr[0])
splitstr[0] = ''.join(retain)
# Keep last piece for next loop and yield the rest
retain = [splitstr[-1]]
for element in splitstr[:-1]:
yield element
# yield anything left over
yield retain[0]
Chris
Douglas Alan wrote: Is there a canonical way of iterating over the lines of a file that are null-separated rather than newline-separated? Sure, I can implement my own iterator using read() and split(), etc., but considering that using "find -print0" is so common, it seems like there should be a more cannonical way.
You could start with this code and add '\0' as a line terminator: http://members.dsl-only.net/~daniels/ilines.html
--Scott David Daniels Sc***********@Acm.Org
Christopher De Vries <de*****@idolstarastronomer.com> writes: I'm not sure if there is a canonical method, but I would recommending using a generator to get something like this, where 'f' is a file object:
Thanks for the generator. It returns an extra blank line at the end
when used with "find -print0", which is probably not ideal, and is
also not how the normal file line iterator behaves. But don't worry
-- I can fix it.
In any case, as a suggestion to the whomever it is that arranges for
stuff to be put into the standard library, there should be something
like this there, so everyone doesn't have to reinvent the wheel (even
if it's an easy wheel to reinvent) for something that any sysadmin
(and many other users) would want to do on practically a daily basis.
|>oug
Douglas Alan wrote:
.... In any case, as a suggestion to the whomever it is that arranges for stuff to be put into the standard library, there should be something like this there, so everyone doesn't have to reinvent the wheel (even if it's an easy wheel to reinvent) for something that any sysadmin (and many other users) would want to do on practically a daily basis.
The general model is that you produce a module, and if it gains a
audience to a stable interface, inclusion might be considered. I'd
suggest you put up a recipe at ActiveState.
--Scott David Daniels Sc***********@Acm.Org
On Thu, Feb 24, 2005 at 02:03:52PM -0500, Douglas Alan wrote: Thanks for the generator. It returns an extra blank line at the end when used with "find -print0", which is probably not ideal, and is also not how the normal file line iterator behaves. But don't worry -- I can fix it.
Sorry... I forgot to try it with a null terminated string. I guess it further
illustrates the power of writing good test cases. Something like this would
help:
# yield anything left over
if retain[0]:
yield retain[0]
The other modification would be an option to ignore multiple nulls in a row,
rather than returning empty strings, which could be done in a similar way.
Chris
On Thu, 24 Feb 2005 11:53:32 -0500, Christopher De Vries
<de*****@idolstarastronomer.com> wrote: On Wed, Feb 23, 2005 at 10:54:50PM -0500, Douglas Alan wrote: Is there a canonical way of iterating over the lines of a file that are null-separated rather than newline-separated? I'm not sure if there is a canonical method, but I would recommending using a generator to get something like this, where 'f' is a file object:
def readnullsep(f): # Need a place to put potential pieces of a null separated string # across buffer boundaries retain = []
while True: instr = f.read(2048) if len(instr)==0: # End of file break
# Split over nulls splitstr = instr.split('\0')
# Combine with anything left over from previous read retain.append(splitstr[0]) splitstr[0] = ''.join(retain)
# Keep last piece for next loop and yield the rest retain = [splitstr[-1]] for element in splitstr[:-1]:
(1) Inefficient (copies all but the last element of splitstr)
yield element
# yield anything left over yield retain[0]
(2) Dies when the input file is empty.
(3) As noted by the OP, can return a spurious empty line at the end.
Try this:
!def readweird(f, line_end='\0', bufsiz=8192):
! retain = ''
! while True:
! instr = f.read(bufsiz)
! if not instr:
! # End of file
! break
! splitstr = instr.split(line_end)
! if splitstr[-1]:
! # last piece not terminated
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = splitstr.pop()
! else:
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = ''
! del splitstr[-1]
! for element in splitstr:
! yield element
! if retain:
! yield retain
Cheers,
John
On Thu, 24 Feb 2005 14:51:07 -0500, Christopher De Vries
<de*****@idolstarastronomer.com> wrote: The other modification would be an option to ignore multiple nulls in a row, rather than returning empty strings, which could be done in a similar way.
Why not leave this to the caller? Efficiency?? Filtering out empty
lines is the least of your worries.
Try giving the callers options to do things they *can't* do
themselves, like a different line-terminator or a buffer size > 2048
[which could well enhance efficiency] or < 10 [which definitely
enhances testing]
On Fri, Feb 25, 2005 at 07:56:49AM +1100, John Machin wrote: Try this: !def readweird(f, line_end='\0', bufsiz=8192): ! retain = '' ! while True: ! instr = f.read(bufsiz) ! if not instr: ! # End of file ! break ! splitstr = instr.split(line_end) ! if splitstr[-1]: ! # last piece not terminated ! if retain: ! splitstr[0] = retain + splitstr[0] ! retain = splitstr.pop() ! else: ! if retain: ! splitstr[0] = retain + splitstr[0] ! retain = '' ! del splitstr[-1] ! for element in splitstr: ! yield element ! if retain: ! yield retain
I think this is a definite improvement... especially putting the buffer size
and line terminators as optional arguments, and handling empty files. I think,
however that the if splitstr[-1]: ... else: ... clauses aren't necessary, so I
would probably reduce it to this:
!def readweird(f, line_end='\0', bufsiz=8192):
! retain = ''
! while True:
! instr = f.read(bufsiz)
! if not instr:
! # End of file
! break
! splitstr = instr.split(line_end)
! if retain:
! splitstr[0] = retain + splitstr[0]
! retain = splitstr.pop()
! for element in splitstr:
! yield element
! if retain:
! yield retain
Popping off that last member and then iterating over the rest of the list as
you suggested is so much more efficient, and it looks a lot better.
Chris
On Thu, 24 Feb 2005 16:51:22 -0500, Christopher De Vries
<de*****@idolstarastronomer.com> wrote:
[snip] I think this is a definite improvement... especially putting the buffer size and line terminators as optional arguments, and handling empty files. I think, however that the if splitstr[-1]: ... else: ... clauses aren't necessary,
Indeed. Any efficiency gain would be negated by the if test and it's
only once per buffer-full anyway. I left all that stuff in to show
that I had actually analyzed the four cases i.e. it wasn't arrived at
by lucky accident.
so I would probably reduce it to this:
[snip] Popping off that last member and then iterating over the rest of the list as you suggested is so much more efficient, and it looks a lot better.
Yeah. If it looks like a warthog, it is a warthog. The converse is of
course not true; examples of elegant insufficiency abound.
Cheers,
John
Okay, here's the definitive version (or so say I). Some good doobie
please make sure it makes its way into the standard library:
def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.
You can also set the read size and control whether or not the newline string
is left on the end of the iterated lines. Setting newline to '\0' is
particularly good for use with an input file created with something like
"os.popen('find -print0')".
"""
partialLine = []
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = charsJustRead.split(newline)
if len(lines) > 1:
partialLine.append(lines[0])
lines[0] = "".join(partialLine)
partialLine = [lines.pop()]
else:
partialLine.append(lines.pop())
for line in lines: yield line + ("", newline)[leaveNewline]
if partialLine and partialLine[-1] != '': yield "".join(partialLine)
|>oug
Douglas Alan wrote: Okay, here's the definitive version (or so say I). Some good doobie please make sure it makes its way into the standard library:
def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192): """Like the normal file iter but you can set what string indicates newline.
You can also set the read size and control whether or not the newline string is left on the end of the iterated lines. Setting newline to '\0' is particularly good for use with an input file created with something like "os.popen('find -print0')". """ partialLine = [] while True: charsJustRead = inputFile.read(readSize) if not charsJustRead: break lines = charsJustRead.split(newline) if len(lines) > 1: partialLine.append(lines[0]) lines[0] = "".join(partialLine) partialLine = [lines.pop()] else: partialLine.append(lines.pop()) for line in lines: yield line + ("", newline)[leaveNewline] if partialLine and partialLine[-1] != '': yield "".join(partialLine)
|>oug
Hmm, adding optional arguments to file.readlines() would seem to be the place to
start, as well as undeprecating file.xreadlines() (with the same optional
arguments) for the iterator version.
I've put an RFE (#1152248) on Python's Sourceforge project so the idea doesn't
get completely lost. Actually making it happen needs someone to step up and
offer a patch to the relevant C code and documentation, though.
Cheers,
Nick.
--
Nick Coghlan | nc******@email.com | Brisbane, Australia
--------------------------------------------------------------- http://boredomandlaziness.skystorm.net
I wrote: Okay, here's the definitive version (or so say I). Some good doobie please make sure it makes its way into the standard library:
Oops, I just realized that my previously definitive version did not
handle multi-character newlines. So here is a new definition
version. Oog, now my brain hurts:
def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.
The newline string can be arbitrarily long; it need not be restricted to a
single character. You can also set the read size and control whether or not
the newline string is left on the end of the iterated lines. Setting
newline to '\0' is particularly good for use with an input file created with
something like "os.popen('find -print0')".
"""
isNewlineMultiChar = len(newline) > 1
outputLineEnd = ("", newline)[leaveNewline]
# 'partialLine' is a list of strings to be concatinated later:
partialLine = []
# Because read() might unfortunately split across our newline string, we
# have to regularly check to see if the newline string appears in what we
# previously thought was only a partial line. We do so with this generator:
def linesInPartialLine():
if isNewlineMultiChar:
linesInPartialLine = "".join(partialLine).split(newline)
if linesInPartialLine > 1:
partialLine[:] = [linesInPartialLine.pop()]
for line in linesInPartialLine:
yield line + outputLineEnd
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = charsJustRead.split(newline)
if len(lines) > 1:
for line in linesInPartialLine(): yield line
partialLine.append(lines[0])
lines[0] = "".join(partialLine)
partialLine[:] = [lines.pop()]
else:
partialLine.append(lines.pop())
for line in linesInPartialLine(): yield line
for line in lines: yield line + outputLineEnd
for line in linesInPartialLine(): yield line
if partialLine and partialLine[-1] != '':
yield "".join(partialLine)
|>oug
I wrote: Oops, I just realized that my previously definitive version did not handle multi-character newlines. So here is a new definitive version. Oog, now my brain hurts:
I dunno what I was thinking. That version sucked! Here's a version
that's actually comprehensible, a fraction of the size, and works in
all cases. (I think.)
def fileLineIter(inputFile, newline='\n', leaveNewline=False, readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.
The newline string can be arbitrarily long; it need not be restricted to a
single character. You can also set the read size and control whether or not
the newline string is left on the end of the iterated lines. Setting
newline to '\0' is particularly good for use with an input file created with
something like "os.popen('find -print0')".
"""
outputLineEnd = ("", newline)[leaveNewline]
partialLine = ''
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
lines = (partialLine + charsJustRead).split(newline)
partialLine = lines.pop()
for line in lines: yield line + outputLineEnd
if partialLine: yield partialLine
|>oug
Douglas Alan wrote: I wrote:
Oops, I just realized that my previously definitive version did not handle multi-character newlines. So here is a new definitive version. Oog, now my brain hurts: I dunno what I was thinking. That version sucked! Here's a version that's actually comprehensible, a fraction of the size, and works in all cases. (I think.)
def fileLineIter(inputFile, newline='\n', leaveNewline=False,
readSize=8192): """Like the normal file iter but you can set what string indicates
newline. The newline string can be arbitrarily long; it need not be
restricted to a single character. You can also set the read size and control
whether or not the newline string is left on the end of the iterated lines.
Setting newline to '\0' is particularly good for use with an input file
created with something like "os.popen('find -print0')". """ outputLineEnd = ("", newline)[leaveNewline] partialLine = '' while True: charsJustRead = inputFile.read(readSize) if not charsJustRead: break lines = (partialLine + charsJustRead).split(newline)
The above line is prepending a short string to what will typically be a
whole buffer full. There's gotta be a better way to do it. Perhaps you
might like to refer back to CdV's solution which was prepending the
residue to the first element of the split() result.
partialLine = lines.pop() for line in lines: yield line + outputLineEnd
In the case of leaveNewline being false, you are concatenating an empty
string. IMHO, to quote Jon Bentley, one should "do nothing gracefully".
if partialLine: yield partialLine
|>oug
"John Machin" <sj******@lexicon.net> writes: lines = (partialLine + charsJustRead).split(newline)
The above line is prepending a short string to what will typically be a whole buffer full. There's gotta be a better way to do it.
If there is, I'm all ears. In a previous post I provided code that
doesn't concatinate any strings together until the last possible
moment (i.e. when yielding a value). The problem with that the code
was that it was complicated and didn't work right in all cases.
One way of solving the string concatination issue would be to write a
string find routine that will work on lists of strings while ignoring
the boundaries between list elements. (I.e., it will consider the
list of strings to be one long string for its purposes.) Unless it is
written in C, however, I bet it will typically be much slower than the
code I just provided.
Perhaps you might like to refer back to CdV's solution which was prepending the residue to the first element of the split() result.
The problem with that solution is that it doesn't work in all cases
when the line-separation string is more than one character.
for line in lines: yield line + outputLineEnd
In the case of leaveNewline being false, you are concatenating an empty string. IMHO, to quote Jon Bentley, one should "do nothing gracefully".
In Python,
longString + "" is longString
evaluates to True. I don't know how you can do nothing more
gracefully than that.
|>oug
Douglas Alan wrote: "John Machin" <sj******@lexicon.net> writes:
lines = (partialLine + charsJustRead).split(newline) The above line is prepending a short string to what will typically
be a whole buffer full. There's gotta be a better way to do it. If there is, I'm all ears. In a previous post I provided code that doesn't concatinate any strings together until the last possible moment (i.e. when yielding a value). The problem with that the code was that it was complicated and didn't work right in all cases.
One way of solving the string concatination issue would be to write a string find routine that will work on lists of strings while ignoring the boundaries between list elements. (I.e., it will consider the list of strings to be one long string for its purposes.) Unless it
is written in C, however, I bet it will typically be much slower than
the code I just provided.
Perhaps you might like to refer back to CdV's solution which was prepending the residue to the first element of the split() result. The problem with that solution is that it doesn't work in all cases when the line-separation string is more than one character. for line in lines: yield line + outputLineEnd In the case of leaveNewline being false, you are concatenating an
empty string. IMHO, to quote Jon Bentley, one should "do nothing
gracefully". In Python,
longString + "" is longString
evaluates to True. I don't know how you can do nothing more gracefully than that.
And also "" + longString is longString
The string + operator provides those graceful *external* results by
ugly special-case testing internally.
It is not graceful IMHO to concatenate a variable which you already
know refers to a null string.
Let's go back to the first point, and indeed further back to the use
cases:
(1) multi-byte separator for lines in test files: never heard of one
apart from '\r\n'; presume this is rare, so test for length of 1 and
use Chris's simplification of my effort in this case.
(2) keep newline: with the standard file reading routines, if one is
going to do anything much with the line other than write it out again,
one does buffer = buffer.rstrip('\n') anyway. In the case of a
non-standard separator, one is likely to want to write the line out
with the standard '\n'. So, specialisation for this is indicated:
! if keepNewline:
! for line in lines: yield line + newline
! else:
! for line in lines: yield line
Cheers,
John
"John Machin" <sj******@lexicon.net> writes: In Python,
longString + "" is longString
evaluates to True. I don't know how you can do nothing more gracefully than that.
And also "" + longString is longString
The string + operator provides those graceful *external* results by ugly special-case testing internally.
I guess I don't know what you are getting at. If Python peforms ugly
special-case testing internally so that I can write more simple,
elegant code, then more power to it! Concentrating ugliness in one,
small, reusable place is a good thing.
It is not graceful IMHO to concatenate a variable which you already know refers to a null string.
It's better than making my code bigger, uglier, and putting in extra
tests for no particularly good reason.
Let's go back to the first point, and indeed further back to the use cases:
(1) multi-byte separator for lines in test files: never heard of one apart from '\r\n'; presume this is rare, so test for length of 1 and use Chris's simplification of my effort in this case.
I want to ability to handle multibyte separators, and so I coded for
it. There are plenty of other uses for an iterator that handles
multi-byte separators. Not all of them would typically be considered
"newline-delimited lines" as opposed to "records delimited by a
separation string", but a rose by any other name....
If one wants to special case for single-byte separators in the name of
efficiency, I provided one back there in the thread that never
degrades to N^2, as the ones you and Chris provided.
(2) keep newline: with the standard file reading routines, if one is going to do anything much with the line other than write it out again, one does buffer = buffer.rstrip('\n') anyway. In the case of a non-standard separator, one is likely to want to write the line out with the standard '\n'. So, specialisation for this is indicated:
! if keepNewline: ! for line in lines: yield line + newline ! else: ! for line in lines: yield line
I would certainly never want the iterator to tack on a standard "\n"
as a replacement for whatever newline string the input used. That
seems like completely gratuitous functionality to me. The standard
(but not the only) reason that I want the line terminator left on the
yielded strings is so that you can tell whether or not there is a
line-separator terminating the very last line of the input. Usually I
want the line-terminator discarded, and it kind of annoys me that the
standard line iterator leaves it on.
|>oug This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: jerrygarciuh |
last post by:
Hello,
If you have the whole server path for a file is there a canonical way to get
the path from document root for that file so that you can present the file
ina browser or for download? Check...
|
by: deko |
last post by:
I have a (Access 2003) contact management database where the user can
double-click a contact's phone number in a form and have the Windows Phone
Dialer dial the number. The problem is the number...
|
by: Eric Lilja |
last post by:
Hello, consider the following complete program:
#include <assert.h>
#include <ctype.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
static int...
|
by: Max Sandman |
last post by:
I'm getting increasingly frustrated with C# and its exceptions on null
values. Rather than try to deal with it on a hit-or-miss basis as
exceptions pop up, I thought I should try to learn exactly...
|
by: Matt |
last post by:
I could use some help dealing with null blobs. I'm
returning a transaction from an Image BLOB field in SQL
Server 2000 using C#. If the transaction exists the value
is returned with out trouble,...
|
by: Juan R. |
last post by:
Introduction
I am developing the CanonML language (version 1.0) as a way to
generate, store, and publish canonical science documents on the
Internet. This language will be the basis for the next...
|
by: Juan R. |
last post by:
The initial CanonMath program presented here
http://canonicalscience.blogspot.com/2006/02/choosing-notationsyntax-for-canonmath.html]
was discussed with several specialists, including father of...
|
by: Stephen Cawood |
last post by:
I'm trying to use a C++ .lib from C# (I tried the Interop group will no
results).
I have a working wrapper DLL (I can get back simple things like int), but
I'm having issues dealing with an array...
|
by: zzz |
last post by:
Hi all,
I was recently reading the book "Write Great code by ryndall Hyde" in
this in chapter 8 the following are given.
given n input variables there are two raised to two raised to n
unique...
|
by: Jeffrey Walton |
last post by:
Hi All,
BMP Strings are a subset of Universal Strings.The BMP string uses
approximately 65,000 code points from Universal String encoding. BMP
Strings: ISO/IEC 10646, 2-octet canonical form,...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
| |