By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,290 Members | 1,631 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,290 IT Pros & Developers. It's quick & easy.

efficient 'tail' implementation

P: n/a
hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

Dec 8 '05 #1
Share this Question
Share on Google+
12 Replies


P: n/a

s99999999s2...@yahoo.com wrote:
hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

I don't think this is a python specific issue but a generic problem for
all "file as byte stream" system. The problem is, "line" is not a
property of the file, but its content(some big iron system use
"records" for lines and can be addressed with O(1))

So the simplest is just read and drop until the one you want.

for x in f:
if x_is_what_I_want: something

If you really want, you can do the reverse lookup like this :

f.seek(0,EOF)
x = f.tell()

then loop byte by byte backward till you find you stuff. The is quite
cumbersome and may not be faster, depending on your content.

Dec 8 '05 #2

P: n/a
s9************@yahoo.com writes:
I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.


Well, 200mb isn't all that big these days. But it's easy to code:

# untested code
input = open(filename)
tail = input.readlines()[:tailcount]
input.close()

and you're done. However, it will go through a lot of memory. Fastest
is probably working through it backwards, but that may take multiple
tries to get everything you want:

# untested code
input = open(filename)
blocksize = tailcount * expected_line_length
tail = []
while len(tail) < tailcount:
input.seek(-blocksize, EOF)
tail = input.read().split('\n')
blocksize *= 2
input.close()
tail = tail[:tailcount]

It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Dec 8 '05 #3

P: n/a

Mike Meyer wrote:
It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.

That actually is a pretty good idea. just reverse the buffer and do a
split, the last line becomes the first line and so on. The logic then
would be no different than reading from beginning of file. Just need to
keep the last "half line" of the reversed buffer if the wanted one
happens to be across buffer boundary.

Dec 8 '05 #4

P: n/a
As long as memory mapped files are available, the fastest
method is to map the whole file into memory and use the
mappings rfind method to search for an end of line.

The following code snippets may be usefull:
reportFile = open( filename )
length = os.fstat( reportFile.fileno() ).st_size
if length == 0:
# Don't map zero length files, windows will barf
continue
try:
mapping = mmap.mmap( reportFile.fileno(), length,
mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap(
reportFile.fileno(),
0, None,
mmap.ACCESS_READ )

Then you can use
mapping.rfind( os.linesep )
to find the end of the but last line and so on.

This is very fast, because nearly all work is done by are rfind, which
is implemented in C and the OS' paging logic.

HTH,
Gerald

bo****@gmail.com schrieb:
Mike Meyer wrote:
It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.


That actually is a pretty good idea. just reverse the buffer and do a
split, the last line becomes the first line and so on. The logic then
would be no different than reading from beginning of file. Just need to
keep the last "half line" of the reversed buffer if the wanted one
happens to be across buffer boundary.

Dec 8 '05 #5

P: n/a
s9************@yahoo.com writes:
hi

I have a file which is very large eg over 200Mb , and i am going to
use python to code a "tail" command to get the last few lines of the
file. What is a good algorithm for this type of task in python for
very big files? Initially, i thought of reading everything into an
array from the file and just get the last few elements (lines) but
since it's a very big file, don't think is efficient. thanks

First of all, what makes you think that tail on your system isn't
already optinized for this? Devil's advocate here... I have no clue
really.

Anyway, if you must roll your own;

Determine some reasonable max line size,, multiply this by a few
larger than the numbers of lines that you want to tail, seek to EOF
and then back to the position in the file where this chunk would
start, get and split that chunk into lines and now output the last 3
or however many you need.

If the read fails to hold enough lines, seek back a bit further and do
same but you'll have to be prepared to concat the second and Nth last
chunks together.

Have fun
--
-------------------------------------------------------------------------------
Jerry Sievers 305 854-3001 (home) WWW ECommerce Consultant
305 321-1144 (mobile http://www.JerrySievers.com/
Dec 8 '05 #6

P: n/a
Gerald Klix <Ge*********@klix.ch> wrote:
As long as memory mapped files are available, the fastest
method is to map the whole file into memory and use the
mappings rfind method to search for an end of line.


Excellent idea.

It'll blow up for large >2GB files on a 32bit OS though.
--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Dec 8 '05 #7

P: n/a
Gerald Klix <Ge*********@klix.ch> wrote:
As long as memory mapped files are available, the fastest
method is to map the whole file into memory and use the
mappings rfind method to search for an end of line.


Actually mmap doesn't appear to have an rfind method :-(

Here is a tested solution using mmap using your code. Inefficient if
number of lines to be tailed is too big.

import os
import sys
import mmap

def main(nlines, filename):
reportFile = open( filename )
length = os.fstat( reportFile.fileno() ).st_size
if length == 0:
# Don't map zero length files, windows will barf
return
try:
mapping = mmap.mmap( reportFile.fileno(), length,
mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap(
reportFile.fileno(),
0, None,
mmap.ACCESS_READ )
search = 1024
lines = []
while 1:
if search > length:
search = length
tail = mapping[length-search:]
lines = tail.split(os.linesep)
if len(lines) >= nlines or search == length:
break
search *= 2
lines = lines[-nlines-1:]
print "\n".join(lines)

if __name__ == "__main__":
if len(sys.argv) != 3:
print "Syntax: %s n file" % sys.argv[0]
else:
main(int(sys.argv[1]), sys.argv[2])

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Dec 8 '05 #8

P: n/a
s9************@yahoo.com wrote:
hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks


You should look at pyinotify. I assume we're talking linux here.

Dec 8 '05 #9

P: n/a
On Thu, 08 Dec 2005 02:09:58 -0500, Mike Meyer <mw*@mired.org> wrote:
s9************@yahoo.com writes:
I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.


Well, 200mb isn't all that big these days. But it's easy to code:

# untested code
input = open(filename)
tail = input.readlines()[:tailcount]
input.close()

and you're done. However, it will go through a lot of memory. Fastest
is probably working through it backwards, but that may take multiple
tries to get everything you want:

# untested code
input = open(filename)
blocksize = tailcount * expected_line_length
tail = []
while len(tail) < tailcount:
input.seek(-blocksize, EOF)
tail = input.read().split('\n')
blocksize *= 2
input.close()
tail = tail[:tailcount]

It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.

Ok, I'll have a go (only tested slightly ;-)
def frsplit(fname, nitems=10, splitter='\n', chunk=8192): ... f = open(fname, 'rb')
... f.seek(0, 2)
... bufpos = f.tell() # pos from file beg == file length
... buf = ['']
... for nl in xrange(nitems):
... while len(buf)<2:
... chunk = min(chunk, bufpos)
... bufpos = bufpos-chunk
... f.seek(bufpos)
... buf = (f.read(chunk)+buf[0]).split(splitter)
... if buf== ['']: break
... if bufpos==0: break
... if len(buf)>1: yield buf.pop(); continue
... if bufpos==0: yield buf.pop(); break
...

20 lines from the tail of november's python-dev archive
print '\n'.join(reversed(list(frsplit(r'v:\temp\clp\2005-November.txt', 20))))

lives in the mimelib project's hidden CVS on SF, but that seems pretty
silly.

Basically I'm just going to add the test script, setup.py, generated
html docs and a few additional unit tests, along with svn:external refs
to pull in Lib/email from the appropriate Python svn tree. This way,
I'll be able to create standalone email packages from the sandbox (which
I need to do because I plan on fixing a few outstanding email bugs).

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 307 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/pyt...attachment.pgp
Might want to throw away the first item returned by frsplit, unless it is !='' (indicating a
last line with no \n). Splitting with os.linesep is a problematical default, since e.g. it
wouldn't work with the above archive, since it has unix endings, and I didn't download it
in a manner that would convert it.

Regards,
Bengt Richter
Dec 9 '05 #10

P: n/a
s9************@yahoo.com wrote:
hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks


To read the last x bytes of a file, you could do:
import os
x = 2000 # or whatever...
f=open('my_big_file')
l=os.fstat(f.fileno()).st_size
f.seek(l-x)
f.read()


Maybe that's a start. I didn't try it on a anything bigger than 16MB,
but it was more or less instantaneous for 16Megs.

If you want the last X lines and know that lines are no more than N
chars, f.seek(l-X*N); f.readlines()[-X:] should give you what you
need... (I think...I didn't test it.)
Dec 9 '05 #11

P: n/a
s9************@yahoo.com wrote:
hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

If your file is seekable you could consider using file's seek function.

This would appear to give you a way of just working with the last, say 1mb.

Colin W.
Dec 9 '05 #12

P: n/a
Magnus Lycka wrote:
To read the last x bytes of a file, you could do:
>>> import os
>>> x = 2000 # or whatever...
>>> f=open('my_big_file')
>>> l=os.fstat(f.fileno()).st_size
>>> f.seek(l-x)
>>> f.read()
You don't need fstat/st_size, you can ask seek to move to an offset
relative to the end of the file:
x = 2000
f = open('my_big_file')
f.seek(-x, 2)
f.read()


Dec 15 '05 #13

This discussion thread is closed

Replies have been disabled for this discussion.