473,323 Members | 1,547 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,323 software developers and data experts.

efficient 'tail' implementation

hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

Dec 8 '05 #1
12 4301

s99999999s2...@yahoo.com wrote:
hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

I don't think this is a python specific issue but a generic problem for
all "file as byte stream" system. The problem is, "line" is not a
property of the file, but its content(some big iron system use
"records" for lines and can be addressed with O(1))

So the simplest is just read and drop until the one you want.

for x in f:
if x_is_what_I_want: something

If you really want, you can do the reverse lookup like this :

f.seek(0,EOF)
x = f.tell()

then loop byte by byte backward till you find you stuff. The is quite
cumbersome and may not be faster, depending on your content.

Dec 8 '05 #2
s9************@yahoo.com writes:
I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.


Well, 200mb isn't all that big these days. But it's easy to code:

# untested code
input = open(filename)
tail = input.readlines()[:tailcount]
input.close()

and you're done. However, it will go through a lot of memory. Fastest
is probably working through it backwards, but that may take multiple
tries to get everything you want:

# untested code
input = open(filename)
blocksize = tailcount * expected_line_length
tail = []
while len(tail) < tailcount:
input.seek(-blocksize, EOF)
tail = input.read().split('\n')
blocksize *= 2
input.close()
tail = tail[:tailcount]

It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Dec 8 '05 #3

Mike Meyer wrote:
It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.

That actually is a pretty good idea. just reverse the buffer and do a
split, the last line becomes the first line and so on. The logic then
would be no different than reading from beginning of file. Just need to
keep the last "half line" of the reversed buffer if the wanted one
happens to be across buffer boundary.

Dec 8 '05 #4
As long as memory mapped files are available, the fastest
method is to map the whole file into memory and use the
mappings rfind method to search for an end of line.

The following code snippets may be usefull:
reportFile = open( filename )
length = os.fstat( reportFile.fileno() ).st_size
if length == 0:
# Don't map zero length files, windows will barf
continue
try:
mapping = mmap.mmap( reportFile.fileno(), length,
mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap(
reportFile.fileno(),
0, None,
mmap.ACCESS_READ )

Then you can use
mapping.rfind( os.linesep )
to find the end of the but last line and so on.

This is very fast, because nearly all work is done by are rfind, which
is implemented in C and the OS' paging logic.

HTH,
Gerald

bo****@gmail.com schrieb:
Mike Meyer wrote:
It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.


That actually is a pretty good idea. just reverse the buffer and do a
split, the last line becomes the first line and so on. The logic then
would be no different than reading from beginning of file. Just need to
keep the last "half line" of the reversed buffer if the wanted one
happens to be across buffer boundary.

Dec 8 '05 #5
s9************@yahoo.com writes:
hi

I have a file which is very large eg over 200Mb , and i am going to
use python to code a "tail" command to get the last few lines of the
file. What is a good algorithm for this type of task in python for
very big files? Initially, i thought of reading everything into an
array from the file and just get the last few elements (lines) but
since it's a very big file, don't think is efficient. thanks

First of all, what makes you think that tail on your system isn't
already optinized for this? Devil's advocate here... I have no clue
really.

Anyway, if you must roll your own;

Determine some reasonable max line size,, multiply this by a few
larger than the numbers of lines that you want to tail, seek to EOF
and then back to the position in the file where this chunk would
start, get and split that chunk into lines and now output the last 3
or however many you need.

If the read fails to hold enough lines, seek back a bit further and do
same but you'll have to be prepared to concat the second and Nth last
chunks together.

Have fun
--
-------------------------------------------------------------------------------
Jerry Sievers 305 854-3001 (home) WWW ECommerce Consultant
305 321-1144 (mobile http://www.JerrySievers.com/
Dec 8 '05 #6
Gerald Klix <Ge*********@klix.ch> wrote:
As long as memory mapped files are available, the fastest
method is to map the whole file into memory and use the
mappings rfind method to search for an end of line.


Excellent idea.

It'll blow up for large >2GB files on a 32bit OS though.
--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Dec 8 '05 #7
Gerald Klix <Ge*********@klix.ch> wrote:
As long as memory mapped files are available, the fastest
method is to map the whole file into memory and use the
mappings rfind method to search for an end of line.


Actually mmap doesn't appear to have an rfind method :-(

Here is a tested solution using mmap using your code. Inefficient if
number of lines to be tailed is too big.

import os
import sys
import mmap

def main(nlines, filename):
reportFile = open( filename )
length = os.fstat( reportFile.fileno() ).st_size
if length == 0:
# Don't map zero length files, windows will barf
return
try:
mapping = mmap.mmap( reportFile.fileno(), length,
mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap(
reportFile.fileno(),
0, None,
mmap.ACCESS_READ )
search = 1024
lines = []
while 1:
if search > length:
search = length
tail = mapping[length-search:]
lines = tail.split(os.linesep)
if len(lines) >= nlines or search == length:
break
search *= 2
lines = lines[-nlines-1:]
print "\n".join(lines)

if __name__ == "__main__":
if len(sys.argv) != 3:
print "Syntax: %s n file" % sys.argv[0]
else:
main(int(sys.argv[1]), sys.argv[2])

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Dec 8 '05 #8
s9************@yahoo.com wrote:
hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks


You should look at pyinotify. I assume we're talking linux here.

Dec 8 '05 #9
On Thu, 08 Dec 2005 02:09:58 -0500, Mike Meyer <mw*@mired.org> wrote:
s9************@yahoo.com writes:
I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.


Well, 200mb isn't all that big these days. But it's easy to code:

# untested code
input = open(filename)
tail = input.readlines()[:tailcount]
input.close()

and you're done. However, it will go through a lot of memory. Fastest
is probably working through it backwards, but that may take multiple
tries to get everything you want:

# untested code
input = open(filename)
blocksize = tailcount * expected_line_length
tail = []
while len(tail) < tailcount:
input.seek(-blocksize, EOF)
tail = input.read().split('\n')
blocksize *= 2
input.close()
tail = tail[:tailcount]

It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.

Ok, I'll have a go (only tested slightly ;-)
def frsplit(fname, nitems=10, splitter='\n', chunk=8192): ... f = open(fname, 'rb')
... f.seek(0, 2)
... bufpos = f.tell() # pos from file beg == file length
... buf = ['']
... for nl in xrange(nitems):
... while len(buf)<2:
... chunk = min(chunk, bufpos)
... bufpos = bufpos-chunk
... f.seek(bufpos)
... buf = (f.read(chunk)+buf[0]).split(splitter)
... if buf== ['']: break
... if bufpos==0: break
... if len(buf)>1: yield buf.pop(); continue
... if bufpos==0: yield buf.pop(); break
...

20 lines from the tail of november's python-dev archive
print '\n'.join(reversed(list(frsplit(r'v:\temp\clp\2005-November.txt', 20))))

lives in the mimelib project's hidden CVS on SF, but that seems pretty
silly.

Basically I'm just going to add the test script, setup.py, generated
html docs and a few additional unit tests, along with svn:external refs
to pull in Lib/email from the appropriate Python svn tree. This way,
I'll be able to create standalone email packages from the sandbox (which
I need to do because I plan on fixing a few outstanding email bugs).

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 307 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/pyt...attachment.pgp
Might want to throw away the first item returned by frsplit, unless it is !='' (indicating a
last line with no \n). Splitting with os.linesep is a problematical default, since e.g. it
wouldn't work with the above archive, since it has unix endings, and I didn't download it
in a manner that would convert it.

Regards,
Bengt Richter
Dec 9 '05 #10
s9************@yahoo.com wrote:
hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks


To read the last x bytes of a file, you could do:
import os
x = 2000 # or whatever...
f=open('my_big_file')
l=os.fstat(f.fileno()).st_size
f.seek(l-x)
f.read()


Maybe that's a start. I didn't try it on a anything bigger than 16MB,
but it was more or less instantaneous for 16Megs.

If you want the last X lines and know that lines are no more than N
chars, f.seek(l-X*N); f.readlines()[-X:] should give you what you
need... (I think...I didn't test it.)
Dec 9 '05 #11
s9************@yahoo.com wrote:
hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

If your file is seekable you could consider using file's seek function.

This would appear to give you a way of just working with the last, say 1mb.

Colin W.
Dec 9 '05 #12
Magnus Lycka wrote:
To read the last x bytes of a file, you could do:
>>> import os
>>> x = 2000 # or whatever...
>>> f=open('my_big_file')
>>> l=os.fstat(f.fileno()).st_size
>>> f.seek(l-x)
>>> f.read()
You don't need fstat/st_size, you can ask seek to move to an offset
relative to the end of the file:
x = 2000
f = open('my_big_file')
f.seek(-x, 2)
f.read()


Dec 15 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Stephen Thorne | last post by:
Decorators have been getting lots of air-time at the moment, but only really the syntax. After a short discussion on irc the other night I decided to download python2.4 from experimental and write...
2
by: Count Dracula | last post by:
Is there a way to write a c++ program that will print out the last few lines of a file without reading the whole file? The implementations of 'tail' I have seen all appear to be system dependent....
3
by: Michael B Allen | last post by:
Consider the following structures: struct bar { int i; float f; unsigned char tail; }; struct foo { struct bar b1; unsigned char _b1tail;
0
by: Ole Nielsby | last post by:
I just wonder if this is possible with C++/clr. I'm trying to .NET-enable an interpreter for a Lisp flavoured language, the interpreter written in NASM which I have managed to port to MASM....
19
by: Kay Schluehr | last post by:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/496691
30
by: Chas Emerick | last post by:
I looked around for an ElementTree-specific mailing list, but found none -- my apologies if this is too broad a forum for this question. I've been using the lxml variant of the ElementTree API,...
21
by: Owen Zhang | last post by:
What is the best way to implement "tail -f" in C or C++ and higher performance compared to either unix shell command "tail -f" or perl File::Tail ? Any suggestion appreciated. Thanks.
0
by: Miguel Perez | last post by:
Please critique this tail call optimizing decorator I've written. I've tried to fix the pitfalls of other proposed decorators, and the result is this one that supports mutual recursion, does not...
35
by: Muzammil | last post by:
int harmonic(int n) { if (n=1) { return 1; } else { return harmonic(n-1)+1/n; } } can any help me ??
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.