440,097 Members | 1,529 Online
Need help? Post your question and get tips & solutions from a community of 440,097 IT Pros & Developers. It's quick & easy.

# Seek the one billionth line in a file containing 3 billion lines.

 P: n/a I have a huge log file which contains 3,453,299,000 lines with different lengths. It is not possible to calculate the absolute position of the beginning of the one billionth line. Are there efficient way to seek to the beginning of that line in python? This program: for i in range(1000000000): f.readline() is absolutely every slow.... Thank you so much for help. Aug 8 '07 #1
15 Replies

 P: n/a Sullivan WxPyQtKinter

 P: n/a On 8/7/07, Sullivan WxPyQtKinter Aug 8 '07 #3

 P: n/a On Aug 8, 2:35 am, Paul Rubin

 P: n/a Sullivan WxPyQtKinter wrote: I have a huge log file which contains 3,453,299,000 lines with different lengths. It is not possible to calculate the absolute position of the beginning of the one billionth line. Are there efficient way to seek to the beginning of that line in python? This program: for i in range(1000000000): f.readline() is absolutely every slow.... Thank you so much for help. That will be slow regardless of language. However n = 10**9 - 1 assert n < sys.maxint f = open(filename) wanted_line = itertools.islice(f, n, None).next() should do slightly better than your implementation. Peter Aug 8 '07 #5

 P: n/a Hi! Create a "index" (a file with 3,453,299,000 tuples : line_number + start_byte) ; this file has fix-length lines. slow, OK, but once. Then, for every consult/read a specific line: - direct acces read on index - seek at the fisrt byte of the line desired @+ Michel Claveau Aug 8 '07 #7

 P: n/a Jay Loden a écrit : (snip) If we just want to iterate through the file one line at a time, why not just: count = 0 handle = open('hugelogfile.txt') for line in handle.xreadlines(): count = count + 1 if count == '1000000000': #do something for count, line in enumerate(handle): if count == '1000000000': #do something NB : files now (well... since 2.3) handle iteration directly http://www.python.org/doc/2.3/whatsnew/node17.html Aug 8 '07 #8

 P: n/a On Wed, 08 Aug 2007 09:54:26 +0200, MÃ©ta-MCI \(MVP\) wrote: Create a "index" (a file with 3,453,299,000 tuples : line_number + start_byte) ; this file has fix-length lines. slow, OK, but once. Why storing the line number? The first start offset is for the first line, the second start offset for the second line and so on. Ciao, Marc 'BlackJack' Rintsch Aug 8 '07 #9

 P: n/a Sullivan WxPyQtKinter Thank you for pointing out these two problem. I wrote this program just to say that how inefficient it is to use a seemingly NATIVE way to seek a such a big file. No other intention........ The native way isn't iterating over 'range(hugenum)', it's to use an iterator. Python file objects are iterable, only reading eaach line as needed and not creating a companion list. logfile = open("foo.log", 'r') for line in logfile: do_stuff(line) This at least avoids the 'range' issue. To know when we've reached a particular line, use 'enumerate' to number each item as it comes out of the iterator. logfile = open("foo.log", 'r') target_line_num = 10**9 for (line_num, line) in enumerate(file): if line_num < target_line_num: continue else: do_stuff(line) break As for reading each line: that's unavoidable if you want a specific line from a stream of variable-length lines. -- \ "I have never made but one prayer to God, a very short one: 'O | `\ Lord, make my enemies ridiculous!' And God granted it." -- | _o__) Voltaire | Ben Finney Aug 8 '07 #10

 P: n/a On Aug 8, 11:10 am, Bruno Desthuilliers

 P: n/a Peter Otten wrote: n = 10**9 - 1 assert n < sys.maxint f = open(filename) wanted_line = itertools.islice(f, n, None).next() should do slightly better than your implementation. It will do vastly better, at least in memory usage terms, because there is no memory eating range call. Regards, Björn -- BOFH excuse #31: cellular telephone interference Aug 8 '07 #12

 P: n/a Ant a écrit : On Aug 8, 11:10 am, Bruno Desthuilliers Jay Loden a écrit :(snip) >>If we just want to iterate through the file one line at a time, why not just:count = 0handle = open('hugelogfile.txt')for line in handle.xreadlines(): count = count + 1 if count == '1000000000': #do something for count, line in enumerate(handle): if count == '1000000000': #do something You'd get better results if the test were: if count == 1000000000: Or probably even: if count == 999999999: Since the 1 billionth line will have index 999999999. Doh :( Thanks for the correction. Aug 8 '07 #13

 P: n/a On 8/8/07, Ben Finney There are two problems: > 1) range(1000000000) builds a list of a billion elements in memory [...] > 2) f.readline() reads an entire line of input [...] Thank you for pointing out these two problem. I wrote this program just to say that how inefficient it is to use a seemingly NATIVE way to seek a such a big file. No other intention........ The native way isn't iterating over 'range(hugenum)', it's to use an iterator. Python file objects are iterable, only reading eaach line as needed and not creating a companion list. logfile = open("foo.log", 'r') for line in logfile: do_stuff(line) This at least avoids the 'range' issue. To know when we've reached a particular line, use 'enumerate' to number each item as it comes out of the iterator. logfile = open("foo.log", 'r') target_line_num = 10**9 for (line_num, line) in enumerate(file): if line_num < target_line_num: continue else: do_stuff(line) break As for reading each line: that's unavoidable if you want a specific line from a stream of variable-length lines. The minimum bounds for a line is at least one byte (the newline) and maybe more, depending on your data. You can seek() forward the minimum amount of bytes that (1 billion -1) lines will consume and save yourself some wasted IO. Aug 8 '07 #14

 P: n/a "Marc 'BlackJack' Rintsch"

 P: n/a "Chris Mellon"

### This discussion thread is closed

Replies have been disabled for this discussion.