By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,097 Members | 1,529 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,097 IT Pros & Developers. It's quick & easy.

Seek the one billionth line in a file containing 3 billion lines.

P: n/a
I have a huge log file which contains 3,453,299,000 lines with
different lengths. It is not possible to calculate the absolute
position of the beginning of the one billionth line. Are there
efficient way to seek to the beginning of that line in python?

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

Thank you so much for help.

Aug 8 '07 #1
Share this Question
Share on Google+
15 Replies


P: n/a
Sullivan WxPyQtKinter <su***********@gmail.comwrites:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....
There are two problems:

1) range(1000000000) builds a list of a billion elements in memory,
which is many gigabytes and probably thrashing your machine.
You want to use xrange instead of range, which builds an iterator
(i.e. something that uses just a small amount of memory, and
generates the values on the fly instead of precomputing a list).

2) f.readline() reads an entire line of input which (depending on
the nature of the log file) could also be of very large size.
If you're sure the log file contents are sensible (lines up to
several megabytes shouldn't cause a problem) then you can do it
that way, but otherwise you want to read fixed size units.
Aug 8 '07 #2

P: n/a
On 8/7/07, Sullivan WxPyQtKinter <su***********@gmail.comwrote:
I have a huge log file which contains 3,453,299,000 lines with
different lengths. It is not possible to calculate the absolute
position of the beginning of the one billionth line. Are there
efficient way to seek to the beginning of that line in python?

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

Thank you so much for help.
There is no fast way to do this, unless the lines are of fixed length
(in which case you can use f.seek() to move to the correct spot). The
reason is that there is no way to find the position of the billionth
line without scanning the whole file. You should split your logs into
smaller files in the future.

You might be able to do this a very tiny bit faster by using the split
utility and have it split the log file into smaller chunks (split can
split by line amounts), but since that still has to scan the file it
will will be IO bound.

--
Evan Klitzke <ev**@yelp.com>
Aug 8 '07 #3

P: n/a
On Aug 8, 2:35 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
Sullivan WxPyQtKinter <sullivanz....@gmail.comwrites:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

There are two problems:

1) range(1000000000) builds a list of a billion elements in memory,
which is many gigabytes and probably thrashing your machine.
You want to use xrange instead of range, which builds an iterator
(i.e. something that uses just a small amount of memory, and
generates the values on the fly instead of precomputing a list).

2) f.readline() reads an entire line of input which (depending on
the nature of the log file) could also be of very large size.
If you're sure the log file contents are sensible (lines up to
several megabytes shouldn't cause a problem) then you can do it
that way, but otherwise you want to read fixed size units.

Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........

Aug 8 '07 #4

P: n/a
Sullivan WxPyQtKinter wrote:
I have a huge log file which contains 3,453,299,000 lines with
different lengths. It is not possible to calculate the absolute
position of the beginning of the one billionth line. Are there
efficient way to seek to the beginning of that line in python?

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

Thank you so much for help.
That will be slow regardless of language. However

n = 10**9 - 1
assert n < sys.maxint
f = open(filename)
wanted_line = itertools.islice(f, n, None).next()

should do slightly better than your implementation.

Peter
Aug 8 '07 #5

P: n/a
Paul Rubin wrote:
Sullivan WxPyQtKinter <su***********@gmail.comwrites:
>This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

There are two problems:

1) range(1000000000) builds a list of a billion elements in memory,
which is many gigabytes and probably thrashing your machine.
You want to use xrange instead of range, which builds an iterator
(i.e. something that uses just a small amount of memory, and
generates the values on the fly instead of precomputing a list).

2) f.readline() reads an entire line of input which (depending on
the nature of the log file) could also be of very large size.
If you're sure the log file contents are sensible (lines up to
several megabytes shouldn't cause a problem) then you can do it
that way, but otherwise you want to read fixed size units.
If we just want to iterate through the file one line at a time, why not just:

count = 0
handle = open('hugelogfile.txt')
for line in handle.xreadlines():
count = count + 1
if count == '1000000000':
#do something
My first suggestion would be to split the file into smaller more manageable
chunks, because any type of manipulation of a multi-billion line log file is
going to be a nightmare. For example, you could try the UNIX 'split' utility to
break the file into individual files of say, 100000 lines each. split is likely
to be faster than anything in Python, since it is written in C with no
interpreter overhead etc.

Is there a reason you specifically need to get to line 1 billion, or are you
just trying to trim the file down? Do you need a value that's on that particular
line, or is there some other reason? Perhaps if you can provide the use case the
list can help you solve the problem itself rather than looking for a way to seek
to the one billionth line in a file.

-Jay
Aug 8 '07 #6

P: n/a
Hi!

Create a "index" (a file with 3,453,299,000 tuples :
line_number + start_byte) ; this file has fix-length lines.
slow, OK, but once.

Then, for every consult/read a specific line:
- direct acces read on index
- seek at the fisrt byte of the line desired

@+

Michel Claveau
Aug 8 '07 #7

P: n/a
Jay Loden a écrit :
(snip)
If we just want to iterate through the file one line at a time, why not just:

count = 0
handle = open('hugelogfile.txt')
for line in handle.xreadlines():
count = count + 1
if count == '1000000000':
#do something
for count, line in enumerate(handle):
if count == '1000000000':
#do something
NB : files now (well... since 2.3) handle iteration directly
http://www.python.org/doc/2.3/whatsnew/node17.html
Aug 8 '07 #8

P: n/a
On Wed, 08 Aug 2007 09:54:26 +0200, Méta-MCI \(MVP\) wrote:
Create a "index" (a file with 3,453,299,000 tuples :
line_number + start_byte) ; this file has fix-length lines.
slow, OK, but once.
Why storing the line number? The first start offset is for the first
line, the second start offset for the second line and so on.

Ciao,
Marc 'BlackJack' Rintsch
Aug 8 '07 #9

P: n/a
Sullivan WxPyQtKinter <su***********@gmail.comwrites:
On Aug 8, 2:35 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
Sullivan WxPyQtKinter <sullivanz....@gmail.comwrites:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....
There are two problems:

1) range(1000000000) builds a list of a billion elements in memory
[...]

2) f.readline() reads an entire line of input
[...]
>
Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........
The native way isn't iterating over 'range(hugenum)', it's to use an
iterator. Python file objects are iterable, only reading eaach line as
needed and not creating a companion list.

logfile = open("foo.log", 'r')
for line in logfile:
do_stuff(line)

This at least avoids the 'range' issue.

To know when we've reached a particular line, use 'enumerate' to
number each item as it comes out of the iterator.

logfile = open("foo.log", 'r')
target_line_num = 10**9
for (line_num, line) in enumerate(file):
if line_num < target_line_num:
continue
else:
do_stuff(line)
break

As for reading each line: that's unavoidable if you want a specific
line from a stream of variable-length lines.

--
\ "I have never made but one prayer to God, a very short one: 'O |
`\ Lord, make my enemies ridiculous!' And God granted it." -- |
_o__) Voltaire |
Ben Finney
Aug 8 '07 #10

P: n/a
Ant
On Aug 8, 11:10 am, Bruno Desthuilliers <bruno.
42.desthuilli...@wtf.websiteburo.oops.comwrote:
Jay Loden a écrit :
(snip)
If we just want to iterate through the file one line at a time, why notjust:
count = 0
handle = open('hugelogfile.txt')
for line in handle.xreadlines():
count = count + 1
if count == '1000000000':
#do something

for count, line in enumerate(handle):
if count == '1000000000':
#do something
You'd get better results if the test were:

if count == 1000000000:

Or probably even:

if count == 999999999:

Since the 1 billionth line will have index 999999999.

Cheers,

--
Ant...

http://antroy.blogspot.com/


Aug 8 '07 #11

P: n/a
Peter Otten wrote:
n = 10**9 - 1
assert n < sys.maxint
f = open(filename)
wanted_line = itertools.islice(f, n, None).next()

should do slightly better than your implementation.
It will do vastly better, at least in memory usage terms, because
there is no memory eating range call.

Regards,
Björn

--
BOFH excuse #31:

cellular telephone interference

Aug 8 '07 #12

P: n/a
Ant a écrit :
On Aug 8, 11:10 am, Bruno Desthuilliers <bruno.
42.desthuilli...@wtf.websiteburo.oops.comwrote:
>Jay Loden a écrit :
(snip)
>>If we just want to iterate through the file one line at a time, why not just:
count = 0
handle = open('hugelogfile.txt')
for line in handle.xreadlines():
count = count + 1
if count == '1000000000':
#do something
for count, line in enumerate(handle):
if count == '1000000000':
#do something

You'd get better results if the test were:

if count == 1000000000:

Or probably even:

if count == 999999999:

Since the 1 billionth line will have index 999999999.
Doh :(

Thanks for the correction.
Aug 8 '07 #13

P: n/a
On 8/8/07, Ben Finney <bi****************@benfinney.id.auwrote:
Sullivan WxPyQtKinter <su***********@gmail.comwrites:
On Aug 8, 2:35 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
Sullivan WxPyQtKinter <sullivanz....@gmail.comwrites:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....
>
There are two problems:
>
1) range(1000000000) builds a list of a billion elements in memory
[...]
>
2) f.readline() reads an entire line of input
[...]

Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........

The native way isn't iterating over 'range(hugenum)', it's to use an
iterator. Python file objects are iterable, only reading eaach line as
needed and not creating a companion list.

logfile = open("foo.log", 'r')
for line in logfile:
do_stuff(line)

This at least avoids the 'range' issue.

To know when we've reached a particular line, use 'enumerate' to
number each item as it comes out of the iterator.

logfile = open("foo.log", 'r')
target_line_num = 10**9
for (line_num, line) in enumerate(file):
if line_num < target_line_num:
continue
else:
do_stuff(line)
break

As for reading each line: that's unavoidable if you want a specific
line from a stream of variable-length lines.
The minimum bounds for a line is at least one byte (the newline) and
maybe more, depending on your data. You can seek() forward the minimum
amount of bytes that (1 billion -1) lines will consume and save
yourself some wasted IO.
Aug 8 '07 #14

P: n/a

"Marc 'BlackJack' Rintsch" <bj****@gmx.netwrote in message
news:5h*************@mid.uni-berlin.de...
| On Wed, 08 Aug 2007 09:54:26 +0200, Méta-MCI \(MVP\) wrote:
|
| Create a "index" (a file with 3,453,299,000 tuples :
| line_number + start_byte) ; this file has fix-length lines.
| slow, OK, but once.
|
| Why storing the line number? The first start offset is for the first
| line, the second start offset for the second line and so on.

Somewhat ironically, given that the OP's problem stems from variable line
lengths, this requires that the offsets by fixed length. On a true 64-bit
OS (not Win64, apparently) with 64-bit ints that would work great.

Aug 8 '07 #15

P: n/a
"Chris Mellon" <ar*****@gmail.comwrites:
[...]
The minimum bounds for a line is at least one byte (the newline) and
maybe more, depending on your data. You can seek() forward the minimum
amount of bytes that (1 billion -1) lines will consume and save
yourself some wasted IO.
But how do you know which line number you're on, then?
John
Aug 12 '07 #16

This discussion thread is closed

Replies have been disabled for this discussion.