Seek the one billionth line in a file containing 3 billion lines.

Sullivan WxPyQtKinter

I have a huge log file which contains 3,453,299,000 lines with
different lengths. It is not possible to calculate the absolute
position of the beginning of the one billionth line. Are there
efficient way to seek to the beginning of that line in python?

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

Thank you so much for help.

Aug 8 '07 #1

Subscribe Post Reply

2042

Paul Rubin

Sullivan WxPyQtKinter <su***********@gmail.comwrites:

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

There are two problems:

1) range(1000000000) builds a list of a billion elements in memory,
which is many gigabytes and probably thrashing your machine.
You want to use xrange instead of range, which builds an iterator
(i.e. something that uses just a small amount of memory, and
generates the values on the fly instead of precomputing a list).

2) f.readline() reads an entire line of input which (depending on
the nature of the log file) could also be of very large size.
If you're sure the log file contents are sensible (lines up to
several megabytes shouldn't cause a problem) then you can do it
that way, but otherwise you want to read fixed size units.

Aug 8 '07 #2

Evan Klitzke

On 8/7/07, Sullivan WxPyQtKinter <su***********@gmail.comwrote:

I have a huge log file which contains 3,453,299,000 lines with
different lengths. It is not possible to calculate the absolute
position of the beginning of the one billionth line. Are there
efficient way to seek to the beginning of that line in python?

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

Thank you so much for help.

There is no fast way to do this, unless the lines are of fixed length
(in which case you can use f.seek() to move to the correct spot). The
reason is that there is no way to find the position of the billionth
line without scanning the whole file. You should split your logs into
smaller files in the future.

You might be able to do this a very tiny bit faster by using the split
utility and have it split the log file into smaller chunks (split can
split by line amounts), but since that still has to scan the file it
will will be IO bound.

--
Evan Klitzke <ev**@yelp.com>

Aug 8 '07 #3

Sullivan WxPyQtKinter

On Aug 8, 2:35 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

Sullivan WxPyQtKinter <sullivanz....@gmail.comwrites:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

There are two problems:

1) range(1000000000) builds a list of a billion elements in memory,
which is many gigabytes and probably thrashing your machine.
You want to use xrange instead of range, which builds an iterator
(i.e. something that uses just a small amount of memory, and
generates the values on the fly instead of precomputing a list).

2) f.readline() reads an entire line of input which (depending on
the nature of the log file) could also be of very large size.
If you're sure the log file contents are sensible (lines up to
several megabytes shouldn't cause a problem) then you can do it
that way, but otherwise you want to read fixed size units.

Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........

Aug 8 '07 #4

Peter Otten

Sullivan WxPyQtKinter wrote:

I have a huge log file which contains 3,453,299,000 lines with
different lengths. It is not possible to calculate the absolute
position of the beginning of the one billionth line. Are there
efficient way to seek to the beginning of that line in python?

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

Thank you so much for help.

That will be slow regardless of language. However

n = 10**9 - 1
assert n < sys.maxint
f = open(filename)
wanted_line = itertools.islice(f, n, None).next()

should do slightly better than your implementation.

Peter

Aug 8 '07 #5

Jay Loden

Paul Rubin wrote:

Sullivan WxPyQtKinter <su***********@gmail.comwrites:
>This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

There are two problems:

1) range(1000000000) builds a list of a billion elements in memory,
which is many gigabytes and probably thrashing your machine.
You want to use xrange instead of range, which builds an iterator
(i.e. something that uses just a small amount of memory, and
generates the values on the fly instead of precomputing a list).

2) f.readline() reads an entire line of input which (depending on
the nature of the log file) could also be of very large size.
If you're sure the log file contents are sensible (lines up to
several megabytes shouldn't cause a problem) then you can do it
that way, but otherwise you want to read fixed size units.

If we just want to iterate through the file one line at a time, why not just:

count = 0
handle = open('hugelogfile.txt')
for line in handle.xreadlines():
count = count + 1
if count == '1000000000':
#do something
My first suggestion would be to split the file into smaller more manageable
chunks, because any type of manipulation of a multi-billion line log file is
going to be a nightmare. For example, you could try the UNIX 'split' utility to
break the file into individual files of say, 100000 lines each. split is likely
to be faster than anything in Python, since it is written in C with no
interpreter overhead etc.

Is there a reason you specifically need to get to line 1 billion, or are you
just trying to trim the file down? Do you need a value that's on that particular
line, or is there some other reason? Perhaps if you can provide the use case the
list can help you solve the problem itself rather than looking for a way to seek
to the one billionth line in a file.

-Jay

Aug 8 '07 #6

Méta-MCI $MVP$

Hi!

Create a "index" (a file with 3,453,299,000 tuples :
line_number + start_byte) ; this file has fix-length lines.
slow, OK, but once.

Then, for every consult/read a specific line:
- direct acces read on index
- seek at the fisrt byte of the line desired

@+

Michel Claveau

Aug 8 '07 #7

Bruno Desthuilliers

Jay Loden a écrit :
(snip)

If we just want to iterate through the file one line at a time, why not just:

count = 0
handle = open('hugelogfile.txt')
for line in handle.xreadlines():
count = count + 1
if count == '1000000000':
#do something

for count, line in enumerate(handle):
if count == '1000000000':
#do something
NB : files now (well... since 2.3) handle iteration directly
http://www.python.org/doc/2.3/whatsnew/node17.html

Aug 8 '07 #8

Marc 'BlackJack' Rintsch

On Wed, 08 Aug 2007 09:54:26 +0200, MÃ©ta-MCI $MVP$ wrote:

Create a "index" (a file with 3,453,299,000 tuples :
line_number + start_byte) ; this file has fix-length lines.
slow, OK, but once.

Why storing the line number? The first start offset is for the first
line, the second start offset for the second line and so on.

Ciao,
Marc 'BlackJack' Rintsch

Aug 8 '07 #9

Ben Finney

Sullivan WxPyQtKinter <su***********@gmail.comwrites:

On Aug 8, 2:35 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
Sullivan WxPyQtKinter <sullivanz....@gmail.comwrites:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....
There are two problems:

1) range(1000000000) builds a list of a billion elements in memory

[...]

2) f.readline() reads an entire line of input

[...]

>
Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........

The native way isn't iterating over 'range(hugenum)', it's to use an
iterator. Python file objects are iterable, only reading eaach line as
needed and not creating a companion list.

logfile = open("foo.log", 'r')
for line in logfile:
do_stuff(line)

This at least avoids the 'range' issue.

To know when we've reached a particular line, use 'enumerate' to
number each item as it comes out of the iterator.

logfile = open("foo.log", 'r')
target_line_num = 10**9
for (line_num, line) in enumerate(file):
if line_num < target_line_num:
continue
else:
do_stuff(line)
break

As for reading each line: that's unavoidable if you want a specific
line from a stream of variable-length lines.

--
\ "I have never made but one prayer to God, a very short one: 'O |
`\ Lord, make my enemies ridiculous!' And God granted it." -- |
_o__) Voltaire |
Ben Finney

Aug 8 '07 #10

Ant

On Aug 8, 11:10 am, Bruno Desthuilliers <bruno.
42.desthuilli...@wtf.websiteburo.oops.comwrote:

Jay Loden a écrit :
(snip)

If we just want to iterate through the file one line at a time, why notjust:

count = 0
handle = open('hugelogfile.txt')
for line in handle.xreadlines():
count = count + 1
if count == '1000000000':
#do something

for count, line in enumerate(handle):
if count == '1000000000':
#do something

You'd get better results if the test were:

if count == 1000000000:

Or probably even:

if count == 999999999:

Since the 1 billionth line will have index 999999999.

Cheers,

--
Ant...

http://antroy.blogspot.com/

Aug 8 '07 #11

Bjoern Schliessmann

Peter Otten wrote:

n = 10**9 - 1
assert n < sys.maxint
f = open(filename)
wanted_line = itertools.islice(f, n, None).next()

should do slightly better than your implementation.

It will do vastly better, at least in memory usage terms, because
there is no memory eating range call.

Regards,
Björn

--
BOFH excuse #31:

cellular telephone interference

Aug 8 '07 #12

Bruno Desthuilliers

Ant a écrit :

On Aug 8, 11:10 am, Bruno Desthuilliers <bruno.
42.desthuilli...@wtf.websiteburo.oops.comwrote:
>Jay Loden a écrit :
(snip)

>>If we just want to iterate through the file one line at a time, why not just:
count = 0
handle = open('hugelogfile.txt')
for line in handle.xreadlines():
count = count + 1
if count == '1000000000':
#do something
for count, line in enumerate(handle):
if count == '1000000000':
#do something

You'd get better results if the test were:

if count == 1000000000:

Or probably even:

if count == 999999999:

Since the 1 billionth line will have index 999999999.

Doh :(

Thanks for the correction.

Aug 8 '07 #13

Chris Mellon

On 8/8/07, Ben Finney <bi****************@benfinney.id.auwrote:

Sullivan WxPyQtKinter <su***********@gmail.comwrites:

On Aug 8, 2:35 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
Sullivan WxPyQtKinter <sullivanz....@gmail.comwrites:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....
>
There are two problems:
>
1) range(1000000000) builds a list of a billion elements in memory

[...]

>
2) f.readline() reads an entire line of input

[...]

Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........

The native way isn't iterating over 'range(hugenum)', it's to use an
iterator. Python file objects are iterable, only reading eaach line as
needed and not creating a companion list.

logfile = open("foo.log", 'r')
for line in logfile:
do_stuff(line)

This at least avoids the 'range' issue.

To know when we've reached a particular line, use 'enumerate' to
number each item as it comes out of the iterator.

logfile = open("foo.log", 'r')
target_line_num = 10**9
for (line_num, line) in enumerate(file):
if line_num < target_line_num:
continue
else:
do_stuff(line)
break

As for reading each line: that's unavoidable if you want a specific
line from a stream of variable-length lines.

The minimum bounds for a line is at least one byte (the newline) and
maybe more, depending on your data. You can seek() forward the minimum
amount of bytes that (1 billion -1) lines will consume and save
yourself some wasted IO.

Aug 8 '07 #14

Terry Reedy

"Marc 'BlackJack' Rintsch" <bj****@gmx.netwrote in message
news:5h*************@mid.uni-berlin.de...
| On Wed, 08 Aug 2007 09:54:26 +0200, Méta-MCI $MVP$ wrote:
|
| Create a "index" (a file with 3,453,299,000 tuples :
| line_number + start_byte) ; this file has fix-length lines.
| slow, OK, but once.
|
| Why storing the line number? The first start offset is for the first
| line, the second start offset for the second line and so on.

Somewhat ironically, given that the OP's problem stems from variable line
lengths, this requires that the offsets by fixed length. On a true 64-bit
OS (not Win64, apparently) with 64-bit ints that would work great.

Aug 8 '07 #15

John J. Lee

"Chris Mellon" <ar*****@gmail.comwrites:
[...]

The minimum bounds for a line is at least one byte (the newline) and
maybe more, depending on your data. You can seek() forward the minimum
amount of bytes that (1 billion -1) lines will consume and save
yourself some wasted IO.

But how do you know which line number you're on, then?
John

Aug 12 '07 #16

Similar topics

file object: seek and close?

by: Shu-Hsien Sheu | last post by:

Hi, Does the seek method would close the file object after using a for loop? My program looks like this: f = open('somefile', 'r') for lines in f: some operations f.seek(0)

Python

Using .seek(0) to read.

by: Amy G | last post by:

"Peter Otten" <__peter__@web.de> wrote in message news:btjbot$f56$00$1@news.t-online.com... > You can open the file in "r+" mode and then f_vm.seek(0) before writing. > However, from the above...

Python

xreadlines() being used with file.tell() and file.seek()

by: Pernell Williams | last post by:

Hi all: I am new to Python, and this is my first post (and it won't be my last!), so HELLO EVERYONE!! I am attempting to use "xreadlines", an outer loop and an inner loop in conjunction with...

Python

zipfile.py, fp.seek(-22,2) error

by: Waitman Gobble | last post by:

Hello, I am new to Python. I am having trouble with zipfile.py. On a Linux machine with python 2.4.2 I have trouble opening a zipfile. Python is complaining about the bit where it does a...

Python

BaseStream.Seek(0, SeekOrigin.Begin)

by: Harry J. Smith | last post by:

In the code below the two lines marked with ????????????????? do not work properly. The line: inCF.BaseStream.Seek(0, SeekOrigin.Begin); // rewind inCF to beginning of the file...

C# / C Sharp

StreamReader.Seek(0, Begin)

by: Joan Reddy | last post by:

Can anyone tell me why this code doesn't work for setting the pointer to the begining of a file stream? This is driving me crazy. At the end of Main1, sString2 is the second line of the file, as...

Visual Basic .NET

How copy a new line char to a file?

by: Clodoaldo Pinto Neto | last post by:

Hi all, I'm trying to copy a table with a text field column containing a new line char to a file: ksDesenv=# create table page(line text) without oids; CREATE TABLE ksDesenv=# insert into...

PostgreSQL Database

Comparing Two Files line by line and word by word

by: Frost | last post by:

Hi All, I am a newbie i have written a c program on unix for line by line comparison for two files now could some one help on how i could do word by word comparison in case both lines have the...

C / C++

Search line containing exact word...

by: Kristapins | last post by:

Hello! Can anyone help me... I have *.txt file containing lines like... 100 0 0 7 high UNLOCKED CRITICAL Door open in site CLEARED 100 11 0 7 low UNLOCKED CRITICAL Door Open CLEARED 100 12 0...

C / C++

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware