By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,827 Members | 813 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,827 IT Pros & Developers. It's quick & easy.

building an index for large text files for fast access

P: n/a
Hi,

I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.

Yi

Jul 25 '06 #1
Share this Question
Share on Google+
6 Replies


P: n/a
Yi Xing wrote:
I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.
Hey Yi,

The standard library module 'libcache' does exactly what you're
considering implementing.

-alex23

Jul 25 '06 #2

P: n/a
Yi Xing:
Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python?
It depends on the size of the files and the amount of memory and
disk you may use. First suggestion would be an in-memory array.array of
64 bit integers made from 2 'I' entries with each 64 bit integer
pointing to the start of a set of n lines. Then to find a particular
line number p you seek to a[p/n] and then read over p%n lines. The
factor 'n' is adjusted to fit your data into memory. If this uses too
much memory or scanning the file to build the index each time uses too
much time then you can use an index file with the same layout instead.

Neil
Jul 25 '06 #3

P: n/a
alex23 wrote:
The standard library module 'libcache' does exactly what you're
considering implementing.
I believe the module you're referring to is `linecache`.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
It is from numberless diverse acts of courage and belief that human
history is shaped. -- John F. Kennedy
Jul 25 '06 #4

P: n/a

Erik Max Francis wrote:
alex23 wrote:
The standard library module 'libcache' does exactly what you're
considering implementing.

I believe the module you're referring to is `linecache`.
and whatever its name is, it's not a goer: it reads the whole of each
file into memory. It was designed for stack traces. See docs. See
source code. See discussion when somebody with a name within a typo or
0 of the OP's asked an almost identical question very recently.

Here's a thought:

sqlite3 database, one table (line_number, file_offset, line_length),
make line_number the primary key, Bob's yer uncle.

Oh yeah, that battery's not included yet -- you'll need to download the
pysqlite2 module, and mutter strange oaths:
import sys
PY_VERSION = sys.version_info[:2]
if PY_VERSION >= (2, 5):
import sqlite3
else:
from pysqlite2 import dbapi2 as sqlite3

It would be a very good idea for the OP to give us a clue as to
(min/max/average) number of (bytes per line, lines per file, bytes per
file) *and* the desired response time on what hardware & OS ... *and*
how long if takes to do this:
for line in file_handle:
pass

Alternatively, if you have the disk space, you could have just two
columns in the DB table: (line_number, line).

Cheers,
John

Jul 25 '06 #5

P: n/a
Erik Max Francis wrote:
I believe the module you're referring to is `linecache`.
Thanks, Erik. That's what I get for posting on my way out of work :)

-alex23

Jul 26 '06 #6

P: n/a
Yi Xing wrote:
Hi,

I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.

Yi
I had to do this for some large log files. I wrote one simple script
to generate the index file and another that used the index file to read
lines from the log file. Here are (slightly cleaned up for clarity)
the two scripts. (Note that they'll only work with files less than
4,294,967,296 bytes long.. If your files are larger than that
substitute 'Q' for 'L' in the struct formats.)

First, genoffsets.py
#!/usr/bin/env python
'''
Write the byte offset of each line.
'''
import fileinput
import struct
import sys

def f(n): return struct.pack('L', n)

def main():
total = 0

# Main processing..
for n, line in enumerate(fileinput.input()):

sys.stdout.write(f(total))

total += len(line)

# Status output.
if not n % 1000:
print >sys.stderr, '%i lines processed' % n

print >sys.stderr, '%i lines processed' % (n + 1)
if __name__ == '__main__':
main()
You use it (on linux) like so:
cat large_file | ./genoffsets.py index.dat

And here's the getline.py script:
#!/usr/bin/env python
'''
Usage: "getline.py <datafile<indexfile<num>"

Prints line num from datafile using indexfile.
'''
import struct
import sys

fmt = 'L'
fmt_size = struct.calcsize(fmt)
def F(n, fn):
'''
Return the byte offset of line n from index file fn.
'''
f = open(fn)

try:
f.seek(n * fmt_size)
data = f.read(fmt_size)
finally:
f.close()

return struct.unpack(fmt, data)[0]
def getline(n, data_file, index_file):
'''
Return line n from data file using index file.
'''
n = F(n, index_file)
f = open(data_file)

try:
f.seek(n)
data = f.readline()
finally:
f.close()

return data
if __name__ == '__main__':
dfn, ifn, lineno = sys.argv[-3:]
n = int(lineno)
print getline(n, dfn, ifn)

Hope this helps,
~Simon

Jul 26 '06 #7

This discussion thread is closed

Replies have been disabled for this discussion.