473,809 Members | 2,775 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

building an index for large text files for fast access

Hi,

I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.

Yi

Jul 25 '06 #1
6 6616
Yi Xing wrote:
I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.
Hey Yi,

The standard library module 'libcache' does exactly what you're
considering implementing.

-alex23

Jul 25 '06 #2
Yi Xing:
Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python?
It depends on the size of the files and the amount of memory and
disk you may use. First suggestion would be an in-memory array.array of
64 bit integers made from 2 'I' entries with each 64 bit integer
pointing to the start of a set of n lines. Then to find a particular
line number p you seek to a[p/n] and then read over p%n lines. The
factor 'n' is adjusted to fit your data into memory. If this uses too
much memory or scanning the file to build the index each time uses too
much time then you can use an index file with the same layout instead.

Neil
Jul 25 '06 #3
alex23 wrote:
The standard library module 'libcache' does exactly what you're
considering implementing.
I believe the module you're referring to is `linecache`.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
It is from numberless diverse acts of courage and belief that human
history is shaped. -- John F. Kennedy
Jul 25 '06 #4

Erik Max Francis wrote:
alex23 wrote:
The standard library module 'libcache' does exactly what you're
considering implementing.

I believe the module you're referring to is `linecache`.
and whatever its name is, it's not a goer: it reads the whole of each
file into memory. It was designed for stack traces. See docs. See
source code. See discussion when somebody with a name within a typo or
0 of the OP's asked an almost identical question very recently.

Here's a thought:

sqlite3 database, one table (line_number, file_offset, line_length),
make line_number the primary key, Bob's yer uncle.

Oh yeah, that battery's not included yet -- you'll need to download the
pysqlite2 module, and mutter strange oaths:
import sys
PY_VERSION = sys.version_inf o[:2]
if PY_VERSION >= (2, 5):
import sqlite3
else:
from pysqlite2 import dbapi2 as sqlite3

It would be a very good idea for the OP to give us a clue as to
(min/max/average) number of (bytes per line, lines per file, bytes per
file) *and* the desired response time on what hardware & OS ... *and*
how long if takes to do this:
for line in file_handle:
pass

Alternatively, if you have the disk space, you could have just two
columns in the DB table: (line_number, line).

Cheers,
John

Jul 25 '06 #5
Erik Max Francis wrote:
I believe the module you're referring to is `linecache`.
Thanks, Erik. That's what I get for posting on my way out of work :)

-alex23

Jul 26 '06 #6
Yi Xing wrote:
Hi,

I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.

Yi
I had to do this for some large log files. I wrote one simple script
to generate the index file and another that used the index file to read
lines from the log file. Here are (slightly cleaned up for clarity)
the two scripts. (Note that they'll only work with files less than
4,294,967,296 bytes long.. If your files are larger than that
substitute 'Q' for 'L' in the struct formats.)

First, genoffsets.py
#!/usr/bin/env python
'''
Write the byte offset of each line.
'''
import fileinput
import struct
import sys

def f(n): return struct.pack('L' , n)

def main():
total = 0

# Main processing..
for n, line in enumerate(filei nput.input()):

sys.stdout.writ e(f(total))

total += len(line)

# Status output.
if not n % 1000:
print >sys.stderr, '%i lines processed' % n

print >sys.stderr, '%i lines processed' % (n + 1)
if __name__ == '__main__':
main()
You use it (on linux) like so:
cat large_file | ./genoffsets.py index.dat

And here's the getline.py script:
#!/usr/bin/env python
'''
Usage: "getline.py <datafile<index file<num>"

Prints line num from datafile using indexfile.
'''
import struct
import sys

fmt = 'L'
fmt_size = struct.calcsize (fmt)
def F(n, fn):
'''
Return the byte offset of line n from index file fn.
'''
f = open(fn)

try:
f.seek(n * fmt_size)
data = f.read(fmt_size )
finally:
f.close()

return struct.unpack(f mt, data)[0]
def getline(n, data_file, index_file):
'''
Return line n from data file using index file.
'''
n = F(n, index_file)
f = open(data_file)

try:
f.seek(n)
data = f.readline()
finally:
f.close()

return data
if __name__ == '__main__':
dfn, ifn, lineno = sys.argv[-3:]
n = int(lineno)
print getline(n, dfn, ifn)

Hope this helps,
~Simon

Jul 26 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
6168
by: shailesh kumar | last post by:
Hi, I need to design data interfaces for accessing files of very large sizes efficiently. The data will be accessed in chunks of fixed size ... My data interface should be able to do a random seek in the file, as well as sequential access block by block.... One aspect of the usage of this interface is that there is quite good chance of accessing same blocks again and again by the application..
9
3182
by: WalterR | last post by:
This is my first time here, so there may be earlier relevant threads of which I am unaware. Though my experience with DB2 is not extensive, such as it is was under OS/390 or equ. My main experience is IMS DB, which leads to my question. In IMS, there is an HDAM access method which can find a record without using an index as such. At initial database load, it first formats the entire space allocation into blocks of the given size. ...
6
2789
by: Dennis | last post by:
I was trying to determine the fastest way to build a byte array from components where the size of the individual components varied depending on the user's input. I tried three classes I built: (1) using redim arrays to add to a normal byte array (2) using an ArrayList and finally (3) using a memorystream. These three classes are listed below the test sub called "TestBuildByteArray". It was interesting that using the memorystream was...
0
793
by: David Helgason | last post by:
I think those best practices threads are a treat to follow (might even consider archiving some of them in a sort of best-practices faq), so here's one more. In coding an game asset server I want to keep a large number of file revisions of varying sizes (1Kb-50Mb) inside the database. Naturally I want to avoid having to allocate whole buffers of 50Mb too often.
5
17330
by: Bas Scheffers | last post by:
Hi, I have a table with about 100K rows, on which I have created a btree index of the type table_name(int, int, int, timestamp). At first postgres was using it for my AND query on all four columns, but after dropping it and creating different ones and testing, it suddenly stopped using it. Vaccuuming, reindexing, recreating the table and even recreating the database all didn't help.
10
10047
by: setar | last post by:
How can I edit an xml file which has 250MB? I tried to use UltraEdit, Visual Studio, Eclipse, Stylus Studio and XMLSpy editors but these programs can't read this file because it is too big. SmEdit reads only the first MB of the file and doesn't support UTF-8 (I need program which supports it). Now I use XVI32 which is hexadecimal editor, but it can be useful only is editing small number of characters - deleting and inserting characters to...
5
2655
by: vd12005 | last post by:
Hello, While playing to write an inverted index (see: http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with a classic dict, (i have thousand of documents and millions of terms, stemming or other filtering are not considered, i wanted to understand how to handle GB of text first). I found ZODB and try to use it a bit, but i think i must be misunderstanding how to use it even after reading...
1
1770
by: Robert Neville | last post by:
Basically, I want to create a table in html, xml, or xslt; with any number of regular expressions; a script (Perl or Python) which reads each table row (regex and replacement); and performs the replacement on any file name, folder, or text file (e.g. css, php, html). For example, I often rename my mp3 (files); the folder holding the mp3 files; and replace these renamed values in a playlist/m3u/xml file. The table should hold clean...
2
1648
by: lawpoop | last post by:
Hello folks - I'm working on a function to map a php project that's gotten a little unwieldly. I'd like to have a reference of all files like the output of the unix tree command, like this: .. |-- index.php | |-- menu.php | |-- content.php | |-- admin_page.php
0
9721
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10637
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10376
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10379
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10115
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9199
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7660
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5687
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
3014
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.