What strategy for random accession of records in massive FASTA file?

Chris Lasher

Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

CW127_A01 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATCW127_A02 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAG GAATAGACGGCW127_A03

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGG
....

Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. :-) If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris

Jul 18 '05 #1

Subscribe Post Reply

3428

Fredrik Lundh

Chris Lasher wrote:

Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. :-) If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)

keeping an index in memory might be reasonable. the following class
creates an index file by scanning the FASTA file, and uses the "marshal"
module to save it to disk. if the index file already exists, it's used as is.
to regenerate the index, just remove the index file, and run the program
again.

import os, marshal

class FASTA:

def __init__(self, file):
self.file = open(file)
self.checkindex()

def __getitem__(self, key):
try:
pos = self.index[key]
except KeyError:
raise IndexError("no such item")
else:
f = self.file
f.seek(pos)
header = f.readline()
assert ">" + header + "\n"
data = []
while 1:
line = f.readline()
if not line or line[0] == ">":
break
data.append(line)
return data

def checkindex(self):
indexfile = self.file.name + ".index"

try:
self.index = marshal.load(open(indexfile, "rb"))
except IOError:
print "building index..."

index = {}

# scan the file
f = self.file
f.seek(0)
while 1:
pos = f.tell()
line = f.readline()
if not line:
break
if line[0] == ">":
# save offset to header line
header = line[1:].strip()
index[header] = pos

# save index to disk
f = open(indexfile, "wb")
marshal.dump(index, f)
f.close()

self.index = index

db = FASTA("myfastafile.dat")
print db["CW127_A02"]

['TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAG...

tweak as necessary.

</F>

Jul 18 '05 #2

John Lenton

> If you could help me figure out how to code a solution

that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)

You don't need a RDBMS; I'd put it in a DBM or CDB myself.

Jul 18 '05 #3

Larry Bates

You don't say how this will be used, but here goes:

1) Read the records and put into dictionary with key
of sequence (from header) and data being the sequence
data. Use shelve to store the dictionary for subsequent
runs (if load time is excessive).

2) Take a look at Gadfly (gadfly.sourceforge.net). It
provides you with Python SQL-like database and may be
better solution if data is basically static and you
do lots of processing.

All depends on how you use the data.

Regards,
Larry Bates
Syscon, Inc.

Chris Lasher wrote:

Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

CW127_A01

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACAT
CW127_A02

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAG GAATAGACGG
CW127_A03

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGG
...

Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. :-) If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris

Jul 18 '05 #4

David E. Konerding DSD staff

In article <11*********************@c13g2000cwb.googlegroups. com>, Chris Lasher wrote:

Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

Use biopython. They have dictionary-style classes which wrap FASTA files using indexes.

http://www.biopython.org

Dave

Jul 18 '05 #5

John Machin

Chris Lasher wrote:

Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:
CW127_A01 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACAT

[snip] Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help once again. :-) If you could help me figure out how to code a solution that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris

Before you get too carried away, how often do you want to do this and
how grunty is the box you will be running on? Will the data be on a
server? If the server is on a WAN or at the other end of a radio link
between buildings, you definitely need an index so that you can access
the data randomly!

By way of example, to read all of a 157MB file into memory from a local
(i.e. not networked) disk using readlines() takes less than 4 seconds
on a 1.4Ghz Athlon processor (see below). The average new corporate
desktop box is about twice as fast as that. Note that Windows Task
Manager showed 100% CPU utilisation for both read() and readlines().

My guess is that you don't need anything much fancier than the effbot's
index method -- which by now you have probably found works straight out
of the box and is more than fast enough for your needs.

BTW, you need to clarify "don't have access to an RDBMS" ... surely
this can only be due to someone stopping them from installing good free
software freely available on the Internet.

HTH,
John

C:\junk>python -m timeit -n 1 -r 6 "print
len(file('bigfile.csv').read())"
157581595
157581595
157581595
157581595
157581595
157581595
1 loops, best of 6: 3.3e+006 usec per loop

C:\junk>python -m timeit -n 1 -r 6 "print
len(file('bigfile.csv').readlines())"
1118870
1118870
1118870
1118870
1118870
1118870
1 loops, best of 6: 3.57e+006 usec per loop

Jul 18 '05 #6

Bengt Richter

On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <ch**********@gmail.com> wrote:

Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

Others have probably solved your basic problem, or pointed
the way. I'm just curious.

Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).

I'm wondering how the data is actually used once records are
retrieved. (but I'm too lazy to explore the biopython.org link).

CW127_A01

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACG GGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCT AATACCCCATA
GCATTAAACAT
CW127_A02

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACG GGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCT AATACCCCATA
GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAA GGAATAGACGG
CW127_A03

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACG GGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCT AATACCCCATA
GCATTAAACATTCCGCCTGGG
...

Regards,
Bengt Richter

Jul 18 '05 #7

Chris Lasher

>Before you get too carried away, how often do you want to do this and

how grunty is the box you will be running on?
Oops, I should have specified this. The script will only need to be run
once every three or four months, when the sequences are updated. I'll
be running it on boxes that are 3GHz/100GB Ram, but others may not be
so fortunate, and so I'd like to keep that in mind.
BTW, you need to clarify "don't have access to an RDBMS" ... surely
this can only be due to someone stopping them from installing good
free software freely available on the Internet.
I understand your and others' sentiment on this. I agree, the
open-source database systems are wonderful. However, keeping it
Python-only saves me hassle by only having to assist in instances where
others need help downloading and installing Python. I suppose if I keep
it in Python, I can even use Py2exe to generate an executable that
wouldn't even require them to install Python. A solution using
interaction with a database is much sexier, but, for the purposes of
the script, seems unnecesary. However, I certainly appreciate the
suggestions.
My guess is that you don't need anything much fancier than the
effbot's index method -- which by now you have probably found works
straight out of the box and is more than fast enough for your needs.

I think Mr. Lundh's code will work very well for these purposes. Thanks
very much to him for posting it. Many thanks for posting that! You'll
have full credit for that part of the code. Thanks very much to all who
replied!

Chris

Jul 18 '05 #8

Chris Lasher

Thanks for your reply, Larry. I thought about this, but I'm worried the
dictionary will consume a lot of resources. I think my 3GHz/1GB RAM box
could handle the load fine, but I'm not sure about others' systems.
Chris

Jul 18 '05 #9

Chris Lasher

>Others have probably solved your basic problem, or pointed

the way. I'm just curious. Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).
My guess for the inefficiency in storage size is because it is
human-readable, and because most in-silico molecular biology is just a
bunch of fancy string algorithms. This is my limited view of these
things at least.
I'm wondering how the data is actually used once records are
retrieved.

This one I can answer. For my purposes, I'm just organizing the
sequences at hand, but there are all sorts of things one could actually
do with sequences: alignments, BLAST searches, gene annotations, etc.

Jul 18 '05 #10

Jeff Shannon

Chris Lasher wrote:

Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).

My guess for the inefficiency in storage size is because it is
human-readable, and because most in-silico molecular biology is just a
bunch of fancy string algorithms. This is my limited view of these
things at least.

Yeah, that pretty much matches my guess (not that I'm involved in
anything related to computational molecular biology or genetics).
Given the current technology, the cost of the extra storage size is
presumably lower than the cost of translating into/out of a packed
format. Heck, hard drives cost less than $1/GB now.

And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.

Jeff Shannon
Technician/Programmer
Credit International

Jul 18 '05 #11

Chris Lasher

>And besides, for long-term archiving purposes, I'd expect that zip et

al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.

This 105MB FASTA file is 8.3 MB gzip-ed.

Jul 18 '05 #12

Jeff Shannon

Chris Lasher wrote:

And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.

This 105MB FASTA file is 8.3 MB gzip-ed.

And a 4:1 packed-format file would be ~26MB. It'd be interesting to
see how that packed-format file would compress, but I don't care
enough to write a script to convert the FASTA file into a
packed-format file to experiment with... ;)

Short version, then, is that yes, size concerns (such as they may be)
are outweighed by speed and conceptual simplicity (i.e. avoiding a
huge mess of bit-masking every time a single base needs to be
examined, or a human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That
gets ugly real fast.)

Jeff Shannon
Technician/Programmer
Credit International

Jul 18 '05 #13

Robert Kern

Jeff Shannon wrote:

(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That gets
ugly real fast.)

Not to mention all the IUPAC symbols for incompletely specified bases
(e.g. R = A or G).

http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Jul 18 '05 #14

John Lenton

On Thu, Jan 13, 2005 at 12:19:49AM +0100, Fredrik Lundh wrote:

Chris Lasher wrote:

Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. :-) If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)

keeping an index in memory might be reasonable. the following class
creates an index file by scanning the FASTA file, and uses the "marshal"
module to save it to disk. if the index file already exists, it's used as is.
to regenerate the index, just remove the index file, and run the program
again.

the problem caught my interest, and the way you used a non-mmaped file
in a place where using mmap was pretty much obvious (IMVHO) bothered
me enough to overcome my laziness. It didn't overcome it by *much*,
mind you, so the following probably only works on python 2.3, on
Linux, in Argentina, and with FASTA data that looks like the sample I
was able to download.

However, having said that, the sample I downloaded was one 46MiB file,
and reading it in on my notebook was fast enough that I ripped out the
saving/reloading of the index and just reindex it every time. Adding
back the persistant index is trivial. [ time passes ] In fact, I just
found a several-gigabyte fasta file, and I guess you'd want the index
for that one; I put the code back in. And now I should really go to
bed, because this is very interesting but won't pay the bills.

import os, mmap, marshal
from UserDict import DictMixin

class FASTA(DictMixin):
def __init__(self, filename):
self.file = f = open(filename)
fno = f.fileno()
stat = os.fstat(fno)
self.map = mmap.mmap(fno, stat.st_size, access=mmap.ACCESS_COPY)
self.checkindex()

def __getitem__(self, key):
p0, pf = self.index[key]
m = self.map
return m[p0:pf]

def keys(self):
return self.index.keys()
def __contains__(self, key):
return self.index.__contains__(key)

def checkindex(self):
indexfile = self.file.name + ".idx"
if os.path.exists(indexfile):
# and os.stat(indexfile).st_mtime > os.stat(self.file.name).st_mtime:
self.index = marshal.load(open(indexfile, "rb"))
else:
print 'generating index...'
self.genindex()
marshal.dump(self.index, open(indexfile, "wb"))
print 'done.'

def genindex(self):
index = {}
m = self.map
last = None
while 1:
pos = m.find('>')
if last is not None:
index[last] = (index[last], pos)
if pos == -1:
break
m.seek(pos)
line = m.readline()
pos = m.tell()
# this is the bit that probably only works with FASTA
# files like I was able to find on the 'net.
sep = line.index(' ')
if sep == -1:
name = line[1:].strip()
else:
name = line[1:sep].strip()
index[name] = pos
last = name
self.index = index

db = FASTA("/home/john/tmp/uniprot_sprot.fasta")
print db["104K_THEPA"]
--
John Lenton (jo**@grulic.org.ar) -- Random fortune:
"Those who believe in astrology are living in houses with foundations of
Silly Putty."
- Dennis Rawlins, astronomer

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB52VWgPqu395ykGsRAtiTAJ9sooa2folCJp3beGzY3P enuuMJJgCfQEnz
5ZSQezYJ5R0X6vB+Aj7FnOQ=
=n0xF
-----END PGP SIGNATURE-----

Jul 18 '05 #15

Neil Benn

Jeff Shannon wrote:

Chris Lasher wrote:
And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.

This 105MB FASTA file is 8.3 MB gzip-ed.

And a 4:1 packed-format file would be ~26MB. It'd be interesting to
see how that packed-format file would compress, but I don't care
enough to write a script to convert the FASTA file into a
packed-format file to experiment with... ;)

Short version, then, is that yes, size concerns (such as they may be)
are outweighed by speed and conceptual simplicity (i.e. avoiding a
huge mess of bit-masking every time a single base needs to be
examined, or a human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That
gets ugly real fast.)

Jeff Shannon
Technician/Programmer
Credit International

Hello,

Just to clear up a few things on the topic :

If the file denotes DNA sequences there are five basic identifiers

AGCT and X (where X means 'dunno!').

If the files denoites RNA sequences, you will still only need five
basic indentifiers the issue is that the T is replaced by a U.

One very good way I have found to parse large files of this nature
(I've done it with many a use case) is to write a sax parser for the
file. Therefore you can register a content handler, receive events from
the sax parser and do whatever you like with it. Basically, using the
sax framework to read the files - if your write the sax parser carefully
then you stream the files and remove old lines from memory, therefore
you have a scalable solution (rather than keeping everything in memory).

As an aside, I would seriously consider parsing your files and
putting this information in a small local db - it's really not much work
to do and the 'pure' python thing is a misnomer, whichever persistence
mechanism you use (file,DB,etching it on the floor with a small robot
accepting logo commands,etc) is unlikely to be pure python.

The advantage of putting it in a DB will show up later when you have
fast and powerful retrieval capability.

Cheers,

Neil

--

Neil Benn
Senior Automation Engineer
Cenix BioScience
BioInnovations Zentrum
Tatzberg 47
D-01307
Dresden
Germany

Tel : +49 (0)351 4173 154
e-mail : be**@cenix-bioscience.com
Cenix Website : http://www.cenix-bioscience.com

Jul 18 '05 #16

Michael Maibaum

On Thu, Jan 13, 2005 at 04:41:45PM -0800, Robert Kern wrote:

Jeff Shannon wrote:
(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That gets
ugly real fast.)

Not to mention all the IUPAC symbols for incompletely specified bases
(e.g. R = A or G).

http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html

Or, for those of us working with proteins as well, all the single letter codes for proteins:

http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html

lots more bits.

I have a db with approx 3 million proteins in it and would not want to be using a pure python approach :)

Michael

Jul 18 '05 #17

Chris Lasher

Forgive my ignorance, but what does using mmap do for the script? My
guess is that it improves performance, but I'm not sure how. I read the
module documentation and the module appears to be a way to read out
information from memory (RAM maybe?).

Jul 18 '05 #18

Paul Rubin

"Chris Lasher" <ch**********@gmail.com> writes:

Forgive my ignorance, but what does using mmap do for the script? My
guess is that it improves performance, but I'm not sure how. I read the
module documentation and the module appears to be a way to read out
information from memory (RAM maybe?).

Mmap lets you treat a disk file as an array, so you can randomly
access the bytes in the file without having to do seek operations.
Just say a[234]='x' and you've changed byte 234 of the file to the
letter x. It works through the OS's virtual memory system and the
computer's MMU hardware, and so it has lower overhead than doing
system calls for every access.

Jul 18 '05 #19

Steve Holden

Bengt Richter wrote:

On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <ch**********@gmail.com> wrote:
[...] Others have probably solved your basic problem, or pointed
the way. I'm just curious.

Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).

I'm wondering how the data is actually used once records are
retrieved. (but I'm too lazy to explore the biopython.org link).

Revealingly honest.

Of course, adopting an encoding that only used two bits per base would
make it impossible to use the re module to search for patterns in them,
for example. So the work of continuously translating between
representations might militate against more efficient representations.
Or, of course, it might not :-)

it's-only-storage-ly y'rs - steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119

Jul 18 '05 #20

Steve Holden

Jeff Shannon wrote:

Chris Lasher wrote:
And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.

This 105MB FASTA file is 8.3 MB gzip-ed.

And a 4:1 packed-format file would be ~26MB. It'd be interesting to see
how that packed-format file would compress, but I don't care enough to
write a script to convert the FASTA file into a packed-format file to
experiment with... ;)

If your compression algorithm's any good then both, when compressed,
should be approximately equal in size, since the size should be
determined by the information content rather than the representation.
Short version, then, is that yes, size concerns (such as they may be)
are outweighed by speed and conceptual simplicity (i.e. avoiding a huge
mess of bit-masking every time a single base needs to be examined, or a
human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That gets
ugly real fast.)

Right!

regards
Steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119

Jul 18 '05 #21

Roy Smith

In article <11*********************@c13g2000cwb.googlegroups. com>,
"Chris Lasher" <ch**********@gmail.com> wrote:

Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:
CW127_A01

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACAT
CW127_A02

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAG GAATAGACGG
CW127_A03

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGG
...

Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request.

First, before embarking on any major project, take a look at
http://www.biopython.org/ to at least familiarize yourself with what
other people have done in the field.

The easiest thing I think would be to use the gdbm module. You can
write a simple parser to parse the FASTA file (or, I would imagine, find
one already written on biopython), and then store the data in a gdbm
map, using the tag lines as the keys and the sequences as the values.

Even for a Python neophyte, this should be a pretty simple project. The
most complex part might getting the gdbm module built if your copy of
Python doesn't already have it, but gdbm is so convenient, it's worth
the effort.

Jul 18 '05 #22

Bulba!

On 14 Jan 2005 12:30:57 -0800, Paul Rubin
<http://ph****@NOSPAM.invalid> wrote:

Mmap lets you treat a disk file as an array, so you can randomly
access the bytes in the file without having to do seek operations
Cool!
Just say a[234]='x' and you've changed byte 234 of the file to the
letter x.

However.. however.. suppose this element located more or less
in the middle of an array occupies more space after changing it,
say 2 bytes instead of 1. Will flush() need to rewrite the half of
mmaped file just to add that one byte?

flush() definitely makes updating less of an issue, I'm just
curious about the cost of writing small changes scattered all
over the place back to the large file.

--
I have come to kick ass, chew bubble gum and do the following:

from __future__ import py3k

And it doesn't work.

Jul 18 '05 #23

Michael Hoffman

Chris Lasher wrote:

I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order.

I just came across this thread today and I don't understand why you are
trying to reinvent the wheel instead of using Biopython which already
has a solution to this problem, among others.

But actually I usually use formatdb, which comes with NCBI-BLAST to
create blastdb files that can also be used for BLAST.

mh5@ecs4a /data/blastdb/Users/mh5
$ python
Python 2.3.3 (#1, Jan 20 2004, 17:39:36) [C] on osf1V5
Type "help", "copyright", "credits" or "license" for more information.

import blastdb
from tools2 import LightIterator
temp_file = blastdb.Database("mammals.peptides.faa").fetch_to_ tempfile("004/04/m00404.peptide.faa")
LightIterator(temp_file).next()

('lcl|004/04/m00404.peptide.faa ENSMUSG00000022297 peptide', 'MERSPFLLACILLPLVRGHSLFTCEPITVPRCMKMTYNMTFFPNLMGHY DQGIAAVEMGHFLHLANLECSPNIEMFLCQAFIPTCTEQIHVVLPCRKLC EKIVSDCKKLMDTFGIRWPEELECNRLPHCDDTVPVTSHPHTELSGPQKK SDQVPRDIGFWCPKHLRTSGDQGYRFLGIEQCAPPCPNMYFKSDELDFAK SFIGIVSIFCLCATLFTFLTFLIDVRRFRYPERPIIYYSVCYSIVSLMYF VGFLLGNSTACNKADEKLELGDTVVLGSKNKACSVVFMFLYFFTMAGTVW WVILTITWFLAAGRKWSCEAIEQKAVWFHAVAWGAPGFLTVMLLAMNKVE GDNISGVCFVGLYDLDASRYFVLLPLCLCVFVGLSLLLAGIISLNHVRQV IQHDGRNQEKLKKFMIRIGVFSGLYLVPLVTLLGCYVYELVNRITWEMTW FSDHCHQYRIPCPYQANPKARPELALFMIKYLMTLIVGISAVFWVGSKKT CTEWAGFFKRNRKRDPISESRRVLQESCEFFLKHNSKVKHKKKHGAPGPH RLKVISKSMGTSTGATTNHGTSAMAIADHDYLGQETSTEVHTSPEASVKE GRADRANTPSAKDRDCGESAGPSSKLSGNRNGRESRAGGLKERSNGSEGA PSEGRVSPKSSVPETGLIDCSTSQAASSPEPTSLKGSTSLPVHSASRARK EQGAGSHSDA')

tools2 has this in it:

class LightIterator(object):
def __init__(self, handle):
self._handle = handle
self._defline = None

def __iter__(self):
return self

def next(self):
lines = []
defline_old = self._defline

while 1:
line = self._handle.readline()
if not line:
if not defline_old and not lines:
raise StopIteration
if defline_old:
self._defline = None
break
elif line[0] == '>':
self._defline = line[1:].rstrip()
if defline_old or lines:
break
else:
defline_old = self._defline
else:
lines.append(line.rstrip())

return defline_old, ''.join(lines)

blastdb.py:

#!/usr/bin/env python
from __future__ import division

__version__ = "$Revision: 1.3 $"

"""
blastdb.py

access blastdb files
Copyright 2005 Michael Hoffman
License: GPL
"""

import os
import sys

try:
from poly import NamedTemporaryFile # http://www.ebi.ac.uk/~hoffman/software/poly/
except ImportError:
from tempfile import NamedTemporaryFile

FASTACMD_CMDLINE = "fastacmd -d %s -s %s -o %s"

class Database(object):
def __init__(self, filename):
self.filename = filename

def fetch_to_file(self, query, filename):
status = os.system(FASTACMD_CMDLINE % (self.filename, query, filename))
if status:
raise RuntimeError, "fastacmd returned %d" % os.WEXITSTATUS(status)

def fetch_to_tempfile(self, query):
temp_file = NamedTemporaryFile()
self.fetch_to_file(query, temp_file.name)
return temp_file
--
Michael Hoffman

Jul 18 '05 #24

Steve Holden

Bulba! wrote:

On 14 Jan 2005 12:30:57 -0800, Paul Rubin
<http://ph****@NOSPAM.invalid> wrote:

Mmap lets you treat a disk file as an array, so you can randomly
access the bytes in the file without having to do seek operations

Cool!

Just say a[234]='x' and you've changed byte 234 of the file to the
letter x.

However.. however.. suppose this element located more or less
in the middle of an array occupies more space after changing it,
say 2 bytes instead of 1. Will flush() need to rewrite the half of
mmaped file just to add that one byte?

Nope. If you try a[234] = 'banana' you'll get an error message. The mmap
protocol doesn't support insertion and deletion, only overwriting.

Of course, it's far too complicated to actually *try* this stuff before
pontificating [not]:

import mmap
f = file("/tmp/Xout.txt", "r+")
mm = mmap.mmap(f.fileno(), 200)
mm[1:10] 'elcome to' mm[1] = "banana" Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: mmap assignment must be single-character string mm[1:10] = 'ishing ::'
mm[1:10] 'ishing ::' mm[1:10] = 'a' Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: mmap slice assignment is wrong size

flush() definitely makes updating less of an issue, I'm just
curious about the cost of writing small changes scattered all
over the place back to the large file.
Some of this depends on whether the mmap is shared or private, of
course, but generally speaking you can ignore the overhead, and the
flush() calls will be automatic as long as you don't mix file and string
operations. The programming convenience is amazing.
--
I have come to kick ass, chew bubble gum and do the following:

from __future__ import py3k

And it doesn't work.

So make it work :-)

regards
Steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119

Jul 18 '05 #25

Chris Lasher

Roy, thank you for your reply. I have BioPython installed on my box at
work and have been browsing through the code in there, some of which I
can follow, and most of which will take more time and experience for me
to do so. I have considered BioPython and databases, and have chosen to
forego those routes for the reasons above, here summarized: this script
has a limited scope of what I hope it will acheive, and it should be as
convenient as possible for the end-user to execute it (even at the
expense of convenience to me as the coder). Again, thanks for your
input, though. It's very helpful to me to be able to learn from other
perspectives that I wouldn't have seen from, myself.

Jul 18 '05 #26

Bengt Richter

On Sat, 15 Jan 2005 15:24:56 -0500, Steve Holden <st***@holdenweb.com> wrote:

Bulba! wrote:
On 14 Jan 2005 12:30:57 -0800, Paul Rubin
<http://ph****@NOSPAM.invalid> wrote:

Mmap lets you treat a disk file as an array, so you can randomly
access the bytes in the file without having to do seek operations

Cool!

Just say a[234]='x' and you've changed byte 234 of the file to the
letter x.

However.. however.. suppose this element located more or less
in the middle of an array occupies more space after changing it,
say 2 bytes instead of 1. Will flush() need to rewrite the half of
mmaped file just to add that one byte?
I would wonder what mm.find('pattern') in the middle of a huge file
would do to the working set vs sequential reads as in my little toy
(which BTW is also happy to expand or contract old vs new replacement string
as it streams buffers file to file).
Nope. If you try a[234] = 'banana' you'll get an error message. The mmap
protocol doesn't support insertion and deletion, only overwriting.

Of course, it's far too complicated to actually *try* this stuff before
pontificating [not]:
>>> import mmap
>>> f = file("/tmp/Xout.txt", "r+")
>>> mm = mmap.mmap(f.fileno(), 200)
>>> mm[1:10]'elcome to' >>> mm[1] = "banana"Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: mmap assignment must be single-character string >>> mm[1:10] = 'ishing ::'
>>> mm[1:10]'ishing ::' >>> mm[1:10] = 'a'Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: mmap slice assignment is wrong size >>>
flush() definitely makes updating less of an issue, I'm just
curious about the cost of writing small changes scattered all
over the place back to the large file.

Some of this depends on whether the mmap is shared or private, of
course, but generally speaking you can ignore the overhead, and the
flush() calls will be automatic as long as you don't mix file and string
operations. The programming convenience is amazing.

That part does look good, but will scanning a large file with find
cause massive swapouts, or is there some smart prioritization or
hidden sequential windowing that limits mmap's impact?
--
I have come to kick ass, chew bubble gum and do the following:

from __future__ import py3k

And it doesn't work.

So make it work :-)

Regards,
Bengt Richter

Jul 18 '05 #27

What strategy for random accession of records in massive FASTA file?

Similar topics