Hello,
I need to store data in large lists (~e7 elements) and I often get a memory error in code that looks like: -
f = open('data.txt','r')
-
for line in f:
-
list1.append(line.split(',')[1])
-
list2.append(line.split(',')[2])
-
# etc.
I get the error when reading-in the data, but I don't really need all elements to be stored in RAM all the time. I work with chunks of that data.
So, more specifically, I have to read-in ~ 10,000,000 entries (strings and numeric) from 15 different columns in a text file, store them in list-like objects, do some element-wise calculations and get summary statistics (means, stdevs etc.) for blocks of say 500,000. Fast access for these blocks would be needed!
I also need to read everything in at once (so no f.seek() etc. to read the data a block at a time).
Any advice on how to achieve this? Platform = windowsXP
Cheers!
7 2576 bvdet 2,851
Expert Mod 2GB
My thought would be to read in a range of lines at a time, process those lines and move onto the next range, storing the results in a file as needed.
This function reads in a range of lines: - def fileLineRange(fn, start, end):
-
f = open(fn)
-
for i in xrange(start-1):
-
try:
-
f.next()
-
except StopIteration, e:
-
return "Start line %s is beyond end of file." % (num)
-
-
outputList = []
-
for line in xrange(start, end+1):
-
outputList.append(f.next().strip())
-
f.close()
-
return outputList
fileLineRange(fn, 700, 720) would read in lines 700 through 720.
Yes, I thought of that. Problem is: (i) I need to be able to calculate statistics for different block sizes without having to read the file over and over again and (ii) I need to know some info from the very last line (files have a time-column, started the same time but are not equally long).
Is there any way to store the whole thing in some kind of data structure (e.g. to create a class "extending" list or something?) Sorry for the java terminology :)
bvdet 2,851
Expert Mod 2GB
You are only truly reading a group of lines at a time, but I understand that it might not be the most efficient way. You should consider storing all the data in a MySql database for efficient access. MySqldb is the Python interface.
An afterthought to the code I posted. In case the end line number is greater than the number of lines, I added a try/except block: - def fileLineRange(fn, start, end):
-
f = open(fn)
-
for i in xrange(start-1):
-
try:
-
f.next()
-
except StopIteration, e:
-
return "Start line %s is beyond end of file." % (num)
-
-
outputList = []
-
for i in xrange(start, end+1):
-
try:
-
outputList.append(f.next().strip())
-
except StopIteration, e:
-
print "The last line in the file is line number %s." % (i-1)
-
break
-
f.close()
-
return outputList
Hmm, sorry I wasn't very clear. What I meant is:
(i) the files contain ~ month long measurements and I'd like to be able when I've read a file in to have e.g. per-day or per-week means. Or for a specific file to focus on the first hours. So that's what I meant I don't want to read the whole thing over and over again...
(ii) as for the last line, I guess it's not a big issue. I just need to know the duration of all measurements from the start to do some of the calculations. But I guess I should just read the last line in the beginning and then go back to the start of the file.
Also, not very familiar with MySQL. Is there no alternative "large list implementation" of say storing on disk and loading in RAM a chunk ("page") at a time.
You may want to use SQL, but since you do not say what specifically you want to access or how you want to do it, it is difficult to tell whether using a list is the best way. Most of us have code generators for quick and dirty apps, so below is a generated SQL example of what you might want to do, using SQLite which comes with Python (code comments are sparse though). I don't want to waste time on something that may not be used, so post back if you want more info. - import random
-
import sqlite3 as sqlite
-
-
class SQLTest:
-
def __init__( self ) :
-
self.SQL_filename = './SQLtest.SQL'
-
self.open_files()
-
-
##----------------------------------------------------------------------
-
def add_rec( self, val_tuple) :
-
self.cur.execute('INSERT INTO example_dbf values (?,?,?,?,?)', val_tuple)
-
self.con.commit()
-
-
##----------------------------------------------------------------------
-
def list_all_recs( self ) :
-
self.cur.execute("select * from example_dbf")
-
recs_list = self.cur.fetchall()
-
for rec in recs_list:
-
print rec
-
-
##----------------------------------------------------------------------
-
def lookup_date( self, date_in ) :
-
self.cur.execute("select * from example_dbf where st_date==:dic_lookup",
-
{"dic_lookup":date_in})
-
recs_list = self.cur.fetchall()
-
print
-
print "lookup_date"
-
for rec in recs_list:
-
print "%3d %9s %10.6f %3d %s" % (rec[0], rec[1], rec[2], rec[3], rec[4])
-
-
##----------------------------------------------------------------------
-
def lookup_2_fields( self, lookup_dic ) :
-
self.cur.execute("select * from example_dbf where st_date==:dic_field_1 and st_int==:dic_field_2", lookup_dic)
-
-
recs_list = self.cur.fetchall()
-
print
-
print "lookup_2_fields"
-
if len(recs_list):
-
for rec in recs_list:
-
print rec
-
else:
-
print "no recs found"
-
-
##----------------------------------------------------------------------
-
def open_files( self ) :
-
## a connection to the database file
-
self.con = sqlite.connect(self.SQL_filename)
-
-
# Get a Cursor object that operates in the context of Connection con
-
self.cur = self.con.cursor()
-
-
##--- CREATE FILE ONLY IF IT DOESN'T EXIST
-
self.cur.execute("CREATE TABLE IF NOT EXISTS example_dbf(st_rec_num int, st_date varchar, st_float, st_int int, st_lit varchar)")
-
-
##===================================================================
-
if __name__ == "__main__":
-
ST = SQLTest()
-
-
""" add some records with the format
-
record_number date float int string
-
"""
-
rec_num = 0
-
ccyy = 2010
-
for x in range(1, 11):
-
rec_num += 1
-
mm = x + 1
-
dd = x + 2
-
date = "%d%02d%02d" % (ccyy, mm, dd)
-
add_fl = random.random() * 1000
-
add_int = random.randint(1, 21)
-
lit = "test lit # %d" % (x)
-
ST.add_rec( (rec_num, date, add_fl, add_int, lit) )
-
-
## add duplicate dates for testing
-
for x in range(1, 3):
-
for y in range(2):
-
rec_num += 1
-
mm = x + 1
-
dd = x + 2
-
date = "%d%02d%02d" % (ccyy, mm, dd)
-
add_fl = random.random() * 1000
-
add_int = random.randint(1, 21)
-
lit = "test lit # %d" % (x)
-
ST.add_rec( (rec_num, date, add_fl, add_int, lit) )
-
-
ST.list_all_recs()
-
ST.lookup_date("20100203")
-
-
lookup_dict = {"dic_field_1":"20100203",
-
"dic_field_2":10}
-
ST.lookup_2_fields(lookup_dict)
Thank you, i will look into this tomorrow and I'll post back if in trouble..
Cheers!
Sign in to post your reply or Sign up for a free account.
Similar topics
by: Avi Kak |
last post by:
Hello:
Is it possible in Python to define an empty list
of a predetermined size?
To illustrate my question further, I can do the
following in Perl
my @arr = ();
$#arr = 999;
|
by: Andrea Griffini |
last post by:
I did it.
I proposed python as the main language for our next CAD/CAM
software because I think that it has all the potential needed
for it. I'm not sure yet if the decision will get through, but...
|
by: Thomas Beyerlein |
last post by:
I am writing code for a list box editor, one of the lists is large was
hoping that there is a way to either speed it up by making the server do
the IF statements of there is a faster way of...
|
by: Chaz Ginger |
last post by:
I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list...
|
by: rmstvns |
last post by:
Okay, I am stuck with an issue. I have two large lists, a list of users and a list zipcodes. Both lists have thousands of records. The user picks zipcode preferences and can choose as many zipcodes...
|
by: TokiDoki |
last post by:
Hi!
I have a Python problem which is my last problem to solve to finish up a Django application. This is amazingly simple but I have been stuck now for a couple of days. It is embarrisingly...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
| |