473,409 Members | 1,935 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,409 software developers and data experts.

Large lists in python

11
Hello,

I need to store data in large lists (~e7 elements) and I often get a memory error in code that looks like:

Expand|Select|Wrap|Line Numbers
  1. f = open('data.txt','r')
  2. for line in f:
  3.     list1.append(line.split(',')[1])
  4.     list2.append(line.split(',')[2])
  5.     # etc.

I get the error when reading-in the data, but I don't really need all elements to be stored in RAM all the time. I work with chunks of that data.

So, more specifically, I have to read-in ~ 10,000,000 entries (strings and numeric) from 15 different columns in a text file, store them in list-like objects, do some element-wise calculations and get summary statistics (means, stdevs etc.) for blocks of say 500,000. Fast access for these blocks would be needed!

I also need to read everything in at once (so no f.seek() etc. to read the data a block at a time).

Any advice on how to achieve this? Platform = windowsXP

Cheers!
Aug 14 '10 #1
7 2576
bvdet
2,851 Expert Mod 2GB
My thought would be to read in a range of lines at a time, process those lines and move onto the next range, storing the results in a file as needed.

This function reads in a range of lines:
Expand|Select|Wrap|Line Numbers
  1. def fileLineRange(fn, start, end):
  2.     f = open(fn)
  3.     for i in xrange(start-1):
  4.         try:
  5.             f.next()
  6.         except StopIteration, e:
  7.             return "Start line %s is beyond end of file." % (num)
  8.  
  9.     outputList = []
  10.     for line in xrange(start, end+1):
  11.         outputList.append(f.next().strip())
  12.     f.close()
  13.     return outputList
fileLineRange(fn, 700, 720) would read in lines 700 through 720.
Aug 14 '10 #2
fekioh
11
Yes, I thought of that. Problem is: (i) I need to be able to calculate statistics for different block sizes without having to read the file over and over again and (ii) I need to know some info from the very last line (files have a time-column, started the same time but are not equally long).

Is there any way to store the whole thing in some kind of data structure (e.g. to create a class "extending" list or something?) Sorry for the java terminology :)
Aug 14 '10 #3
bvdet
2,851 Expert Mod 2GB
You are only truly reading a group of lines at a time, but I understand that it might not be the most efficient way. You should consider storing all the data in a MySql database for efficient access. MySqldb is the Python interface.

An afterthought to the code I posted. In case the end line number is greater than the number of lines, I added a try/except block:
Expand|Select|Wrap|Line Numbers
  1. def fileLineRange(fn, start, end):
  2.     f = open(fn)
  3.     for i in xrange(start-1):
  4.         try:
  5.             f.next()
  6.         except StopIteration, e:
  7.             return "Start line %s is beyond end of file." % (num)
  8.  
  9.     outputList = []
  10.     for i in xrange(start, end+1):
  11.         try:
  12.             outputList.append(f.next().strip())
  13.         except StopIteration, e:
  14.             print "The last line in the file is line number %s." % (i-1)
  15.             break
  16.     f.close()
  17.     return outputList
Aug 14 '10 #4
fekioh
11
Hmm, sorry I wasn't very clear. What I meant is:

(i) the files contain ~ month long measurements and I'd like to be able when I've read a file in to have e.g. per-day or per-week means. Or for a specific file to focus on the first hours. So that's what I meant I don't want to read the whole thing over and over again...

(ii) as for the last line, I guess it's not a big issue. I just need to know the duration of all measurements from the start to do some of the calculations. But I guess I should just read the last line in the beginning and then go back to the start of the file.
Aug 14 '10 #5
fekioh
11
Also, not very familiar with MySQL. Is there no alternative "large list implementation" of say storing on disk and loading in RAM a chunk ("page") at a time.
Aug 14 '10 #6
dwblas
626 Expert 512MB
You may want to use SQL, but since you do not say what specifically you want to access or how you want to do it, it is difficult to tell whether using a list is the best way. Most of us have code generators for quick and dirty apps, so below is a generated SQL example of what you might want to do, using SQLite which comes with Python (code comments are sparse though). I don't want to waste time on something that may not be used, so post back if you want more info.
Expand|Select|Wrap|Line Numbers
  1. import random
  2. import sqlite3 as sqlite
  3.  
  4. class SQLTest:
  5.    def __init__( self ) :
  6.       self.SQL_filename = './SQLtest.SQL'
  7.       self.open_files()
  8.  
  9.    ##----------------------------------------------------------------------
  10.    def add_rec( self, val_tuple) :
  11.       self.cur.execute('INSERT INTO example_dbf values (?,?,?,?,?)', val_tuple)
  12.       self.con.commit()
  13.  
  14.    ##----------------------------------------------------------------------
  15.    def list_all_recs( self ) :
  16.       self.cur.execute("select * from example_dbf")
  17.       recs_list = self.cur.fetchall()
  18.       for rec in recs_list:
  19.          print rec
  20.  
  21.    ##----------------------------------------------------------------------
  22.    def lookup_date( self, date_in ) :
  23.       self.cur.execute("select * from example_dbf where st_date==:dic_lookup", 
  24.               {"dic_lookup":date_in})
  25.       recs_list = self.cur.fetchall()
  26.       print
  27.       print "lookup_date" 
  28.       for rec in recs_list:
  29.          print "%3d %9s %10.6f %3d  %s" % (rec[0], rec[1], rec[2], rec[3], rec[4])
  30.  
  31.    ##----------------------------------------------------------------------
  32.    def lookup_2_fields( self, lookup_dic ) :
  33.       self.cur.execute("select * from example_dbf where st_date==:dic_field_1 and st_int==:dic_field_2", lookup_dic)
  34.  
  35.       recs_list = self.cur.fetchall()
  36.       print
  37.       print "lookup_2_fields" 
  38.       if len(recs_list):
  39.          for rec in recs_list:
  40.             print rec
  41.       else:
  42.          print "no recs found"
  43.  
  44.    ##----------------------------------------------------------------------
  45.    def open_files( self ) :
  46.          ##  a connection to the database file
  47.          self.con = sqlite.connect(self.SQL_filename)
  48.  
  49.          # Get a Cursor object that operates in the context of Connection con
  50.          self.cur = self.con.cursor()
  51.  
  52.          ##--- CREATE FILE ONLY IF IT DOESN'T EXIST
  53.          self.cur.execute("CREATE TABLE IF NOT EXISTS example_dbf(st_rec_num int, st_date varchar, st_float, st_int int, st_lit varchar)")
  54.  
  55. ##===================================================================
  56. if __name__ == "__main__":
  57.    ST = SQLTest()
  58.  
  59.    """ add some records with the format
  60.        record_number  date  float  int  string
  61.    """
  62.    rec_num = 0
  63.    ccyy = 2010
  64.    for x in range(1, 11):
  65.       rec_num += 1
  66.       mm = x + 1
  67.       dd = x + 2
  68.       date = "%d%02d%02d" % (ccyy, mm, dd)
  69.       add_fl = random.random() * 1000
  70.       add_int = random.randint(1, 21)
  71.       lit = "test lit # %d" % (x)
  72.       ST.add_rec( (rec_num, date, add_fl, add_int, lit) )
  73.  
  74.    ## add duplicate dates for testing
  75.    for x in range(1, 3):
  76.       for y in range(2):
  77.          rec_num += 1
  78.          mm = x + 1
  79.          dd = x + 2
  80.          date = "%d%02d%02d" % (ccyy, mm, dd)
  81.          add_fl = random.random() * 1000
  82.          add_int = random.randint(1, 21)
  83.          lit = "test lit # %d" % (x)
  84.          ST.add_rec( (rec_num, date, add_fl, add_int, lit) )
  85.  
  86.    ST.list_all_recs()
  87.    ST.lookup_date("20100203")
  88.  
  89.    lookup_dict = {"dic_field_1":"20100203",
  90.                   "dic_field_2":10}
  91.    ST.lookup_2_fields(lookup_dict) 
Aug 15 '10 #7
fekioh
11
Thank you, i will look into this tomorrow and I'll post back if in trouble..

Cheers!
Aug 15 '10 #8

Sign in to post your reply or Sign up for a free account.

Similar topics

3
by: Avi Kak | last post by:
Hello: Is it possible in Python to define an empty list of a predetermined size? To illustrate my question further, I can do the following in Perl my @arr = (); $#arr = 999;
36
by: Andrea Griffini | last post by:
I did it. I proposed python as the main language for our next CAD/CAM software because I think that it has all the potential needed for it. I'm not sure yet if the decision will get through, but...
3
by: Thomas Beyerlein | last post by:
I am writing code for a list box editor, one of the lists is large was hoping that there is a way to either speed it up by making the server do the IF statements of there is a faster way of...
19
by: Chaz Ginger | last post by:
I have a system that has a few lists that are very large (thousands or tens of thousands of entries) and some that are rather small. Many times I have to produce the difference between a large list...
2
by: rmstvns | last post by:
Okay, I am stuck with an issue. I have two large lists, a list of users and a list zipcodes. Both lists have thousands of records. The user picks zipcode preferences and can choose as many zipcodes...
5
by: TokiDoki | last post by:
Hi! I have a Python problem which is my last problem to solve to finish up a Django application. This is amazingly simple but I have been stuck now for a couple of days. It is embarrisingly...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.