Large lists in python

Hello,

I need to store data in large lists (~e7 elements) and I often get a memory error in code that looks like:

Expand|Select|Wrap|Line Numbers

 
f = open('data.txt','r')

for line in f:

    list1.append(line.split(',')[1])

    list2.append(line.split(',')[2])

    # etc.

I get the error when reading-in the data, but I don't really need all elements to be stored in RAM all the time. I work with chunks of that data.

So, more specifically, I have to read-in ~ 10,000,000 entries (strings and numeric) from 15 different columns in a text file, store them in list-like objects, do some element-wise calculations and get summary statistics (means, stdevs etc.) for blocks of say 500,000. Fast access for these blocks would be needed!

I also need to read everything in at once (so no f.seek() etc. to read the data a block at a time).

Any advice on how to achieve this? Platform = windowsXP

Cheers!

Aug 14 '10 #1

Subscribe Post Reply

2576

bvdet

2,851

Expert Mod 2GB

My thought would be to read in a range of lines at a time, process those lines and move onto the next range, storing the results in a file as needed.

This function reads in a range of lines:

Expand|Select|Wrap|Line Numbers

 def fileLineRange(fn, start, end):

    f = open(fn)

    for i in xrange(start-1):

        try:

            f.next()

        except StopIteration, e:

            return "Start line %s is beyond end of file." % (num)
 
    outputList = []

    for line in xrange(start, end+1):

        outputList.append(f.next().strip())

    f.close()

    return outputList

fileLineRange(fn, 700, 720) would read in lines 700 through 720.

Aug 14 '10 #2

fekioh

Yes, I thought of that. Problem is: (i) I need to be able to calculate statistics for different block sizes without having to read the file over and over again and (ii) I need to know some info from the very last line (files have a time-column, started the same time but are not equally long).

Is there any way to store the whole thing in some kind of data structure (e.g. to create a class "extending" list or something?) Sorry for the java terminology :)

Aug 14 '10 #3

bvdet

2,851

Expert Mod 2GB

You are only truly reading a group of lines at a time, but I understand that it might not be the most efficient way. You should consider storing all the data in a MySql database for efficient access. MySqldb is the Python interface.

An afterthought to the code I posted. In case the end line number is greater than the number of lines, I added a try/except block:

Expand|Select|Wrap|Line Numbers

 def fileLineRange(fn, start, end):

    f = open(fn)

    for i in xrange(start-1):

        try:

            f.next()

        except StopIteration, e:

            return "Start line %s is beyond end of file." % (num)
 
    outputList = []

    for i in xrange(start, end+1):

        try:

            outputList.append(f.next().strip())

        except StopIteration, e:

            print "The last line in the file is line number %s." % (i-1)

            break

    f.close()

    return outputList

Aug 14 '10 #4

fekioh

Hmm, sorry I wasn't very clear. What I meant is:

(i) the files contain ~ month long measurements and I'd like to be able when I've read a file in to have e.g. per-day or per-week means. Or for a specific file to focus on the first hours. So that's what I meant I don't want to read the whole thing over and over again...

(ii) as for the last line, I guess it's not a big issue. I just need to know the duration of all measurements from the start to do some of the calculations. But I guess I should just read the last line in the beginning and then go back to the start of the file.

Aug 14 '10 #5

fekioh

Also, not very familiar with MySQL. Is there no alternative "large list implementation" of say storing on disk and loading in RAM a chunk ("page") at a time.

Aug 14 '10 #6

dwblas

626

Expert 512MB

You may want to use SQL, but since you do not say what specifically you want to access or how you want to do it, it is difficult to tell whether using a list is the best way. Most of us have code generators for quick and dirty apps, so below is a generated SQL example of what you might want to do, using SQLite which comes with Python (code comments are sparse though). I don't want to waste time on something that may not be used, so post back if you want more info.

Expand|Select|Wrap|Line Numbers

 import random

import sqlite3 as sqlite
 
class SQLTest:

   def __init__( self ) :

      self.SQL_filename = './SQLtest.SQL'

      self.open_files()
 
   ##----------------------------------------------------------------------

   def add_rec( self, val_tuple) :

      self.cur.execute('INSERT INTO example_dbf values (?,?,?,?,?)', val_tuple)

      self.con.commit()
 
   ##----------------------------------------------------------------------

   def list_all_recs( self ) :

      self.cur.execute("select * from example_dbf")

      recs_list = self.cur.fetchall()

      for rec in recs_list:

         print rec
 
   ##----------------------------------------------------------------------

   def lookup_date( self, date_in ) :

      self.cur.execute("select * from example_dbf where st_date==:dic_lookup", 

              {"dic_lookup":date_in})

      recs_list = self.cur.fetchall()

      print

      print "lookup_date" 

      for rec in recs_list:

         print "%3d %9s %10.6f %3d  %s" % (rec[0], rec[1], rec[2], rec[3], rec[4])
 
   ##----------------------------------------------------------------------

   def lookup_2_fields( self, lookup_dic ) :

      self.cur.execute("select * from example_dbf where st_date==:dic_field_1 and st_int==:dic_field_2", lookup_dic)
 
      recs_list = self.cur.fetchall()

      print

      print "lookup_2_fields" 

      if len(recs_list):

         for rec in recs_list:

            print rec

      else:

         print "no recs found"
 
   ##----------------------------------------------------------------------

   def open_files( self ) :

         ##  a connection to the database file

         self.con = sqlite.connect(self.SQL_filename)
 
         # Get a Cursor object that operates in the context of Connection con

         self.cur = self.con.cursor()
 
         ##--- CREATE FILE ONLY IF IT DOESN'T EXIST

         self.cur.execute("CREATE TABLE IF NOT EXISTS example_dbf(st_rec_num int, st_date varchar, st_float, st_int int, st_lit varchar)")
 
##===================================================================

if __name__ == "__main__":

   ST = SQLTest()
 
   """ add some records with the format

       record_number  date  float  int  string

   """

   rec_num = 0

   ccyy = 2010

   for x in range(1, 11):

      rec_num += 1

      mm = x + 1

      dd = x + 2

      date = "%d%02d%02d" % (ccyy, mm, dd)

      add_fl = random.random() * 1000

      add_int = random.randint(1, 21)

      lit = "test lit # %d" % (x)

      ST.add_rec( (rec_num, date, add_fl, add_int, lit) )
 
   ## add duplicate dates for testing

   for x in range(1, 3):

      for y in range(2):

         rec_num += 1

         mm = x + 1

         dd = x + 2

         date = "%d%02d%02d" % (ccyy, mm, dd)

         add_fl = random.random() * 1000

         add_int = random.randint(1, 21)

         lit = "test lit # %d" % (x)

         ST.add_rec( (rec_num, date, add_fl, add_int, lit) )
 
   ST.list_all_recs()

   ST.lookup_date("20100203")
 
   lookup_dict = {"dic_field_1":"20100203",

                  "dic_field_2":10}

   ST.lookup_2_fields(lookup_dict)

Aug 15 '10 #7

fekioh

Thank you, i will look into this tomorrow and I'll post back if in trouble..

Cheers!

Aug 15 '10 #8

Similar topics

Memory pre-allocation for large lists

by: Avi Kak | last post by:

Hello: Is it possible in Python to define an empty list of a predetermined size? To illustrate my question further, I can do the following in Perl my @arr = (); $#arr = 999;

Python

Is there a "Large Scale Python Software Design" ?

by: Andrea Griffini | last post by:

I did it. I proposed python as the main language for our next CAD/CAM software because I think that it has all the potential needed for it. I'm not sure yet if the decision will get through, but...

Python

Loading Large Lists

by: Thomas Beyerlein | last post by:

I am writing code for a list box editor, one of the lists is large was hoping that there is a way to either speed it up by making the server do the IF statements of there is a faster way of...

Visual Basic .NET

Best way to handle large lists?

by: Chaz Ginger | last post by:

I have a system that has a few lists that are very large (thousands or tens of thousands of entries) and some that are rather small. Many times I have to produce the difference between a large list...

Python

Two Large Lists?

by: rmstvns | last post by:

Okay, I am stuck with an issue. I have two large lists, a list of users and a list zipcodes. Both lists have thousands of records. The user picks zipcode preferences and can choose as many zipcodes...

MySQL Database

Combining two lists [python union operators]

by: TokiDoki | last post by:

Hi! I have a Python problem which is my last problem to solve to finish up a Django application. This is amazingly simple but I have been stuck now for a couple of days. It is embarrisingly...

Python

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice