By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,658 Members | 1,297 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,658 IT Pros & Developers. It's quick & easy.

Memory management for Large dataset in python

P: 25
Hi Everyone,

I am working on Latin dataset and I am supposed to read all the data from ~30,000 files. What i did was, I opened and read the file and written file contents in separate one file (say and then close the individual files. But will not close unless to read the last file.

I need file at the end having the content information of all ~30,000 files.

My program is running fine for small dataset (i.e., till 900 files). Whenever the dataset files increasing from 900 my program stucks and do not perform my desired processing.

Managing large file as is difficult to handle in this situation. Please guide me how can I handle this situation, I need at the end one file because I have to use it for training.

Please help me in this scenario.
Thanks alot
Mar 26 '12 #1
Share this Question
Share on Google+
5 Replies

Expert 100+
P: 626
You should open the output file, then open, read, write, and close each of the input files before processing the next file. I don't understand what is meant by
But will not close unless to read the last file.
as you close it after all the files are processed.
Expand|Select|Wrap|Line Numbers
  1. output = open(combined_file, "w")
  3. for fname in list_of_30000:
  4.     fp=open(fname, "r"):
  5.     for rec in fp:
  6.         output.write(rec)
  7.     fp.close()
  9. output.close() 
Mar 26 '12 #2

P: 25
Actually I have to read, and save the contents of each file in separate netcdf file named Its mean all files content will be written in one file i.e., At the end I will have one file which should have contents of all files (in my case files=~30,000).
Mar 26 '12 #3

P: 25
I currently read,write and close every file but will remain open until to read, write and close all 30,000.
Mar 26 '12 #4

Expert 100+
P: 626
That is correct. Also, are you sure that you are not running out of disk as the copy may require twice the amount of space on disk of the 30,000 files. You will have to post your code for any more detailed assistance.
Mar 27 '12 #5

P: 25
Yes, it seems that I am running out of disk by doing all the stuff. Whenever files increased to 900 or more it automatically hangs further processing. It does not show me any error message but also not processed further. I have already used garbage collector function gc.collect() that also could not solve the problem.
Mar 27 '12 #6

Post your reply

Sign in to post your reply or Sign up for a free account.