By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,526 Members | 1,895 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,526 IT Pros & Developers. It's quick & easy.

Too many open files

P: n/a
AMD
Hello,

I need to split a very big file (10 gigabytes) into several thousand
smaller files according to a hash algorithm, I do this one line at a
time. The problem I have is that opening a file using append, writing
the line and closing the file is very time consuming. I'd rather have
the files all open for the duration, do all writes and then close them
all at the end.
The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Thanks in advance for any answers!

Andre M. Descombes
Feb 4 '08 #1
Share this Question
Share on Google+
6 Replies


P: n/a
Why don't you start around 50 threads at a time to do the file
writes? Threads are effective for IO. You open the source file,
start a queue, and start sending data sets to be written to the
queue. Your source file processing can go on while the writes are
done in other threads.
Feb 4 '08 #2

P: n/a
On Mon, 04 Feb 2008 13:57:39 +0100, AMD wrote:
The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?
Windows XP has a limit of 512 files opened by any process, including
stdin, stdout and stderr, so your code is probably failing after file
number 509.

http://forums.devx.com/archive/index.php/t-136946.html

It's almost certainly not a Python problem, because under Linux I can
open 1000+ files without blinking.

I don't know how Delphi works around that issue. Perhaps one of the
Windows gurus can advise if there's a way to increase that limit from 512?

--
Steven
Feb 4 '08 #3

P: n/a
Jeff wrote:
Why don't you start around 50 threads at a time to do the file
writes? Threads are effective for IO. You open the source file,
start a queue, and start sending data sets to be written to the
queue. Your source file processing can go on while the writes are
done in other threads.
I'm sorry, but you are totally wrong. Threads are a very bad idea for IO
bound operation. Asynchronous event IO is the best answer for any IO
bound problem. That is select, poll, epoll, kqueue or IOCP.

Christian

Feb 4 '08 #4

P: n/a
AMD wrote:
Hello,

I need to split a very big file (10 gigabytes) into several thousand
smaller files according to a hash algorithm, I do this one line at a
time. The problem I have is that opening a file using append, writing
the line and closing the file is very time consuming. I'd rather have
the files all open for the duration, do all writes and then close them
all at the end.
The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Thanks in advance for any answers!

Andre M. Descombes
Not quite sure what you mean by "a hash algorithm" but if you sort the file
(with external sort program) on what you want to split on, then you only have to
have 1 file at a time open.

-Larry
Feb 4 '08 #5

P: n/a
En Mon, 04 Feb 2008 12:50:15 -0200, Christian Heimes <li***@cheimes.de>
escribi�:
Jeff wrote:
>Why don't you start around 50 threads at a time to do the file
writes? Threads are effective for IO. You open the source file,
start a queue, and start sending data sets to be written to the
queue. Your source file processing can go on while the writes are
done in other threads.

I'm sorry, but you are totally wrong. Threads are a very bad idea for IO
bound operation. Asynchronous event IO is the best answer for any IO
bound problem. That is select, poll, epoll, kqueue or IOCP.
The OP said that he has this problem on Windows. The available methods
that I am aware of are:
- using synchronous (blocking) I/O with multiple threads
- asynchronous I/O using OVERLAPPED and wait functions
- asynchronous I/O using IO completion ports

Python does not (natively) support any of the latter ones, only the first.
I don't have any evidence proving that it's a very bad idea as you claim;
altough I wouldn't use 50 threads as suggested above, but a few more than
the number of CPU cores.

--
Gabriel Genellina

Feb 4 '08 #6

P: n/a
AMD
Thank you every one,

I ended up using a solution similar to what Gary Herron suggested :
Caching the output to a list of lists, one per file, and only doing the
IO when the list reaches a certain treshold.
After playing around with the list threshold I ended up with faster
execution times than originally and while having a maximum of two files
open at a time! Its only a matter of trading memory for open files.
It could be that using this strategy with asynchronous IO or threads
could yield even faster times, but I haven't tested it.
Again, much appreciated thanks for all your suggestions.

Andre M. Descombes
Hello,

I need to split a very big file (10 gigabytes) into several thousand
smaller files according to a hash algorithm, I do this one line at a
time. The problem I have is that opening a file using append, writing
the line and closing the file is very time consuming. I'd rather have
the files all open for the duration, do all writes and then close them
all at the end.
The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Thanks in advance for any answers!

Andre M. Descombes
Feb 5 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.