472,144 Members | 1,975 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,144 software developers and data experts.

Code For Five Threads To Process Multiple Files?

All,

I'd appreciate any help. I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.

Is this a good idea? I picked the number five at random... I was
thinking that I might check the number of processors and start a
multiple of that, but then I remembered KISS and it seemed that that
was too complicated.

If it's not a horrible idea, would anyone be able to provide some
quick code as to how to do that? Any and all help would be greatly
appreciated!

Thanks in advance!
Jun 27 '08 #1
4 2145
On 2008-05-21, td****@gmail.com <td****@gmail.comwrote:
I'd appreciate any help. I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.

Is this a good idea? I picked the number five at random... I was
Depends what you are doing.
If you are mainly reading/writing files, there is not much to gain, since 1
process will already push the disk IO system to its limit. If you do a lot of
processing, then more threads than the number of processors is not much use. If
you have more 'burtsy' behavior (first do lot of reading, then lot of
processing, then again reading, etc), then the system may be able to do some
scheduling and keep both the processors and the file system busy.

I cannot really give you advice on threading, I have never done that. You may
want to consider an alternative, namely multi-tasking at OS level. If you can
easily split the files over a number of OS processes (written in Python), you
can make the Python program really simple, and let the OS handle the
task-switching between the programs.

Sincerely,
Albert
Jun 27 '08 #2
On May 21, 11:13*am, "A.T.Hofkamp" <h...@se-162.se.wtb.tue.nlwrote:
On 2008-05-21, tda...@gmail.com <tda...@gmail.comwrote:
I'd appreciate any help. *I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. *Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.
Is this a good idea? *I picked the number five at random... I was

Depends what you are doing.
If you are mainly reading/writing files, there is not much to gain, since 1
process will already push the disk IO system to its limit. If you do a lotof
processing, then more threads than the number of processors is not much use. If
you have more 'burtsy' behavior (first do lot of reading, then lot of
processing, then again reading, etc), then the system may be able to do some
scheduling and keep both the processors and the file system busy.

I cannot really give you advice on threading, I have never done that. You may
want to consider an alternative, namely multi-tasking at OS level. If you can
easily split the files over a number of OS processes (written in Python), you
can make the Python program really simple, and let the OS handle the
task-switching between the programs.

Sincerely,
Albert
Albert,

Thanks for your response - I appreciate your time!

I am mainly reading and writing files, so it seems like it might not
be a good idea. What if I read the whole file into memory first, and
operate on it there? They are not large files...

Either way, I'd hope that someone might respond with an example, as
then I could test and see which is faster!

Thanks again.
Jun 27 '08 #3
On May 21, 11:41*am, tda...@gmail.com wrote:
On May 21, 11:13*am, "A.T.Hofkamp" <h...@se-162.se.wtb.tue.nlwrote:
On 2008-05-21, tda...@gmail.com <tda...@gmail.comwrote:
I'd appreciate any help. *I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. *Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.
Is this a good idea? *I picked the number five at random... I was
Depends what you are doing.
If you are mainly reading/writing files, there is not much to gain, since 1
process will already push the disk IO system to its limit. If you do a lot of
processing, then more threads than the number of processors is not much use. If
you have more 'burtsy' behavior (first do lot of reading, then lot of
processing, then again reading, etc), then the system may be able to do some
scheduling and keep both the processors and the file system busy.
I cannot really give you advice on threading, I have never done that. You may
want to consider an alternative, namely multi-tasking at OS level. If you can
easily split the files over a number of OS processes (written in Python), you
can make the Python program really simple, and let the OS handle the
task-switching between the programs.
Sincerely,
Albert

Albert,

Thanks for your response - I appreciate your time!

I am mainly reading and writing files, so it seems like it might not
be a good idea. *What if I read the whole file into memory first, and
operate on it there? *They are not large files...

Either way, I'd hope that someone might respond with an example, as
then I could test and see which is faster!

Thanks again.
Ah, well, I didn't get any other responses, but here's what I've done:

loopCount = 0
for l in range(len(self.filesToProcess)):
threads = []
try:

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+l])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+2])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+3])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+4])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+5])))
msg = "Processing file...\n"
for thread in threads:
wx.CallAfter(self.textctrl03.write(msg),
thread.start())
for thread in threads:
thread.join()
loopCount += 5
except IndexError:
pass

It works, and it works well. It starts five threads, and processes
five files at a time. (In the "self.processFiles" I read the whole
file into memory using readlines(), which works well.)

Of course, now the wx.CallAfter function doesn't work... I get
"TypeError: 'NoneType' object is not callable" for every time it is
run...
Jun 27 '08 #4
On May 23, 12:20*am, Dennis Lee Bieber <wlfr...@ix.netcom.comwrote:
On Thu, 22 May 2008 11:03:48 -0700 (PDT), tda...@gmail.com declaimed the
following in comp.lang.python:
Ah, well, I didn't get any other responses, but here's what I've done:

* * * * Apparently the direct email from my work address did not get through
(I don't have group posting ability from work).
loopCount = 0
* * * * * * * * for l in range(len(self.filesToProcess)):
* * * * * * * * * * threads = []
* * * * * * * * * * try:
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+l])))

* * * * Python lists index from 0... So this will be 0+0, first entry in the
file list
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+2])))

* * * * This is 0+2, THIRD entry in the file list -- you've just skipped
over the second entry...
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+3])))
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+4])))
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+5])))

* * * * Very ugly... Also going to fail for other reasons... Consider:

filestoprocess = [ 'file1', 'file2', 'file3' ]
for jnk in range(len(filestoprocess)): *#this will loop three times!
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * #jnk = 0, 1, 2

* * * * You proceed to create FIVE threads (or try to) when there are only
THREE files... It will fail as soon as it tries loopCount+3 (fourth
entry in a three element list)
* * * * * * * * * * * * msg = "Processing file....\n"
* * * * * * * * * * * * for thread in threads:
* * * * * * * * * * * * * * wx.CallAfter(self.textctrl03.write(msg),
thread.start())

* * * * Is this running as the main controller of some GUI? if so....
* * * * * * * * * * * * for thread in threads:
* * * * * * * * * * * * * * thread.join()

* * * * Your GUI will essentially freeze since it can't process events
(including screen updates) until the entire function you are in returns
to the event handler... But .join() blocks until the specified thread
really finishes...
* * * * * * * * * * * * loopCount += 5
* * * * * * * * * * except IndexError:
* * * * * * * * * * * * pass

* * * * BAD style -- if you are going to trap an exception, you should do
something with it... But then, the only reason you would GET this
exception is because the preceding code is looping too many times
relative to the number of files...

* * * * As shown, with three files, you will create the first thread (0) for
first file, skip the second file creating the second thread (1) for the
third file, and raise an exception on trying to create the third thread
(2) when you try to access a fourth file in the list. *The exception
will be raised -- SKIPPING over the thread.start() calls, and skipping
the thread.join() calls. You then ignore the error, and go back to the
start of the loop where the index is now "1"... AND reset the thread
list, so threads 0&1 are forgotten, never started, never joined, garbage
collected...

* * * * Again, you now create a thread (0) giving it the second file (since
loopCount was never incremented, and the first thread is using loopCount
+ <loopindex>), create thread (1) giving it the third file, raise the
exception... repeat
It works, and it works well. *It starts five threads, and processes
five files at a time. *(In the "self.processFiles" I read the whole
file into memory using readlines(), which works well.)

* * * * It only works as long as loopCount+5 is less than the number of
files in the list... AND at that, it skips one file and double processes
another...
Of course, now the wx.CallAfter function doesn't work... I get
"TypeError: 'NoneType' object is not callable" for every time it is
run...

* * * * Probably because it wants you to supply it with one or two
/callable/ functions... but you are actually calling the functions and
passing it the results of the called functions (and they aren't
returning anything -- None).

* * * * Ignoring GUI stuff... here is a simple one-job threadpool algorithm
-- you have to plug in the file list and the actual processing work. It
creates n-threads; and those threads pull the work off of a common
queue; the main program only has to fill the queue with the work to be
done, and stuff a sentinal value onto the queue when it wants the
threads to die -- which would be before shutdown of the program (create
the pool at start-up, leave the threads blocked on the .get() until you
need one to process...

-=-=-=-=-=-=-=-
#
# * * * Example code for a pooled thread file processor
# * * * NOT EXECUTABLE as is -- there is no code to obtain
# * * * the list of files to be processed; and the processor
# * * * just sleeps...

import threading
import Queue
import time * * * * #just for demo sleep

NUMTHREADS = 5
SENTINAL = object()

workQueue = Queue.Queue()

def fileProc(): * * * * #function that handles processing of the files
* * while True:
* * * * fname = workQueue.get()
* * * * if fname is SENTINAL:
* * * * * * workQueue.put(SENTINAL) * *#recycle sentinal for next
* * * * * * break
* * * * print "Processing %s" % fname
* * * * time.sleep(3) * #replace with real file processing

threadList = []
for ti in range(NUMTHREADS): * *#create worker threads
* * t = threading.Thread(target=fileProc)
* * t.start()
* * threadList.append(t)

for fn in listOfFiles: *#queue up the file names to be worked
* * workQueue.put(fn) * #need to expand to include how names are
* * * * * * * * * * * * #obtained

workQueue.put(SENTINAL) #signal that no more files are to be worked

for t in threadList:
* * t.join() * * * * * *#wait for each thread to exit (ensures main
* * * * * * * * * * * * #doesn't exit before all threads finish
processing

--
* * * * Wulfraed * * * *Dennis Lee Bieber * * * * * * * KD6MOG
* * * * wlfr...@ix.netcom.com * * * * * * *wulfr...@bestiaria.com
* * * * * * * * HTTP://wlfraed.home.netcom.com/
* * * * (Bestiaria Support Staff: * * * * * * * web-a...@bestiaria.com)
* * * * * * * * HTTP://www.bestiaria.com/
Thanks for the information! I can definitely see what you're talking
about, and the Exception is only "pass" right now while I am working
on the code.

However, it does process every file (it doesn't skip the second one),
and I'm guessing that this is because it loops so many times? I guess
that means I am successful in spite of myself! ;-) (This wouldn't be
the first time... ;-) )

I REALLY appreciate your insights!!
Jun 27 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

8 posts views Thread by AnalogKid | last post: by
31 posts views Thread by AlexeiOst | last post: by
6 posts views Thread by RahimAsif | last post: by
7 posts views Thread by Michael | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.