473,288 Members | 1,718 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,288 software developers and data experts.

Code For Five Threads To Process Multiple Files?

All,

I'd appreciate any help. I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.

Is this a good idea? I picked the number five at random... I was
thinking that I might check the number of processors and start a
multiple of that, but then I remembered KISS and it seemed that that
was too complicated.

If it's not a horrible idea, would anyone be able to provide some
quick code as to how to do that? Any and all help would be greatly
appreciated!

Thanks in advance!
Jun 27 '08 #1
4 2232
On 2008-05-21, td****@gmail.com <td****@gmail.comwrote:
I'd appreciate any help. I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.

Is this a good idea? I picked the number five at random... I was
Depends what you are doing.
If you are mainly reading/writing files, there is not much to gain, since 1
process will already push the disk IO system to its limit. If you do a lot of
processing, then more threads than the number of processors is not much use. If
you have more 'burtsy' behavior (first do lot of reading, then lot of
processing, then again reading, etc), then the system may be able to do some
scheduling and keep both the processors and the file system busy.

I cannot really give you advice on threading, I have never done that. You may
want to consider an alternative, namely multi-tasking at OS level. If you can
easily split the files over a number of OS processes (written in Python), you
can make the Python program really simple, and let the OS handle the
task-switching between the programs.

Sincerely,
Albert
Jun 27 '08 #2
On May 21, 11:13*am, "A.T.Hofkamp" <h...@se-162.se.wtb.tue.nlwrote:
On 2008-05-21, tda...@gmail.com <tda...@gmail.comwrote:
I'd appreciate any help. *I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. *Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.
Is this a good idea? *I picked the number five at random... I was

Depends what you are doing.
If you are mainly reading/writing files, there is not much to gain, since 1
process will already push the disk IO system to its limit. If you do a lotof
processing, then more threads than the number of processors is not much use. If
you have more 'burtsy' behavior (first do lot of reading, then lot of
processing, then again reading, etc), then the system may be able to do some
scheduling and keep both the processors and the file system busy.

I cannot really give you advice on threading, I have never done that. You may
want to consider an alternative, namely multi-tasking at OS level. If you can
easily split the files over a number of OS processes (written in Python), you
can make the Python program really simple, and let the OS handle the
task-switching between the programs.

Sincerely,
Albert
Albert,

Thanks for your response - I appreciate your time!

I am mainly reading and writing files, so it seems like it might not
be a good idea. What if I read the whole file into memory first, and
operate on it there? They are not large files...

Either way, I'd hope that someone might respond with an example, as
then I could test and see which is faster!

Thanks again.
Jun 27 '08 #3
On May 21, 11:41*am, tda...@gmail.com wrote:
On May 21, 11:13*am, "A.T.Hofkamp" <h...@se-162.se.wtb.tue.nlwrote:
On 2008-05-21, tda...@gmail.com <tda...@gmail.comwrote:
I'd appreciate any help. *I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. *Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.
Is this a good idea? *I picked the number five at random... I was
Depends what you are doing.
If you are mainly reading/writing files, there is not much to gain, since 1
process will already push the disk IO system to its limit. If you do a lot of
processing, then more threads than the number of processors is not much use. If
you have more 'burtsy' behavior (first do lot of reading, then lot of
processing, then again reading, etc), then the system may be able to do some
scheduling and keep both the processors and the file system busy.
I cannot really give you advice on threading, I have never done that. You may
want to consider an alternative, namely multi-tasking at OS level. If you can
easily split the files over a number of OS processes (written in Python), you
can make the Python program really simple, and let the OS handle the
task-switching between the programs.
Sincerely,
Albert

Albert,

Thanks for your response - I appreciate your time!

I am mainly reading and writing files, so it seems like it might not
be a good idea. *What if I read the whole file into memory first, and
operate on it there? *They are not large files...

Either way, I'd hope that someone might respond with an example, as
then I could test and see which is faster!

Thanks again.
Ah, well, I didn't get any other responses, but here's what I've done:

loopCount = 0
for l in range(len(self.filesToProcess)):
threads = []
try:

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+l])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+2])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+3])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+4])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+5])))
msg = "Processing file...\n"
for thread in threads:
wx.CallAfter(self.textctrl03.write(msg),
thread.start())
for thread in threads:
thread.join()
loopCount += 5
except IndexError:
pass

It works, and it works well. It starts five threads, and processes
five files at a time. (In the "self.processFiles" I read the whole
file into memory using readlines(), which works well.)

Of course, now the wx.CallAfter function doesn't work... I get
"TypeError: 'NoneType' object is not callable" for every time it is
run...
Jun 27 '08 #4
On May 23, 12:20*am, Dennis Lee Bieber <wlfr...@ix.netcom.comwrote:
On Thu, 22 May 2008 11:03:48 -0700 (PDT), tda...@gmail.com declaimed the
following in comp.lang.python:
Ah, well, I didn't get any other responses, but here's what I've done:

* * * * Apparently the direct email from my work address did not get through
(I don't have group posting ability from work).
loopCount = 0
* * * * * * * * for l in range(len(self.filesToProcess)):
* * * * * * * * * * threads = []
* * * * * * * * * * try:
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+l])))

* * * * Python lists index from 0... So this will be 0+0, first entry in the
file list
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+2])))

* * * * This is 0+2, THIRD entry in the file list -- you've just skipped
over the second entry...
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+3])))
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+4])))
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+5])))

* * * * Very ugly... Also going to fail for other reasons... Consider:

filestoprocess = [ 'file1', 'file2', 'file3' ]
for jnk in range(len(filestoprocess)): *#this will loop three times!
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * #jnk = 0, 1, 2

* * * * You proceed to create FIVE threads (or try to) when there are only
THREE files... It will fail as soon as it tries loopCount+3 (fourth
entry in a three element list)
* * * * * * * * * * * * msg = "Processing file....\n"
* * * * * * * * * * * * for thread in threads:
* * * * * * * * * * * * * * wx.CallAfter(self.textctrl03.write(msg),
thread.start())

* * * * Is this running as the main controller of some GUI? if so....
* * * * * * * * * * * * for thread in threads:
* * * * * * * * * * * * * * thread.join()

* * * * Your GUI will essentially freeze since it can't process events
(including screen updates) until the entire function you are in returns
to the event handler... But .join() blocks until the specified thread
really finishes...
* * * * * * * * * * * * loopCount += 5
* * * * * * * * * * except IndexError:
* * * * * * * * * * * * pass

* * * * BAD style -- if you are going to trap an exception, you should do
something with it... But then, the only reason you would GET this
exception is because the preceding code is looping too many times
relative to the number of files...

* * * * As shown, with three files, you will create the first thread (0) for
first file, skip the second file creating the second thread (1) for the
third file, and raise an exception on trying to create the third thread
(2) when you try to access a fourth file in the list. *The exception
will be raised -- SKIPPING over the thread.start() calls, and skipping
the thread.join() calls. You then ignore the error, and go back to the
start of the loop where the index is now "1"... AND reset the thread
list, so threads 0&1 are forgotten, never started, never joined, garbage
collected...

* * * * Again, you now create a thread (0) giving it the second file (since
loopCount was never incremented, and the first thread is using loopCount
+ <loopindex>), create thread (1) giving it the third file, raise the
exception... repeat
It works, and it works well. *It starts five threads, and processes
five files at a time. *(In the "self.processFiles" I read the whole
file into memory using readlines(), which works well.)

* * * * It only works as long as loopCount+5 is less than the number of
files in the list... AND at that, it skips one file and double processes
another...
Of course, now the wx.CallAfter function doesn't work... I get
"TypeError: 'NoneType' object is not callable" for every time it is
run...

* * * * Probably because it wants you to supply it with one or two
/callable/ functions... but you are actually calling the functions and
passing it the results of the called functions (and they aren't
returning anything -- None).

* * * * Ignoring GUI stuff... here is a simple one-job threadpool algorithm
-- you have to plug in the file list and the actual processing work. It
creates n-threads; and those threads pull the work off of a common
queue; the main program only has to fill the queue with the work to be
done, and stuff a sentinal value onto the queue when it wants the
threads to die -- which would be before shutdown of the program (create
the pool at start-up, leave the threads blocked on the .get() until you
need one to process...

-=-=-=-=-=-=-=-
#
# * * * Example code for a pooled thread file processor
# * * * NOT EXECUTABLE as is -- there is no code to obtain
# * * * the list of files to be processed; and the processor
# * * * just sleeps...

import threading
import Queue
import time * * * * #just for demo sleep

NUMTHREADS = 5
SENTINAL = object()

workQueue = Queue.Queue()

def fileProc(): * * * * #function that handles processing of the files
* * while True:
* * * * fname = workQueue.get()
* * * * if fname is SENTINAL:
* * * * * * workQueue.put(SENTINAL) * *#recycle sentinal for next
* * * * * * break
* * * * print "Processing %s" % fname
* * * * time.sleep(3) * #replace with real file processing

threadList = []
for ti in range(NUMTHREADS): * *#create worker threads
* * t = threading.Thread(target=fileProc)
* * t.start()
* * threadList.append(t)

for fn in listOfFiles: *#queue up the file names to be worked
* * workQueue.put(fn) * #need to expand to include how names are
* * * * * * * * * * * * #obtained

workQueue.put(SENTINAL) #signal that no more files are to be worked

for t in threadList:
* * t.join() * * * * * *#wait for each thread to exit (ensures main
* * * * * * * * * * * * #doesn't exit before all threads finish
processing

--
* * * * Wulfraed * * * *Dennis Lee Bieber * * * * * * * KD6MOG
* * * * wlfr...@ix.netcom.com * * * * * * *wulfr...@bestiaria.com
* * * * * * * * HTTP://wlfraed.home.netcom.com/
* * * * (Bestiaria Support Staff: * * * * * * * web-a...@bestiaria.com)
* * * * * * * * HTTP://www.bestiaria.com/
Thanks for the information! I can definitely see what you're talking
about, and the Exception is only "pass" right now while I am working
on the code.

However, it does process every file (it doesn't skip the second one),
and I'm guessing that this is because it loops so many times? I guess
that means I am successful in spite of myself! ;-) (This wouldn't be
the first time... ;-) )

I REALLY appreciate your insights!!
Jun 27 '08 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: AnalogKid | last post by:
Short question: What's the difference between SingleUse and MultiUse ? Long question: I've been writing some sample code to see how different Instancing values and threading models work. I...
4
by: topjun | last post by:
using osql to apply SPs in mutiple threads Hello, I got a weird problem when I was using osql to apply scripts for msde database in multiple threads mode. Sometime 2 sps were missing during the...
31
by: AlexeiOst | last post by:
Everywhere in documentation there are recommendations to use threads from thread pooling for relatively short tasks. As I understand, fetching a page or multiple pages (sometimes up to 50 but not...
6
by: RahimAsif | last post by:
Hi guys, I would like some advice on thread programming using C#. I am writing an application that communicates with a panel over ethernet, collects data and writes it to a file. The way the...
1
by: IT | last post by:
I want to fire off threads to start retrieving File from several locations. FTP and the local network. Is it ok to start these threads on the main form load. Ex. Every day we load EDI files...
2
by: Brett | last post by:
What are the advantages/disadvantages of using one process with multiple threads or doing the same task with multiple processes, each having one thread? I see using multiple threads under one...
10
by: HK | last post by:
With VB.NET 2005, and a Windows Form, running on a dual CPU box, I need to take a recordset (e.g. 100,000 records) and spawn a thread to handle an internet XML transaction routine for each of the...
7
by: Michael | last post by:
I'm writing an application that decodes a file containing binary records. Each record is a particular event type. Each record is translated into ASCII and then written to a file. Each file contains...
16
by: WATYF | last post by:
Hi there... I have a huge text file that needs to be processed. At the moment, I'm loading it into memory in small chunks (x amount of lines) and processing it that way. I'd like the process to be...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.