473,381 Members | 1,515 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,381 software developers and data experts.

Code For Five Threads To Process Multiple Files?

All,

I'd appreciate any help. I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.

Is this a good idea? I picked the number five at random... I was
thinking that I might check the number of processors and start a
multiple of that, but then I remembered KISS and it seemed that that
was too complicated.

If it's not a horrible idea, would anyone be able to provide some
quick code as to how to do that? Any and all help would be greatly
appreciated!

Thanks in advance!
Jun 27 '08 #1
4 2240
On 2008-05-21, td****@gmail.com <td****@gmail.comwrote:
I'd appreciate any help. I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.

Is this a good idea? I picked the number five at random... I was
Depends what you are doing.
If you are mainly reading/writing files, there is not much to gain, since 1
process will already push the disk IO system to its limit. If you do a lot of
processing, then more threads than the number of processors is not much use. If
you have more 'burtsy' behavior (first do lot of reading, then lot of
processing, then again reading, etc), then the system may be able to do some
scheduling and keep both the processors and the file system busy.

I cannot really give you advice on threading, I have never done that. You may
want to consider an alternative, namely multi-tasking at OS level. If you can
easily split the files over a number of OS processes (written in Python), you
can make the Python program really simple, and let the OS handle the
task-switching between the programs.

Sincerely,
Albert
Jun 27 '08 #2
On May 21, 11:13*am, "A.T.Hofkamp" <h...@se-162.se.wtb.tue.nlwrote:
On 2008-05-21, tda...@gmail.com <tda...@gmail.comwrote:
I'd appreciate any help. *I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. *Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.
Is this a good idea? *I picked the number five at random... I was

Depends what you are doing.
If you are mainly reading/writing files, there is not much to gain, since 1
process will already push the disk IO system to its limit. If you do a lotof
processing, then more threads than the number of processors is not much use. If
you have more 'burtsy' behavior (first do lot of reading, then lot of
processing, then again reading, etc), then the system may be able to do some
scheduling and keep both the processors and the file system busy.

I cannot really give you advice on threading, I have never done that. You may
want to consider an alternative, namely multi-tasking at OS level. If you can
easily split the files over a number of OS processes (written in Python), you
can make the Python program really simple, and let the OS handle the
task-switching between the programs.

Sincerely,
Albert
Albert,

Thanks for your response - I appreciate your time!

I am mainly reading and writing files, so it seems like it might not
be a good idea. What if I read the whole file into memory first, and
operate on it there? They are not large files...

Either way, I'd hope that someone might respond with an example, as
then I could test and see which is faster!

Thanks again.
Jun 27 '08 #3
On May 21, 11:41*am, tda...@gmail.com wrote:
On May 21, 11:13*am, "A.T.Hofkamp" <h...@se-162.se.wtb.tue.nlwrote:
On 2008-05-21, tda...@gmail.com <tda...@gmail.comwrote:
I'd appreciate any help. *I've got a list of files in a directory, and
I'd like to iterate through that list and process each one. *Rather
than do that serially, I was thinking I should start five threads and
process five files at a time.
Is this a good idea? *I picked the number five at random... I was
Depends what you are doing.
If you are mainly reading/writing files, there is not much to gain, since 1
process will already push the disk IO system to its limit. If you do a lot of
processing, then more threads than the number of processors is not much use. If
you have more 'burtsy' behavior (first do lot of reading, then lot of
processing, then again reading, etc), then the system may be able to do some
scheduling and keep both the processors and the file system busy.
I cannot really give you advice on threading, I have never done that. You may
want to consider an alternative, namely multi-tasking at OS level. If you can
easily split the files over a number of OS processes (written in Python), you
can make the Python program really simple, and let the OS handle the
task-switching between the programs.
Sincerely,
Albert

Albert,

Thanks for your response - I appreciate your time!

I am mainly reading and writing files, so it seems like it might not
be a good idea. *What if I read the whole file into memory first, and
operate on it there? *They are not large files...

Either way, I'd hope that someone might respond with an example, as
then I could test and see which is faster!

Thanks again.
Ah, well, I didn't get any other responses, but here's what I've done:

loopCount = 0
for l in range(len(self.filesToProcess)):
threads = []
try:

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+l])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+2])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+3])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+4])))

threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+5])))
msg = "Processing file...\n"
for thread in threads:
wx.CallAfter(self.textctrl03.write(msg),
thread.start())
for thread in threads:
thread.join()
loopCount += 5
except IndexError:
pass

It works, and it works well. It starts five threads, and processes
five files at a time. (In the "self.processFiles" I read the whole
file into memory using readlines(), which works well.)

Of course, now the wx.CallAfter function doesn't work... I get
"TypeError: 'NoneType' object is not callable" for every time it is
run...
Jun 27 '08 #4
On May 23, 12:20*am, Dennis Lee Bieber <wlfr...@ix.netcom.comwrote:
On Thu, 22 May 2008 11:03:48 -0700 (PDT), tda...@gmail.com declaimed the
following in comp.lang.python:
Ah, well, I didn't get any other responses, but here's what I've done:

* * * * Apparently the direct email from my work address did not get through
(I don't have group posting ability from work).
loopCount = 0
* * * * * * * * for l in range(len(self.filesToProcess)):
* * * * * * * * * * threads = []
* * * * * * * * * * try:
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+l])))

* * * * Python lists index from 0... So this will be 0+0, first entry in the
file list
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+2])))

* * * * This is 0+2, THIRD entry in the file list -- you've just skipped
over the second entry...
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+3])))
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+4])))
threads.append(threading.Thread(target=self.proces sFiles(self.filesToProcess[loopCount
+5])))

* * * * Very ugly... Also going to fail for other reasons... Consider:

filestoprocess = [ 'file1', 'file2', 'file3' ]
for jnk in range(len(filestoprocess)): *#this will loop three times!
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * #jnk = 0, 1, 2

* * * * You proceed to create FIVE threads (or try to) when there are only
THREE files... It will fail as soon as it tries loopCount+3 (fourth
entry in a three element list)
* * * * * * * * * * * * msg = "Processing file....\n"
* * * * * * * * * * * * for thread in threads:
* * * * * * * * * * * * * * wx.CallAfter(self.textctrl03.write(msg),
thread.start())

* * * * Is this running as the main controller of some GUI? if so....
* * * * * * * * * * * * for thread in threads:
* * * * * * * * * * * * * * thread.join()

* * * * Your GUI will essentially freeze since it can't process events
(including screen updates) until the entire function you are in returns
to the event handler... But .join() blocks until the specified thread
really finishes...
* * * * * * * * * * * * loopCount += 5
* * * * * * * * * * except IndexError:
* * * * * * * * * * * * pass

* * * * BAD style -- if you are going to trap an exception, you should do
something with it... But then, the only reason you would GET this
exception is because the preceding code is looping too many times
relative to the number of files...

* * * * As shown, with three files, you will create the first thread (0) for
first file, skip the second file creating the second thread (1) for the
third file, and raise an exception on trying to create the third thread
(2) when you try to access a fourth file in the list. *The exception
will be raised -- SKIPPING over the thread.start() calls, and skipping
the thread.join() calls. You then ignore the error, and go back to the
start of the loop where the index is now "1"... AND reset the thread
list, so threads 0&1 are forgotten, never started, never joined, garbage
collected...

* * * * Again, you now create a thread (0) giving it the second file (since
loopCount was never incremented, and the first thread is using loopCount
+ <loopindex>), create thread (1) giving it the third file, raise the
exception... repeat
It works, and it works well. *It starts five threads, and processes
five files at a time. *(In the "self.processFiles" I read the whole
file into memory using readlines(), which works well.)

* * * * It only works as long as loopCount+5 is less than the number of
files in the list... AND at that, it skips one file and double processes
another...
Of course, now the wx.CallAfter function doesn't work... I get
"TypeError: 'NoneType' object is not callable" for every time it is
run...

* * * * Probably because it wants you to supply it with one or two
/callable/ functions... but you are actually calling the functions and
passing it the results of the called functions (and they aren't
returning anything -- None).

* * * * Ignoring GUI stuff... here is a simple one-job threadpool algorithm
-- you have to plug in the file list and the actual processing work. It
creates n-threads; and those threads pull the work off of a common
queue; the main program only has to fill the queue with the work to be
done, and stuff a sentinal value onto the queue when it wants the
threads to die -- which would be before shutdown of the program (create
the pool at start-up, leave the threads blocked on the .get() until you
need one to process...

-=-=-=-=-=-=-=-
#
# * * * Example code for a pooled thread file processor
# * * * NOT EXECUTABLE as is -- there is no code to obtain
# * * * the list of files to be processed; and the processor
# * * * just sleeps...

import threading
import Queue
import time * * * * #just for demo sleep

NUMTHREADS = 5
SENTINAL = object()

workQueue = Queue.Queue()

def fileProc(): * * * * #function that handles processing of the files
* * while True:
* * * * fname = workQueue.get()
* * * * if fname is SENTINAL:
* * * * * * workQueue.put(SENTINAL) * *#recycle sentinal for next
* * * * * * break
* * * * print "Processing %s" % fname
* * * * time.sleep(3) * #replace with real file processing

threadList = []
for ti in range(NUMTHREADS): * *#create worker threads
* * t = threading.Thread(target=fileProc)
* * t.start()
* * threadList.append(t)

for fn in listOfFiles: *#queue up the file names to be worked
* * workQueue.put(fn) * #need to expand to include how names are
* * * * * * * * * * * * #obtained

workQueue.put(SENTINAL) #signal that no more files are to be worked

for t in threadList:
* * t.join() * * * * * *#wait for each thread to exit (ensures main
* * * * * * * * * * * * #doesn't exit before all threads finish
processing

--
* * * * Wulfraed * * * *Dennis Lee Bieber * * * * * * * KD6MOG
* * * * wlfr...@ix.netcom.com * * * * * * *wulfr...@bestiaria.com
* * * * * * * * HTTP://wlfraed.home.netcom.com/
* * * * (Bestiaria Support Staff: * * * * * * * web-a...@bestiaria.com)
* * * * * * * * HTTP://www.bestiaria.com/
Thanks for the information! I can definitely see what you're talking
about, and the Exception is only "pass" right now while I am working
on the code.

However, it does process every file (it doesn't skip the second one),
and I'm guessing that this is because it loops so many times? I guess
that means I am successful in spite of myself! ;-) (This wouldn't be
the first time... ;-) )

I REALLY appreciate your insights!!
Jun 27 '08 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: AnalogKid | last post by:
Short question: What's the difference between SingleUse and MultiUse ? Long question: I've been writing some sample code to see how different Instancing values and threading models work. I...
4
by: topjun | last post by:
using osql to apply SPs in mutiple threads Hello, I got a weird problem when I was using osql to apply scripts for msde database in multiple threads mode. Sometime 2 sps were missing during the...
31
by: AlexeiOst | last post by:
Everywhere in documentation there are recommendations to use threads from thread pooling for relatively short tasks. As I understand, fetching a page or multiple pages (sometimes up to 50 but not...
6
by: RahimAsif | last post by:
Hi guys, I would like some advice on thread programming using C#. I am writing an application that communicates with a panel over ethernet, collects data and writes it to a file. The way the...
1
by: IT | last post by:
I want to fire off threads to start retrieving File from several locations. FTP and the local network. Is it ok to start these threads on the main form load. Ex. Every day we load EDI files...
2
by: Brett | last post by:
What are the advantages/disadvantages of using one process with multiple threads or doing the same task with multiple processes, each having one thread? I see using multiple threads under one...
10
by: HK | last post by:
With VB.NET 2005, and a Windows Form, running on a dual CPU box, I need to take a recordset (e.g. 100,000 records) and spawn a thread to handle an internet XML transaction routine for each of the...
7
by: Michael | last post by:
I'm writing an application that decodes a file containing binary records. Each record is a particular event type. Each record is translated into ASCII and then written to a file. Each file contains...
16
by: WATYF | last post by:
Hi there... I have a huge text file that needs to be processed. At the moment, I'm loading it into memory in small chunks (x amount of lines) and processing it that way. I'd like the process to be...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.