473,387 Members | 1,899 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Most efficient way to process thousands of files using multiple threads (dealing with thread handling)

m
Hello,

I have an application that processes thousands of files each day. The
filenames and various related file information is retrieved, related
filenames are associate and placed in a linked list within a single object,
which is then placed on a stack(This cuts down thread creation and deletions
roughly by a factor of 4). I create up to 12 threads, which then process a
single object off of the stack. I use a loop with a boolean statement,
stack.Count > 0. Then I check each thread to see if it is alive, if it is
not, then I create a new thread with a new object off of the stack which is
passed as the constructor parameter for a new threaded object. If the thread
is alive, then it merely goes on to check the status of the next thread in
line. This is a big process and running the CPU at 100% is not an issue, I
would just like to optimize my threading code in order to make my
application faster and more efficient. The ThreadPool class does not seem
like a good option for my needs, as my threads will be constantly processing
throughout their lifetime. I think that my constant polling of threads could
definitely be replaced with something like a thread callback upon completion
of its processing. How can I further reduce the threading overhead? Would it
be better to just reset all the variables in a thread and pass a new stack
object, without creating a new thread to overwrite the dead thread? My code,
while reliable so far, could easily be simplified and improved upon.

Thanks for any and all input :)
Nov 15 '05 #1
6 3162
"m" <sk****@speakeasy.net> wrote in
news:rD********************@speakeasy.net:
Hello,

I have an application that processes thousands of files each day. The
filenames and various related file information is retrieved, related
filenames are associate and placed in a linked list within a single
object, which is then placed on a stack(This cuts down thread creation
and deletions roughly by a factor of 4). I create up to 12 threads,
which then process a single object off of the stack. I use a loop with a
boolean statement, stack.Count > 0. Then I check each thread to see if
it is alive, if it is not, then I create a new thread with a new object
off of the stack which is passed as the constructor parameter for a new
threaded object. If the thread is alive, then it merely goes on to check
the status of the next thread in line. This is a big process and running
the CPU at 100% is not an issue, I would just like to optimize my
threading code in order to make my application faster and more
efficient. The ThreadPool class does not seem like a good option for my
needs, as my threads will be constantly processing throughout their
lifetime. I think that my constant polling of threads could definitely
be replaced with something like a thread callback upon completion of its
processing. How can I further reduce the threading overhead? Would it
be better to just reset all the variables in a thread and pass a new
stack object, without creating a new thread to overwrite the dead
thread? My code, while reliable so far, could easily be simplified and
improved upon.


I wonder if the threads are the bottleneck, since accessing a file
system via 2 or more threads simultaneously is slowing down readprocessing
terribly, because of head-stepping on your harddrive.

What I would do is a queue-like mechanism for file requests and one
thread that reads all the files in a sequential order and returns them
back to processing threads. This way your disk activity is streamlined.

So you can f.e. start 4 or 5 threads, which request files, these
are loaded and your loader thread goes to sleep until a new thread
requests a file so there will be another request in the queue. 20 or so
threads per process for a single CPU are probably the maximum amount,
since IIS is optimized for 20 threads per cpu.

FB

--
Solutions Design : http://www.sd.nl
My open source .NET Software : http://www.sd.nl/software
My .NET Blog : http://weblogs.asp.net/FBouma
-------------------------------------------------------------------------
Nov 15 '05 #2
First thing that comes to mind, is instead of polling to see what the
threads are doing - you should have each thread update a static member that
updates the status..

For example, have a static array that keeps track of threads 0 to 11. When a
thread is created, it's given one of the empty spots. The thread executes
(and it knows it's "4" for example).. when it's done, it updates that array
and says "4" is done, then kills itself. That way, all threads clean up
after themselves cleanly..

"m" <sk****@speakeasy.net> wrote in message
news:rD********************@speakeasy.net...
Hello,

I have an application that processes thousands of files each day. The
filenames and various related file information is retrieved, related
filenames are associate and placed in a linked list within a single object, which is then placed on a stack(This cuts down thread creation and deletions roughly by a factor of 4). I create up to 12 threads, which then process a
single object off of the stack. I use a loop with a boolean statement,
stack.Count > 0. Then I check each thread to see if it is alive, if it is
not, then I create a new thread with a new object off of the stack which is passed as the constructor parameter for a new threaded object. If the thread is alive, then it merely goes on to check the status of the next thread in
line. This is a big process and running the CPU at 100% is not an issue, I
would just like to optimize my threading code in order to make my
application faster and more efficient. The ThreadPool class does not seem
like a good option for my needs, as my threads will be constantly processing throughout their lifetime. I think that my constant polling of threads could definitely be replaced with something like a thread callback upon completion of its processing. How can I further reduce the threading overhead? Would it be better to just reset all the variables in a thread and pass a new stack
object, without creating a new thread to overwrite the dead thread? My code, while reliable so far, could easily be simplified and improved upon.

Thanks for any and all input :)

Nov 15 '05 #3
I do something similar but in more asynch way.

I use queue - or it could be list or collection like in your case.

Main thread, which is getting data to process (files), puts every new item
into the queue and signals using Mointor.PulseAll to all threads that there
is item to process.

All worker threads were created during app. startup; are locking on queue
and sit in Monitor.Wait. When Pulse arrives, first thread (which exactly I
never know - but from tests it looks like MS does this in round-robin
fashion) gets kicked up, locks the queue, takes out item, releases the queue
and starts processing. Sometime later it is interrupted (usually because of
asynch calls inside thread) and next waiting thread is being kicked up. It
locks the queue and gets next item. If there are no items yet - it goes into
Monitor.Wait immediately.

As you see - there is no polling, no loops or Thread.Sleep calls. Sure I sit
also at 100% nearly when there are several threads running together. However
this is split between my threads and OS (5%) and MS SQL (40-60%) which are
used from threads. Because threads are created during startup - there is no
overhead with repeated creation of threads and corresponding clogging of
memory and GC kicking in.

About callbacks - I use callbacks from threads only for updating UI visual
feedback elements. Never to start new threads. And I wouldn't recommend it
to you in your case - in essence, if your thread finished processing, it
should check for next work item immediately; there is no sense to get back
to some another thread to start new thread etc..

Btw, I started with scheme like yours - polling and attempts to create
additional threads for each work item. However with Monitor.Pulse/Wait and
my own fixed thread pool final implementation is much faster and stable. My
app can process around up to thousand items per hour on 1GHz PC - every item
is file (1-500K) and causes several MS SQL operations. App runs for several
hours usually.

The only change apart from Monitor.Pulse/Wait you might need to consider -
how to signal end of run to all threads. It could be special item or some
static flag field, which is less preferable. Of course, each work item
is/should be processed completely independently from any other. Put all
parameters controlling processing into item object.

Also, take a look if you really need 12 threads. E.g. on my machine there is
no sense to try to run more than 4-5 simultaneously. Higher than that there
is no gain at all and degradation starts even when all connections are
available. If you have multiprocessor one and low disk activity during
processing - maybe 12. But I would try to measure before making conclusions.

And last one - take a look how you process your files in threads.
Asynchronous IO can help to speed up processing too.

HTH
Alex

"m" <sk****@speakeasy.net> wrote in message
news:rD********************@speakeasy.net...
Hello,

I have an application that processes thousands of files each day. The
filenames and various related file information is retrieved, related
filenames are associate and placed in a linked list within a single object, which is then placed on a stack(This cuts down thread creation and deletions roughly by a factor of 4). I create up to 12 threads, which then process a
single object off of the stack. I use a loop with a boolean statement,
stack.Count > 0. Then I check each thread to see if it is alive, if it is
not, then I create a new thread with a new object off of the stack which is passed as the constructor parameter for a new threaded object. If the thread is alive, then it merely goes on to check the status of the next thread in
line. This is a big process and running the CPU at 100% is not an issue, I
would just like to optimize my threading code in order to make my
application faster and more efficient. The ThreadPool class does not seem
like a good option for my needs, as my threads will be constantly processing throughout their lifetime. I think that my constant polling of threads could definitely be replaced with something like a thread callback upon completion of its processing. How can I further reduce the threading overhead? Would it be better to just reset all the variables in a thread and pass a new stack
object, without creating a new thread to overwrite the dead thread? My code, while reliable so far, could easily be simplified and improved upon.

Thanks for any and all input :)

Nov 15 '05 #4
Hi m, (are you related to q?)

If I understand correctly, the processing is continuous and each thread
is processing a file and then dying - only to be immediately replaced by a
new thread for the next file.

I wonder why you don't just have your threads stay alive and grab a new
file each time they've finished. They don't need to die at all, do they?

And I would certainly investigate Frans' idea of disk-access being a
bottleneck.

Regards,
Fergus
Nov 15 '05 #5
Keep in mind that if it is a single CPU machine, creating multiple threads
will be even slower than processing everything in one thread.

"m" <sk****@speakeasy.net> wrote in message
news:rD********************@speakeasy.net...
Hello,

I have an application that processes thousands of files each day. The
filenames and various related file information is retrieved, related
filenames are associate and placed in a linked list within a single object, which is then placed on a stack(This cuts down thread creation and deletions roughly by a factor of 4). I create up to 12 threads, which then process a
single object off of the stack. I use a loop with a boolean statement,
stack.Count > 0. Then I check each thread to see if it is alive, if it is
not, then I create a new thread with a new object off of the stack which is passed as the constructor parameter for a new threaded object. If the thread is alive, then it merely goes on to check the status of the next thread in
line. This is a big process and running the CPU at 100% is not an issue, I
would just like to optimize my threading code in order to make my
application faster and more efficient. The ThreadPool class does not seem
like a good option for my needs, as my threads will be constantly processing throughout their lifetime. I think that my constant polling of threads could definitely be replaced with something like a thread callback upon completion of its processing. How can I further reduce the threading overhead? Would it be better to just reset all the variables in a thread and pass a new stack
object, without creating a new thread to overwrite the dead thread? My code, while reliable so far, could easily be simplified and improved upon.

Thanks for any and all input :)

Nov 15 '05 #6
A couple of suggestions...

1. Becarful of running too many threads. You may end up slowing yourself
down because of the time and overhead involved during the context switch
from one thread to another. If your threads are do a lot of work with
blobking calls like file IO, network communication, or database requests
having extra threads is a definite benefit, but you can go too far. If your
threads are doing steady intense processing without much blocking IO you may
actually be slowing your app down. A lot of people get thread happy like
it's the answer to everything. A single processor can only do one thing at
a time.

2. Consider using a thread safe queue to hand out the work. Rather than
having a thread terminate and clean up, just to create a new thread that
will perform the same type of work, have your threads grab their work
requests from a queue. This way you won't have to poll. Basically I
created a thread safe queue with a Get(object ref) that retreives and item
from the queue or waits on an event if the queue is empty. When an item is
added to the queue an event is signaled. The Get has a timeout. If the
timeout is reached the Get returns 0 otherwise it returns 1. If the get
returns 0 I check a global variable to see if the app is closing otherwise I
go back into the Get. When I want to close the App I signal a closing
event. That's a rough idea of how it works anyway.

What Frans said about hdd drive access makes sense to and is something to
check, but I would definitely try reusing the worker threads. If you are
processing thousands of files and creating and destroying thousands of
threads the cost of that constant destruction and allocations should be a
noticable difference. You can also play with the number of threads in your
pool to see what performs best.

Good luck!

jim

"m" <sk****@speakeasy.net> wrote in message
news:rD********************@speakeasy.net...
Hello,

I have an application that processes thousands of files each day. The
filenames and various related file information is retrieved, related
filenames are associate and placed in a linked list within a single object, which is then placed on a stack(This cuts down thread creation and deletions roughly by a factor of 4). I create up to 12 threads, which then process a
single object off of the stack. I use a loop with a boolean statement,
stack.Count > 0. Then I check each thread to see if it is alive, if it is
not, then I create a new thread with a new object off of the stack which is passed as the constructor parameter for a new threaded object. If the thread is alive, then it merely goes on to check the status of the next thread in
line. This is a big process and running the CPU at 100% is not an issue, I
would just like to optimize my threading code in order to make my
application faster and more efficient. The ThreadPool class does not seem
like a good option for my needs, as my threads will be constantly processing throughout their lifetime. I think that my constant polling of threads could definitely be replaced with something like a thread callback upon completion of its processing. How can I further reduce the threading overhead? Would it be better to just reset all the variables in a thread and pass a new stack
object, without creating a new thread to overwrite the dead thread? My code, while reliable so far, could easily be simplified and improved upon.

Thanks for any and all input :)

Nov 15 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
by: Qiangning Hong | last post by:
A class Collector, it spawns several threads to read from serial port. Collector.get_data() will get all the data they have read since last call. Who can tell me whether my implementation correct?...
16
by: Daniel Tonks | last post by:
First, please excuse the fact that I'm a complete MySQL newbie. My site used forum software that I wrote myself (in Perl) which, up until now, has used flat files. This worked fine, however...
3
by: Chris Tanger | last post by:
I am creating a class that has a method "Write" that I wish to make threadsafe. The method must block calling threads until the task performed in write is complete. Only 1 thread at a time can...
22
by: nd02tsk | last post by:
Hello! I have a couple of final ( I hope, for your sake ) questions regarding PostgreSQL. I understand PostgreSQL uses processes rather than threads. I found this statement in the archives: ...
4
by: Gregory Gadow | last post by:
I've cobbled together a PrinterClass that takes a text file and dumps it to a printer. The app using is has multiple threads, all of which need access to a shared instance. Can someone point me to...
4
by: tdahsu | last post by:
All, I'd appreciate any help. I've got a list of files in a directory, and I'd like to iterate through that list and process each one. Rather than do that serially, I was thinking I should...
4
by: rdabane | last post by:
Hi, I'm trying to perform following operation from inside the python script 1. Open a shell ( start a process ) 2. Send command1 to the process 3. Get output from the process 4. Send command2...
1
by: =?Utf-8?B?UVNJRGV2ZWxvcGVy?= | last post by:
Using .NET 2.0 is it more efficient to copy files to a single folder versus spreading them across multiple folders. For instance if we have 100,000 files to be copied, Do we copy all of them to...
3
by: tombrogan3 | last post by:
Hi, I'm writing a service which has to monitor a database looking for new feeds (basically a few hundred thousand rows with a common id). When it finds a new feed I can either: 1) Process it on...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.