473,320 Members | 2,054 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

File processor

Hi all

This is a bit vague I suppose :-) Tomorrow I need to write a service which
monitors two folders for new files and performs tasks appropriately. Some
of these tasks are not too intensive and some are. Here's a scenario

Event: \Incoming\SomeFile.txt
Action: Copy to a backup folder. Move it elsewhere

Event: \Incoming\SomeFile.zip
Action: Copy to a backup folder. Unzip a file within it elsewhere

Event: \Outgoing\SomeFile.txt
Action: Copy to a backup folder. Move it elsewhere

Event: \Outgoing\SomeFile.xml
Action: Parse the XML, generate a binary file, zip the binary file, backup
the zip file, copy the zip elsewhere.
In most of these cases the task is quick, in the final case the task could
take up to a couple of minutes. I really need to look into this in great
detail in the morning, but I am hoping to get a bit of a head-start :-)
01: Is there a class for monitoring new files in a folder and triggering an
event or something with the name of the new file?

02: I expect that once the event triggers I will stuff the filename into a
thread-safe queue. If I have a thread pool for the quick tasks and queue
tasks to perform I presume the thread automatically sleeps again once the
task is complete, is that right?
Thanks

Pete

Sep 24 '08 #1
7 1818
On Wed, 24 Sep 2008 15:22:11 -0700, Peter Morris
<mr*********@spamgmail.comwrote:
[...]
01: Is there a class for monitoring new files in a folder and triggering
an event or something with the name of the new file?
FileSystemWatcher
02: I expect that once the event triggers I will stuff the filename into
a thread-safe queue. If I have a thread pool for the quick tasks and
queue tasks to perform I presume the thread automatically sleeps again
once the task is complete, is that right?
Which thread? The thread pool thread? Yes, if there are no more thread
pool tasks queued, a thread pool thread will simply enter a wait state
until a new task is queued.

Note that if your tasks are not i/o bound, and you expect there to be a
large number of them queued in a short time, using the built-in thread
pool is probably not a great idea, as your tasks will all wind up fighting
each other for the CPU, wasting lots of time in the process.

Pete
Sep 25 '08 #2
FileSystemWatcher

Shame there wasn't a way of receiving notifications after the file is
created and the file handle closed. I had to write something to handle this
situation.

Which thread? The thread pool thread? Yes, if there are no more thread
pool tasks queued, a thread pool thread will simply enter a wait state
until a new task is queued.
I decided against the pool thread. I have tasks which are immediate, short,
long in duration. I didn't want them all in the same thread pool because a
few long tasks would hog it. I'm going to have 3 threads, each with their
own queue, and give them jobs to do. Adding to the queue will resume a
thread, running out of jobs will suspend it.
Thanks for the info!
Pete

Sep 25 '08 #3
On Thu, 25 Sep 2008 11:52:36 -0700, Peter Morris
<mr*********@spamgmail.comwrote:
>FileSystemWatcher

Shame there wasn't a way of receiving notifications after the file is
created and the file handle closed. I had to write something to handle
this situation.
Yes, FileSystemWatcher is not a panacea. That said, it does provide you
with enough basic information that the manual labor is reduced (presumably
that's what you had to do in your own situation).
>Which thread? The thread pool thread? Yes, if there are no more
thread pool tasks queued, a thread pool thread will simply enter a wait
state until a new task is queued.

I decided against the pool thread. I have tasks which are immediate,
short, long in duration. I didn't want them all in the same thread pool
because a few long tasks would hog it.
The built-in thread pool allows thousands of threads. Not necessarily the
best approach for performance, but if the tasks are not CPU-bound, it
would probably be fine. Even if the tasks are CPU-bound, as long as you
never have a huge number of them running simultaneously, it would probably
be fine. A few long-running tasks would not block other tasks.
I'm going to have 3 threads, each with their own queue, and give them
jobs to do. Adding to the queue will resume a thread, running out of
jobs will suspend it.
Note that the number of threads should probably also relate to the number
of CPUs on the system, not just the types of jobs you have. And of
course, it depends on whether the tasks are CPU-bound versus i/o-bound.

You haven't posted those details, so it's hard to provide specifics. But
so far, I haven't seen anything that would suggest that using the built-in
thread pool wouldn't be appropriate here.

Pete
Sep 25 '08 #4
I'm sorry to bother you, but I'm a little confused about this statement and
I was hoping you could clarify it for me.

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
Note that if your tasks are not i/o bound, and you expect there to be a
large number of them queued in a short time, using the built-in thread
pool is probably not a great idea, as your tasks will all wind up fighting
each other for the CPU, wasting lots of time in the process.
Are you saying, that queuing a lot of CPU bound tasks on the thread pool is
a bad idea?

That's not generally my understanding. Unless the tasks are long running,
the thread pool is well suited for this kind of task, and it is designed to
provide good performance based on the number of available CPUs. Fact of the
matter is, that if you have many CPU bound tasks, they will compete for CPU
time no matter what kind of threading strategy you use.

--
Regards,
Brian Rasmussen [C# MVP]
http://kodehoved.dk

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Wed, 24 Sep 2008 15:22:11 -0700, Peter Morris
<mr*********@spamgmail.comwrote:
>[...]
01: Is there a class for monitoring new files in a folder and triggering
an event or something with the name of the new file?

FileSystemWatcher
>02: I expect that once the event triggers I will stuff the filename into
a thread-safe queue. If I have a thread pool for the quick tasks and
queue tasks to perform I presume the thread automatically sleeps again
once the task is complete, is that right?

Which thread? The thread pool thread? Yes, if there are no more thread
pool tasks queued, a thread pool thread will simply enter a wait state
until a new task is queued.

Note that if your tasks are not i/o bound, and you expect there to be a
large number of them queued in a short time, using the built-in thread
pool is probably not a great idea, as your tasks will all wind up fighting
each other for the CPU, wasting lots of time in the process.

Pete
Sep 26 '08 #5
On Thu, 25 Sep 2008 22:05:18 -0700, Brian Rasmussen [C# MVP]
<br***@kodehoved.dkwrote:
I'm sorry to bother you, but I'm a little confused about this statement
and I was hoping you could clarify it for me.

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
>Note that if your tasks are not i/o bound, and you expect there to be a
large number of them queued in a short time, using the built-in thread
pool is probably not a great idea, as your tasks will all wind up
fighting each other for the CPU, wasting lots of time in the process.

Are you saying, that queuing a lot of CPU bound tasks on the thread pool
is a bad idea?
It certainly can be.
That's not generally my understanding.
Your understanding may be wrong, or simply incomplete. I'm not sure which.
Unless the tasks are long running, the thread pool is well suited for
this kind of task, and it is designed to provide good performance based
on the number of available CPUs.
No, not really. The thread pool doesn't do anything in particular to
match active threads with the CPU count. If the tasks are so short-lived,
and queued so infrequently that one just naturally has relatively few
threads competing with each other for the CPU, then that's fine. There's
probably no need to go to the extra effort to limit the number of active
threads at once.

But unless you can guarantee that the tasks are all short-lived _and_ that
there are only a few running at any given time, it pays to be more careful.
Fact of the matter is, that if you have many CPU bound tasks, they will
compete for CPU time no matter what kind of threading strategy you use.
Define "compete". The fact is, there's a good way to compete and a bad
way.

First, let's ignore the system threads and assume that you have _only_
your interesting CPU-bound threads. Now, as long as you only have at most
one of these active for each CPU, then _none_ of the threads need ever
yield the CPU. But as soon as you have more of these threads than you
have CPUs, at least some will have to be round-robin-ed by the thread
scheduler (and in practice, all probably will be).

Interrupting a thread is very costly. Not only is there the immediate
cost of the context switch, in which all of the state for one thread is
saved and all of the state for another thread is restored, there is a
serious risk of completely blowing the CPU caches (pipelines, L1 and L2
cache, jump prediction, etc.).

If you limit the number of active threads to the number of CPUs you have,
then you maximize the probability that any given thread can run for an
extended period of time without interruption, which in turn significantly
improves the overall throughput of that thread. Conversely, when you
create a situation in which it's assured that you have more active threads
than CPUs, you ensure that you inject non-productive CPU cycles and
disrupt the caching mechanisms in the CPU, all of which can significantly
hurt performance.

Even when you have exactly as many threads as CPUs, there are of course
other threads in the system that may need to run from time to time. It's
not a perfect system. But, those other threads are generally not going to
be CPU-bound, and thus aren't going to present the same kind of constant
competition for the CPU that your own CPU-bound worker threads would.

The issue of being CPU-bound is important. An i/o-bound thread will in
fact spend a lot of time in a non-runnable state and thus won't compete
for the CPU (as much). You can have an awful lot of i/o-bound threads
sitting idle without any significant cost, and in fact having many
multiple i/o operations all pending is a way to take advantage of some
inherent parallelism that exists elsewhere in the hardware (this could go
either way though...in some cases, having multple i/o operations pending
allows a particular device to retrieve data most efficiently, as in the
case of a hard disk, and in other cases having multiple i/o operations
pending just causes contention for the i/o device, which would be as
counter-productive as over-competing for the CPU).

There's not a single one right way to do threading. It does depend on
your specific task. But for CPU-bound tasks, it is _definitely_
counter-productive to simply queue a large number of tasks and let the
thread pool sort it out. You can get much more efficient throughput by
making sure you never have more runnable threads than you have CPUs.

Pete

Sep 26 '08 #6
On Thu, 25 Sep 2008 19:52:36 +0100, "Peter Morris"
<mr*********@SPAMgmail.comwrote:
>FileSystemWatcher

Shame there wasn't a way of receiving notifications after the file is
created and the file handle closed. I had to write something to handle this
situation.

>Which thread? The thread pool thread? Yes, if there are no more thread
pool tasks queued, a thread pool thread will simply enter a wait state
until a new task is queued.

I decided against the pool thread. I have tasks which are immediate, short,
long in duration. I didn't want them all in the same thread pool because a
few long tasks would hog it. I'm going to have 3 threads, each with their
own queue, and give them jobs to do. Adding to the queue will resume a
thread, running out of jobs will suspend it.
You might consider giving each task its own thread pool at the
outgoing end of each task queue. I did something kinda similar a few
years back where a worker thread would read its request queue, perform
the task, and finally write to a parallel response queue.

regards
A.G.
Sep 26 '08 #7
Thanks for the reply - please see my comments below
>Unless the tasks are long running, the thread pool is well suited for
this kind of task, and it is designed to provide good performance based
on the number of available CPUs.

No, not really. The thread pool doesn't do anything in particular to
match active threads with the CPU count. If the tasks are so short-lived,
and queued so infrequently that one just naturally has relatively few
threads competing with each other for the CPU, then that's fine. There's
probably no need to go to the extra effort to limit the number of active
threads at once.
According to the documentation
(http://msdn.microsoft.com/en-us/libr...hreadpool.aspx)
that's not entirely correct. The documentation says, "The thread pool
maintains a minimum number of idle threads. For worker threads, the default
value of this minimum is the number of processors." In other words: The
thread pool tries to avoid creating redundant threads based on the number of
CPUs. As you point out having more threads than CPUs is wasteful.
>Fact of the matter is, that if you have many CPU bound tasks, they will
compete for CPU time no matter what kind of threading strategy you use.

Define "compete". The fact is, there's a good way to compete and a bad
way.
By competing I mean, that the scheduler will switch between all runnable
threads with the highest priority. As the switching is expensive it should
be minimized.

Anyway, I'm aware of all the stuff you go through about CPU threads vs. I/O
threads and as far as I can tell, we have the same understanding of those
issues. Given that, I'm confused that you end your post with the following:
There's not a single one right way to do threading. It does depend on
your specific task. But for CPU-bound tasks, it is _definitely_
counter-productive to simply queue a large number of tasks and let the
thread pool sort it out. You can get much more efficient throughput by
making sure you never have more runnable threads than you have CPUs.
I agree that threading is hard and I certainly won't claim to be a master in
the field. However, I cannot see why you would gain an advantage by doing
what you describe here.

Assume we have 10 CPU bound tasks (non-blocking and short running) and 2
available CPUs. In this case the thread pool will schedule the tasks to run
on 2 CPUs and thus not create additional threads thereby reducing the cost
of switching between threads. On the other hand if you create 10 threads and
let each of them run one of the tasks each, you not only pay the price of
creating additional threads, you will also end up with a lot of context
switches which is pure overhead (assuming of course that the tasks cannot be
completed within a single time slice).

If the goal is to complete all tasks as fast as possible, it seems to me
that the thread pool offers a pretty good deal.

--
Regards,
Brian Rasmussen [C# MVP]
http://kodehoved.dk

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
On Thu, 25 Sep 2008 22:05:18 -0700, Brian Rasmussen [C# MVP]
<br***@kodehoved.dkwrote:
>I'm sorry to bother you, but I'm a little confused about this statement
and I was hoping you could clarify it for me.

"Peter Duniho" <Np*********@nnowslpianmk.comwrote in message
news:op***************@petes-computer.local...
>>Note that if your tasks are not i/o bound, and you expect there to be a
large number of them queued in a short time, using the built-in thread
pool is probably not a great idea, as your tasks will all wind up
fighting each other for the CPU, wasting lots of time in the process.

Are you saying, that queuing a lot of CPU bound tasks on the thread pool
is a bad idea?

It certainly can be.
>That's not generally my understanding.

Your understanding may be wrong, or simply incomplete. I'm not sure
which.
>Unless the tasks are long running, the thread pool is well suited for
this kind of task, and it is designed to provide good performance based
on the number of available CPUs.

No, not really. The thread pool doesn't do anything in particular to
match active threads with the CPU count. If the tasks are so short-lived,
and queued so infrequently that one just naturally has relatively few
threads competing with each other for the CPU, then that's fine. There's
probably no need to go to the extra effort to limit the number of active
threads at once.

But unless you can guarantee that the tasks are all short-lived _and_ that
there are only a few running at any given time, it pays to be more
careful.
>Fact of the matter is, that if you have many CPU bound tasks, they will
compete for CPU time no matter what kind of threading strategy you use.

Define "compete". The fact is, there's a good way to compete and a bad
way.

First, let's ignore the system threads and assume that you have _only_
your interesting CPU-bound threads. Now, as long as you only have at most
one of these active for each CPU, then _none_ of the threads need ever
yield the CPU. But as soon as you have more of these threads than you
have CPUs, at least some will have to be round-robin-ed by the thread
scheduler (and in practice, all probably will be).

Interrupting a thread is very costly. Not only is there the immediate
cost of the context switch, in which all of the state for one thread is
saved and all of the state for another thread is restored, there is a
serious risk of completely blowing the CPU caches (pipelines, L1 and L2
cache, jump prediction, etc.).

If you limit the number of active threads to the number of CPUs you have,
then you maximize the probability that any given thread can run for an
extended period of time without interruption, which in turn significantly
improves the overall throughput of that thread. Conversely, when you
create a situation in which it's assured that you have more active threads
than CPUs, you ensure that you inject non-productive CPU cycles and
disrupt the caching mechanisms in the CPU, all of which can significantly
hurt performance.

Even when you have exactly as many threads as CPUs, there are of course
other threads in the system that may need to run from time to time. It's
not a perfect system. But, those other threads are generally not going to
be CPU-bound, and thus aren't going to present the same kind of constant
competition for the CPU that your own CPU-bound worker threads would.

The issue of being CPU-bound is important. An i/o-bound thread will in
fact spend a lot of time in a non-runnable state and thus won't compete
for the CPU (as much). You can have an awful lot of i/o-bound threads
sitting idle without any significant cost, and in fact having many
multiple i/o operations all pending is a way to take advantage of some
inherent parallelism that exists elsewhere in the hardware (this could go
either way though...in some cases, having multple i/o operations pending
allows a particular device to retrieve data most efficiently, as in the
case of a hard disk, and in other cases having multiple i/o operations
pending just causes contention for the i/o device, which would be as
counter-productive as over-competing for the CPU).

There's not a single one right way to do threading. It does depend on
your specific task. But for CPU-bound tasks, it is _definitely_
counter-productive to simply queue a large number of tasks and let the
thread pool sort it out. You can get much more efficient throughput by
making sure you never have more runnable threads than you have CPUs.

Pete
Sep 26 '08 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: shailesh_gaikar | last post by:
All, Please help in the following. I am new to XML and XSL. I have written one XSL as follows: <?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="1.0"...
22
by: Long | last post by:
Problem: to insert the content of a file in an HTML document at a specific location. One possible way is to add a WebCharm tag like this: <%@charm:text 20 0 my_include_file.txt %> When the...
1
by: santhosh_176 | last post by:
:I Created a Pocket PC application for iPAQ 5450. Every thing went fine even installer creation. I could run the setup and install it into the actual device and worked fine. The application enables...
4
by: Grant Austin | last post by:
Hello, This might be a tad off topic. The c-programming groups I found appear to be unused... The problem is simple... I need to create a listing file from an assembler source file. The...
5
by: Pete | last post by:
I having a problem reading all characters from a file. What I'm trying to do is open a file with "for now" a 32bit hex value 0x8FB4902F which I want to && with a mask 0xFF000000 then >> right...
3
by: Larry Maturo | last post by:
I'm not sure if this is the right group to post this to, but here goes. I need to find out the name of the print spool file for every job queued to a particular printer. I actually need to read a...
7
by: Steve Bugden | last post by:
Hi, I am trying to reference an html page from an aspx file. The intention is that the html file will contain the content for my web site and the aspx will contain the navigation, logo etc. Then...
13
by: Andrew | last post by:
Hello, I am trying to find a way to take the contents of a directory and write it into a file. I have a directory that has several hundred text files, and I want to create a file containing all...
14
by: prasadjoshi124 | last post by:
Hi All, I am writing a small tool which is supposed to fill the filesystem to a specified percent. For, that I need to read how much the file system is full in percent, like the output given...
2
by: tgiles | last post by:
Hi, All! I started back programming Python again after a hiatus of several years and run into a sticky problem that I can't seem to fix, regardless of how hard I try- it it starts with tailing a...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.