473,405 Members | 2,187 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Parallelising code

I have some file processing code that has to deal with quite a lot of
data. I have a quad core machine, so I wondered whether I could take
advantage of some parallelism.

Essentially, I have a number of CSV files, let's say 100, each
containing about 8000 data points. For each point, I need to look up
some other data structures (generated in advance) and append the point
to a relevant list. I wondered whether I could get each core to handle
a few files each. I have a few questions:

- Am I actually going to get any speed up from parallelism, or is it
likely that most of my processing time is spent reading files? I guess
I can profile for this?

- Is list.append() thread safe? (not sure if this is the right term)
what I mean is, can two separate processors file a point in the same
list at the same time without anything horrible happening? Do I need
to do anything special (mutex or whatever) to make this happen, or
will it happen automatically?

Thanks in advance for any guidance,

Peter
Sep 15 '08 #1
6 1138
Essentially, I have a number of CSV files, let's say 100, each
containing about 8000 data points. For each point, I need to look up
some other data structures (generated in advance) and append the point
to a relevant list. I wondered whether I could get each core to handle
a few files each. I have a few questions:

- Am I actually going to get any speed up from parallelism, or is it
likely that most of my processing time is spent reading files? I guess
I can profile for this?
That probably depends on both how much data is involved in a "data
point" (ie, is it just one value, or are you parsing several fields
from the CSV per record), and how much processing each point involves.
Profiling should enlighten you, yes. You may also have issues with
I/O contention if you have lots of threads trying to read from disk at
once, although I'm not sure how much of an impact that will have.
>
- Is list.append() thread safe? (not sure if this is the right term)
what I mean is, can two separate processors file a point in the same
list at the same time without anything horrible happening? Do I need
to do anything special (mutex or whatever) to make this happen, or
will it happen automatically?
http://mail.python.org/pipermail/pyt...ch/430017.html

http://effbot.org/pyfaq/what-kinds-o...hread-safe.htm
Sep 15 '08 #2
ps******@googlemail.com wrote:
I have some file processing code that has to deal with quite a lot of
data. I have a quad core machine, so I wondered whether I could take
advantage of some parallelism.

Essentially, I have a number of CSV files, let's say 100, each
containing about 8000 data points. For each point, I need to look up
some other data structures (generated in advance) and append the point
to a relevant list. I wondered whether I could get each core to handle
a few files each. I have a few questions:

- Am I actually going to get any speed up from parallelism, or is it
likely that most of my processing time is spent reading files? I guess
I can profile for this?

- Is list.append() thread safe? (not sure if this is the right term)
what I mean is, can two separate processors file a point in the same
list at the same time without anything horrible happening? Do I need
to do anything special (mutex or whatever) to make this happen, or
will it happen automatically?

Thanks in advance for any guidance,

Peter
Put the data into a database first to see if it is actually too slow.
If it is take a look at an in-memory database or perhaps something as simple as
memcached could help.

-Larry
Sep 15 '08 #3
2008/9/15 ps******@googlemail.com <ps******@googlemail.com>:
I have some file processing code that has to deal with quite a lot of
data. I have a quad core machine, so I wondered whether I could take
advantage of some parallelism.

Essentially, I have a number of CSV files, let's say 100, each
containing about 8000 data points. For each point, I need to look up
some other data structures (generated in advance) and append the point
to a relevant list. I wondered whether I could get each core to handle
a few files each. I have a few questions:

- Am I actually going to get any speed up from parallelism, or is it
likely that most of my processing time is spent reading files? I guess
I can profile for this?

- Is list.append() thread safe? (not sure if this is the right term)
what I mean is, can two separate processors file a point in the same
list at the same time without anything horrible happening? Do I need
to do anything special (mutex or whatever) to make this happen, or
will it happen automatically?
You won't take advantage of your cores with a pure and single python
script. Python threads are useful for UI, files operations, all but
concurrent processing. The simpler way to do concurrent processing is
to use Popen from subrocess, that'll create new processes.

Notice that you can call python scripts from another one eg a manager
and as many workers as you want. IMO it's the simpler design and less
work for making concurrent processes.
Ideally make your workers not need to feedback with variable, or
anything more complex than a return value. Also, make them not write
the same file. They can read the same file without problem.

Remark that you can manage lock etc from the manager script.

I'm not sure python semaphore allow interprocess communication like c
semaphores [1] ; check this. A workaround is to send to stderr tuples.

HTH,
Mathieu

[1] Programming with POSIX Threads, David R. Butenhof, http://tinyurl.com/6hpkol
Sep 15 '08 #4
On Sep 15, 12:46 pm, "psaff...@googlemail.com"
<psaff...@googlemail.comwrote:
I have some file processing code that has to deal with quite a lot of
data. I have a quad core machine, so I wondered whether I could take
advantage of some parallelism.

Essentially, I have a number of CSV files, let's say 100, each
containing about 8000 data points. For each point, I need to look up
some other data structures (generated in advance) and append the point
to a relevant list. I wondered whether I could get each core to handle
a few files each. I have a few questions:

- Am I actually going to get any speed up from parallelism, or is it
likely that most of my processing time is spent reading files? I guess
I can profile for this?

- Is list.append() thread safe? (not sure if this is the right term)
what I mean is, can two separate processors file a point in the same
list at the same time without anything horrible happening? Do I need
to do anything special (mutex or whatever) to make this happen, or
will it happen automatically?

Thanks in advance for any guidance,

Peter
Look at http://pypi.python.org/pypi/processing
Sep 15 '08 #5
On 15 Sep, 18:46, "psaff...@googlemail.com" <psaff...@googlemail.com>
wrote:
I have some file processing code that has to deal with quite a lot of
data. I have a quad core machine, so I wondered whether I could take
advantage of some parallelism.
Take a look at this page for some solutions:

http://wiki.python.org/moin/ParallelProcessing

In addition, Jython and IronPython provide the ability to use threads
more effectively.
Essentially, I have a number of CSV files, let's say 100, each
containing about 8000 data points. For each point, I need to look up
some other data structures (generated in advance) and append the point
to a relevant list. I wondered whether I could get each core to handle
a few files each. I have a few questions:

- Am I actually going to get any speed up from parallelism, or is it
likely that most of my processing time is spent reading files? I guess
I can profile for this?
There are a few things to consider, and it is useful to see where most
of the time is being spent. One interesting exercise called "Wide
Finder 2", run by Tim Bray (see [1] for more details), investigated
the benefits of log file processing using many concurrent processes,
but it was often argued that the greatest speed-up over a naive serial
implementation could be achieved by optimising the input and output
and by choosing the right parsing strategy.

Paul

[1] http://www.tbray.org/ongoing/When/20.../Wide-Finder-2
Sep 16 '08 #6
Many very helpful replies, which I will now mull over.

Thanks,

Peter
Sep 16 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

51
by: Mudge | last post by:
Please, someone, tell me why OO in PHP is better than procedural.
9
by: bigoxygen | last post by:
Hi. I'm using a 3 tier FrontController Design for my web application right now. The problem is that I'm finding to have to duplicate a lot of code for similar functions; for example, listing...
4
by: jason | last post by:
Hello. Newbie on SQL and suffering through this. I have two tables created as such: drop table table1; go drop table table2; go
16
by: Dario de Judicibus | last post by:
I'm getting crazy. Look at this code: #include <string.h> #include <stdio.h> #include <iostream.h> using namespace std ; char ini_code = {0xFF, 0xFE} ; char line_sep = {0x20, 0x28} ;
109
by: Andrew Thompson | last post by:
It seems most people get there JS off web sites, which is entirely logical. But it is also a great pity since most of that code is of such poor quality. I was looking through the JS FAQ for any...
5
by: ED | last post by:
I currently have vba code that ranks employees based on their average job time ordered by their region, zone, and job code. I currently have vba code that will cycle through a query and ranks each...
0
by: Namratha Shah \(Nasha\) | last post by:
Hey Guys, Today we are going to look at Code Access Security. Code access security is a feature of .NET that manages code depending on its trust level. If the CLS trusts the code enough to...
18
by: Joe Fallon | last post by:
I have some complex logic which is fairly simply to build up into a string. I needed a way to Eval this string and return a Boolean result. This code works fine to achieve that goal. My...
37
by: Alan Silver | last post by:
Hello, Newbie here, so please forgive what is probably a basic question ... I see a lot of discussion about "code behind", which if I have understood correctly, means that the script code goes...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.