parallel csv-file processing

Michel Albert

Currently I am faced with a large computation tasks, which works on a
huge CSV file. As a test I am working on a very small subset which
already contains 2E6 records. The task itself allows the file to be
split however as each computation only involves one line. The
application performing the computation exists already, but it was
never meant to run on such a big dataset.

One thing that is clear, is that it will take a while to compute all
this. So a distributed approach is probably a good idea. There ar a
couple of options for this:

Scenario A ( file is split manually in smaller parts ):
1) Fire up an openmosix/kerrighed cluster, and run one process for
each file part.

Scenario B ( file is "split" using the application itself ):
2) Again with an openmosix/kerrighed cluster, but only one instance of
the application is run, using parallelpython
3) Using parallelpython without cluster, but using ppserver.py on each
node.

The second case looks most interesting as it is quite flexible. In
this case I would need to address subsets of the CSV file however. And
the default csv.reader class does not allow random-access of the file
(or jumping to a specific line).

What would be the most efficient way to subset a CSV-file. For
example:

f1 = job_server.submit(calc_scores, datafile[0:1000])
f2 = job_server.submit(calc_scores, datafile[1001:2000])
f3 = job_server.submit(calc_scores, datafile[2001:3000])
....

and so on

Obviously this won't work as you cannot access a slice of a csv-file.
Would it be possible to subclass the csv.reader class in a way that
you can somewhat efficiently access a slice? Jumping backwards is not
really necessary, so it's not really random access.

The obvious way is to do the following:

buffer = []
for line in reader:
buffer.append(line)
if len(buffer) == 1000:
f = job_server.submit(calc_scores, buffer)
buffer = []

f = job_server.submit(calc_scores, buffer)
buffer = []

but would this not kill my memory if I start loading bigger slices
into the "buffer" variable?

Nov 9 '07 #1

Subscribe Post Reply

2342

Paul Rubin

Michel Albert <ex****@gmail.comwrites:

buffer = []
for line in reader:
buffer.append(line)
if len(buffer) == 1000:
f = job_server.submit(calc_scores, buffer)
buffer = []

f = job_server.submit(calc_scores, buffer)
buffer = []

but would this not kill my memory if I start loading bigger slices
into the "buffer" variable?

Why not pass the disk offsets to the job server (untested):

n = 1000
for i,_ in enumerate(reader):
if i % n == 0:
job_server.submit(calc_scores, reader.tell(), n)

the remote process seeks to the appropriate place and processes n lines
starting from there.

Nov 9 '07 #2

Marc 'BlackJack' Rintsch

On Fri, 09 Nov 2007 02:51:10 -0800, Michel Albert wrote:

Obviously this won't work as you cannot access a slice of a csv-file.
Would it be possible to subclass the csv.reader class in a way that
you can somewhat efficiently access a slice?

An arbitrary slice? I guess not as all records before must have been read
because the lines are not equally long.

The obvious way is to do the following:

buffer = []
for line in reader:
buffer.append(line)
if len(buffer) == 1000:
f = job_server.submit(calc_scores, buffer)
buffer = []

With `itertools.islice()` this can be written as:

while True:
buffer = list(itertools.islice(reader, 1000))
if not buffer:
break
f = job_server.submit(calc_scores, buffer)

Nov 9 '07 #3

Paul Boddie

On 9 Nov, 12:02, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

>
Why not pass the disk offsets to the job server (untested):

n = 1000
for i,_ in enumerate(reader):
if i % n == 0:
job_server.submit(calc_scores, reader.tell(), n)

the remote process seeks to the appropriate place and processes n lines
starting from there.

This is similar to a lot of the smarter solutions for Tim Bray's "Wide
Finder" - a problem apparently in the same domain. See here for more
details:

http://www.tbray.org/ongoing/When/20...20/Wide-Finder

Lots of discussion about more than just parallel processing/
programming, too.

Paul

Nov 9 '07 #4

Similar topics

Parallel Programming with .NET

by: Joshua Nussbaum | last post by:

I came up with what I think is a good idea for making multithreading programming easier in any .NET language. I dont know where else to post it, so I'll try here. ..NET 2.0 adds the capability...

.NET Framework

Parallel programming/execution

by: | last post by:

Assume that we have a complex application with many math operations and it is written in an ANSI C++ code and running on a single PC without any problem. Is there an automatic way to execute the...

C / C++

parallel programming

by: paytam | last post by:

Hi all, Is it possible to write parallel programming in C? I mean for example a simple program like I have a clock on a program that show me current time and and at the same time another job like...

C / C++

126

what parallel C language does MIPS Pro C Compiler support?

by: ramyach | last post by:

Hi friends, I need to write a parallel code in 'C' on the server that is running SGI Irix 6.5. This server supports MIPS Pro C compiler. I don't have any idea of parallel C languages. I looked...

C / C++

Robotics and parallel ports

by: Isaac T Alston | last post by:

Basically, I'm thinking about building a robot which can be controlled by programs which I write, I'm going to interface to the robot through the parallel port (like in this tutorial here:...

Python

Parallel Port Access

by: david.brown.0 | last post by:

I'm trying to make a Java program access a parallel port. Java's comm API does not provide me with the control I need. I need to be able to write to the data and control pins and read the status...

C / C++

How to know two lines are a pare parallel lines

by: lovecreatesbeauty | last post by:

For example, line L1 and line L2 are two lines in two-dimensional space, the start-points and end-points can be described with following the `point_t' type. The start-points and end-points are:...

C / C++

Parallel Python

by: parallelpython | last post by:

Has anybody tried to run parallel python applications? It appears that if your application is computation-bound using 'thread' or 'threading' modules will not get you any speedup. That is because...

Python

Parallel port control with USB->Parallel converter

by: Soren | last post by:

Hi, I want to control some motors using the parallel port.. however, my laptop does not have any parallel ports (very few do). What I do have is a USB->Parallel converter... I thought about...

Python

Re: Use of Parallel Query Servers

by: Jim Kennedy | last post by:

If you are firing that many queries you better be using bind variables and parsing the query once and rebinding, and executing many times and NOT closing the cursor. Doing that will help you...

Oracle Database

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA