473,396 Members | 2,024 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

regd efficient methods to manipulate *large* files

Hi:

This question is not directed "entirely" at python only. But since
I want to know how to do it in python, I am posting here.
I am constructing a huge matrix (m x n), whose columns n are stored in
smaller files. Once I read m such files, my matrix is complete. I want to
pass this matrix as an input to another script of mine (I just have the
binary.) Currently, the script reads a file (which is nothing but the
matrix) and processes it. Is there any way of doing this in memory,
without writing the matrix onto the disk?

Since I have to repeat my experimentation for multiple iterations, it
becomes expensive to write the matrix onto the disk.

Thanks in advance. Help appreciated.

-Madhu
May 1 '06 #1
2 1532
Madhusudhanan Chandrasekaran wrote:
Hi:

This question is not directed "entirely" at python only. But since
I want to know how to do it in python, I am posting here.
I am constructing a huge matrix (m x n), whose columns n are stored in
smaller files. Once I read m such files, my matrix is complete. I
want to pass this matrix as an input to another script of mine (I
just have the binary.) Currently, the script reads a file (which is
nothing but the matrix) and processes it. Is there any way of doing
this in memory, without writing the matrix onto the disk?

Since I have to repeat my experimentation for multiple iterations, it
becomes expensive to write the matrix onto the disk.

Thanks in advance. Help appreciated.

-Madhu


Basically, you're asking about Inter Process Communication (IPC), for
which Python provides several interfaces to mechanisms provided by the
operating system (whatever that may be). Here's a couple of commonly
used methods:

Redirected I/O

Have a look at the popen functions in the os module, or better still
the subprocess module (which is a higher level interface to the same
functionality). Specifically, the "Replacing the shell pipe line"
example in the subprocess module's documentation should be interesting:

output=`dmesg | grep hda`
==>
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]

Here, the stdout of the "dmesg" process has been redirected to the
stdin of the "grep" process. You could do something similar with your
two scripts: e.g., the first script simply writes the content of the
matrix in some format to stdout (e.g. print, sys.stdout.write), while
the second script reads the content of the matrix from stdin (e.g.
raw_input, sys.stdin.read). Here's some brutally simplistic scripts
that demonstrate the method:

in.py
=====
#!/bin/env python
#
# I read integers from stdin until I encounter 0

import sys

while True:
i = int(sys.stdin.readline())
print "Read %d from stdin" % i
if i == 0:
break
out.py
======
#!/bin/env python
#
# I write some numbers to stdout

for i in [1, 2, 3, 4, 5, 0]:
print i
run.py
======
#!/bin/env python
#
# I run out.py and in.py with a pipe between them, capture the
# output of in.py and print it

from subprocess import Popen, PIPE

process1 = Popen(["./out.py"], stdout=PIPE)
process2 = Popen(["./in.py"], stdin=process1.stdout, stdout=PIPE)
output = process2.communicate()[0]

print output
Sockets

Another form of IPC uses sockets to communicate between two processes
(see the socket module or one of the higher level modules like
SocketServer). Hence, the second process would listen on a port
(presumably on the localhost interface, although there's no reason it
couldn't listen on a LAN interface for example), and the first process
connects to that port and sends the matrix data across it to the second
process.
Summary

Given that your second script currently reads a file containing the
complete matrix (if I understand your post correctly), it's probably
easiest for you to use the Redirected I/O method (as it's very similar
to reading a file, although there are some differences, and sometimes
one must be careful about closing pipe ends to avoid deadlocks).
However, the sockets method has the advantage that you can easily move
one of the processes onto a different machine.

There are other methods of IPC (for example, shared memory: see the
mmap module) however the two mentioned above are available on most
platforms whereas others may be specific to a given platform, or have
platform specific subtleties (for example, mmap is only available on
Windows and UNIX, and has a slightly different constructor on each).
HTH,

Dave.

--

May 1 '06 #2
I take it that you have a binary file that takes a file name and
proceses the file contents.
Sometimes Unix binaries are written so that a file name of '-', (just a
dash), causes it to take input from stdin so that the piping mentioned
in a previous reply could work.
On some of our unix systems /tmp is set up as a 'virtual disk' It
quacks like a normal disk filesystem but is actually implimented in
RAM/virtual memory, and is faster than normal disk access.
(Unfortunately we are not allowed to save multi-gigabyte files there as
it affects other aspects of the OS).

Maybe you can mount a similar filesystem if you have the RAM.

-- Pad.

May 1 '06 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Anthony Baxter | last post by:
To go along with the 2.4a3 release, here's an updated version of the decorator PEP. It describes the state of decorators as they are in 2.4a3. PEP: 318 Title: Decorators for Functions and...
14
by: Luka Milkovic | last post by:
Hello, I have a little problem and although it's little it's extremely difficult for me to describe it, but I'll try. I have written a program which extracts certain portions of my received...
4
by: Anders | last post by:
Hi, I was wondering what is most efficient of the two. Is it more efficient to add server controls within the Itemtemplate and use OnItemDataBound to manipulate and databind the servercontrols. ...
17
by: Bruce One | last post by:
Lets consider a class called Currency. This class must be the responsible for taking care of all calculations over currency exchanges, in such a way that I pass values in Euros and it returns the...
5
by: sql_er | last post by:
Guys, I have an XML file which is 233MB in size. It was created by loading 6 tables from an sql server database into a dataset object and then writing out the contents from this dataset into an...
2
by: herbasher | last post by:
I need a few pointers. It isn't really related to C but I'm not sure where to post, and thought this group probably has really smart programmers. I need to store and manipulate a big tree...
21
by: py_genetic | last post by:
Hello, I'm importing large text files of data using csv. I would like to add some more auto sensing abilities. I'm considing sampling the data file and doing some fuzzy logic scoring on the...
3
by: Manogna | last post by:
hi ! I have two big files like A anb B. A contains nearly 1000 lines. and B contains 1500000 lines.
1
by: =?Utf-8?B?UVNJRGV2ZWxvcGVy?= | last post by:
Using .NET 2.0 is it more efficient to copy files to a single folder versus spreading them across multiple folders. For instance if we have 100,000 files to be copied, Do we copy all of them to...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.