"Do this, and come back when you're done"

Kamus of Kadizhar

I have the following function which generates MD5 hashes for files on a
local and remote server. The remote server has a little applet that
runs from inetd and generates an MD5 hash given the file name.

The problem is that it takes 2+ minutes to generate the MD5 hash, so
this function takes about 5 minutes every time it is called. Since the
first MD5 hash is generated on a remote machine, the local machine does
nothing but wait for half that time.

Is there any way to rewrite each half of the function to run in the
background, so to speak, and then have a master process that waits on
the results? This would cut execution time in half more or less.

# checkMD5
def checkMD5(fileName, localDir):
# get remote hash
Socket = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
Socket.connect((MD5server,888))
#throw away ID string
Socket.recv(256)
Socket.send(fileName+'\n')
remoteMD5hash = Socket.recv(256)

# get local hash
try:
file=open(makeMovieName(localDir,fileName), 'r')
except IOError:
localMD5hash = '0'
else:
hasher = md5.new()
while True:
chunk = file.read(1024)
if not chunk:
break
hasher.update(chunk)
localMD5hash = hasher.hexdigest()
if Debug: print "local:",localMD5hash, "remote:",remoteMD5hash
return localMD5hash.strip() == remoteMD5hash.strip()

-Kamus

--
o__ | If you're old, eat right and ride a decent bike.
,>/'_ | Q.
(_)\(_) | Usenet posting`

Jul 18 '05 #1

Subscribe Post Reply

2357

Paul Rubin

Kamus of Kadizhar <ya*@NsOeSiPnAeMr.com> writes:

Is there any way to rewrite each half of the function to run in the
background, so to speak, and then have a master process that waits on
the results? This would cut execution time in half more or less.

Sure, use the threading module. Think about another aspect of what
you're doing though. You're comparing the md5's of a local and remote
copy of the same file, to see if they're the same. Are you trying to
detect malicious tampering? If someone tampered with one of the
files, how do you know that person can't also intercept your network
connection and send you the "correct" md5, so you won't detect the
tampering? Or for that matter, do you know that the remote copy of
the program itself hasn't been tampered with?

Jul 18 '05 #2

Kamus of Kadizhar

Paul Rubin wrote:

Kamus of Kadizhar <ya*@NsOeSiPnAeMr.com> writes:
Is there any way to rewrite each half of the function to run in the
background, so to speak, and then have a master process that waits on
the results? This would cut execution time in half more or less.

Sure, use the threading module.

OK, I'll read up on that. I've written gobs of scientific type code,
but this OS stuff is new to me.
Think about another aspect of what
you're doing though. You're comparing the md5's of a local and remote
copy of the same file, to see if they're the same. Are you trying to
detect malicious tampering?

No, actually, both machines are under my control (and in my house). I'm
slinging large (1GB MOL) files around on an unreliable, slow wireless
network. I am trying to detect an incomplete copy across the network.
The local machine is the video player and the remote machine is the
archive server. My kids have a habit of just shutting down the video
server, resulting in incomplete transfers to the archives.

If it's appropriate for this newsgroup, I'd like to post the entire
effort for comments (it's my first bit of pyton code.) So far, python
has been the easiest language to learn I've ever come across. I tried
learning perl, and it was a disaster.... Too convoluted. Python is a
breath of fresh air. Also, the docs and support here is excellent.
:-) My thanks to all the volunteers who put in time to build python.

-Kamus
--
o__ | If you're old, eat right and ride a decent bike.
,>/'_ | Q.
(_)\(_) | Usenet posting`

Jul 18 '05 #3

Paul Rubin

Kamus of Kadizhar <ya*@NsOeSiPnAeMr.com> writes:

No, actually, both machines are under my control (and in my house).
I'm slinging large (1GB MOL) files around on an unreliable, slow
wireless network. I am trying to detect an incomplete copy across the
network. The local machine is the video player and the remote machine
is the archive server. My kids have a habit of just shutting down the
video server, resulting in incomplete transfers to the archives. If
it's appropriate for this newsgroup, I'd like to post the entire
effort for comments (it's my first bit of pyton code.) So far, python
has been the easiest language to learn I've ever come across. I tried
learning perl, and it was a disaster.... Too convoluted. Python is a
breath of fresh air. Also, the docs and support here is
excellent. :-) My thanks to all the volunteers who put in time to
build python.

Why don't you look at the rsync program. It brings two machines into
sync with each other by automatically detecting differences between
files and sending only the deltas over the network.

Jul 18 '05 #4

Kamus of Kadizhar

Paul Rubin wrote:

Why don't you look at the rsync program. It brings two machines into
sync with each other by automatically detecting differences between
files and sending only the deltas over the network.

Well, the purpose of this whole project was to learn python. I did look
at the pysync modules (rsync written in python), but it's too
complicated for me at the moment.

-Kamus

--
o__ | If you're old, eat right and ride a decent bike.
,>/'_ | Q.
(_)\(_) | Usenet posting`

Jul 18 '05 #5

Roy Smith

Kamus of Kadizhar <ya*@NsOeSiPnAeMr.com> wrote:

Is there any way to rewrite each half of the function to run in the
background, so to speak, and then have a master process that waits on
the results?

Yup. Two ways in fact.

The traditional way would be to fork another process to do the work and
have the parent process wait for the child to finish. You'll need to
use the fork() and exec() functions that can be found in the os module.

The other way would be to do something similar, but with threads instead
of processes. The basic flow is the same; you create a thread, have
that thread do the stuff that takes a long time, and then rejoin with
the primary thread. Of course (just like with child processes), you
could have multiple of these running at the same time doing different
parts of a parallelizable job. Take a look at the Threading module.

I'm intentionally not including any sample code here, because the
possibilities are numerous. Exactly how you do it depends on many
factors. I'm guessing that doing it with threads is what you really
want to do, so my suggestion would be to start by reading up on the
Threading module and playing with some examples to get the feel for how
it works. Working with threads is becomming more and more mainstream
and more operating systems and languages provide support for it, and the
programming community at large becomes more familiar and comfortable
with the issues involved.

Jul 18 '05 #6

Valentino Volonghi aka Dialtone

Kamus of Kadizhar <ya*@NsOeSiPnAeMr.com> writes:

Is there any way to rewrite each half of the function to run in the
background, so to speak, and then have a master process that waits on
the results? This would cut execution time in half more or less.

Why don't you use twisted? It's a net framework with a lot of
protocols (and you can define your own ones) and it's based on async
sockets which let you write programs avoiding threads for most of the
times.

www.twistedmatrix.com

I'm sure you will find out that's the best thing ever done for python
:)

--
Valentino Volonghi, Regia SpA, Milan
Linux User #310274, Gentoo Proud User

Jul 18 '05 #7

Paul Rubin

"Donn Cave" <do**@drizzle.com> writes:

Yes. I may be missing something here, because the followups
I have seen strike me as somewhat misguided, if they're not
just fooling with you. You already have two independent threads
or processes here, one on each machine. All you need to do is
take the results from the remote machine AFTER the local computation.
Move the line that says "remoteMD5hash = Socket.recv(256)" to
after the block that ends with "localMD5hash = hasher.hexdigest()".
No?

Can the remote process time out if the local side takes too long to
read from the socket? That could happen if the two machines aren't
the same speed.

Jul 18 '05 #8

Donn Cave

Quoth Paul Rubin <http://ph****@NOSPAM.invalid>:
....
| Can the remote process time out if the local side takes too long to
| read from the socket? That could happen if the two machines aren't
| the same speed.

I wouldn't expect so. I'm no expert in such things, but I would
expect the remote process to return from send(), and exit; the
data would be waiting in a kernel mbuf on the local side

Donn Cave, do**@drizzle.com

Jul 18 '05 #9

Alan Kennedy

[Kamus of Kadizhar]

So far, python
has been the easiest language to learn I've ever come across. I tried
learning perl, and it was a disaster.... Too convoluted. Python is a
breath of fresh air. Also, the docs and support here is excellent.
:-) My thanks to all the volunteers who put in time to build python.

+1 QOTW.

regards,

--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan

Jul 18 '05 #10

Peter Hansen

Valentino Volonghi aka Dialtone wrote:

Kamus of Kadizhar <ya*@NsOeSiPnAeMr.com> writes:
Is there any way to rewrite each half of the function to run in the
background, so to speak, and then have a master process that waits on
the results? This would cut execution time in half more or less.

Why don't you use twisted? It's a net framework with a lot of
protocols (and you can define your own ones) and it's based on async
sockets which let you write programs avoiding threads for most of the
times.

www.twistedmatrix.com

I'm sure you will find out that's the best thing ever done for python
:)

I second that advice, and will also mention that it would avoid the sort
of bug that I pointed out in your first post, involving the simplistic
..recv(256) calls you are doing. Twisted would make the code much more
readable *and* reliable. Well worth learning. If you're doing this
just to learn Python, you could do worse than get it working with Twisted,
then go poking into the Twisted internals to see how *it* works instead.
-Peter

Jul 18 '05 #11

Nick Vargish

Kamus of Kadizhar <ya*@NsOeSiPnAeMr.com> writes:

No, actually, both machines are under my control (and in my house).
I'm slinging large (1GB MOL) files around on an unreliable, slow
wireless network. I am trying to detect an incomplete copy across the
network.

If you're checking for incomplete copies, then md5 is overkill. Just
make sure the file sizes match.

If you're checking for corruption, then maybe doing an md5 sum would
help, but again, you only need to do that if the files are the same
size.

The Python Cookbook site has a recipe that lets you farm out "jobs" to
"worker threads", which might help you if you do go with checksumming
every file:

http://aspn.activestate.com/ASPN/Coo.../Recipe/203871

Nick

--
# sigmask || 0.2 || 20030107 || public domain || feed this to a python
print reduce(lambda x,y:x+chr(ord(y)-1),' Ojdl!Wbshjti!=obwAcboefstobudi/psh?')

Jul 18 '05 #12

Similar topics

spliting a list by nth items

by: Steven Bethard | last post by:

I feel like this has probably been answered before, but I couldn't find something quite like it in the archives. Feel free to point me somewhere if you know where this has already been answered. ...

Python

"Dynamic" Object Structures?

by: Randy Yates | last post by:

Having done a bit of Access Basic programming, I'm realizing that AB does seem to have (as much as I hate to admit it since I think it's a toy language) an advantage over C++. Let's say I have a...

C / C++

134

"no variable or argument declarations are necessary."

by: James A. Donald | last post by:

I am contemplating getting into Python, which is used by engineers I admire - google and Bram Cohen, but was horrified to read "no variable or argument declarations are necessary." Surely that...

Python

Take a hike, "Jerry Linson" - you're nothin but a useless troll!

by: Steven Matthew Bennett | last post by:

I don't know about the rest of the posters, but I came to this NG to learn more about Access, not to have some idiot homophobic dipshit spewing hate with every post. You have nothing to add to...

Microsoft Access / VBA

388

"Mastering C Pointers"....

by: maniac | last post by:

Hey guys, I'm new here, just a simple question. I'm learning to Program in C, and I was recommended a book called, "Mastering C Pointers", just asking if any of you have read it, and if it's...

C / C++

"a < b < c" not the same as "(a < b) && (b < c)"?

by: Paminu | last post by:

In math this expression: (a < b) && (b < c) would be described as: a < b < c But why is it that in C these two expressions evaluate to something different for the same values of a, b and...

C / C++

"Timeout expired" for simple ADO.NET SQL Server query

by: Nils Magnus Englund | last post by:

Hi, I've made a HttpModule which deals with user authentication. On the first request in a users session, it fetches data from a SQL Server using the following code: using (SqlConnection...

ASP.NET

169

"Small C++" Anyone?

by: JohnQ | last post by:

(The "C++ Grammer" thread in comp.lang.c++.moderated prompted this post). It would be more than a little bit nice if C++ was much "cleaner" (less complex) so that it wasn't a major world wide...

C / C++

"indexed properties"...

by: David C. Ullrich | last post by:

Having a hard time phrasing this in the form of a question... The other day I saw a thread where someone asked about overrideable properties and nobody offered the advice that properties are...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++