By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,939 Members | 1,546 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,939 IT Pros & Developers. It's quick & easy.

comparing huge files

P: n/a
hi
i wrote some code to compare 2 files. One is the base file, the other
file i got from somewhere. I need to compare this file against the
base,
eg base file
abc
def
ghi

eg another file
abc
def
ghi
jkl

after compare , the base file will be overwritten with "jkl". Also both
files tend to grow towards > 20MB ..

Here is my code...using difflib.

pat = re.compile(r'^\+') ## i want to get rid of the '+' from the
difflib output...
def difference(filename,basename):
import difflib
base = open(basename)
a = base.readlines()
input = open(filename)
b = input.readlines()
d = difflib.Differ()
diff = list(d.compare(a, b))
if len(diff) > 0:
os.remove(basename)
o = open(basename, "aU")
for i in diff:
if pat.search(i):
i = i.lstrip("\+ ")
o.writelines(i) ## write a new base
file...
o.close()
g = open(basename)
return g.readlines()

Whenever the 2 files get very large, i find that it's very slow
comparing...any good advice to speed things up.? I thought of removing
readlines() method, and use line by line compare. Is it a better way?
thanks

Mar 16 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
s9************@yahoo.com wrote:
hi
i wrote some code to compare 2 files. One is the base file, the other
file i got from somewhere. I need to compare this file against the
base,
eg base file
abc
def
ghi

eg another file
abc
def
ghi
jkl

after compare , the base file will be overwritten with "jkl". Also both
files tend to grow towards > 20MB ..

Here is my code...using difflib.

pat = re.compile(r'^\+') ## i want to get rid of the '+' from the
difflib output...
def difference(filename,basename):
import difflib
base = open(basename)
a = base.readlines()
input = open(filename)
b = input.readlines()
d = difflib.Differ()
diff = list(d.compare(a, b))
if len(diff) > 0:
os.remove(basename)
o = open(basename, "aU")
for i in diff:
if pat.search(i):
i = i.lstrip("\+ ")
o.writelines(i) ## write a new base
file...
o.close()
g = open(basename)
return g.readlines()

Whenever the 2 files get very large, i find that it's very slow
comparing...any good advice to speed things up.? I thought of removing
readlines() method, and use line by line compare. Is it a better way?
thanks


It seems like you want a new base that contains only those lines
contained in 'filename' that are not contained in 'basename' where
'basename' is an ordered subset of filename. In other words, the
'filename' file has all of the lines of 'basename' in order somewhere
but 'basename' has some additional lines. Is that correct? difflib looks
to be overkill for this. Here is a suggestion:
basefile = open(basename)
newfile = open(filename)
baseiter = basefile.xreadlines()
newiter = newfile.xreadlines()

newbase = open('tmp.txt', 'w')

for baseline in baseiter:
for newline in newiter:
if baseline != newline:
newbase.write(newline)
else:
break

for afile in (basefile, newfile, newbase): afile.close()
If 'basename'is not an ordered subset of 'filename', then difflib seems
to be your best bet because you have a computationally intensive problem.
James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Mar 16 '06 #2

P: n/a
thanks for the reply,
I have used another method to solve my problem. ie
1) get the total count of the first file
2) write this total count to basecnt eg basecnt
3) get another file, get the total count of this file. eg filecnt
4) if filecnt > basecnt, read in the values from file[basecnt:filecnt]
5) if filecnt < basecnt, overwrite original basecnt and start over
again.

basically, the problem domain is i want to get the most current records
from a log file to review after every 3 hours. so this log file will
increase or accumulate.

Mar 16 '06 #3

P: n/a

<s9************@yahoo.com> wrote in message
news:11**********************@e56g2000cwe.googlegr oups.com...
thanks for the reply,
I have used another method to solve my problem. ie
1) get the total count of the first file
2) write this total count to basecnt eg basecnt
3) get another file, get the total count of this file. eg filecnt
4) if filecnt > basecnt, read in the values from file[basecnt:filecnt]
5) if filecnt < basecnt, overwrite original basecnt and start over
again.

basically, the problem domain is i want to get the most current records
from a log file to review after every 3 hours. so this log file will
increase or accumulate.


I did this:

fp = os.popen('/usr/sbin/logtail /var/log/syslog')
loglines = fp.readlines()

.... pyparsing ... stuff .... from loglines
;-)

Python is maybe overkill too - have "cron" call "logtail" and pibe the
output whereever?

PS:

"logtail" is very simple, it works simply by maintaining a "bookmark" from
the last read that is updated after each time the file is read (i.e. on each
call). It is probably a very easy thing to implement in Python. On
Linux/UNIX syslog+logutils can do a lot of work just by configuration (but
you did not say you are on unix)
Mar 17 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.