Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old March 16th, 2006, 02:25 AM
s99999999s2003@yahoo.com
Guest
 
Posts: n/a
Default comparing huge files

hi
i wrote some code to compare 2 files. One is the base file, the other
file i got from somewhere. I need to compare this file against the
base,
eg base file
abc
def
ghi

eg another file
abc
def
ghi
jkl

after compare , the base file will be overwritten with "jkl". Also both
files tend to grow towards > 20MB ..

Here is my code...using difflib.

pat = re.compile(r'^\+') ## i want to get rid of the '+' from the
difflib output...
def difference(filename,basename):
import difflib
base = open(basename)
a = base.readlines()
input = open(filename)
b = input.readlines()
d = difflib.Differ()
diff = list(d.compare(a, b))
if len(diff) > 0:
os.remove(basename)
o = open(basename, "aU")
for i in diff:
if pat.search(i):
i = i.lstrip("\+ ")
o.writelines(i) ## write a new base
file...
o.close()
g = open(basename)
return g.readlines()

Whenever the 2 files get very large, i find that it's very slow
comparing...any good advice to speed things up.? I thought of removing
readlines() method, and use line by line compare. Is it a better way?
thanks

  #2  
Old March 16th, 2006, 05:05 AM
James Stroud
Guest
 
Posts: n/a
Default Re: comparing huge files

s99999999s2003@yahoo.com wrote:[color=blue]
> hi
> i wrote some code to compare 2 files. One is the base file, the other
> file i got from somewhere. I need to compare this file against the
> base,
> eg base file
> abc
> def
> ghi
>
> eg another file
> abc
> def
> ghi
> jkl
>
> after compare , the base file will be overwritten with "jkl". Also both
> files tend to grow towards > 20MB ..
>
> Here is my code...using difflib.
>
> pat = re.compile(r'^\+') ## i want to get rid of the '+' from the
> difflib output...
> def difference(filename,basename):
> import difflib
> base = open(basename)
> a = base.readlines()
> input = open(filename)
> b = input.readlines()
> d = difflib.Differ()
> diff = list(d.compare(a, b))
> if len(diff) > 0:
> os.remove(basename)
> o = open(basename, "aU")
> for i in diff:
> if pat.search(i):
> i = i.lstrip("\+ ")
> o.writelines(i) ## write a new base
> file...
> o.close()
> g = open(basename)
> return g.readlines()
>
> Whenever the 2 files get very large, i find that it's very slow
> comparing...any good advice to speed things up.? I thought of removing
> readlines() method, and use line by line compare. Is it a better way?
> thanks
>[/color]

It seems like you want a new base that contains only those lines
contained in 'filename' that are not contained in 'basename' where
'basename' is an ordered subset of filename. In other words, the
'filename' file has all of the lines of 'basename' in order somewhere
but 'basename' has some additional lines. Is that correct? difflib looks
to be overkill for this. Here is a suggestion:


basefile = open(basename)
newfile = open(filename)
baseiter = basefile.xreadlines()
newiter = newfile.xreadlines()

newbase = open('tmp.txt', 'w')

for baseline in baseiter:
for newline in newiter:
if baseline != newline:
newbase.write(newline)
else:
break

for afile in (basefile, newfile, newbase): afile.close()


If 'basename'is not an ordered subset of 'filename', then difflib seems
to be your best bet because you have a computationally intensive problem.


James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
  #3  
Old March 16th, 2006, 10:45 AM
s99999999s2003@yahoo.com
Guest
 
Posts: n/a
Default Re: comparing huge files

thanks for the reply,
I have used another method to solve my problem. ie
1) get the total count of the first file
2) write this total count to basecnt eg basecnt
3) get another file, get the total count of this file. eg filecnt
4) if filecnt > basecnt, read in the values from file[basecnt:filecnt]
5) if filecnt < basecnt, overwrite original basecnt and start over
again.

basically, the problem domain is i want to get the most current records
from a log file to review after every 3 hours. so this log file will
increase or accumulate.

  #4  
Old March 17th, 2006, 09:55 AM
Frithiof Andreas Jensen
Guest
 
Posts: n/a
Default Re: comparing huge files


<s99999999s2003@yahoo.com> wrote in message
news:1142505157.496128.148740@e56g2000cwe.googlegr oups.com...[color=blue]
> thanks for the reply,
> I have used another method to solve my problem. ie
> 1) get the total count of the first file
> 2) write this total count to basecnt eg basecnt
> 3) get another file, get the total count of this file. eg filecnt
> 4) if filecnt > basecnt, read in the values from file[basecnt:filecnt]
> 5) if filecnt < basecnt, overwrite original basecnt and start over
> again.
>
> basically, the problem domain is i want to get the most current records
> from a log file to review after every 3 hours. so this log file will
> increase or accumulate.
>[/color]

I did this:

fp = os.popen('/usr/sbin/logtail /var/log/syslog')
loglines = fp.readlines()

.... pyparsing ... stuff .... from loglines
;-)

Python is maybe overkill too - have "cron" call "logtail" and pibe the
output whereever?

PS:

"logtail" is very simple, it works simply by maintaining a "bookmark" from
the last read that is updated after each time the file is read (i.e. on each
call). It is probably a very easy thing to implement in Python. On
Linux/UNIX syslog+logutils can do a lot of work just by configuration (but
you did not say you are on unix)


 

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles