comparing huge files

s99999999s2003

hi
i wrote some code to compare 2 files. One is the base file, the other
file i got from somewhere. I need to compare this file against the
base,
eg base file
abc
def
ghi

eg another file
abc
def
ghi
jkl

after compare , the base file will be overwritten with "jkl". Also both
files tend to grow towards > 20MB ..

Here is my code...using difflib.

pat = re.compile(r'^\+') ## i want to get rid of the '+' from the
difflib output...
def difference(filename,basename):
import difflib
base = open(basename)
a = base.readlines()
input = open(filename)
b = input.readlines()
d = difflib.Differ()
diff = list(d.compare(a, b))
if len(diff) > 0:
os.remove(basename)
o = open(basename, "aU")
for i in diff:
if pat.search(i):
i = i.lstrip("\+ ")
o.writelines(i) ## write a new base
file...
o.close()
g = open(basename)
return g.readlines()

Whenever the 2 files get very large, i find that it's very slow
comparing...any good advice to speed things up.? I thought of removing
readlines() method, and use line by line compare. Is it a better way?
thanks

Mar 16 '06 #1

Subscribe Reply

4248

James Stroud

s9************@yahoo.com wrote:

hi
i wrote some code to compare 2 files. One is the base file, the other
file i got from somewhere. I need to compare this file against the
base,
eg base file
abc
def
ghi

eg another file
abc
def
ghi
jkl

after compare , the base file will be overwritten with "jkl". Also both
files tend to grow towards > 20MB ..

Here is my code...using difflib.

pat = re.compile(r'^\+') ## i want to get rid of the '+' from the
difflib output...
def difference(filename,basename):
import difflib
base = open(basename)
a = base.readlines()
input = open(filename)
b = input.readlines()
d = difflib.Differ()
diff = list(d.compare(a, b))
if len(diff) > 0:
os.remove(basename)
o = open(basename, "aU")
for i in diff:
if pat.search(i):
i = i.lstrip("\+ ")
o.writelines(i) ## write a new base
file...
o.close()
g = open(basename)
return g.readlines()

Whenever the 2 files get very large, i find that it's very slow
comparing...any good advice to speed things up.? I thought of removing
readlines() method, and use line by line compare. Is it a better way?
thanks

It seems like you want a new base that contains only those lines
contained in 'filename' that are not contained in 'basename' where
'basename' is an ordered subset of filename. In other words, the
'filename' file has all of the lines of 'basename' in order somewhere
but 'basename' has some additional lines. Is that correct? difflib looks
to be overkill for this. Here is a suggestion:
basefile = open(basename)
newfile = open(filename)
baseiter = basefile.xreadlines()
newiter = newfile.xreadlines()

newbase = open('tmp.txt', 'w')

for baseline in baseiter:
for newline in newiter:
if baseline != newline:
newbase.write(newline)
else:
break

for afile in (basefile, newfile, newbase): afile.close()
If 'basename'is not an ordered subset of 'filename', then difflib seems
to be your best bet because you have a computationally intensive problem.
James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Mar 16 '06 #2

s99999999s2003

thanks for the reply,
I have used another method to solve my problem. ie
1) get the total count of the first file
2) write this total count to basecnt eg basecnt
3) get another file, get the total count of this file. eg filecnt
4) if filecnt > basecnt, read in the values from file[basecnt:filecnt]
5) if filecnt < basecnt, overwrite original basecnt and start over
again.

basically, the problem domain is i want to get the most current records
from a log file to review after every 3 hours. so this log file will
increase or accumulate.

Mar 16 '06 #3

Frithiof Andreas Jensen

<s9************@yahoo.com> wrote in message
news:11**********************@e56g2000cwe.googlegr oups.com...

thanks for the reply,
I have used another method to solve my problem. ie
1) get the total count of the first file
2) write this total count to basecnt eg basecnt
3) get another file, get the total count of this file. eg filecnt
4) if filecnt > basecnt, read in the values from file[basecnt:filecnt]
5) if filecnt < basecnt, overwrite original basecnt and start over
again.

basically, the problem domain is i want to get the most current records
from a log file to review after every 3 hours. so this log file will
increase or accumulate.

I did this:

fp = os.popen('/usr/sbin/logtail /var/log/syslog')
loglines = fp.readlines()

.... pyparsing ... stuff .... from loglines
;-)

Python is maybe overkill too - have "cron" call "logtail" and pibe the
output whereever?

PS:

"logtail" is very simple, it works simply by maintaining a "bookmark" from
the last read that is updated after each time the file is read (i.e. on each
call). It is probably a very easy thing to implement in Python. On
Linux/UNIX syslog+logutils can do a lot of work just by configuration (but
you did not say you are on unix)

Mar 17 '06 #4

Similar topics

1219

XML or a (relational) database for HUGE files

by: hakhan | last post by:

Hello, I need to store huge(+/- 100MB) data. Furthermore, my GUI application must select data portions from these huge data files in order to do some post-processing. I wonder in which format I...

.NET Framework

3897

Comparing lists

by: Odd-R. | last post by:

I have to lists, A and B, that may, or may not be equal. If they are not identical, I want the output to be three new lists, X,Y and Z where X has all the elements that are in A, but not in B, and...

Python

3338

Comparing two numbers

by: purifier | last post by:

The problem is to write a program in 'C' to find the greatest of 2 given numbers... Easy? huh here's the catch do not use 'if' or any conditional statements if u want it to be a little more...

C / C++

5515

algorithm on comparing two html files

by: ddd | last post by:

I am trying to build a diff tool that allows me to compare two HTML files. I am looking for resources on how to achive this. The main problem is that I do not want to simply highlight the line of...

C# / C Sharp

2370

Modify code using a hash table in comparing two text files

by: richardkreidl | last post by:

I have the following hash script that I use to compare two text files. 'Class Public Class FileComparison Public Class FileComparisonException Public Enum ExceptionType U 'Unknown A 'Add...

Visual Basic .NET

7500

Comparing Two Files line by line and word by word

by: Frost | last post by:

Hi All, I am a newbie i have written a c program on unix for line by line comparison for two files now could some one help on how i could do word by word comparison in case both lines have the...

C / C++

2199

comparing two structs .. revisited

by: ma740988 | last post by:

There's a need for me to move around at specified offsets within memory. As as a result - long story short - unsigned char* is the type of choice. At issue: Consider the case ( test code ) where...

C / C++

1349

Need Help On Best Querying ( is LINQ work with Huge amount of data..)

by: ranganadh | last post by:

Dear Group members, I am new to LINQ, pls help on the deeling with huge amount of data with the C# stand Alone application. I have two file, which contains more then 2 lacs lines in every...

C# / C Sharp

2309

Problems in comparing two files written in Japanese

by: Avi1 | last post by:

Hi, I got the code (from the internet)for comparing two files and showing the difference in contents.Now,I tried the same code for two files written in japanese language(kanji).If I save the two...

Perl

7039

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

7080

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

6735

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

6895

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

4770

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

2992

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

2977

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1296

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

558

php

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP