473,785 Members | 2,400 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

comparing huge files

hi
i wrote some code to compare 2 files. One is the base file, the other
file i got from somewhere. I need to compare this file against the
base,
eg base file
abc
def
ghi

eg another file
abc
def
ghi
jkl

after compare , the base file will be overwritten with "jkl". Also both
files tend to grow towards > 20MB ..

Here is my code...using difflib.

pat = re.compile(r'^\ +') ## i want to get rid of the '+' from the
difflib output...
def difference(file name,basename):
import difflib
base = open(basename)
a = base.readlines( )
input = open(filename)
b = input.readlines ()
d = difflib.Differ( )
diff = list(d.compare( a, b))
if len(diff) > 0:
os.remove(basen ame)
o = open(basename, "aU")
for i in diff:
if pat.search(i):
i = i.lstrip("\+ ")
o.writelines(i) ## write a new base
file...
o.close()
g = open(basename)
return g.readlines()

Whenever the 2 files get very large, i find that it's very slow
comparing...any good advice to speed things up.? I thought of removing
readlines() method, and use line by line compare. Is it a better way?
thanks

Mar 16 '06 #1
3 4309
s9************@ yahoo.com wrote:
hi
i wrote some code to compare 2 files. One is the base file, the other
file i got from somewhere. I need to compare this file against the
base,
eg base file
abc
def
ghi

eg another file
abc
def
ghi
jkl

after compare , the base file will be overwritten with "jkl". Also both
files tend to grow towards > 20MB ..

Here is my code...using difflib.

pat = re.compile(r'^\ +') ## i want to get rid of the '+' from the
difflib output...
def difference(file name,basename):
import difflib
base = open(basename)
a = base.readlines( )
input = open(filename)
b = input.readlines ()
d = difflib.Differ( )
diff = list(d.compare( a, b))
if len(diff) > 0:
os.remove(basen ame)
o = open(basename, "aU")
for i in diff:
if pat.search(i):
i = i.lstrip("\+ ")
o.writelines(i) ## write a new base
file...
o.close()
g = open(basename)
return g.readlines()

Whenever the 2 files get very large, i find that it's very slow
comparing...any good advice to speed things up.? I thought of removing
readlines() method, and use line by line compare. Is it a better way?
thanks


It seems like you want a new base that contains only those lines
contained in 'filename' that are not contained in 'basename' where
'basename' is an ordered subset of filename. In other words, the
'filename' file has all of the lines of 'basename' in order somewhere
but 'basename' has some additional lines. Is that correct? difflib looks
to be overkill for this. Here is a suggestion:
basefile = open(basename)
newfile = open(filename)
baseiter = basefile.xreadl ines()
newiter = newfile.xreadli nes()

newbase = open('tmp.txt', 'w')

for baseline in baseiter:
for newline in newiter:
if baseline != newline:
newbase.write(n ewline)
else:
break

for afile in (basefile, newfile, newbase): afile.close()
If 'basename'is not an ordered subset of 'filename', then difflib seems
to be your best bet because you have a computationally intensive problem.
James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Mar 16 '06 #2
thanks for the reply,
I have used another method to solve my problem. ie
1) get the total count of the first file
2) write this total count to basecnt eg basecnt
3) get another file, get the total count of this file. eg filecnt
4) if filecnt > basecnt, read in the values from file[basecnt:filecnt]
5) if filecnt < basecnt, overwrite original basecnt and start over
again.

basically, the problem domain is i want to get the most current records
from a log file to review after every 3 hours. so this log file will
increase or accumulate.

Mar 16 '06 #3

<s9************ @yahoo.com> wrote in message
news:11******** **************@ e56g2000cwe.goo glegroups.com.. .
thanks for the reply,
I have used another method to solve my problem. ie
1) get the total count of the first file
2) write this total count to basecnt eg basecnt
3) get another file, get the total count of this file. eg filecnt
4) if filecnt > basecnt, read in the values from file[basecnt:filecnt]
5) if filecnt < basecnt, overwrite original basecnt and start over
again.

basically, the problem domain is i want to get the most current records
from a log file to review after every 3 hours. so this log file will
increase or accumulate.


I did this:

fp = os.popen('/usr/sbin/logtail /var/log/syslog')
loglines = fp.readlines()

.... pyparsing ... stuff .... from loglines
;-)

Python is maybe overkill too - have "cron" call "logtail" and pibe the
output whereever?

PS:

"logtail" is very simple, it works simply by maintaining a "bookmark" from
the last read that is updated after each time the file is read (i.e. on each
call). It is probably a very easy thing to implement in Python. On
Linux/UNIX syslog+logutils can do a lot of work just by configuration (but
you did not say you are on unix)
Mar 17 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1236
by: hakhan | last post by:
Hello, I need to store huge(+/- 100MB) data. Furthermore, my GUI application must select data portions from these huge data files in order to do some post-processing. I wonder in which format I should put my data in? XML or just a (relational) database? Or should I use an XML database (native or xml-enabled?)??? I am a little bit confused .... If I'd put the data in XML files, then loading the entire XML tree in memory(DOM) would...
41
3964
by: Odd-R. | last post by:
I have to lists, A and B, that may, or may not be equal. If they are not identical, I want the output to be three new lists, X,Y and Z where X has all the elements that are in A, but not in B, and Y contains all the elements that are B but not in A. Z will then have the elements that are in both A and B. One way of doing this is of course to iterate throug the lists and compare each of the element, but is there a more efficient way? ...
89
3471
by: purifier | last post by:
The problem is to write a program in 'C' to find the greatest of 2 given numbers... Easy? huh here's the catch do not use 'if' or any conditional statements if u want it to be a little more tougher you can use the if but this time no relational operators or any of the predefined functions.... Can someone please help me solve the problem....
4
5545
by: ddd | last post by:
I am trying to build a diff tool that allows me to compare two HTML files. I am looking for resources on how to achive this. The main problem is that I do not want to simply highlight the line of code where the change happened, but rather the word/text that changed. Example say the html file contains a table with three cells/one row, and all that changes between the two HTML files that I want to compare is the value on the second cell. I...
0
2394
by: richardkreidl | last post by:
I have the following hash script that I use to compare two text files. 'Class Public Class FileComparison Public Class FileComparisonException Public Enum ExceptionType U 'Unknown A 'Add D 'Delete
8
7519
by: Frost | last post by:
Hi All, I am a newbie i have written a c program on unix for line by line comparison for two files now could some one help on how i could do word by word comparison in case both lines have the same words but in jumbled order they should match and print only the dissimilar lines.The program also checks for multiple entries of the same line. Here file 2 converts to file 3 which is in the format of file1 and i compare file1 with file3.
5
2217
by: ma740988 | last post by:
There's a need for me to move around at specified offsets within memory. As as a result - long story short - unsigned char* is the type of choice. At issue: Consider the case ( test code ) where I'm comparing two structs. The struct test1 has information with regards to data_size and pointer to address. The struct test2 has information with regards to data_size and value. I will compare test1 and test2. For each matching data size,...
0
1365
by: ranganadh | last post by:
Dear Group members, I am new to LINQ, pls help on the deeling with huge amount of data with the C# stand Alone application. I have two file, which contains more then 2 lacs lines in every file suppose file1 like ...
1
2333
by: Avi1 | last post by:
Hi, I got the code (from the internet)for comparing two files and showing the difference in contents.Now,I tried the same code for two files written in japanese language(kanji).If I save the two japanese .txt files in ANSI format,it works fine,but, if I save them in formats like 'UTF-8','unicode','unicode bigendian',it doesn't show the differences properly....keeps showing odd symbols instead of the japanese characters. Would be glad if...
0
9646
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9483
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10346
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10157
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10096
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9956
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8982
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
4055
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3658
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.