473,325 Members | 2,671 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,325 software developers and data experts.

DiffLib Question

Hi Guys,
I'm a bit confused in difflib. In most cases, the differences
found using difflib works well but when I have come across the
following set of text:
>>d1 = '''In addition, the considered problem does not have a meaningful traditional type of adjoint
.... problem even for the simple forms of the differential equation and
the nonlocal conditions. Due to these facts, some serious difficulties
arise in the application of the classical methods to such a
problem.'''
>>d2 = '''In addition, the considered problem does not have a meaningful traditional type of
.... adjoint problem even for the simple forms of the differential
equation and the nonlocal conditions. Due
.... to these facts, some serious difficulties arise in the application
of the classical methods to such a
.... problem. '''

Using this line of code:
>>a = difflib.Differ().compare(d1,d2)
dif =[]
for i in a:
.... dif.append(i)
.... s = ''.join(dif)

I get the following output:

' I n a d d i t i o n , t h e c o n s i
d e r e d p r o b l e m d o e s n o t
h a v e a m e a n i n g f u l t r a d i
t i o n a l t y p e o f- + \n a d j o i n t+
+ p+ r+ o+ b+ l+ e+ m+ + e+ v+ e+ n+ + f+ o+ r+ + t+ h+ e+ + s+ i+
m+ p+ l+ e+ + f+ o+ r+ m+ s+ + o+ f+ + t+ h+ e+ + d+ i+ f+ f+ e+ r
+ e+ n+ t+ i+ a+ l+ + e+ q+ u+ a+ t+ i+ o+ n+ + a+ n+ d+ + t+ h+ e
+ + n+ o+ n+ l+ o+ c+ a+ l+ + c+ o+ n+ d+ i+ t+ i+ o+ n+ s+ .+ + D+
u+ e \n+ t+ o+ + t+ h+ e+ s+ e+ + f+ a+ c+ t+ s+ ,+ + s+ o+ m+ e+
+ s+ e+ r+ i+ o+ u+ s+ + d+ i+ f+ f+ i+ c+ u+ l+ t+ i+ e+ s+ + a+ r+
i+ s+ e+ + i+ n+ + t+ h+ e+ + a+ p+ p+ l+ i+ c+ a+ t+ i+ o+ n+ + o
+ f+ + t+ h+ e+ + c+ l+ a+ s+ s+ i+ c+ a+ l+ + m+ e+ t+ h+ o+ d+ s
+ + t+ o+ + s+ u+ c+ h+ + a+ \n p r o b l e m- - e- v- e-
n- - f- o- r- - t- h- e- - s- i- m- p- l- e- - f- o- r- m- s- -
o- f- - t- h- e- - d- i- f- f- e- r- e- n- t- i- a- l- - e- q- u-
a- t- i- o- n- - a- n- d- - t- h- e- - n- o- n- l- o- c- a- l- -
c- o- n- d- i- t- i- o- n- s . - D- u- e- - t- o- - t- h- e- s-
e- - f- a- c- t- s- ,- - s- o- m- e- - s- e- r- i- o- u- s- - d-
i- f- f- i- c- u- l- t- i- e- s- - a- r- i- s- e- - i- n- - t- h-
e- - a- p- p- l- i- c- a- t- i- o- n- - o- f- - t- h- e- - c- l-
a- s- s- i- c- a- l- - m- e- t- h- o- d- s- - t- o- - s- u- c- h-
- a- - p- r- o- b- l- e- m- .'

How come the rest of the text after the "adjoint" word is marked as an
additional text (while others is deleted) while in fact those text are
contained in both d1 and d2?The only difference is that it has a
newline. I'm I missing something? Is there a way for me to disregard
the newlines and spaces?

Python 2.3
WINXP

Thanks.
Jen

May 2 '07 #1
7 3386
On May 2, 10:46 am, whitewave <fru...@gmail.comwrote:
Is there a way for me to disregard
the newlines and spaces?

Python 2.3
WINXP

Thanks.
Jen
HTH:
>help(difflib.Differ.__init__)
Help on method __init__ in module difflib:

__init__(self, linejunk=None, charjunk=None) unbound difflib.Differ
method
Construct a text differencer, with optional filters.

The two optional keyword parameters are for filter functions:

- `linejunk`: A function that should accept a single string
argument,
and return true iff the string is junk. The module-level
function
`IS_LINE_JUNK` may be used to filter out lines without visible
characters, except for at most one splat ('#'). It is
recommended
to leave linejunk None; as of Python 2.3, the underlying
SequenceMatcher class has grown an adaptive notion of "noise"
lines
that's better than any static definition the author has ever
been
able to craft.

- `charjunk`: A function that should accept a string of length 1.
The
module-level function `IS_CHARACTER_JUNK` may be used to filter
out
whitespace characters (a blank or tab; **note**: bad idea to
include
newline in this!). Use of IS_CHARACTER_JUNK is recommended.
Michele Simionato

May 2 '07 #2
Hi,
Thank you for your reply. But I don't fully understand what the
charjunk and linejunk is all about. I'm a bit newbie in python using
the DiffLib. I'm I using the right code here? I will I implement the
linejunk and charjunk using the following code?
>>a = difflib.Differ().compare(d1,d2)
dif =[]
for i in a:
.... dif.append(i)
.... s = ''.join(dif)

Thanks
Jen

May 2 '07 #3
En Wed, 02 May 2007 06:26:13 -0300, whitewave <fr****@gmail.comescribió:
Thank you for your reply. But I don't fully understand what the
charjunk and linejunk is all about. I'm a bit newbie in python using
the DiffLib. I'm I using the right code here? I will I implement the
linejunk and charjunk using the following code?
Usually, Differ receives two sequences of lines, being each line a
sequence of characters (strings). It uses a SequenceMatcher to compare
lines; the linejunk argument is used to ignore certain lines. For each
pair of similar lines, it uses another SequenceMatcher to compare
characters inside lines; the charjunk is used to ignore characters.
As you are feeding Differ with a single string (not a list of text lines),
the "lines" it sees are just characters. To ignore whitespace and
newlines, in this case one should use the linejunk argument:

def ignore_ws_nl(c):
return c in " \t\n\r"

a = difflib.Differ(linejunk=ignore_ws_nl).compare(d1,d 2)
dif = list(a)
print ''.join(dif)

I n a d d i t i o n , t h e c o n s i d e
r e
d p r o b l e m d o e s n o t h a v e
a m
e a n i n g f u l t r a d i t i o n a l t y
p e
o f- +
a d j o i n t-
+ p r o b l e m e v e n f o r t h e s i
m p
l e f o r m s o f t h e d i f f e r e n t
i a
l e q u a t i o n a n d t h e n o n l o
c a l
c o n d i t i o n s . D u e- +
t o t h e s e f a c t s , s o m e s e r
i o
u s d i f f i c u l t i e s a r i s e i n
t h
e a p p l i c a t i o n o f t h e c l a
s s i
c a l m e t h o d s t o s u c h a- +
p r o b l e m .+

I hope this is what you were looking for.

--
Gabriel Genellina
May 2 '07 #4
Usually, Differ receives two sequences of lines, being each line a
sequence of characters (strings). It uses a SequenceMatcher to compare
lines; the linejunk argument is used to ignore certain lines. For each
pair of similar lines, it uses another SequenceMatcher to compare
characters inside lines; the charjunk is used to ignore characters.
As you are feeding Differ with a single string (not a list of text lines),
the "lines" it sees are just characters. To ignore whitespace and
newlines, in this case one should use the linejunk argument:

def ignore_ws_nl(c):
return c in " \t\n\r"

a =difflib.Differ(linejunk=ignore_ws_nl).compare(d1, d2)
dif = list(a)
print ''.join(dif)

I n a d d i t i o n , t h e c o n s i d e
r e
d p r o b l e m d o e s n o t h a v e
a m
e a n i n g f u l t r a d i t i o n a l t y
p e
o f- +
a d j o i n t-
+ p r o b l e m e v e n f o r t h e s i
m p
l e f o r m s o f t h e d i f f e r e n t
i a
l e q u a t i o n a n d t h e n o n l o
c a l
c o n d i t i o n s . D u e- +
t o t h e s e f a c t s , s o m e s e r
i o
u s d i f f i c u l t i e s a r i s e i n
t h
e a p p l i c a t i o n o f t h e c l a
s s i
c a l m e t h o d s t o s u c h a- +
p r o b l e m .+
Thanks! It works fine but I was wondering why the result isn't
consistent. I am comparing two huge documents with several paragraphs
in it. Some parts in the paragraph returns the diff perfectly but
others aren't. I am confused.

Thanks.
Jen

May 4 '07 #5
En Fri, 04 May 2007 06:46:44 -0300, whitewave <fr****@gmail.comescribió:
Thanks! It works fine but I was wondering why the result isn't
consistent. I am comparing two huge documents with several paragraphs
in it. Some parts in the paragraph returns the diff perfectly but
others aren't. I am confused.
Differ objects do a two-level diff; depending on what kind of differences
you are interested in, you feed it with different things.
If the "line" concept is important to you (that is, you want to see which
"lines" were added, removed or modified), then feed the Differ with a
sequence of lines (file.readlines() would be fine).
This way, if someone inserts a few words inside a paragraph and the
remaining lines have to be reflushed, you'll see many changes from words
that were at end of lines now moving to the start of next line.
If you are more concerned about "paragraphs" and words, feed the Differ
with a sequence of "paragraphs". Maybe your editor can handle it; assuming
a paragraph ends with two linefeeds, you can get a list of paragraphs in
Python using file.read().split("\n\n").
A third alternative would be to consider the text as absolutely plain, and
just feed Differ with file.read(), as menctioned in an earlier post.

--
Gabriel Genellina
May 6 '07 #6
Hello,

I am currently doing the third option. Doing file.read() to both file
to be compared then feed the result to the compare function.

Let me give you a brief sample of what I want to achieve.

Using this code
>>diffline=[]
fileDiff = difflib.Differ().compare(f1, f2)
diffline = list(fileDiff)
finStr = ''.join(diffline)
With the following strings for comparison:
>>f1 = ''' The solvable conditions and the Green's functions of linear boundary value
.... problems for ordinary differential equations with sufficiently
smooth coefficients have been
.... investigated in detail by other authors
\cite{CR1,CR2,CR3,CR4,CR5}.'''
>>f2 = '''The solvability conditions and the Green's functions of linear boundary value problems for ordinary
.... differential equations with sufficiently smooth coefficients have
been investigated in detail by many
.... authors \cite{CR1,CR2,CR3,CR4,CR5}.'''

I get this result:

T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o n s a n d t h e G r e e n ' s f u
n c t i o n s o f l i n e a r b o u n d
a r y v a l u e+ + p+ r+ o+ b+ l+ e+ m+ s+ + f+ o+ r+ + o
+ r+ d+ i+ n+ a+ r+ y
+ d+ i+ f+ f- p- r- o- b- l e+ r+ e+ n+ t+ i+ a+ l+ + e+ q+ u+ a+ t+
i+ o+ n+ s+ + w+ i+ t+ h+ + s+ u+ f+ f+ i+ c+ i+ e+ n+ t+ l+ y+ +
s m- s- - f o- r- o- r- d- i- n- a- r- y- - d- i- f- f- e- r-
e- n t- i- a- l- - e- q- u- a- t- i- o- n- s- - w- i- t h - s-
u- f- f- i c- i+ o e+ f+ f+ i+ c+ i+ e n t- l- y- s+ + h+ a+ v
+ e+ + b+ e+ e+ n+ + i+ n+ v+ e+ s+ t+ i+ g+ a+ t+ e+ d+ + i+ n+ +
d+ e+ t+ a+ i+ l+ + b+ y+ m- o- o- t- h- - c- o- e- f- f- i- c-
i- e- n- t- s- - h a- v- e- - b- e- e n+ y
- i- n- v- e- s- t- i- g- a- t- e- d- - i- n- - d- e- t- a- i- l- -
b- y- - o- t- h- e- r- a u t h o r s \ c i t e
{ C R 1 , C R 2 , C R 3 , C R 4 , C R 5 } .

Whereas, this is my expected result:

T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o n s a n d t h e G r e e n ' s f u
n c t i o n s o f l i n e a r b o u n d
a r y v a l u e-
+ p r o b l e m s f o r o r d i n a r y- +
d i f f e r e n t i a l e q u a t i o n s
w i t h s u f f i c i e n t l y s m o o t
h c o e f f i c i e n t s h a v e b e e
n-
+ i n v e s t i g a t e d i n d e t a i
l b y + m- o- t- h- e- r- a+ n+ y+
+ a u t h o r s \ c i t e { C R 1 , C R 2 , C
R 3 , C R 4 , C R 5 } .
Thanks,
Jen

May 7 '07 #7
En Mon, 07 May 2007 00:52:18 -0300, whitewave <fr****@gmail.comescribió:
I am currently doing the third option. Doing file.read() to both file
to be compared then feed the result to the compare function.

Let me give you a brief sample of what I want to achieve.

Using this code
>>>diffline=[]
fileDiff = difflib.Differ().compare(f1, f2)
diffline = list(fileDiff)
finStr = ''.join(diffline)
So you are concerned with character differences, ignoring higher order
structures. Use a linejunk filter function to the Differ constructor -as
shown in my post last Wednesday- to ignore "\n" characters when matching.
That is:

def ignore_eol(c): return c in "\r\n"
fileDiff = difflib.Differ(linejunk=ignore_eol).compare(f1, f2)
print ''.join(fileDiff)

you get:

- T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o
n s a n d t h e G r e e n ' s f u n c t i
o n
s o f l i n e a r b o u n d a r y v a l
u e-
+ p r o b l e m s f o r o r d i n a r y- +
d i f f e r e n t i a l e q u a t i o n s w
i t
h s u f f i c i e n t l y s m o o t h c o e
f f
i c i e n t s h a v e b e e n-
+ i n v e s t i g a t e d i n d e t a i l
b y
+ m+ a+ n+ y+
- o- t- h- e- r- a u t h o r s \ c i t e { C R 1 ,
C R
2 , C R 3 , C R 4 , C R 5 } .

--
Gabriel Genellina

May 7 '07 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: shuhsien | last post by:
Hi, I am confused by the junk parameter in the difflib.sequencematcher. I thought it would simply ignore everything that's returned true by the junk function. However, I have results as follows:...
3
by: Humpdydum | last post by:
Can anyone try the following in their python interpreter? These give correct output: >>> print list(ndiff(,)) >>> print list(ndiff(,)) >>> print list(ndiff(,))
1
by: pemo | last post by:
I'm trying to use difflib.py from a COM aware language - and, for ease of use, I'm initially trying to get this going from VB6. I've wrappered difflib.py correctly I think, and I can now call...
11
by: John Henry | last post by:
I am just wondering what's with get_close_matches() in difflib. What's the magic? How fuzzy do I need to get in order to get a match?
1
by: Neilen Marais | last post by:
Hi I'm trying to compare some text to find differences other than whitespace. I seem to be misunderstanding something, since I can't even get a basic example to work: In : d =...
0
by: stefaan | last post by:
Hello List, I am using difflib.HtmlDiff and it provides great functionality. Unfortunately it is too slow for my purpose. Is anyone aware of an alternative ? - a C-implementation lying around...
2
by: krishnakant Mane | last post by:
hello all, I have a bit of a confusing question. firstly I wanted a library which can do an svn like diff with two files. let's say I have file1 and file2 where file2 contains some thing which...
1
by: erikcw | last post by:
Hi, I'm trying to create an undo/redo feature for a webapp I'm working on (django based). I'd like to have an undo/redo function. My first thought was to use the difflib to generate a diff to...
3
by: n00m | last post by:
from random import randint s1 = '' s2 = '' for i in xrange(1000): s1 += chr(randint(97,122)) s2 += chr(randint(97,122)) print s1
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.