By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,630 Members | 1,252 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,630 IT Pros & Developers. It's quick & easy.

DiffLib Question

P: n/a
Hi Guys,
I'm a bit confused in difflib. In most cases, the differences
found using difflib works well but when I have come across the
following set of text:
>>d1 = '''In addition, the considered problem does not have a meaningful traditional type of adjoint
.... problem even for the simple forms of the differential equation and
the nonlocal conditions. Due to these facts, some serious difficulties
arise in the application of the classical methods to such a
problem.'''
>>d2 = '''In addition, the considered problem does not have a meaningful traditional type of
.... adjoint problem even for the simple forms of the differential
equation and the nonlocal conditions. Due
.... to these facts, some serious difficulties arise in the application
of the classical methods to such a
.... problem. '''

Using this line of code:
>>a = difflib.Differ().compare(d1,d2)
dif =[]
for i in a:
.... dif.append(i)
.... s = ''.join(dif)

I get the following output:

' I n a d d i t i o n , t h e c o n s i
d e r e d p r o b l e m d o e s n o t
h a v e a m e a n i n g f u l t r a d i
t i o n a l t y p e o f- + \n a d j o i n t+
+ p+ r+ o+ b+ l+ e+ m+ + e+ v+ e+ n+ + f+ o+ r+ + t+ h+ e+ + s+ i+
m+ p+ l+ e+ + f+ o+ r+ m+ s+ + o+ f+ + t+ h+ e+ + d+ i+ f+ f+ e+ r
+ e+ n+ t+ i+ a+ l+ + e+ q+ u+ a+ t+ i+ o+ n+ + a+ n+ d+ + t+ h+ e
+ + n+ o+ n+ l+ o+ c+ a+ l+ + c+ o+ n+ d+ i+ t+ i+ o+ n+ s+ .+ + D+
u+ e \n+ t+ o+ + t+ h+ e+ s+ e+ + f+ a+ c+ t+ s+ ,+ + s+ o+ m+ e+
+ s+ e+ r+ i+ o+ u+ s+ + d+ i+ f+ f+ i+ c+ u+ l+ t+ i+ e+ s+ + a+ r+
i+ s+ e+ + i+ n+ + t+ h+ e+ + a+ p+ p+ l+ i+ c+ a+ t+ i+ o+ n+ + o
+ f+ + t+ h+ e+ + c+ l+ a+ s+ s+ i+ c+ a+ l+ + m+ e+ t+ h+ o+ d+ s
+ + t+ o+ + s+ u+ c+ h+ + a+ \n p r o b l e m- - e- v- e-
n- - f- o- r- - t- h- e- - s- i- m- p- l- e- - f- o- r- m- s- -
o- f- - t- h- e- - d- i- f- f- e- r- e- n- t- i- a- l- - e- q- u-
a- t- i- o- n- - a- n- d- - t- h- e- - n- o- n- l- o- c- a- l- -
c- o- n- d- i- t- i- o- n- s . - D- u- e- - t- o- - t- h- e- s-
e- - f- a- c- t- s- ,- - s- o- m- e- - s- e- r- i- o- u- s- - d-
i- f- f- i- c- u- l- t- i- e- s- - a- r- i- s- e- - i- n- - t- h-
e- - a- p- p- l- i- c- a- t- i- o- n- - o- f- - t- h- e- - c- l-
a- s- s- i- c- a- l- - m- e- t- h- o- d- s- - t- o- - s- u- c- h-
- a- - p- r- o- b- l- e- m- .'

How come the rest of the text after the "adjoint" word is marked as an
additional text (while others is deleted) while in fact those text are
contained in both d1 and d2?The only difference is that it has a
newline. I'm I missing something? Is there a way for me to disregard
the newlines and spaces?

Python 2.3
WINXP

Thanks.
Jen

May 2 '07 #1
Share this Question
Share on Google+
7 Replies


P: n/a
On May 2, 10:46 am, whitewave <fru...@gmail.comwrote:
Is there a way for me to disregard
the newlines and spaces?

Python 2.3
WINXP

Thanks.
Jen
HTH:
>help(difflib.Differ.__init__)
Help on method __init__ in module difflib:

__init__(self, linejunk=None, charjunk=None) unbound difflib.Differ
method
Construct a text differencer, with optional filters.

The two optional keyword parameters are for filter functions:

- `linejunk`: A function that should accept a single string
argument,
and return true iff the string is junk. The module-level
function
`IS_LINE_JUNK` may be used to filter out lines without visible
characters, except for at most one splat ('#'). It is
recommended
to leave linejunk None; as of Python 2.3, the underlying
SequenceMatcher class has grown an adaptive notion of "noise"
lines
that's better than any static definition the author has ever
been
able to craft.

- `charjunk`: A function that should accept a string of length 1.
The
module-level function `IS_CHARACTER_JUNK` may be used to filter
out
whitespace characters (a blank or tab; **note**: bad idea to
include
newline in this!). Use of IS_CHARACTER_JUNK is recommended.
Michele Simionato

May 2 '07 #2

P: n/a
Hi,
Thank you for your reply. But I don't fully understand what the
charjunk and linejunk is all about. I'm a bit newbie in python using
the DiffLib. I'm I using the right code here? I will I implement the
linejunk and charjunk using the following code?
>>a = difflib.Differ().compare(d1,d2)
dif =[]
for i in a:
.... dif.append(i)
.... s = ''.join(dif)

Thanks
Jen

May 2 '07 #3

P: n/a
En Wed, 02 May 2007 06:26:13 -0300, whitewave <fr****@gmail.comescribió:
Thank you for your reply. But I don't fully understand what the
charjunk and linejunk is all about. I'm a bit newbie in python using
the DiffLib. I'm I using the right code here? I will I implement the
linejunk and charjunk using the following code?
Usually, Differ receives two sequences of lines, being each line a
sequence of characters (strings). It uses a SequenceMatcher to compare
lines; the linejunk argument is used to ignore certain lines. For each
pair of similar lines, it uses another SequenceMatcher to compare
characters inside lines; the charjunk is used to ignore characters.
As you are feeding Differ with a single string (not a list of text lines),
the "lines" it sees are just characters. To ignore whitespace and
newlines, in this case one should use the linejunk argument:

def ignore_ws_nl(c):
return c in " \t\n\r"

a = difflib.Differ(linejunk=ignore_ws_nl).compare(d1,d 2)
dif = list(a)
print ''.join(dif)

I n a d d i t i o n , t h e c o n s i d e
r e
d p r o b l e m d o e s n o t h a v e
a m
e a n i n g f u l t r a d i t i o n a l t y
p e
o f- +
a d j o i n t-
+ p r o b l e m e v e n f o r t h e s i
m p
l e f o r m s o f t h e d i f f e r e n t
i a
l e q u a t i o n a n d t h e n o n l o
c a l
c o n d i t i o n s . D u e- +
t o t h e s e f a c t s , s o m e s e r
i o
u s d i f f i c u l t i e s a r i s e i n
t h
e a p p l i c a t i o n o f t h e c l a
s s i
c a l m e t h o d s t o s u c h a- +
p r o b l e m .+

I hope this is what you were looking for.

--
Gabriel Genellina
May 2 '07 #4

P: n/a
Usually, Differ receives two sequences of lines, being each line a
sequence of characters (strings). It uses a SequenceMatcher to compare
lines; the linejunk argument is used to ignore certain lines. For each
pair of similar lines, it uses another SequenceMatcher to compare
characters inside lines; the charjunk is used to ignore characters.
As you are feeding Differ with a single string (not a list of text lines),
the "lines" it sees are just characters. To ignore whitespace and
newlines, in this case one should use the linejunk argument:

def ignore_ws_nl(c):
return c in " \t\n\r"

a =difflib.Differ(linejunk=ignore_ws_nl).compare(d1, d2)
dif = list(a)
print ''.join(dif)

I n a d d i t i o n , t h e c o n s i d e
r e
d p r o b l e m d o e s n o t h a v e
a m
e a n i n g f u l t r a d i t i o n a l t y
p e
o f- +
a d j o i n t-
+ p r o b l e m e v e n f o r t h e s i
m p
l e f o r m s o f t h e d i f f e r e n t
i a
l e q u a t i o n a n d t h e n o n l o
c a l
c o n d i t i o n s . D u e- +
t o t h e s e f a c t s , s o m e s e r
i o
u s d i f f i c u l t i e s a r i s e i n
t h
e a p p l i c a t i o n o f t h e c l a
s s i
c a l m e t h o d s t o s u c h a- +
p r o b l e m .+
Thanks! It works fine but I was wondering why the result isn't
consistent. I am comparing two huge documents with several paragraphs
in it. Some parts in the paragraph returns the diff perfectly but
others aren't. I am confused.

Thanks.
Jen

May 4 '07 #5

P: n/a
En Fri, 04 May 2007 06:46:44 -0300, whitewave <fr****@gmail.comescribió:
Thanks! It works fine but I was wondering why the result isn't
consistent. I am comparing two huge documents with several paragraphs
in it. Some parts in the paragraph returns the diff perfectly but
others aren't. I am confused.
Differ objects do a two-level diff; depending on what kind of differences
you are interested in, you feed it with different things.
If the "line" concept is important to you (that is, you want to see which
"lines" were added, removed or modified), then feed the Differ with a
sequence of lines (file.readlines() would be fine).
This way, if someone inserts a few words inside a paragraph and the
remaining lines have to be reflushed, you'll see many changes from words
that were at end of lines now moving to the start of next line.
If you are more concerned about "paragraphs" and words, feed the Differ
with a sequence of "paragraphs". Maybe your editor can handle it; assuming
a paragraph ends with two linefeeds, you can get a list of paragraphs in
Python using file.read().split("\n\n").
A third alternative would be to consider the text as absolutely plain, and
just feed Differ with file.read(), as menctioned in an earlier post.

--
Gabriel Genellina
May 6 '07 #6

P: n/a
Hello,

I am currently doing the third option. Doing file.read() to both file
to be compared then feed the result to the compare function.

Let me give you a brief sample of what I want to achieve.

Using this code
>>diffline=[]
fileDiff = difflib.Differ().compare(f1, f2)
diffline = list(fileDiff)
finStr = ''.join(diffline)
With the following strings for comparison:
>>f1 = ''' The solvable conditions and the Green's functions of linear boundary value
.... problems for ordinary differential equations with sufficiently
smooth coefficients have been
.... investigated in detail by other authors
\cite{CR1,CR2,CR3,CR4,CR5}.'''
>>f2 = '''The solvability conditions and the Green's functions of linear boundary value problems for ordinary
.... differential equations with sufficiently smooth coefficients have
been investigated in detail by many
.... authors \cite{CR1,CR2,CR3,CR4,CR5}.'''

I get this result:

T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o n s a n d t h e G r e e n ' s f u
n c t i o n s o f l i n e a r b o u n d
a r y v a l u e+ + p+ r+ o+ b+ l+ e+ m+ s+ + f+ o+ r+ + o
+ r+ d+ i+ n+ a+ r+ y
+ d+ i+ f+ f- p- r- o- b- l e+ r+ e+ n+ t+ i+ a+ l+ + e+ q+ u+ a+ t+
i+ o+ n+ s+ + w+ i+ t+ h+ + s+ u+ f+ f+ i+ c+ i+ e+ n+ t+ l+ y+ +
s m- s- - f o- r- o- r- d- i- n- a- r- y- - d- i- f- f- e- r-
e- n t- i- a- l- - e- q- u- a- t- i- o- n- s- - w- i- t h - s-
u- f- f- i c- i+ o e+ f+ f+ i+ c+ i+ e n t- l- y- s+ + h+ a+ v
+ e+ + b+ e+ e+ n+ + i+ n+ v+ e+ s+ t+ i+ g+ a+ t+ e+ d+ + i+ n+ +
d+ e+ t+ a+ i+ l+ + b+ y+ m- o- o- t- h- - c- o- e- f- f- i- c-
i- e- n- t- s- - h a- v- e- - b- e- e n+ y
- i- n- v- e- s- t- i- g- a- t- e- d- - i- n- - d- e- t- a- i- l- -
b- y- - o- t- h- e- r- a u t h o r s \ c i t e
{ C R 1 , C R 2 , C R 3 , C R 4 , C R 5 } .

Whereas, this is my expected result:

T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o n s a n d t h e G r e e n ' s f u
n c t i o n s o f l i n e a r b o u n d
a r y v a l u e-
+ p r o b l e m s f o r o r d i n a r y- +
d i f f e r e n t i a l e q u a t i o n s
w i t h s u f f i c i e n t l y s m o o t
h c o e f f i c i e n t s h a v e b e e
n-
+ i n v e s t i g a t e d i n d e t a i
l b y + m- o- t- h- e- r- a+ n+ y+
+ a u t h o r s \ c i t e { C R 1 , C R 2 , C
R 3 , C R 4 , C R 5 } .
Thanks,
Jen

May 7 '07 #7

P: n/a
En Mon, 07 May 2007 00:52:18 -0300, whitewave <fr****@gmail.comescribió:
I am currently doing the third option. Doing file.read() to both file
to be compared then feed the result to the compare function.

Let me give you a brief sample of what I want to achieve.

Using this code
>>>diffline=[]
fileDiff = difflib.Differ().compare(f1, f2)
diffline = list(fileDiff)
finStr = ''.join(diffline)
So you are concerned with character differences, ignoring higher order
structures. Use a linejunk filter function to the Differ constructor -as
shown in my post last Wednesday- to ignore "\n" characters when matching.
That is:

def ignore_eol(c): return c in "\r\n"
fileDiff = difflib.Differ(linejunk=ignore_eol).compare(f1, f2)
print ''.join(fileDiff)

you get:

- T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o
n s a n d t h e G r e e n ' s f u n c t i
o n
s o f l i n e a r b o u n d a r y v a l
u e-
+ p r o b l e m s f o r o r d i n a r y- +
d i f f e r e n t i a l e q u a t i o n s w
i t
h s u f f i c i e n t l y s m o o t h c o e
f f
i c i e n t s h a v e b e e n-
+ i n v e s t i g a t e d i n d e t a i l
b y
+ m+ a+ n+ y+
- o- t- h- e- r- a u t h o r s \ c i t e { C R 1 ,
C R
2 , C R 3 , C R 4 , C R 5 } .

--
Gabriel Genellina

May 7 '07 #8

This discussion thread is closed

Replies have been disabled for this discussion.