DiffLib Question

whitewave

Hi Guys,
I'm a bit confused in difflib. In most cases, the differences
found using difflib works well but when I have come across the
following set of text:

>>d1 = '''In addition, the considered problem does not have a meaningful traditional type of adjoint

.... problem even for the simple forms of the differential equation and
the nonlocal conditions. Due to these facts, some serious difficulties
arise in the application of the classical methods to such a
problem.'''

>>d2 = '''In addition, the considered problem does not have a meaningful traditional type of

.... adjoint problem even for the simple forms of the differential
equation and the nonlocal conditions. Due
.... to these facts, some serious difficulties arise in the application
of the classical methods to such a
.... problem. '''

Using this line of code:

>>a = difflib.Differ( ).compare(d1,d2 )
dif =[]
for i in a:

.... dif.append(i)
.... s = ''.join(dif)

I get the following output:

' I n a d d i t i o n , t h e c o n s i
d e r e d p r o b l e m d o e s n o t
h a v e a m e a n i n g f u l t r a d i
t i o n a l t y p e o f- + \n a d j o i n t+
+ p+ r+ o+ b+ l+ e+ m+ + e+ v+ e+ n+ + f+ o+ r+ + t+ h+ e+ + s+ i+
m+ p+ l+ e+ + f+ o+ r+ m+ s+ + o+ f+ + t+ h+ e+ + d+ i+ f+ f+ e+ r
+ e+ n+ t+ i+ a+ l+ + e+ q+ u+ a+ t+ i+ o+ n+ + a+ n+ d+ + t+ h+ e
+ + n+ o+ n+ l+ o+ c+ a+ l+ + c+ o+ n+ d+ i+ t+ i+ o+ n+ s+ .+ + D+
u+ e \n+ t+ o+ + t+ h+ e+ s+ e+ + f+ a+ c+ t+ s+ ,+ + s+ o+ m+ e+
+ s+ e+ r+ i+ o+ u+ s+ + d+ i+ f+ f+ i+ c+ u+ l+ t+ i+ e+ s+ + a+ r+
i+ s+ e+ + i+ n+ + t+ h+ e+ + a+ p+ p+ l+ i+ c+ a+ t+ i+ o+ n+ + o
+ f+ + t+ h+ e+ + c+ l+ a+ s+ s+ i+ c+ a+ l+ + m+ e+ t+ h+ o+ d+ s
+ + t+ o+ + s+ u+ c+ h+ + a+ \n p r o b l e m- - e- v- e-
n- - f- o- r- - t- h- e- - s- i- m- p- l- e- - f- o- r- m- s- -
o- f- - t- h- e- - d- i- f- f- e- r- e- n- t- i- a- l- - e- q- u-
a- t- i- o- n- - a- n- d- - t- h- e- - n- o- n- l- o- c- a- l- -
c- o- n- d- i- t- i- o- n- s . - D- u- e- - t- o- - t- h- e- s-
e- - f- a- c- t- s- ,- - s- o- m- e- - s- e- r- i- o- u- s- - d-
i- f- f- i- c- u- l- t- i- e- s- - a- r- i- s- e- - i- n- - t- h-
e- - a- p- p- l- i- c- a- t- i- o- n- - o- f- - t- h- e- - c- l-
a- s- s- i- c- a- l- - m- e- t- h- o- d- s- - t- o- - s- u- c- h-
- a- - p- r- o- b- l- e- m- .'

How come the rest of the text after the "adjoint" word is marked as an
additional text (while others is deleted) while in fact those text are
contained in both d1 and d2?The only difference is that it has a
newline. I'm I missing something? Is there a way for me to disregard
the newlines and spaces?

Python 2.3
WINXP

Thanks.
Jen

May 2 '07 #1

Subscribe Reply

3407

Michele Simionato

On May 2, 10:46 am, whitewave <fru...@gmail.c omwrote:

Is there a way for me to disregard
the newlines and spaces?

Python 2.3
WINXP

Thanks.
Jen

HTH:

>help(difflib.D iffer.__init__)

Help on method __init__ in module difflib:

__init__(self, linejunk=None, charjunk=None) unbound difflib.Differ
method
Construct a text differencer, with optional filters.

The two optional keyword parameters are for filter functions:

- `linejunk`: A function that should accept a single string
argument,
and return true iff the string is junk. The module-level
function
`IS_LINE_JUNK` may be used to filter out lines without visible
characters, except for at most one splat ('#'). It is
recommended
to leave linejunk None; as of Python 2.3, the underlying
SequenceMatcher class has grown an adaptive notion of "noise"
lines
that's better than any static definition the author has ever
been
able to craft.

- `charjunk`: A function that should accept a string of length 1.
The
module-level function `IS_CHARACTER_J UNK` may be used to filter
out
whitespace characters (a blank or tab; **note**: bad idea to
include
newline in this!). Use of IS_CHARACTER_JU NK is recommended.
Michele Simionato

May 2 '07 #2

whitewave

Hi,
Thank you for your reply. But I don't fully understand what the
charjunk and linejunk is all about. I'm a bit newbie in python using
the DiffLib. I'm I using the right code here? I will I implement the
linejunk and charjunk using the following code?

>>a = difflib.Differ( ).compare(d1,d2 )
dif =[]
for i in a:

.... dif.append(i)
.... s = ''.join(dif)

Thanks
Jen

May 2 '07 #3

Gabriel Genellina

En Wed, 02 May 2007 06:26:13 -0300, whitewave <fr****@gmail.c omescribió:

Thank you for your reply. But I don't fully understand what the
charjunk and linejunk is all about. I'm a bit newbie in python using
the DiffLib. I'm I using the right code here? I will I implement the
linejunk and charjunk using the following code?

Usually, Differ receives two sequences of lines, being each line a
sequence of characters (strings). It uses a SequenceMatcher to compare
lines; the linejunk argument is used to ignore certain lines. For each
pair of similar lines, it uses another SequenceMatcher to compare
characters inside lines; the charjunk is used to ignore characters.
As you are feeding Differ with a single string (not a list of text lines),
the "lines" it sees are just characters. To ignore whitespace and
newlines, in this case one should use the linejunk argument:

def ignore_ws_nl(c) :
return c in " \t\n\r"

a = difflib.Differ( linejunk=ignore _ws_nl).compare (d1,d2)
dif = list(a)
print ''.join(dif)

I n a d d i t i o n , t h e c o n s i d e
r e
d p r o b l e m d o e s n o t h a v e
a m
e a n i n g f u l t r a d i t i o n a l t y
p e
o f- +
a d j o i n t-
+ p r o b l e m e v e n f o r t h e s i
m p
l e f o r m s o f t h e d i f f e r e n t
i a
l e q u a t i o n a n d t h e n o n l o
c a l
c o n d i t i o n s . D u e- +
t o t h e s e f a c t s , s o m e s e r
i o
u s d i f f i c u l t i e s a r i s e i n
t h
e a p p l i c a t i o n o f t h e c l a
s s i
c a l m e t h o d s t o s u c h a- +
p r o b l e m .+

I hope this is what you were looking for.

--
Gabriel Genellina

May 2 '07 #4

whitewave

Usually, Differ receives two sequences of lines, being each line a

sequence of characters (strings). It uses a SequenceMatcher to compare
lines; the linejunk argument is used to ignore certain lines. For each
pair of similar lines, it uses another SequenceMatcher to compare
characters inside lines; the charjunk is used to ignore characters.
As you are feeding Differ with a single string (not a list of text lines),
the "lines" it sees are just characters. To ignore whitespace and
newlines, in this case one should use the linejunk argument:

def ignore_ws_nl(c) :
return c in " \t\n\r"

a =difflib.Differ (linejunk=ignor e_ws_nl).compar e(d1,d2)
dif = list(a)
print ''.join(dif)

I n a d d i t i o n , t h e c o n s i d e
r e
d p r o b l e m d o e s n o t h a v e
a m
e a n i n g f u l t r a d i t i o n a l t y
p e
o f- +
a d j o i n t-
+ p r o b l e m e v e n f o r t h e s i
m p
l e f o r m s o f t h e d i f f e r e n t
i a
l e q u a t i o n a n d t h e n o n l o
c a l
c o n d i t i o n s . D u e- +
t o t h e s e f a c t s , s o m e s e r
i o
u s d i f f i c u l t i e s a r i s e i n
t h
e a p p l i c a t i o n o f t h e c l a
s s i
c a l m e t h o d s t o s u c h a- +
p r o b l e m .+

Thanks! It works fine but I was wondering why the result isn't
consistent. I am comparing two huge documents with several paragraphs
in it. Some parts in the paragraph returns the diff perfectly but
others aren't. I am confused.

Thanks.
Jen

May 4 '07 #5

Gabriel Genellina

En Fri, 04 May 2007 06:46:44 -0300, whitewave <fr****@gmail.c omescribió:

Thanks! It works fine but I was wondering why the result isn't
consistent. I am comparing two huge documents with several paragraphs
in it. Some parts in the paragraph returns the diff perfectly but
others aren't. I am confused.

Differ objects do a two-level diff; depending on what kind of differences
you are interested in, you feed it with different things.
If the "line" concept is important to you (that is, you want to see which
"lines" were added, removed or modified), then feed the Differ with a
sequence of lines (file.readlines () would be fine).
This way, if someone inserts a few words inside a paragraph and the
remaining lines have to be reflushed, you'll see many changes from words
that were at end of lines now moving to the start of next line.
If you are more concerned about "paragraphs " and words, feed the Differ
with a sequence of "paragraphs ". Maybe your editor can handle it; assuming
a paragraph ends with two linefeeds, you can get a list of paragraphs in
Python using file.read().spl it("\n\n").
A third alternative would be to consider the text as absolutely plain, and
just feed Differ with file.read(), as menctioned in an earlier post.

--
Gabriel Genellina

May 6 '07 #6

whitewave

Hello,

I am currently doing the third option. Doing file.read() to both file
to be compared then feed the result to the compare function.

Let me give you a brief sample of what I want to achieve.

Using this code

>>diffline=[]
fileDiff = difflib.Differ( ).compare(f1, f2)
diffline = list(fileDiff)
finStr = ''.join(difflin e)

With the following strings for comparison:

>>f1 = ''' The solvable conditions and the Green's functions of linear boundary value

.... problems for ordinary differential equations with sufficiently
smooth coefficients have been
.... investigated in detail by other authors
\cite{CR1,CR2,C R3,CR4,CR5}.'''

>>f2 = '''The solvability conditions and the Green's functions of linear boundary value problems for ordinary

.... differential equations with sufficiently smooth coefficients have
been investigated in detail by many
.... authors \cite{CR1,CR2,C R3,CR4,CR5}.'''

I get this result:

T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o n s a n d t h e G r e e n ' s f u
n c t i o n s o f l i n e a r b o u n d
a r y v a l u e+ + p+ r+ o+ b+ l+ e+ m+ s+ + f+ o+ r+ + o
+ r+ d+ i+ n+ a+ r+ y
+ d+ i+ f+ f- p- r- o- b- l e+ r+ e+ n+ t+ i+ a+ l+ + e+ q+ u+ a+ t+
i+ o+ n+ s+ + w+ i+ t+ h+ + s+ u+ f+ f+ i+ c+ i+ e+ n+ t+ l+ y+ +
s m- s- - f o- r- o- r- d- i- n- a- r- y- - d- i- f- f- e- r-
e- n t- i- a- l- - e- q- u- a- t- i- o- n- s- - w- i- t h - s-
u- f- f- i c- i+ o e+ f+ f+ i+ c+ i+ e n t- l- y- s+ + h+ a+ v
+ e+ + b+ e+ e+ n+ + i+ n+ v+ e+ s+ t+ i+ g+ a+ t+ e+ d+ + i+ n+ +
d+ e+ t+ a+ i+ l+ + b+ y+ m- o- o- t- h- - c- o- e- f- f- i- c-
i- e- n- t- s- - h a- v- e- - b- e- e n+ y
- i- n- v- e- s- t- i- g- a- t- e- d- - i- n- - d- e- t- a- i- l- -
b- y- - o- t- h- e- r- a u t h o r s \ c i t e
{ C R 1 , C R 2 , C R 3 , C R 4 , C R 5 } .

Whereas, this is my expected result:

T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o n s a n d t h e G r e e n ' s f u
n c t i o n s o f l i n e a r b o u n d
a r y v a l u e-
+ p r o b l e m s f o r o r d i n a r y- +
d i f f e r e n t i a l e q u a t i o n s
w i t h s u f f i c i e n t l y s m o o t
h c o e f f i c i e n t s h a v e b e e
n-
+ i n v e s t i g a t e d i n d e t a i
l b y + m- o- t- h- e- r- a+ n+ y+
+ a u t h o r s \ c i t e { C R 1 , C R 2 , C
R 3 , C R 4 , C R 5 } .
Thanks,
Jen

May 7 '07 #7

Gabriel Genellina

En Mon, 07 May 2007 00:52:18 -0300, whitewave <fr****@gmail.c omescribió:

I am currently doing the third option. Doing file.read() to both file
to be compared then feed the result to the compare function.

Let me give you a brief sample of what I want to achieve.

Using this code

>>>diffline=[]
fileDiff = difflib.Differ( ).compare(f1, f2)
diffline = list(fileDiff)
finStr = ''.join(difflin e)

So you are concerned with character differences, ignoring higher order
structures. Use a linejunk filter function to the Differ constructor -as
shown in my post last Wednesday- to ignore "\n" characters when matching.
That is:

def ignore_eol(c): return c in "\r\n"
fileDiff = difflib.Differ( linejunk=ignore _eol).compare(f 1, f2)
print ''.join(fileDif f)

you get:

- T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o
n s a n d t h e G r e e n ' s f u n c t i
o n
s o f l i n e a r b o u n d a r y v a l
u e-
+ p r o b l e m s f o r o r d i n a r y- +
d i f f e r e n t i a l e q u a t i o n s w
i t
h s u f f i c i e n t l y s m o o t h c o e
f f
i c i e n t s h a v e b e e n-
+ i n v e s t i g a t e d i n d e t a i l
b y
+ m+ a+ n+ y+
- o- t- h- e- r- a u t h o r s \ c i t e { C R 1 ,
C R
2 , C R 3 , C R 4 , C R 5 } .

--
Gabriel Genellina

May 7 '07 #8

Similar topics

1527

The "junk" parameter in difflib.sequencematcher

by: shuhsien | last post by:

Hi, I am confused by the junk parameter in the difflib.sequencematcher. I thought it would simply ignore everything that's returned true by the junk function. However, I have results as follows: >>> sequencematcher(lambda x: x == ' ', "lion", "li on").ratio() 0.88888888888888884 >>> sequencematcher("lion", "li on").ratio() 0.0

Python

1712

difflib.ndiff broken?

by: Humpdydum | last post by:

Can anyone try the following in their python interpreter? These give correct output: >>> print list(ndiff(,)) >>> print list(ndiff(,)) >>> print list(ndiff(,))

Python

1098

Python COM Class Question

by: pemo | last post by:

I'm trying to use difflib.py from a COM aware language - and, for ease of use, I'm initially trying to get this going from VB6. I've wrappered difflib.py correctly I think, and I can now call into it and call a global method (called 'test' of course). Here's the consuming code: Dim PythonUtils As Object Set PythonUtils =...

Python

8680

How fuzzy is get_close_matches() in difflib?

by: John Henry | last post by:

I am just wondering what's with get_close_matches() in difflib. What's the magic? How fuzzy do I need to get in order to get a match?

Python

9398

Using difflib to compare text ignoring whitespace differences

by: Neilen Marais | last post by:

Hi I'm trying to compare some text to find differences other than whitespace. I seem to be misunderstanding something, since I can't even get a basic example to work: In : d = difflib.Differ(charjunk=difflib.IS_CHARACTER_JUNK) In : list(d.compare(, )) Out:

Python

2218

difflib.HtmlDiff

by: stefaan | last post by:

Hello List, I am using difflib.HtmlDiff and it provides great functionality. Unfortunately it is too slow for my purpose. Is anyone aware of an alternative ? - a C-implementation lying around somewhere ? - perhaps an external tool into which I can pipe my data ? Intra-line differences are important for my task at hand.

Python

1333

difflib confusion

by: krishnakant Mane | last post by:

hello all, I have a bit of a confusing question. firstly I wanted a library which can do an svn like diff with two files. let's say I have file1 and file2 where file2 contains some thing which file1 does not have. now if I do readlines() on both the files, I have a list of all the lines. I now want to do a diff and find out which word is...

Python

3128

Merging a patch/diff generated by difflib?

by: erikcw | last post by:

Hi, I'm trying to create an undo/redo feature for a webapp I'm working on (django based). I'd like to have an undo/redo function. My first thought was to use the difflib to generate a diff to serve as the "backup", and then if someone wants to undo their operation, the diff could just be merged/patched with the current text. However,...

Python

1829

A bug in difflib module? (find_longest_match)

by: n00m | last post by:

from random import randint s1 = '' s2 = '' for i in xrange(1000): s1 += chr(randint(97,122)) s2 += chr(randint(97,122)) print s1

Python

7703

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...

General

7619

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...

Windows Server

7930

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...

C / C++

8138

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...

Online Marketing

7681

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...

Windows Server

7983

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...

General

6290

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...

Career Advice

5514

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...

Microsoft Access / VBA

950

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

General