473,320 Members | 2,003 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

difflib.ndiff broken?

Can anyone try the following in their python interpreter?

These give correct output:
print list(ndiff(['saving2 <<A'],['saving <<a>>'])) ['- saving2 <<A', '? - ^\n', '+ saving <<a>>', '? ^^^\n'] print list(ndiff(['saving2 <<AA'],['saving <<a>>'])) ['- saving2 <<AA', '? - ^^\n', '+ saving <<a>>', '? ^^^\n'] print list(ndiff(['saving2 <<A'],['saving <<aa>>'])) ['- saving2 <<A', '? - ^\n', '+ saving <<aa>>', '? ^^^^\n'] print list(ndiff(['saving <<A'],['saving <<aa>>'])) ['- saving <<A', '? ^\n', '+ saving <<aa>>', '? ^^^^\n']

Now try the very slight variations:
print list(ndiff(['saving2 <<AA'],['saving <<aa>>'])) ['- saving2 <<AA', '+ saving <<aa>>'] print list(ndiff(['saving2 <<AA'],['saving <<aa>>']))

['- saving2 <<AA', '+ saving <<aa>>']

This can't be right... or is it? Where are the '? ...' lines? It does this
for both Python 2.3.2 on Windows 2000 and Python 2.3.3 on SGI. If it's
correct, how come???

Oliver
Jul 18 '05 #1
3 1698
[Humpdydum]
Can anyone try the following in their python interpreter?

These give correct output:
print list(ndiff(['saving2 <<A'],['saving <<a>>'])) ['- saving2 <<A', '? - ^\n', '+ saving <<a>>', '? ^^^\n'] print list(ndiff(['saving2 <<AA'],['saving <<a>>'])) ['- saving2 <<AA', '? - ^^\n', '+ saving <<a>>', '? ^^^\n'] print list(ndiff(['saving2 <<A'],['saving <<aa>>'])) ['- saving2 <<A', '? - ^\n', '+ saving <<aa>>', '? ^^^^\n'] print list(ndiff(['saving <<A'],['saving <<aa>>'])) ['- saving <<A', '? ^\n', '+ saving <<aa>>', '? ^^^^\n']

Now try the very slight variations:
print list(ndiff(['saving2 <<AA'],['saving <<aa>>'])) ['- saving2 <<AA', '+ saving <<aa>>'] print list(ndiff(['saving2 <<AA'],['saving <<aa>>'])) ['- saving2 <<AA', '+ saving <<aa>>']

This can't be right... or is it? Where are the '? ...' lines? It does this
for both Python 2.3.2 on Windows 2000 and Python 2.3.3 on SGI. If it's
correct, how come???


ndiff produces intraline difference marking if and only if it thinks
the inputs are "reasonably close". The cutoff between "reasonably
close" and "not reasonably close" is necessarily heuristic. '?' lines
are more irritating than helpful when they have a lot of markup in
them, so it certainly wan't intended that '?' lines *always* be
produced. The '+' and '-' lines contain all the information about how
to change one sequence into another; the '?' lines are fluff (abeit
sometimes helpful fluff -- that's why they're (sometimes) there).

Concretely, ndiff produces intraline marking iff two lines have a
similarity ratio of at least 0.75. In your first examples, the lines
do:
import difflib
m = difflib.SequenceMatcher()
m.set_seqs('saving2 <<A', 'saving <<a>>')
print m.ratio() 0.782608695652

In your last examples, the lines don't:
m.set_seqs('saving2 <<AA', 'saving <<aa>>')
print m.ratio() 0.72


Internally, 0.75 is the default value of FancyReplacer's optional
minimal_cutoff argument.
Jul 18 '05 #2
OK, forget it, sorry it was my mistake: it wasn't obvious from the difflib
docs, but it appears that ndiff points out the sub-line differences (lines
that start with ?) only if it was able to figure out operations that could
be applied to substrings on the line. Though often such operations are
obvious by looking at the strings being compared, ndiff doesn't always find
them, and so marks the whole line as + or -.

Anyone know of web site that explains ndiff output? I coulnd't figure out a
good set of search terms in google, didn't get anything useful. Thanks,

Oliver

"Humpdydum" <ol***************@utoronto.ca> wrote in message
news:cd**********@nrc-news.nrc.ca...
Can anyone try the following in their python interpreter?

These give correct output:
print list(ndiff(['saving2 <<A'],['saving <<a>>'])) ['- saving2 <<A', '? - ^\n', '+ saving <<a>>', '? ^^^\n'] print list(ndiff(['saving2 <<AA'],['saving <<a>>'])) ['- saving2 <<AA', '? - ^^\n', '+ saving <<a>>', '?

^^^\n']
print list(ndiff(['saving2 <<A'],['saving <<aa>>'])) ['- saving2 <<A', '? - ^\n', '+ saving <<aa>>', '?

^^^^\n']
print list(ndiff(['saving <<A'],['saving <<aa>>'])) ['- saving <<A', '? ^\n', '+ saving <<aa>>', '? ^^^^\n']

Now try the very slight variations:
print list(ndiff(['saving2 <<AA'],['saving <<aa>>'])) ['- saving2 <<AA', '+ saving <<aa>>'] print list(ndiff(['saving2 <<AA'],['saving <<aa>>']))

['- saving2 <<AA', '+ saving <<aa>>']

This can't be right... or is it? Where are the '? ...' lines?

Jul 18 '05 #3
[Humpdydum]
OK, forget it, sorry it was my mistake:
I didn't see a mistake, just a question.
it wasn't obvious from the difflib docs, but it appears that ndiff points out the
sub-line differences (lines that start with ?) only if it was able to figure out
operations that could be applied to substrings on the line. Though often such
operations are obvious by looking at the strings being compared,
They can be for a program but often aren't for people. That's why
ndiff produces '?' lines when it thinks they might help. This is a
heuristic -- a guess. Sometimes it's not the same guess you'd make.
There's always a sequence of operations that can be applied to change
any line into any other line, but *usually* they're uninteresting.
'?' lines attempt to point out "minor edits".
ndiff doesn't always find them, and so marks the whole line as + or -.
It marks two input lines that differ with - and + regardless of
whether it produces two ? lines too.
Anyone know of web site that explains ndiff output? I coulnd't figure out a
good set of search terms in google, didn't get anything useful. Thanks,


ndiff is unique to Python, and you have the source code for it.
Because '?' lines are fluff, precise docs for them would be
counterproductive. They're meant to guide the eye to minor intraline
differences, and that's all.

If a ? line appears, there are always two of them, interleaved between
a -+ pair, in this pattern:

-
?
+
?

Each ? line implicitly refers to the line immediately above it. Four
meaningful characters appear in ? lines. A caret (^) means the
character immediately above it was replaced, in going from the - to
the + line. "-" means the character immediately above it was deleted;
'+' means it was inserted; and a blank means the character immediately
above it is the same in both (- and +) lines. A '-' can appear only
in the ? line following a - line, and a '+' can appear only in the ?
line following a + line, because we're picturing the edits needed to
change the - line into the + line.
Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
by: John Henry | last post by:
I am just wondering what's with get_close_matches() in difflib. What's the magic? How fuzzy do I need to get in order to get a match?
1
by: Neilen Marais | last post by:
Hi I'm trying to compare some text to find differences other than whitespace. I seem to be misunderstanding something, since I can't even get a basic example to work: In : d =...
7
by: whitewave | last post by:
Hi Guys, I'm a bit confused in difflib. In most cases, the differences found using difflib works well but when I have come across the following set of text: .... problem even for the simple...
3
by: n00m | last post by:
from random import randint s1 = '' s2 = '' for i in xrange(1000): s1 += chr(randint(97,122)) s2 += chr(randint(97,122)) print s1
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.