473,400 Members | 2,145 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,400 software developers and data experts.

Fuzzy Lookups

I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab

Jan 30 '06 #1
24 2644
> As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?


I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

def relative(a, b):
"""
Computes a relative distance between two strings. Its in the range
(0-1] where 1 means total equality.
@type a: string
@param a: arg one
@type b: string
@param b: arg two
@rtype: float
@return: the distance
"""
d = distance(a,b)
longer = float(max((len(a), len(b))))
shorter = float(min((len(a), len(b))))
r = ((longer - d) / longer) * (shorter / longer)
return r
distance is a run-of-the mill levenshtein-distance as yours.

The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJolei"

and

"Bob"

Both have a l-dist of 3, but the normalized distance is
0.729591836735

and

0.015306122449

respectively - making you chose the better match :)

regards,

Diez

Jan 30 '06 #2
Diez B. Roggisch wrote:
The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJolei"

and

"Bob"

Both have a l-dist of 3

distance("Angelina Jolie", "AngelinaJolei") 3 distance("Angelina Jolie", "Bob")

13

what did I miss ?

</F>

Jan 30 '06 #3
Fredrik Lundh wrote:
Diez B. Roggisch wrote:
The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJolei"

and

"Bob"

Both have a l-dist of 3

distance("Angelina Jolie", "AngelinaJolei") 3 distance("Angelina Jolie", "Bob")

13

what did I miss ?


Hmm. I missed something - the "1" before the "3" in 13 when I looked on my
terminal after running the example. And according to

http://www.reference.com/browse/wiki...htein_distance

it has the property

"""It is always at least the difference of the sizes of the two strings."""

And my implementation I got from there (or better from Magnus Lie Hetland
whoms python version is referenced there)

So you are right, my example is crap.

But I ran into cases where my normalizing made sense - otherwise I wouldn't
have done it :)

I guess it is more along the lines of (coughed up example)

"abcdef"

compared to

"abcefd"

"abcd"

I can only say that I used it to fuzzy-compare people's and hotel names, and
applying the normalization made my results by far better.

Sorry to cause confusion.

Diez
Jan 30 '06 #4
Diez B. Roggisch wrote:
I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:


Thanks for the snippet. I agree that normalizing is important. A
distance of three is one thing when your strings are long, but quite
another when they are short. I'd been thinking about something along
these lines myself, but hadn't gotten there yet. It'll be interesting
to have a look at the distribution of the normalized numbers, I'd guess
that there may be a rough threshold that effectively separates the
wheat from the chaff.

jab

Jan 30 '06 #5

BBands wrote:
Diez B. Roggisch wrote:
I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:


Thanks for the snippet. I agree that normalizing is important. A
distance of three is one thing when your strings are long, but quite
another when they are short. I'd been thinking about something along
these lines myself, but hadn't gotten there yet. It'll be interesting
to have a look at the distribution of the normalized numbers, I'd guess
that there may be a rough threshold that effectively separates the
wheat from the chaff.

jab


i noticed this guy, who's quite a good ruby developer spent some time
on distances:

http://ruby.brian-schroeder.de/editierdistanz/

and also look at soundex, other algorithms (Double Metaphone, NYSIIS,
Phonex, I have notes to investigate but I haven't looked at them
myhself)

Jan 31 '06 #6

BBands wrote:
I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab


Here is my stab at it, didn't fully test it so it may not work
correctly. The tuples could be avoided by using len(x), but the extra
lookups may cost in execution speed[1].

def distance(a, b):
"""
Computes the levenshtein distance between two strings
"""
m, n = (len(a),a), (len(b),b)
if(m[0] < n[0]): #ensure that the 'm' tuple holds
the longest string
m, n = n, m
dist = m[0] #assume distance = length of
longest string (worst case)
for i in range(0, n[0]): # reduce the distance for each char
match in shorter string
if m[1][i] == n[1][i]:
dist = dist - 1
return dist

[1] I have no if this is true or not. I can see arguments for both.

Jan 31 '06 #7
Ok, ok, I got it! The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=difflib.get_close_matches(AFileName,A llFiles,20,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.

-Greg
On 30 Jan 2006 17:28:47 -0800, ajones <aj*****@gmail.com> wrote:

BBands wrote:
I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab


Here is my stab at it, didn't fully test it so it may not work
correctly. The tuples could be avoided by using len(x), but the extra
lookups may cost in execution speed[1].

def distance(a, b):
"""
Computes the levenshtein distance between two strings
"""
m, n = (len(a),a), (len(b),b)
if(m[0] < n[0]): #ensure that the 'm' tuple holds
the longest string
m, n = n, m
dist = m[0] #assume distance = length of
longest string (worst case)
for i in range(0, n[0]): # reduce the distance for each char
match in shorter string
if m[1][i] == n[1][i]:
dist = dist - 1
return dist

[1] I have no if this is true or not. I can see arguments for both.

--
http://mail.python.org/mailman/listinfo/python-list

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
Jan 31 '06 #8
> Thanks for that, I'll have a look. (So many packages, so little
time...)
Yes, there's a standard library for everything it seems! Except for a
MySQL api :-(
I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.


I would be very interested it seeing that.


Done, see:
http://www.blendedtechnologies.com/r...zy-approach/60

If anyone would be kind enough to improve it I'd love to have these
features but I'm swamped this week!

- MD5 checking for find exact matches regardless of name
- Put each set of duplicates in its own subfolder.

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
Jan 31 '06 #9
Gregory Piñero wrote:
Ok, ok, I got it! The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=difflib.get_close_matches(AFileName,A llFiles,20,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.


A quick trip to difflib.py produces this description of the matching
algorithm:

The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name "gestalt pattern matching". The basic idea is to find
the longest contiguous matching subsequence that contains no "junk"
elements (R-O doesn't address junk). The same idea is then applied
recursively to the pieces of the sequences to the left and to the
right of the matching subsequence.

So no, it doesn't seem to be using Levenshtein distance.

Kent
Jan 31 '06 #10
I wonder which algorithm determines the similarity between two strings better?

On 1/31/06, Kent Johnson <ke**@kentsjohnson.com> wrote:
Gregory Piñero wrote:
Ok, ok, I got it! The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=difflib.get_close_matches(AFileName,A llFiles,20,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.


A quick trip to difflib.py produces this description of the matching
algorithm:

The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name "gestalt pattern matching". The basic idea is to find
the longest contiguous matching subsequence that contains no "junk"
elements (R-O doesn't address junk). The same idea is then applied
recursively to the pieces of the sequences to the left and to the
right of the matching subsequence.

So no, it doesn't seem to be using Levenshtein distance.

Kent
--
http://mail.python.org/mailman/listinfo/python-list

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
Jan 31 '06 #11
On Tue, 31 Jan 2006 10:51:44 -0500, Gregory Piñero wrote:
http://www.blendedtechnologies.com/r...zy-approach/60

If anyone would be kind enough to improve it I'd love to have these
features but I'm swamped this week!

- MD5 checking for find exact matches regardless of name
- Put each set of duplicates in its own subfolder.


This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?
--
Steven.

Jan 31 '06 #12
Steven D'Aprano <st***@REMOVETHIScyber.com.au> writes:
This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?


It's not just a matter of comparing two files. The idea is you have
10,000 local files and you want to find which ones are duplicates
(i.e. if files 637 and 2945 have the same contents, you want to
discover that). The obvious way is make a list of hashes, and sort
the list.
Jan 31 '06 #13
On Tue, 31 Jan 2006, it was written:
Steven D'Aprano <st***@REMOVETHIScyber.com.au> writes:
This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare?
I often wonder that!
Is it purely a caching thing (once you have the checksum, you don't
need to read the file again)? Are there any other reasons?


It's not just a matter of comparing two files. The idea is you have
10,000 local files and you want to find which ones are duplicates (i.e.
if files 637 and 2945 have the same contents, you want to discover
that). The obvious way is make a list of hashes, and sort the list.


Obvious, perhaps, prudent, no. To make the list of hashes, you have to
read all of every single file first, which could take a while. If your
files are reasonably random at the beginning, you'd be better off just
using the first N bytes of the file, since this would be just as
effective, and cheaper to read. Looking at some random MP3s i have to
hand, they all differ within the first 20 bytes - probably due to the ID3
tags, so this should work for these.

Better yet, note that if two files are identical, they must have the same
length, and that finding the length can be done very cheaply, so a quicker
yet approach is to make a list of lengths, sort that, and look for
duplicates; when you find one, do a byte-by-byte comparison of the files
(probably terminating in the first block) to see if they really are the
same.

By way of example, of the 2690 music files in my iTunes library, i have
twelve pairs of same-sized files [1], and all of these differ within the
first 18 bytes (mostly, within the first 9 bytes). Therefore, i could rule
out duplication with just 22 data blocks read from disk (plus rather more
blocks of directory information and inodes, of course). A hash-based
approach would have had to wade through a touch over 13 GB of data before
it could even get started.

Of course, there are situations where this is the wrong approach - if you
have a collection of serialised sparse matrices, for example, which
consist of identically-sized blocks of zeroes with a scattering of ones
throughout, then lengths and prefixes will be useless, whereas hashes will
work perfectly. However, here, we're looking at MP3s, where lengths and
prefixes will be a win.

tom

[1] The distribution of those is a bit weird: ten pairs consist of two
tracks from The Conet Project's 'Recordings of Shortwave Numbers
Stations', one is a song from that and The Doors' 'Horse Latitudes', and
one is between to Calexico songs ('The Ride (Pt II)' and 'Minas De
Cobre'). Why on earth are eleven of the twelve pairs pairs of songs from
the same artist? Is it really that they're pairs of songs from the same
compressor (those tracks aren't from CD), i wonder?

--
Not all legislation can be eye-catching, and it is important that the
desire to achieve the headlines does not mean that small but useful
measures are crowded out of the legislative programme. -- Select Committee
on Transport
Feb 1 '06 #14
On Tue, 31 Jan 2006 13:38:50 -0800, Paul Rubin wrote:
Steven D'Aprano <st***@REMOVETHIScyber.com.au> writes:
This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?


It's not just a matter of comparing two files. The idea is you have
10,000 local files and you want to find which ones are duplicates
(i.e. if files 637 and 2945 have the same contents, you want to
discover that). The obvious way is make a list of hashes, and sort
the list.


Sure. But if you are just comparing two files, is there any reason to
bother with a checksum? (MD5 or other.)

I can't see any, but I thought maybe that's because I'm not thinking
outside the box.
--
Steven.

Feb 1 '06 #15
Tom Anderson <tw**@urchin.earth.li> writes:
The obvious way is make a list of hashes, and sort the list.
Obvious, perhaps, prudent, no. To make the list of hashes, you have to
read all of every single file first, which could take a while. If your
files are reasonably random at the beginning, ...


The possibility of two different mp3 files having the same id3 tags is
something you might specifically be checking for. Hashing the first
few hundred bytes would be more reliable, since it would include some
actual audio in the hash. But then you're back to a list of hashes.
Better yet, note that if two files are identical, they must have the
same length, and that finding the length can be done very cheaply, so
a quicker yet approach is to make a list of lengths, sort that, and
look for duplicates; when you find one, do a byte-by-byte comparison
of the files (probably terminating in the first block) to see if they
really are the same.
Yes, checking the file lengths first is an obvious heuristic, but if
you fine you have a bunch of files with the same length, what do you
do? You're back to a list of hashes.
By way of example, of the 2690 music files in my iTunes library, i
have twelve pairs of same-sized files [1], and all of these differ
within the first 18 bytes (mostly, within the first 9 bytes).


That's a small enough set of matches that you don't need a general
purpose algorithm.
Feb 2 '06 #16
Steven D'Aprano <st***@REMOVETHIScyber.com.au> writes:
Sure. But if you are just comparing two files, is there any reason to
bother with a checksum? (MD5 or other.)


No of course not, except in special situations, like some problem
opening and reading both files simultaneously. E.g.: the files are on
two different DVD-R's, they are too big to fit in ram, and you only
have one DVD drive. If you want to compare byte by byte, you have to
either copy one of the DVD's to your hard disk (if you have the space
available) or else swap DVD's back and forth in the DVD drive reading
and comparing a bufferload at a time. But you can easily read in the
first DVD and compute its hash on the fly, then read and hash the
second DVD and compare the hashes.

If it's a normal situation with two files on HD, just open both files
simultaneously, and use large buffers to keep the amount of seeking
reasonable. That will be faster than big md5 computations, and more
reliable (there are known ways to construct pairs of distinct files
that have the same md5 hash.)
Feb 2 '06 #17
Steven D'Aprano wrote:
This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?


Because if you store a hash, then you can keep that around even when the
original file is archived, moved elsewhere, or deleted. It's awfully
helpful for building databases of files you've seen before.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
Everyone wants to look good at his own funeral.
-- Louis Wu
Feb 2 '06 #18
On Thu, 1 Feb 2006, it was written:
Tom Anderson <tw**@urchin.earth.li> writes:
The obvious way is make a list of hashes, and sort the list.


Obvious, perhaps, prudent, no. To make the list of hashes, you have to
read all of every single file first, which could take a while. If your
files are reasonably random at the beginning, ...


The possibility of two different mp3 files having the same id3 tags is
something you might specifically be checking for.


So read from the end of the file, rather than the beginning.
Better yet, note that if two files are identical, they must have the
same length, and that finding the length can be done very cheaply, so a
quicker yet approach is to make a list of lengths, sort that, and look
for duplicates; when you find one, do a byte-by-byte comparison of the
files (probably terminating in the first block) to see if they really
are the same.


Yes, checking the file lengths first is an obvious heuristic, but if you
fine you have a bunch of files with the same length, what do you do?
You're back to a list of hashes.


Or prefixes or suffixes.
By way of example, of the 2690 music files in my iTunes library, i have
twelve pairs of same-sized files [1], and all of these differ within
the first 18 bytes (mostly, within the first 9 bytes).


That's a small enough set of matches that you don't need a general
purpose algorithm.


True - and this is *exactly* the situation that the OP was talking about,
so this algorithm is appropriate. Moreover, i believe is representative of
most situations where you have a bunch of files to compare. Of course,
cases where files are tougher to tell apart do exist, but i think they're
corner cases. Could you suggest a common kind of file with degenerate
lengths, prefixes and suffixes?

The only one that springs to mind is a set of same-sized image files in
some noncompressed format, recording similar images (frames in a movie,
say), where the differences might be buried deep in the pixel data. As it
happens, i have just such a dataset on disk: with the images in TIFF
format, i get differences between subsequent frames after 9 bytes, but i
suspect that's a timestamp or something; if i convert everything to a nice
simple BMP (throwing away 8 bits per sample of precision in the process -
probably turning most of the pixels to 0!), then i find differences about
a megabyte in. If i compare from the tail in, i also have to wade through
about a megabyte before finding a difference. Here, hashes would be ideal.

tom

--
The revolution is here. Get against the wall, sunshine. -- Mike Froggatt
Feb 2 '06 #19
Diez B. Roggisch wrote:
I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

def relative(a, b):
"""
Computes a relative distance between two strings. Its in the range
(0-1] where 1 means total equality.
@type a: string
@param a: arg one

<snip>

Hello,

I adapted your approach to my needs and thought you might like to see
the result

def LevenshteinRelative(a, b):
"""
Returns the Levenshtein distance between two strings
as a relative quantity in the range 1 to 0 where
1.0 is a perfect match.
"""
# Calculate the Levenshtein distance. (returns an integer)
dist = LevenshteinDistance(a, b)
# dist is at most the length of the longer string.
max_dist = float(max(len(a), len(b)))
# dist is always at least the difference of the sizes of the two
strings.
min_dist = max_dist - float(min(len(a), len(b)))
try: # If max_dist and min_dist are equal use simple form.
relative = 1.0 - (dist - min_dist) / (max_dist - min_dist)
except ZeroDivisionError:
relative = 1.0 - dist / max_dist
return relative

Thanks,

jab

Feb 3 '06 #20
Gregory Piñero ha scritto:
:
If anyone would be kind enough to improve it I'd love to have these
features but I'm swamped this week!

- MD5 checking for find exact matches regardless of name
- Put each set of duplicates in its own subfolder.


Done? http://pyfdupes.sourceforge.net/

Bye,
luca
Feb 9 '06 #21
Gregory Piñero ha scritto:
:
If anyone would be kind enough to improve it I'd love to have these
features but I'm swamped this week!

- MD5 checking for find exact matches regardless of name
- Put each set of duplicates in its own subfolder.


Done? http://pyfdupes.sourceforge.net/

Bye,
luca
Feb 9 '06 #22
Wow, that looks excellent. I'll definately try it out. I'm assuming
this is an existing project, e.g. you didn't write it after reading
this thread?

-Greg
On 2/9/06, name <na**@example.com> wrote:
Gregory Piñero ha scritto:
:
If anyone would be kind enough to improve it I'd love to have these
features but I'm swamped this week!

- MD5 checking for find exact matches regardless of name
- Put each set of duplicates in its own subfolder.


Done? http://pyfdupes.sourceforge.net/

Bye,
luca

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)
Feb 9 '06 #23
Gregory Piñero ha scritto:
Wow, that looks excellent. I'll definately try it out. I'm assuming
this is an existing project, e.g. you didn't write it after reading
this thread?


Yes it is an existing projects of course ;)
Right now I've no time to improve it.
I hope that later this summer I will find
the time to make file management much easier on the
gui side and to split all the different algorithm in
external plugins so anyone can add it's own and compare
with the other already there. The main goal of my project
was primarly the interest for the "similarity search".

bye,
luca
Feb 10 '06 #24
Gregory Piñero ha scritto:
Wow, that looks excellent. I'll definately try it out. I'm assuming
this is an existing project, e.g. you didn't write it after reading
this thread?


Yes it is an existing projects of course ;)
Right now I've no time to improve it.
I hope that later this summer I will find
the time to make file management much easier on the
gui side and to split all the different algorithm in
external plugins so anyone can add it's own and compare
with the other already there. The main goal of my project
was primarly the interest for the "similarity search".

bye,
luca

Feb 10 '06 #25

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

17
by: Andrew McLean | last post by:
I have a problem that is suspect isn't unusual and I'm looking to see if there is any code available to help. I've Googled without success. Basically, I have two databases containing lists of...
1
by: Ray Gardener | last post by:
I was wondering if anyone had tried implementing fuzzy logic set concepts in C++, because in FL, the concept of "type" or "class" is fuzzy; things belong (or are of) a given type only by degree....
1
by: Russell | last post by:
Hi, I've been reading a lot about not using lookups in tables lately. Thus far, what I've been able to understand from my reading is this: - Do not use lookup fields in tables because they...
5
by: AnAnimal | last post by:
We are moving a rather large application to the VB.NET platform. This application has a ton of table lookups. ie. code & description or code, category & description or code, class, category &...
24
by: cassetti | last post by:
Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate...
6
by: mac | last post by:
Summary: 1. I want to define a column in anMS Access table to be lookups on other tables. 2. The table that is the data source (e.g the "parent" table) has a composite primary key. 3. When...
14
by: Steve Bergman | last post by:
I'm looking for a module to do fuzzy comparison of strings. I have 2 item master files which are supposed to be identical, but they have thousands of records where the item numbers don't match in...
2
by: King Ron | last post by:
Ola all. In responding to a recent post requesting help with a search issue, I recommended using a combo box lookup in the table design. "paii, Ron" (no relation) posted this reply: " There are...
4
by: netnewbie78 | last post by:
Hello All, I don't have a problem (but maybe I will after I explain). I have a question with regards to something I saw in Access 2007. But first, a little backstory: I'm writing a very small...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.