Fuzzy matching of postal addresses

Andrew McLean

I have a problem that is suspect isn't unusual and I'm looking to see if
there is any code available to help. I've Googled without success.

Basically, I have two databases containing lists of postal addresses and
need to look for matching addresses in the two databases. More
precisely, for each address in database A I want to find a single
matching address in database B.

I'm 90% of the way there, in the sense that I have a simplistic approach
that matches 90% of the addresses in database A. But the extra cases
could be a pain to deal with!

It's probably not relevant, but I'm using ZODB to store the databases.

The current approach is to loop over addresses in database A. I then
identify all addresses in database B that share the same postal code
(typically less than 50). The database has a mapping that lets me do
this efficiently. Then I look for 'good' matches. If there is exactly
one I declare a success. This isn't as efficient as it could be, it's
O(n^2) for each postcode, because I end up comparing all possible pairs.
But it's fast enough for my application.

The problem is looking for good matches. I currently normalise the
addresses to ignore some irrelevant issues like case and punctuation,
but there are other issues.

Here are just some examples where the software didn't declare a match:

1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS

Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP

Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU

St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL

The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF

The challenge is to fix some of the false negatives above without
introducing false positives!

Any pointers gratefully received.

--
Andrew McLean

Jul 18 '05 #1

Subscribe Post Reply

14005

James Keasley

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2005-01-18, Andrew McLean <sp***********@at-andros.demon.co.uk> wrote:

I have a problem that is suspect isn't unusual and I'm looking to see if
there is any code available to help. I've Googled without success.
I have done something very similar (well, near as dammit identical,
actually), unfortunately, I did this in Perl[1].

I can give you some general pointers of the way to go about this
type of thing, but can't actually provide any code, as it is at work.
Basically, I have two databases containing lists of postal addresses and
need to look for matching addresses in the two databases. More
precisely, for each address in database A I want to find a single
matching address in database B.
In my implementation this is the exact same setup, database A
(OS/RM addresspoint data) contained a metric shitload of addresses,
with addresses to be matched against them coming from client supplied
data.
I'm 90% of the way there, in the sense that I have a simplistic approach
that matches 90% of the addresses in database A. But the extra cases
could be a pain to deal with!
There is no way to guarantee 100% accuracy if one (or both) of the
datasets is collated from a non-authoratative source, ie the actual
homeowners. ;)
It's probably not relevant, but I'm using ZODB to store the databases. The current approach is to loop over addresses in database A. I then
identify all addresses in database B that share the same postal code
(typically less than 50). The database has a mapping that lets me do
this efficiently. Then I look for 'good' matches. If there is exactly
one I declare a success. This isn't as efficient as it could be, it's
O(n^2) for each postcode, because I end up comparing all possible pairs.
But it's fast enough for my application.
OK, this is a good start, the first thing to do is to clean the data,
especially the postcodes.

something along the lines of: (pseudocode, can't remember the exact python
implementation I did)

postcode = resultarray[postcode]
len = length(postcode)
for (i = 0; i < len; i++):
# if the fourth character is a digit or a space append it to the string
if (i == 3 && postcode[i] =~ /(\d|\s)/:
cleanPostcode .= postcode[i]
# if this isn't the fourth character and it is an aplhanumeric character
# append it to the string
else if (postcode[i] =~ /(\w|\d):
cleanPostcode .= postcode[i]

That puts all the postcodes into the format that the Royal Mail uses.

Then search on postcode ;)

The next thing I did was to split each individual word out fom
it field, and matched that in a specific order (I found starting
with elements such as house name, number, and street name
was the best approach).

If a word matched it was assigned a score (1 for a exact match, 0.7 for
a metaphone match and 0.6 for a soundex macth IIRC), and when the searching
was finished I took the resulting score and divided it by the number of
word elements.

If that score was higher than any of the prevous scores then it was put
in a given variable. If there where a number of equally good(bad?) matches
then they were appended onto an array, and if there was no clear winner
yb the time that the last of the records for that postcode had been process
it spat out a multiple choice list.

The trick is picking a threshold level below which no matches are
put into the DB, even if they are the best scoring. (I used a threshold
of 0.3

This can be refined, the current, extremely baroque, perl script that
does this currently drops out certain values from the data array if
there is an exact match with certain fields, such as house name.
It doesn't reduce the value of the integer that the result is divided
by though, thus favouring results that return an exact match on a couple
of given fields.
The challenge is to fix some of the false negatives above without
introducing false positives!

Any pointers gratefully received.

Hope this is a) understandable, and b) useful ;)

FWIW, the perl script (an I would expect a similarly implemented python
script to perform about as well) running a somewhat flaky set of
user collated data against the Royal Mail Addresspoint data managed
a 75% hit rate, with an additional 5% requiring user intervention,
and as near as I have been able to ascertain a >1% false positive
count, from a dataset of nearly 17,000 addresses.

With cleaner and more up to date data I would expect the results
to be noticably better.

[1] It is still my main language, I don't use python enough to
think in it as easily as I think in perl ;)

- --
James jamesk[at]homeric[dot]co[dot]uk

"Consistency is the last resort of the unimaginative."
- -- Bob Hope
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB7FurqfSmHkD6LvoRAgdVAJ4t2HCaT52qbuqyT5yN59 X+az0ZQwCfZgOH
L5nTnPj+TF95Z+FCM65CzV0=
=UkeW
-----END PGP SIGNATURE-----

Jul 18 '05 #2

Jeff Shannon

Andrew McLean wrote:

The problem is looking for good matches. I currently normalise the
addresses to ignore some irrelevant issues like case and punctuation,
but there are other issues.

I'd do a bit more extensive normalization. First, strip off the city
through postal code (e.g. 'Beaminster, Dorset, DT8 3SS' in your
examples). In the remaining string, remove any punctuation and words
like "the", "flat", etc.
Here are just some examples where the software didn't declare a match:
And how they'd look after the transformation I suggest above:
1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS
1 Brantwood
BEECHES 1 BRANTWOOD
Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP
2 Bethany House Broadwindsor Road
2 BETHANY HOUSE
Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU
Penthouse Old Vicarage 1 Clay Lane
PENTHOUSE OLD VICARAGE 1 CLAY LANE
St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL
St Johns Presbytery Shortmoor
PRESBYTERY SHORTMOOR
The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF

Pinnacles White Sheet Hill
PINNACLES WHITESHEET HILL
Obviously, this is not perfect, but it's closer. At this point, you
could perhaps say that if either string is a substring of the other,
you have a match. That should work with all of these examples except
the last one. You could either do this munging for all address
lookups, or you could do it only for those that don't find a match in
the simplistic way. Either way, you can store the Database B's
pre-munged address so that you don't need to constantly recompute
those. I can't say for certain how this will perform in the false
positives department, but I'd expect that it wouldn't be too bad.

For a more-detailed matching, you might look into finding an algorithm
to determine the "distance" between two strings and using that to
score possible matches.

Jeff Shannon
Technician/Programmer
Credit International

Jul 18 '05 #3

John Roth

"Andrew McLean" <sp***********@at-andros.demon.co.uk> wrote in message
news:96**************@at-andros.demon.co.uk...

I have a problem that is suspect isn't unusual and I'm looking to see if
there is any code available to help. I've Googled without success.
There isn't any publically availible code that I'm aware of.
Companies that do a good job of address matching regard
that code as a competitive advantage on a par with the
crown jewels.
Basically, I have two databases containing lists of postal addresses and
need to look for matching addresses in the two databases. More precisely,
for each address in database A I want to find a single matching address in
database B.

I'm 90% of the way there, in the sense that I have a simplistic approach
that matches 90% of the addresses in database A. But the extra cases could
be a pain to deal with!
From a purely pragmatic viewpoint, is this a one-off, and how many
non-matches do you have to deal with? If the answers are yes,
and not all that many, I'd do the rest by hand.
It's probably not relevant, but I'm using ZODB to store the databases.
I doubt if it's relevant.
The current approach is to loop over addresses in database A. I then
identify all addresses in database B that share the same postal code
(typically less than 50). The database has a mapping that lets me do this
efficiently. Then I look for 'good' matches. If there is exactly one I
declare a success. This isn't as efficient as it could be, it's O(n^2) for
each postcode, because I end up comparing all possible pairs. But it's
fast enough for my application.

The problem is looking for good matches. I currently normalise the
addresses to ignore some irrelevant issues like case and punctuation, but
there are other issues.
I used to work on a system that had a reasonably decent address
matching routine. The critical issue is, as you suspected, normalization.
You're not going far enough. You've also got an issue here that doesn't
exist in the States - named buildings.

Here are just some examples where the software didn't declare a match:

1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS
The first line is a street address, the second is a named building and a
street
without a house number. There's no way of matching this unless you know
that The Beaches doesn't have flat (or room, etc.) numbers and can move the
1 to being the street address. On the other hand, this seems to be a
consistent problem in your data base - in the US, the street address must
be associated with the street name. No comma is allowed between the two.
Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP
The first is a flat, house name and street name, the second is a number
and a house name. Assuming that UK postal standards don't allow
more than one named building in a postal code, this is easily matchable
if you do a good job of normalization.
Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU
The issue here is to use the words "flat" and "the" to split the flat
name and the house name. Then the house number is in the wrong
part - it shoud go with the street name. See the comment above.
St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL
This one may not be resolvable, unless there is only one house name
with "presbytery" in it within the postal code. Notice that "the" should
probably be dropped when normalizing.
The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF
Spelling correction needed.
The challenge is to fix some of the false negatives above without
introducing false positives!

Any pointers gratefully received.
If, on the other hand, this is a repeating problem that's simply going
to be an ongoing headache, I'd look into commercial address correction
software. Here in the US, there are a number of vendors that have
such software to correct addresses to the standards of the USPS.
They also have data bases of all the legitimate addresses in each
postal code. They're adjuncts of mass mailers, and they exist
because the USPS gives a mass mailing discount based on the
number of "good" addresses you give them.

I don't know what the situation is in the UK, but I'd be surprised
if there wasn't some availible address data base, either commercial
or free, possibly as an adjunct of the postal service.

The later, by the way, is probably the first place I'd look. The
postal service has a major interest in having addresses that they
can deliver without a lot of hassle.

Another place is google. The first two pages using "Address
Matching software" gave two UK references, and several
Australian references.

John Roth
--
Andrew McLean

Jul 18 '05 #4

Skip Montanaro

Andrew> I'm 90% of the way there, in the sense that I have a simplistic
Andrew> approach that matches 90% of the addresses in database A. But
Andrew> the extra cases could be a pain to deal with!

Based upon the examples you gave, here are a couple things you might try to
reduce the size of the difficult comparisons:

* Remove "the" and commas as part of your normalization process

* Split each address on white space and convert the resulting list to a
set, then consider the size of the intersection with other addresses
with the same postal code:

a1 = "St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL".upper().replace(",", "")
a1 "ST JOHN'S PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL" a2 = "THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL".upper().replace(",", "").replace("THE ", "")
a2 'PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL' a1 == a2 False sa1 = set(a1.split())
sa2 = set(a2.split())
len(sa1) 8 len(sa2) 6 len(sa1 & sa2)

6

Skip

Jul 18 '05 #5

John Machin

Ermmm ... only remove "the" when you are sure it is a whole word. Even
then it's a dodgy idea. In the first 1000 lines of the nearest address
file I had to hand, I found these: Catherine, Matthew, Rotherwood,
Weatherall, and "The Avenue".

Ermmm... don't rip out commas (or other punctuation); replace them with
spaces. That way "SHORTMOOR,BEAMINSTER" doesn't become one word
"SHORTMOORBEAMINSTER".
A not-unreasonable similarity metric would be float(len(sa1 & sa2)) /
len(sa1 | sa2). Even more reasonable would be to use trigrams instead
of words -- more robust in the presence of erroneous insertion or
deletion of spaces (e.g. Short Moor and Bea Minster are plausible
variations) and spelling errors and typos. BTW, the OP's samples look
astonishingly clean to me, so unlike real world data.

Two general comments addressed to the OP:
(1) Your solution doesn't handle the case where the postal code has
been butchered. e.g. "DT8 BEL" or "OT8 3EL".
(2) I endorse John Roth's comments. Validation against an address data
base that is provided by the postal authority, using either an
out-sourced bureau service, or buying a licence to use
validation/standardisation/repair software, is IMHO the way to go. In
Australia the postal authority assigns a unique ID to each delivery
point. This "DPID" has to be barcoded onto the mail article to get bulk
postage discounts. Storing the DPID on your database makes duplicate
detection a snap. You can license s/w (from several vendors) that is
certified by the postal authority and has batch and/or online APIs. I
believe the situation in the UK is similar. At least one of the vendors
in Australia is a British company. Google "address deduplication
site:.uk"
Actually (3): If you are constrained by budget, pointy-haired boss or
hubris to write your own software (a) lots of luck (b) you need to do a
bit more research -- look at the links on the febrl website, also
Google for "Monge Elkan", read their initial paper, look at the papers
referencing that on citeseer; also google for "merge purge"; also
google for "record linkage" (what the statistical and medical
fraternity call the problem) (c) and have a damn good look at your data
[like I said, it looks too clean to be true] and (d) when you add a
nice new wrinkle like "strip out 'the'", do make sure to run your
regression tests :-)
Would you believe (4): you are talking about cross-matching two
databases -- don't forget the possibility of duplicates _within_ each
database.
HTH,
John

Jul 18 '05 #6

John Machin

You can't even get anywhere near 100% accuracy when comparing
"authoritative sources" e.g. postal authority and the body charged with
maintaining a database of which streets are in which electoral district
-- no, not AUS, but close :-)

Jul 18 '05 #7

Aaron Bingham

Andrew McLean wrote:

I have a problem that is suspect isn't unusual and I'm looking to see if
there is any code available to help. I've Googled without success.

Basically, I have two databases containing lists of postal addresses and
need to look for matching addresses in the two databases. More
precisely, for each address in database A I want to find a single
matching address in database B.

I had a similar problem to solve a while ago. I can't give you my code,
but I used this paper as the basis for my solution (BibTeX entry from
http://citeseer.ist.psu.edu/monge00adaptive.html):

@misc{ monge-adaptive,
author = "Alvaro E. Monge",
title = "An Adaptive and Efficient Algorithm for Detecting
Approximately Duplicate
Database Records",
url = "citeseer.ist.psu.edu/monge00adaptive.html" }

There is a lot of literature--try a google search for "approximate
string match"--but very little publically available code in this area,
from what I could gather. Removing punctuation, etc., as others have
suggested in this thread, is _not_sufficient_. Presumably you want to
be able to match typos or phonetic errors as well. This paper's
algorithm deals with those problems quite nicely,

--
--------------------------------------------------------------------
Aaron Bingham
Application Developer
Cenix BioScience GmbH
--------------------------------------------------------------------

Jul 18 '05 #8

Simon Brunning

You might find these at least periperally useful:
<http://www.brunningonline.net/simon/blog/archives/001291.html>
<http://www.brunningonline.net/simon/blog/archives/001292.html>

They refer to address formatting rather than de-duping - but
normalising soulds like a useful first step to me.

--
Cheers,
Simon B,
si***@brunningonline.net,
http://www.brunningonline.net/simon/blog/

Jul 18 '05 #9

sp1d3rx

I think you guys are missing the point. All you would need to add to
get a 'probable match' is add another search that goes through the 10%
that didnt get matched and do a "endswith" search on the data. From the
example data you showed me, that would match a good 90% of the 10%,
leaving you with a 1% that must be hand matched. You would have to
combine this idea with Jeff Shannon's idea to make it work more
efficiently.

Jul 18 '05 #10

John Machin

John Machin wrote:

Ermmm ... only remove "the" when you are sure it is a whole word. Even then it's a dodgy idea. In the first 1000 lines of the nearest address file I had to hand, I found these: Catherine, Matthew, Rotherwood,
Weatherall, and "The Avenue".

Partial apologies: I wasn't reading Skip's snippet correctly -- he had
"THE ", I read "THE". Only "The Avenue" is a problem in the above list.
However Skip's snippet _does_ do damage in cases where the word ends in
"the". Grepping lists of placenames found 25 distinct names in UK,
including "The Mythe" and "The Wrythe".

Addendum: Given examples in the UK like "Barton in the Beans" (no
kiddin') and "Barton-on-the-Heath", replacing "-" by space seems
indicated.

Jul 18 '05 #11

Andrew McLean

Thanks for all the suggestions. There were some really useful pointers.

A few random points:

1. Spending money is not an option, this is a 'volunteer' project. I'll
try out some of the ideas over the weekend.

2. Someone commented that the data was suspiciously good quality. The
data sources are both ones that you might expect to be authoritative. If
you use as a metric, having a correctly formatted and valid postcode, in
one database 100% the records do in the other 99.96% do.

3. I've already noticed duplicate addresses in one of the databases.

4. You need to be careful doing an endswith search. It was actually my
first approach to the house name issue. The problem is you end up
matching "12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".

I am tempted to try an approach based on splitting the address into a
sequence of normalised tokens. Then work with a metric based on the
differences between the sequences. The simple case would look at
deleting tokens and perhaps concatenating tokens to make a match.

--
Andrew McLean

Jul 18 '05 #12

John Roth

"Andrew McLean" <sp***********@at-andros.demon.co.uk> wrote in message
news:$a**************@at-andros.demon.co.uk...

Thanks for all the suggestions. There were some really useful pointers.

A few random points:

1. Spending money is not an option, this is a 'volunteer' project. I'll
try out some of the ideas over the weekend.

2. Someone commented that the data was suspiciously good quality. The data
sources are both ones that you might expect to be authoritative. If you
use as a metric, having a correctly formatted and valid postcode, in one
database 100% the records do in the other 99.96% do.

3. I've already noticed duplicate addresses in one of the databases.

4. You need to be careful doing an endswith search. It was actually my
first approach to the house name issue. The problem is you end up matching
"12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".

I am tempted to try an approach based on splitting the address into a
sequence of normalised tokens. Then work with a metric based on the
differences between the sequences. The simple case would look at deleting
tokens and perhaps concatenating tokens to make a match.
It's been a while since I did this stuff. The trick in dealing
with address normalization is to parse the string backwards,
and try to slot the pieces into the general pattern.

In your case, the postal code, district (is that what it's called
in the UK?) and city seem to be fine, it's when you get to the
street (with or without house number), building name and flat
or room number that there's a difficulty.

We always had a list of keywords that could be trusted
to be delimiters. In your examples, "the" should be pretty
reliable in indicating a building name. Of course, that might
have some trouble with Tottering on the Brink.

John Roth
--
Andrew McLean

Jul 18 '05 #13

Aaron Bingham

Andrew McLean wrote:

Thanks for all the suggestions. There were some really useful pointers.

A few random points: [snip] 4. You need to be careful doing an endswith search. It was actually my
first approach to the house name issue. The problem is you end up
matching "12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".

Is that really a problem? That looks like a likely typo to me. I guess
it depends on your data set. In my case, the addresses were scattered
all over the place, with relatively few in a given city, so the
likelyhood of two addresses on the same street in the same town was very
low. We /wanted/ to check for this kind of 'duplication'.

Note that endswith will not deal with 'Avenue' vs. 'Ave.', but I supose
a normalization phase could take care of this for you. The Monge
algorithm I pointed you to takes care of this pretty nicely.

--
--------------------------------------------------------------------
Aaron Bingham
Application Developer
Cenix BioScience GmbH
--------------------------------------------------------------------

Jul 18 '05 #14

Andrew McLean

In article <ma**************************************@python.o rg>, Aaron
Bingham <bi*****@cenix-bioscience.com> writes

Andrew McLean wrote:
Thanks for all the suggestions. There were some really useful pointers.
A few random points:[snip]
4. You need to be careful doing an endswith search. It was actually
my first approach to the house name issue. The problem is you end up
matching "12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".

Is that really a problem? That looks like a likely typo to me. I
guess it depends on your data set. In my case, the addresses were
scattered all over the place, with relatively few in a given city, so
the likelyhood of two addresses on the same street in the same town was
very low. We /wanted/ to check for this kind of 'duplication'.

There is something I didn't tell you ;-). My data sets are exactly the
opposite of being scattered. Both databases are for a particular
geographical location and should contain the vast majority of the
households within that location.

So I have to worry about:

Chantry Farmhouse, Newtown, BEAMINSTER, DORSET, DT8 3SB

Potentially matching to any of:

CHANTRY COTTAGE, BEAMINSTER, DORSET DT8 3SB
CHANTRY COTTAGE ANNEXE, BEAMINSTER, DORSET DT8 3SB
CHANTRY FARM COTTAGE, BEAMINSTER, DORSET DT8 3SB
CHANTRY FARM HOUSE, BEAMINSTER, DORSET DT8 3SB

Or looking for:

The Mews House, 20 The Square, BEAMINSTER, DORSET, DT8 3AU

Among a set including:

20A, THE SQUARE, BEAMINSTER, DORSET DT8 3AU
THE MEWS 20, THE SQUARE, BEAMINSTER, DORSET DT8 3AU
THE MEWS 20B, THE SQUARE, BEAMINSTER, DORSET DT8 3AU
20, THE SQUARE, BEAMINSTER, DORSET DT8 3AU

I think the token sequence based approach shows promise for these
examples. In the first case I get an exact match by concatenating two
tokens (farm and house), which I think should be given a low penalty.
And in the second by deleting one token (house) which should have a
higher penalty.

So I think a variant on the Smith-Waterman algorithm operating on tokens
looks promising.
Note that endswith will not deal with 'Avenue' vs. 'Ave.', but I supose
a normalization phase could take care of this for you. The Monge
algorithm I pointed you to takes care of this pretty nicely.

--
Andrew McLean

Jul 18 '05 #15

John Machin

Andrew McLean wrote:

In case anyone is interested, here is the latest. def insCost(tokenList, indx, pos):
"""The cost of inserting a specific token at a specific normalised position along the sequence.""" if containsNumber(tokenList[indx]):
return INSERT_TOKEN_WITH_NUMBER + POSITION_WEIGHT * (1 - pos)
elif indx > 0 and containsNumber(tokenList[indx-1]):
return INSERT_TOKEN_AFTER_NUMBER + POSITION_WEIGHT * (1 - pos) elif tokenList[indx][0] in minorTokenList:
return INSERT_MINOR_TOKEN
else:
return INSERT_TOKEN + POSITION_WEIGHT * (1 - pos)

def delCost(tokenList, indx, pos):
"""The cost of deleting a specific token at a specific normalised position along the sequence. This is exactly the same cost as inserting a token."""
return insCost(tokenList, indx, pos)

Functions are first-class citizens of Pythonia -- so just do this:

delCost = insCost

Re speed generally: (1) How many addresses in each list and how long is
it taking? On what sort of configuration? (2) Have you considered using
pysco -- if not running on x86 architecture, consider exporting your
files to a grunty PC and doing the match there. (3) Have you considered
some relatively fast filter to pre-qualify pairs of addresses before
you pass the pair to your relatively slow routine?

Soundex?? To put it bluntly, the _only_ problem to which soundex is the
preferred solution is genealogy searching in the US census records, and
even then one needs to know what varieties of the algorithm were in use
at what times. I thought you said your addresses came from
authoritative sources. You have phonetic errors? Can you give some
examples of pairs of tokens that illustrate the problem you are trying
to overcome with soundex?

Back to speed again: When you look carefully at the dynamic programming
algorithm for edit distance, you will note that it is _not_ necessary
to instantiate the whole NxM matrix -- it only ever refers to the
current row and the previous row. What does space saving have to do
with speed, you ask? Well, Python is not FORTRAN; it takes considerable
effort to evaluate d[i][j]. A relatively simple trick is to keep 2 rows
and swap (the pointers to) them each time around the outer loop. At the
expense of a little more complexity, one can reduce this to one row and
3 variables (north, northwest, and west) corresponding to d[i-1][j],
d[i-1][j-1], and d[i][j-1] -- but I'd suggest the simple way first.
Hope some of this helps,

John

Jul 18 '05 #16

Andrew McLean

In article <11*********************@f14g2000cwb.googlegroups. com>, John
Machin <sj******@lexicon.net> writes

Andrew McLean wrote:
In case anyone is interested, here is the latest.
def insCost(tokenList, indx, pos):
"""The cost of inserting a specific token at a specific

normalised position along the sequence."""
if containsNumber(tokenList[indx]):
return INSERT_TOKEN_WITH_NUMBER + POSITION_WEIGHT * (1 - pos)
elif indx > 0 and containsNumber(tokenList[indx-1]):
return INSERT_TOKEN_AFTER_NUMBER + POSITION_WEIGHT * (1 -

pos)
elif tokenList[indx][0] in minorTokenList:
return INSERT_MINOR_TOKEN
else:
return INSERT_TOKEN + POSITION_WEIGHT * (1 - pos)

def delCost(tokenList, indx, pos):
"""The cost of deleting a specific token at a specific normalised

position along the sequence.
This is exactly the same cost as inserting a token."""
return insCost(tokenList, indx, pos)

Functions are first-class citizens of Pythonia -- so just do this:

delCost = insCost

Actually, the code used to look like that. I think I changed it so that
it would look clearer. But perhaps that was a bad idea.
Re speed generally: (1) How many addresses in each list and how long is
it taking? On what sort of configuration? (2) Have you considered using
pysco -- if not running on x86 architecture, consider exporting your
files to a grunty PC and doing the match there. (3) Have you considered
some relatively fast filter to pre-qualify pairs of addresses before
you pass the pair to your relatively slow routine?
There are approx. 50,000 addresses in each list.

At the moment the processing assumes all the postcodes are correct, and
only compares addresses with matching postcodes. This makes it a lot
faster. But may miss some cases of mismatched postcodes.

Also it does two passes. One looking for exact matches of token
sequences. This deals with about half the cases. Only then do I employ
the, more expensive, edit distance technique.

Overall, the program runs in less than half an hour. Specifically it
takes about 60s per thousand addresses, which requires an average of
about 8 calls to editDistance per address. Psyco.full() reduced the 60s
to 45s.

I'll only try optimisation if I need to use it much more.
Soundex?? To put it bluntly, the _only_ problem to which soundex is the
preferred solution is genealogy searching in the US census records, and
even then one needs to know what varieties of the algorithm were in use
at what times. I thought you said your addresses came from
authoritative sources. You have phonetic errors? Can you give some
examples of pairs of tokens that illustrate the problem you are trying
to overcome with soundex?
I'm sure that in retrospect Soundex might not be a good choice. The
misspellings tend to be minor, e.g.
Kitwhistle and KITTWHISTLE
Tythe and TITHE

I was tempted by an edit distance technique on the tokens, but would
prefer a hash based method for efficiency reasons.
Back to speed again: When you look carefully at the dynamic programming
algorithm for edit distance, you will note that it is _not_ necessary
to instantiate the whole NxM matrix -- it only ever refers to the
current row and the previous row. What does space saving have to do
with speed, you ask? Well, Python is not FORTRAN; it takes considerable
effort to evaluate d[i][j]. A relatively simple trick is to keep 2 rows
and swap (the pointers to) them each time around the outer loop. At the
expense of a little more complexity, one can reduce this to one row and
3 variables (north, northwest, and west) corresponding to d[i-1][j],
d[i-1][j-1], and d[i][j-1] -- but I'd suggest the simple way first.
Hope some of this helps,

Thanks for that. The first edit distance algorithm I looked at did it
that way. But I based my code on a version of the algorithm I could
actually follow ;-). Getting the concatenation bit right was non-trivial
and it was useful to store all of "d" for debugging purposes.

As to Python not being Fortran. You've found me out. The three languages
I am most comfortable with are Fortran, Matlab and Python. It did occur
to me that numarray might be a more efficient way of dealing with a
4-dimensional array, but the arrays aren't very big, so the overhead in
setting them up might be significant.

The simplest optimisation would be to replace the two indices used to
deal with concatenation by four explicit variables. And then, as you
said, I could just store the three last rows, and avoid any multiple
indexing.

As with all these potential optimisations, you don't know until you try
them.

--
Andrew McLean

Jul 18 '05 #17

Joseph Turian

Andrew,

Basically, I have two databases containing lists of postal addresses and need to look for matching addresses in the two databases. More
precisely, for each address in database A I want to find a single
matching address in database B.

What percent of addresses in A have a unique corresponding address in
B? (i.e. how many addresses will have some match in B?)

This is a standard document retrieval task. Whole books could be
written about the topic. (In fact, many have been).

I suggest you don't waste your time trying to solve this problem from
scratch, and instead capitalize on the effort of others. Hence, my
proposal is pretty simple:
1. Regularize the punctuation of the text (e.g. convert it all to
uppercase), since it is uninformative and---at best---a confounding
variable.
2. Use a free information retrieval package to find matches.
e.g. LEMUR: http://www-2.cs.cmu.edu/~lemur/

In this case, a "document" is an address in Database B. A "query" is an
address in Database A. (Alternately, you could switch A and B to see if
that affects accuracy.)

Good luck.

Joseph

Jul 18 '05 #18

Fuzzy matching of postal addresses

Similar topics