By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,290 Members | 1,195 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,290 IT Pros & Developers. It's quick & easy.

adress regex help

P: n/a
Hello all

have a regex question... I want to split an address into descrete parts

so

709 S Milton Ave is split into
number = 709
Direction = S
Name = Milton
Type = Ave

So I have the following regex

(?<number>^\d*(\s\w|\w|\-\w|\s\d/\d))\s(?<direction>(n\.|N\.|s\.|S\.|E\.|e\.|W\.|w\ .|NE\.|ne\.|SE\.|se\.|NW\.|nw\.|SW\.|sw\.|n|N|s|S| E|e|W|w|NE|ne|SE|se|NW|nw|SW|sw|North|East|West|So uth|north|south|west|east)*)(?<street>(.*[^street|place|drive|st|pl|dr|ave|av])*)(?<type>.*)

Which works for the folowing address

709 S S Milton ave (as in 709 S South Milton ave)

as that S is part of the number

but does not work for

709 S Milton ave
because it thinks that the S is part of the number and not the
direction....

any ideas

Jun 14 '06 #1
Share this Question
Share on Google+
4 Replies


P: n/a

<mi***************@gmail.com> wrote in message
news:11*********************@f6g2000cwb.googlegrou ps.com...
Hello all

have a regex question... I want to split an address into descrete parts

so

709 S Milton Ave is split into
number = 709
Direction = S
Name = Milton
Type = Ave

So I have the following regex

(?<number>^\d*(\s\w|\w|\-\w|\s\d/\d))\s(?<direction>(n\.|N\.|s\.|S\.|E\.|e\.|W\.|w\ .|NE\.|ne\.|SE\.|se\.|NW\.|nw\.|SW\.|sw\.|n|N|s|S| E|e|W|w|NE|ne|SE|se|NW|nw|SW|sw|North|East|West|So uth|north|south|west|east)*)(?<street>(.*[^street|place|drive|st|pl|dr|ave|av])*)(?<type>.*)

Which works for the folowing address

709 S S Milton ave (as in 709 S South Milton ave)

as that S is part of the number

but does not work for

709 S Milton ave
because it thinks that the S is part of the number and not the
direction....
Without having a database to find out whether the city has a "South Milton
Avenue", it's ambiguous. Why isn't number "709 S" on "Milton Ave" as valid
as number "709" on "S Milton Ave".

Moreover, your regex is going to go crazy over
P.O. Box 6000

any ideas

Jun 14 '06 #2

P: n/a
The first thing you've got to do is figure out all of the possible
permutations of combinations of tokens that may comprise an "address." You
have only apparently noticed one or two. In fact, an "address" can take many
combinations of many forms, and include many combinations of abbreviations
of various kinds. In addition, the order of the elements (tokens) in an
address can be ordered in any number of ways, particularly if these
addresses come from different countries, and especially if these addresses
have been provided by human beings rather then machines.

IOW, you've opened up a huge can of worms for yourself. What you need is not
just a regular expression, but a bit of AI to solve this problem. I have
seen it done, but I'm not sure *how* it's done. MapPoint and Google Maps can
do it fairly well, but Microsoft and Google have a lot of money to throw at
this sort of problem.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

<mi***************@gmail.com> wrote in message
news:11*********************@f6g2000cwb.googlegrou ps.com...
Hello all

have a regex question... I want to split an address into descrete parts

so

709 S Milton Ave is split into
number = 709
Direction = S
Name = Milton
Type = Ave

So I have the following regex

(?<number>^\d*(\s\w|\w|\-\w|\s\d/\d))\s(?<direction>(n\.|N\.|s\.|S\.|E\.|e\.|W\.|w\ .|NE\.|ne\.|SE\.|se\.|NW\.|nw\.|SW\.|sw\.|n|N|s|S| E|e|W|w|NE|ne|SE|se|NW|nw|SW|sw|North|East|West|So uth|north|south|west|east)*)(?<street>(.*[^street|place|drive|st|pl|dr|ave|av])*)(?<type>.*)

Which works for the folowing address

709 S S Milton ave (as in 709 S South Milton ave)

as that S is part of the number

but does not work for

709 S Milton ave
because it thinks that the S is part of the number and not the
direction....

any ideas

Jun 14 '06 #3

P: n/a
Thanks guys... couple reasponses....

1) 709 S | Milton Ave is not as valid as 709 | S | Milton ave because
they want the direction seperate... 709 S is not the street number 709
is and S Milton is not the street milton is.

2) Kevin, yah what I was suspecting but not wanting to think about.
Alternative for the client is to have 4 seperate fields on the ui
[number] [direction] [street] [type] .... but I hate this as that its
not intuitive.... or web standard.

thanks for your input guys

mike

Kevin Spencer wrote:
The first thing you've got to do is figure out all of the possible
permutations of combinations of tokens that may comprise an "address." You
have only apparently noticed one or two. In fact, an "address" can take many
combinations of many forms, and include many combinations of abbreviations
of various kinds. In addition, the order of the elements (tokens) in an
address can be ordered in any number of ways, particularly if these
addresses come from different countries, and especially if these addresses
have been provided by human beings rather then machines.

IOW, you've opened up a huge can of worms for yourself. What you need is not
just a regular expression, but a bit of AI to solve this problem. I have
seen it done, but I'm not sure *how* it's done. MapPoint and Google Maps can
do it fairly well, but Microsoft and Google have a lot of money to throw at
this sort of problem.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

<mi***************@gmail.com> wrote in message
news:11*********************@f6g2000cwb.googlegrou ps.com...
Hello all

have a regex question... I want to split an address into descrete parts

so

709 S Milton Ave is split into
number = 709
Direction = S
Name = Milton
Type = Ave

So I have the following regex

(?<number>^\d*(\s\w|\w|\-\w|\s\d/\d))\s(?<direction>(n\.|N\.|s\.|S\.|E\.|e\.|W\.|w\ .|NE\.|ne\.|SE\.|se\.|NW\.|nw\.|SW\.|sw\.|n|N|s|S| E|e|W|w|NE|ne|SE|se|NW|nw|SW|sw|North|East|West|So uth|north|south|west|east)*)(?<street>(.*[^street|place|drive|st|pl|dr|ave|av])*)(?<type>.*)

Which works for the folowing address

709 S S Milton ave (as in 709 S South Milton ave)

as that S is part of the number

but does not work for

709 S Milton ave
because it thinks that the S is part of the number and not the
direction....

any ideas


Jun 15 '06 #4

P: n/a
Keep in mind that addresses don't always follow that (or any similar)
format. Here are a few examples:

John Smith
Smith Enterprises
P.O. Box 12345
Anytown, Nebraska
00000

Jack and Jill Hill
RR 5 Box 909
Podunk, WI 12345-7890

MR S HOLMES
2978 W MAIN ST # 12
MINNEAPOLIS MN 23976-4542

May December
Bowers Holiday Village
Bldg 91 Apt. 2-A
12 31st Street
Baltimore, Maryland
79797
USA

Herrn
Günther Meyer
Goethestraße 25
20002 HAMBURG
Federal Republic of Germany

SGT NICK FURY
HEADQUARTERS COMPANY
7TH ARMY TRAINING CENTER
ATTN: AETT-AG
UNIT 28130
APO AE 09114

CUSTOMS ATTACHE
AMERICAN EMBASSY CARACAS
UNIT 4964
APO AA 34037

MS HELEN SAUNDERS
1010 CLEAR STREET
OTTAWA ON K1A 0B1
CANADA

MS JOYCE BROWNING
2045 ROYAL ROAD
06570 ST PAUL
FRANCE

MS JOYCE BROWNING
2045 ROYAL ROAD
LONDON WIP 6HQ
ENGLAND

RUFUS LANGDON
LAW DEPARTMENT
US POSTAL SERVICE
475 L'ENFANT PLZ SW RM 6627
WASHINGTON DC 202360-1120

I have found a few references for you. However, again, this is a huge task.
There is commercial software out there that you can buy to do this sort of
parsing. Just Google for it. Here are some links to references:

http://www.columbia.edu/kermit/postal.html
http://pe.usps.com/text/pub28/welcome.htm
http://www.grcdi.nl/whitepapers.htm
http://aurora.regenstrief.org/v3dt/PAS.html
http://www.cicc.or.jp/english/hyoujy...tabook/219.htm

Good luck!

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.
<mi***************@gmail.com> wrote in message
news:11**********************@f6g2000cwb.googlegro ups.com...
Thanks guys... couple reasponses....

1) 709 S | Milton Ave is not as valid as 709 | S | Milton ave because
they want the direction seperate... 709 S is not the street number 709
is and S Milton is not the street milton is.

2) Kevin, yah what I was suspecting but not wanting to think about.
Alternative for the client is to have 4 seperate fields on the ui
[number] [direction] [street] [type] .... but I hate this as that its
not intuitive.... or web standard.

thanks for your input guys

mike

Kevin Spencer wrote:
The first thing you've got to do is figure out all of the possible
permutations of combinations of tokens that may comprise an "address."
You
have only apparently noticed one or two. In fact, an "address" can take
many
combinations of many forms, and include many combinations of
abbreviations
of various kinds. In addition, the order of the elements (tokens) in an
address can be ordered in any number of ways, particularly if these
addresses come from different countries, and especially if these
addresses
have been provided by human beings rather then machines.

IOW, you've opened up a huge can of worms for yourself. What you need is
not
just a regular expression, but a bit of AI to solve this problem. I have
seen it done, but I'm not sure *how* it's done. MapPoint and Google Maps
can
do it fairly well, but Microsoft and Google have a lot of money to throw
at
this sort of problem.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

<mi***************@gmail.com> wrote in message
news:11*********************@f6g2000cwb.googlegrou ps.com...
> Hello all
>
> have a regex question... I want to split an address into descrete parts
>
> so
>
> 709 S Milton Ave is split into
> number = 709
> Direction = S
> Name = Milton
> Type = Ave
>
> So I have the following regex
>
> (?<number>^\d*(\s\w|\w|\-\w|\s\d/\d))\s(?<direction>(n\.|N\.|s\.|S\.|E\.|e\.|W\.|w\ .|NE\.|ne\.|SE\.|se\.|NW\.|nw\.|SW\.|sw\.|n|N|s|S| E|e|W|w|NE|ne|SE|se|NW|nw|SW|sw|North|East|West|So uth|north|south|west|east)*)(?<street>(.*[^street|place|drive|st|pl|dr|ave|av])*)(?<type>.*)
>
> Which works for the folowing address
>
> 709 S S Milton ave (as in 709 S South Milton ave)
>
> as that S is part of the number
>
> but does not work for
>
> 709 S Milton ave
> because it thinks that the S is part of the number and not the
> direction....
>
> any ideas
>

Jun 15 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.