473,408 Members | 1,784 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,408 software developers and data experts.

How to parse address string using any language

for example if user enters Passing the parseAddress function "A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947" returns:

2299 Lewes-Georgetown Hwy
A. P. Croll & Son
Georgetown
DE
19947

i have thought of the following algorithm but i am having trouble implementing it...can anyone help me write the code? i know C and a bit of C++, so would prefer if the code was in these languages..

my algo

1)Work backward. Start from the zip code, which will be near the end, and in one of two known formats: XXXXX or XXXXX-XXXX. If this doesn't appear, you can assume you're in the city, state portion, below.

The next thing, before the zip, is going to be the state, and it'll be either in a two-letter format, or as words. You know what these will be, too -- there's only 50 of them. Also, you could soundex the words to help compensate for spelling errors.
before that is the city, and it's probably on the same line as the state.

You could use a zip-code database to check the city and state based on the zip, or at least use it as a BS detector.
The street address will generally be one or two lines. The second line will generally be the suite number if there is one, but it could also be a PO box.

It's going to be near-impossible to detect a name on the first or second line, though if it's not prefixed with a number (or if it's prefixed with an "attn:" or "attention to:" it could give you a hint as to whether it's a name or an address line.

any help would be appreciated
Jul 7 '10 #1
5 4088
weaknessforcats
9,208 Expert Mod 8TB
Unless you know the format of the input, it will be difficult to parse it.

Can you insist on a) specific field widths, b) CSV format, c)token identifiers ?

Token identifiers are things like NAME=, ADDRESS=, etc.
Jul 7 '10 #2
donbock
2,426 Expert 2GB
Your example has comma separators between street address and city; and between city and state.
  • Can you count on all these commas being present?
  • Are you sure there isn't a separator between name and street address?
  • Is the comma between city and state optional?
  • What other variant formats do you have to support (post office box, rural route, suite number, c/o, department, military APO/FPO, etc)? [and those are just some US variants]
You might find Frank's Compulsive Guide to Postal Addresses interesting.
Jul 7 '10 #3
well yes i understand its almost impossible to a code that would be 100% accurate. luckily though we've been told that we can make any assumptions we like, so maybe you could restrict the code to some standard format and write the code just based on that..... how would one write code then?
Jul 8 '10 #4
public class Address
{
public string Street {get;set;}; // Lunkad Tower, 6th floor
public string Locality {get;set;}; // Viman Nagar
public string City {get;set;}; // Pune
public string State {get;set;}; // MH, Maharashtra
public string PostalCode {get;set;}; // 60611
public string Country {get;set;}; // e.g. India, IN
}

can anyone help me write the code?
Jul 8 '10 #5
Oralloy
988 Expert 512MB
pranyht,

There are a huge number of ways an address can be expressed. Have you got any limits on what you are going to have to process? U.S.A. only addresses? German addresses? International mail to Italy?

Constrain the problem, so you can start to generate a viable solution.

Once you have the problem space worked out, then you can start analysis and formulate a solution.

That said, I would recommend that you write a form of pattern match engine to scan the addresses. Then, take the first or the best hit, depending on how you implement. After you get your hits, then you can re-check the processed address to make sure what you found was valid.

By way of comparison, I spent several days of my life writing a general date/time/timestamp parser for a commercial web-site. There were about 15 general patterns that I expected (e.g. "MMDDYY", "YYYY-MM-DD", RFC, etc...). I checked all of them using pattern matching, validated them, and when there were multiple valid hits, I selected the "correct" one (or I compared and verified that the result was the same). Invalid inputs were kicked out with exceptions.

Days worth of analysis and false starts. The final code only took a few hours, once I'd worked out how I was going to solve the problem.

That said, be sure to code self-defensively. Any algorithm you may come up with will surely fail at some point, especially when dealing with input from real people. Expect Failures!
Jul 8 '10 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

9
by: Martoni | last post by:
I need to parse a string with an embedded email address. The string always has the format NAME (name@domain) SOMETEXT. What I need to get is the email address as name@domain. I came up with this...
2
by: Watsh | last post by:
Hi All, I have been trying to parse an XML string using the StringReader and InputSource interface but the document returned to me is always null. Please find the code below which i have been...
6
by: Jason | last post by:
Sorry for the post here but could not find an Exchange newgroup. I developed an application a year or 2 ago that sends SMTP email. The application allows the user to customize the text from...
15
by: Jeannie | last post by:
Hello group! I'm in Europe, traveling with my laptop, and I don't any compilers other than Borland C++ 5.5. available. I also don't have any manuals or help files available. Sadly, more...
19
by: linzhenhua1205 | last post by:
I want to parse a string like C program parse the command line into argc & argv. I hope don't use the array the allocate a fix memory first, and don't use the memory allocate function like malloc....
16
by: Charles Law | last post by:
I have a string similar to the following: " MyString 40 "Hello world" all " It contains white space that may be spaces or tabs, or a combination, and I want to produce an array...
4
by: Phil Mc | last post by:
OK this should be bread and butter, easy to do, but I seem to be going around in circles and not getting any answer to achieving this simple task. I have numbers in string format (they are...
1
by: Dan Somdahl | last post by:
Hi, I am new to ASP but have what should be a fairly simple task that I can't figure out. I need to parse a string from a single, semi-colon delimited, 60 character field (el_text) in a recordset...
1
by: (2b|!2b)==? | last post by:
I am expecting a string of this format: "id1:param1,param2;id2:param1,param2,param3;id" The tokens are seperated by semicolon ";" However each token is really a struct of the following...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.