Connecting Tech Pros Worldwide Forums | Help | Site Map

Regular expression

Newbie
 
Join Date: Aug 2005
Posts: 2
#1: Aug 11 '05
Hi,

I'm trying to write regular expression to find addresses or locations in HTML file (like cities, countries, states).
Does anyone have an idea of general format of an address? :confused:

Thanks

Amit :)

KUB365's Avatar
Administrator
 
Join Date: Jul 2005
Location: Portland, OR
Posts: 970
#2: Aug 15 '05

re: Regular expression


Addresses come in so many different format's it would hard to just write one ereg to capture all that data.

Would the data that you are gathering be of one standard format? Example from one region, maybe like the UK or US? and are all the pages similar?

Most addresses come in that general format:

Name of Person/Country
Number StreetName
City State Zipcode
Country

You have to keep in mind the above is for US, each country can have it's own format. Maybe you can create a database of country formats with their equivalent eregs and then parse the html files.

I hope this is not for spidering and collecting junk data (bad, bad, bad).

Enjoy,
KUBSTER
Newbie
 
Join Date: Aug 2005
Posts: 2
#3: Aug 21 '05

re: Regular expression


Don't worry, it's not for spidering and collecting junk data (only an exercise for data mining course).

But I have to work with software/hardware sites (in English only) I'm not restricted to countries.

I managed more or less to clean the html from all its tags and extract the address like you wrote, with the format:

Number StreetName
City State Zipcode
Country

I used some dictionaries I created for filtering garbage.

But I have a harder mission: to extract the company products, which I couldn't find any clue for rules or dictionaries.

I'll appreciate any help.

Thanks
Amit :)
Reply