Rachel,
If your extraction is a one-time effort, designed to gather the basic
content for your new version of the website, it's easiest to use a tool like
Juan recommended or even just extract the details by hand. Real-estate
listings can be fairly complex, containing a couple of hundred fields per
property listing, so you might consider whipping up some tools for yourself
to rend the data from the page. Regular expressions are very useful for
this purpose.
If your content-extraction need is recurring, I would at all costs avoid
screen scraping. That's akin to using their existing website as a database
for your new site. Among other things, it means they have to keep their old
site running somewhere and in good working order.
Instead, do some digging to find out where the content is originating from.
If they're taking the photographs and entering the content directly into
their website themselves, you'll probably have to mimic that functionality
through a set of web-based administrative tools. In that case you may be
able to skip the listing-content extraction entirely, build the tools, and
have your client re-enter all of the listing. Sell the idea as
"training"... =)
There's a good chance that they are using a third party provider to acquire
the listings, or are feeding the data in directly from their local MLS. In
the US, most multiple listing services (MLSs) now comply with the national
IDX and VOW standards for publishing listings. Assuming your client's MLS
does, you can acquire a developer license and pull the content yourself from
the MLS, store it in a database, and then embed the data in the website as
desired.
We do this for the Chicago region, so I should note that the effort is all
fairly significant. The raw data is often published daily in large CSV
files (100 MB+ in size), retrieved from an FTP server. It's fully
de-normalized so you probably want to do a ton of scrubbing and
normalization to make it useful. You'll likely need to decode all of the
fields to English text so that the general public can make sense of the
listing content. Images are also often FTP'd although some MLS's offer URL
access to the photos for active listings (i.e. you'd have to cache some if
you want to display sold listings for your client). In the VOW ("Virtual
Office Website") program, regulations are such that you also need to have an
enrollment process before visitors are permitted to see the listings, do an
email address verification by sending an account activation email, etc. etc.
etc.
Nothing insurmountable, but expect to grind some code if you go this route.
Alternately, you may be able to find a third party service to handle the
listing display entirely, and if your client likes the appearance (you
rarely have choices...), then you can just focus on the rest of the website.
/// M
"rachel" <ra****@hotmail.com> wrote in message
news:05****************************@phx.gbl...
Hello,
I am currently contracted out by a real estate agent. He
has a page that he has created himself that has a list of
homes.. their images and data in html format.
He wants me to take this page and reformat it so that it
looks different.
Do I use screen scraping to do this?
Could someone please point me to a good screen scraping
article... I am using ASP.NET and C#
Thanks,
Rachel