By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
429,327 Members | 2,661 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 429,327 IT Pros & Developers. It's quick & easy.

reading the web with a winform: urls

P: 1
I'm trying to write an winform application that can spider a web site.

In essence, I'm doing it by calling:
string htmlData = enc.GetString(browser.DownloadData(url));

Then using some regex to parse out the links from the html data contained in the string variable.

This works fine so far. The links are being added to an arraylist to be used in my application.

Now that I have the links I've got a problem: all links are not created equal.

There are links within the site, links to external sites,
fully qualified and non-fully qualified, absolute, relative and application relative links, querystrings, etc.

My particular issue is: how do I properly build a fully qualified link (perhaps with querystring) from any type of non-fully qualified link?

An example of some link will probably make the complexities a bit clearer:
(where href=)

/file.html
or
file.html
or
../../file.html
or
~/directory/page.aspx
or
~/directory/page.aspx?returnurl=http://servertoredirect/resource.aspx

while an absolute link (/file.html) is pretty straightforward -- just tack on the domain -- the rest give me issues.

I'm trying to return the links as fully qualified links.
e.g.,
http://www.somedomain.com/applicatio...olve/page.aspx

While there are tools that could be used in an asp.net application to look at the paths (e.g., HttpContext.Current.Request.Url or VirtualPathUtility) but I don't know how to make this work in a winform.

Any help/suggestions would be greatly appreciated.
Feb 27 '09 #1
Share this question for a faster answer!
Share on Google+

Post your reply

Sign in to post your reply or Sign up for a free account.