Connecting Tech Pros Worldwide Forums | Help | Site Map

crawling the net...

ask josephsen
Guest
 
Posts: n/a
#1: Jul 22 '05
Hi NG

I'm making a program to crawl the internet. It works by retrieving all links
in a page, downloading the page of each link and again retrieving all the
links. (If there is better ways I'd like to hear)

My problem is relative links (like "../../wohoo.asp"). What is the smartest
way to get the full url (http://www.xyz.com/wohoo.asp)? Do I have to parse
the relative link in relation to the url where the relative link was found
and then concatenate it? Does anyone know how other search-engines/ crawlers
walk the net?


Thanks :)

../ask



JKop
Guest
 
Posts: n/a
#2: Jul 22 '05

re: crawling the net...


ask josephsen posted:
[color=blue]
> Hi NG
>
> I'm making a program to crawl the internet. It works by retrieving all
> links in a page, downloading the page of each link and again retrieving
> all the links. (If there is better ways I'd like to hear)
>
> My problem is relative links (like "../../wohoo.asp"). What is the
> smartest way to get the full url (http://www.xyz.com/wohoo.asp)? Do I
> have to parse the relative link in relation to the url where the
> relative link was found and then concatenate it? Does anyone know how
> other search-engines/ crawlers walk the net?
>
>
> Thanks :)
>
> ./ask[/color]

You should have posted this on:

alt.sports.gymnastics


It would've been more on-topic _there_.

-JKop
Morten Wennevik
Guest
 
Posts: n/a
#3: Jul 22 '05

re: crawling the net...


Hi Ask,

You could try using the features of Path.GetFullPath which collapses /../
and /./ and returns the proper path. However, it insists on adding the
application path so you will need to do something like

string newUrl =
Path.GetFullPath(url).Substring(Application.Startu pPath.Length+1));

It will switch the / to \ though. Oh, and remove the http:// from the url
first.

There are plenty web crawlers, just do a web searh on "web crawler" and
"web bot".


Happy coding!
Morten Wennevik [C# MVP]
mortb
Guest
 
Posts: n/a
#4: Jul 22 '05

re: crawling the net...


I'm not developing webcrawlers, but a quick thought of mine is

string link = "../../wohoo.asp"
string thisPageURL = "http://www.xyz.com/wohoo.asp"
stirng [] linkParts = System.Text.RegularExpressions.Regex.Split(link,
"x2Ex2E/"); // split on ../
string [] URLParts = System.Text.RegularExpressions.Regex.Split(thisPag eURL,
"/");

the length of linkParts.Lenght - 1 will now contain the wanted numbers of
"../" "directory recursion" and the last element will be the wanted page
the URL to the new page will be concatenated from the URLParts array,
exluding the the linkPartLength number of elements, and the last element in
LinkParts

Just a quick shot at an solution...

/mortb


"ask josephsen" <jaj(((a)))oticon.dk> wrote in message
news:4090c8a4$0$1118$4d4eb98e@news.dk.uu.net...[color=blue]
> Hi NG
>
> I'm making a program to crawl the internet. It works by retrieving all[/color]
links[color=blue]
> in a page, downloading the page of each link and again retrieving all the
> links. (If there is better ways I'd like to hear)
>
> My problem is relative links (like "../../wohoo.asp"). What is the[/color]
smartest[color=blue]
> way to get the full url (http://www.xyz.com/wohoo.asp)? Do I have to parse
> the relative link in relation to the url where the relative link was found
> and then concatenate it? Does anyone know how other search-engines/[/color]
crawlers[color=blue]
> walk the net?
>
>
> Thanks :)
>
> ./ask
>
>[/color]


Christopher Benson-Manica
Guest
 
Posts: n/a
#5: Jul 22 '05

re: crawling the net...


ask josephsen <jaj(((a)))oticon.dk> spoke thus:
[color=blue]
> I'm making a program to crawl the internet. It works by retrieving all links
> in a page, downloading the page of each link and again retrieving all the
> links. (If there is better ways I'd like to hear)[/color]

(You could look at how wget is implemented. Or, better, just USE wget.)

Your post is off-topic for comp.lang.c++. Please visit

http://www.slack.net/~shiva/welcome.txt
http://www.parashift.com/c++-faq-lite/

for posting guidelines and frequently asked questions. Thank you.

--
Christopher Benson-Manica | I *should* know what I'm talking about - if I
ataru(at)cyberspace.org | don't, I need to know. Flames welcome.
Closed Thread