467,113 Members | 1,314 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,113 developers. It's quick & easy.

how to webcrawl disconnected components

Hi~!

I am doing webcrawl from a few url links(given by a enclosed set of htmls), but I can't succeed to crawl disconnected components. Could anyone give some advice on it? Thanks for any suggestion!
Sep 9 '08 #1
  • viewed: 1210
Share:
9 Replies
KevinADC
Expert 2GB
No idea what a disconnected component is.
Sep 9 '08 #2
No idea what a disconnected component is.
Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.
Sep 10 '08 #3
KevinADC
Expert 2GB
Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.
Can you give an example?
Sep 10 '08 #4
Can you give an example?
for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls
Sep 10 '08 #5
numberwhun
Expert Mod 2GB
for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls
If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

Regards,

Jeff
Sep 10 '08 #6
If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

Regards,

Jeff
Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000.html-------999999999.html,but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils
Sep 10 '08 #7
KevinADC
Expert 2GB
Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000.html-------999999999.html,but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils
Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.
Sep 10 '08 #8
Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.

Ummm..since I'm new to using perl to solve this problem, unsure about the scope of perl.
Sep 11 '08 #9
Icecrack
Expert 100+
Your Best bet is to generate a random name for html files and finding links by checking for Non 404 errors or any other error and if there is no error found, then saving it for review, Note: this method will take a lot of processing power.

*but it will solve your problem*,
Sep 15 '08 #10

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

6 posts views Thread by RMG | last post: by
6 posts views Thread by Steve Jorgensen | last post: by
1 post views Thread by Andrew | last post: by
4 posts views Thread by Steve Le Monnier | last post: by
9 posts views Thread by Steven Nagy | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.