By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,541 Members | 1,476 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,541 IT Pros & Developers. It's quick & easy.

how to webcrawl disconnected components

P: 30
Hi~!

I am doing webcrawl from a few url links(given by a enclosed set of htmls), but I can't succeed to crawl disconnected components. Could anyone give some advice on it? Thanks for any suggestion!
Sep 9 '08 #1
Share this Question
Share on Google+
9 Replies


KevinADC
Expert 2.5K+
P: 4,059
No idea what a disconnected component is.
Sep 9 '08 #2

P: 30
No idea what a disconnected component is.
Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.
Sep 10 '08 #3

KevinADC
Expert 2.5K+
P: 4,059
Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.
Can you give an example?
Sep 10 '08 #4

P: 30
Can you give an example?
for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls
Sep 10 '08 #5

numberwhun
Expert Mod 2.5K+
P: 3,503
for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls
If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

Regards,

Jeff
Sep 10 '08 #6

P: 30
If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

Regards,

Jeff
Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000.html-------999999999.html,but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils
Sep 10 '08 #7

KevinADC
Expert 2.5K+
P: 4,059
Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000.html-------999999999.html,but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils
Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.
Sep 10 '08 #8

P: 30
Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.

Ummm..since I'm new to using perl to solve this problem, unsure about the scope of perl.
Sep 11 '08 #9

Icecrack
Expert 100+
P: 174
Your Best bet is to generate a random name for html files and finding links by checking for Non 404 errors or any other error and if there is no error found, then saving it for review, Note: this method will take a lot of processing power.

*but it will solve your problem*,
Sep 15 '08 #10

Post your reply

Sign in to post your reply or Sign up for a free account.