473,396 Members | 1,766 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

how to webcrawl disconnected components

30
Hi~!

I am doing webcrawl from a few url links(given by a enclosed set of htmls), but I can't succeed to crawl disconnected components. Could anyone give some advice on it? Thanks for any suggestion!
Sep 9 '08 #1
9 1415
KevinADC
4,059 Expert 2GB
No idea what a disconnected component is.
Sep 9 '08 #2
anklos
30
No idea what a disconnected component is.
Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.
Sep 10 '08 #3
KevinADC
4,059 Expert 2GB
Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.
Can you give an example?
Sep 10 '08 #4
anklos
30
Can you give an example?
for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls
Sep 10 '08 #5
numberwhun
3,509 Expert Mod 2GB
for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls
If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

Regards,

Jeff
Sep 10 '08 #6
anklos
30
If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

Regards,

Jeff
Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000.html-------999999999.html,but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils
Sep 10 '08 #7
KevinADC
4,059 Expert 2GB
Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000.html-------999999999.html,but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils
Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.
Sep 10 '08 #8
anklos
30
Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.

Ummm..since I'm new to using perl to solve this problem, unsure about the scope of perl.
Sep 11 '08 #9
Icecrack
174 Expert 100+
Your Best bet is to generate a random name for html files and finding links by checking for Non 404 errors or any other error and if there is no error found, then saving it for review, Note: this method will take a lot of processing power.

*but it will solve your problem*,
Sep 15 '08 #10

Sign in to post your reply or Sign up for a free account.

Similar topics

6
by: RMG | last post by:
I use VS.net 2003 and VSS 6a. Whilst working on the network the integration between VS.net and VSS works perfectly. But when I open a project whilst disconnected from the network I get some...
0
by: elcc1958 | last post by:
I need to support a VB6 application that will be receiving disconnected ADODB.Recordset from out DotNet solution. Our dotnet solution deals with System.Data.DataTable. I need to populate a...
6
by: Steve Jorgensen | last post by:
I keep having problems in which ADO disconnected recordset work under some circumstances, but lose all their data at other times, having no rows or fields, though the recordset object still exists....
1
by: Andrew | last post by:
I'm a long time VB6/ADO and Java developer new to ADO.NET. I'm trying to decide on best practices and I'd appreciate any assistance. I have one specific question and another more general...
4
by: Steve Le Monnier | last post by:
The ADO.NET DataSet is idea for application development, especially if you need disconnected data. DataReader objects are great in the connected environment but are forward only. What do you do...
3
by: AC | last post by:
Running VS.NET 2003 Enterprise Arch on WinXP Pro SP1 with a P4-2.4Ghz, 760MB+ RAM, and 10+GB free disk space. Laptop is part of a domain. When working on a web project connected at the office,...
0
by: SunSmile | last post by:
Hi, I am getting the following error when i try to throw an exception. The exception i throw is caught up in the hierarchy and is rethrown and so on. This error is only coming up when unit...
9
by: Steven Nagy | last post by:
I know that .NET is based on a disconnected architecture, but I can't conceive of why continually opening and closing a connection would be faster than leaving a connection open. So I ran a test...
2
by: =?Utf-8?B?ZGdjb29wZXI=?= | last post by:
When I get a list of drives using the Directory.GetLogicalDrives(), it gives me all drives including disconnected network drives. When I attempt to use Directory.GetDirectories() on a disconnected...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.