Hi~!
I am doing webcrawl from a few url links(given by a enclosed set of htmls), but I can't succeed to crawl disconnected components. Could anyone give some advice on it? Thanks for any suggestion!
9 1415
No idea what a disconnected component is.
No idea what a disconnected component is.
Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.
Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.
Can you give an example?
Can you give an example?
for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls
for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls
If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.
If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?
Regards,
Jeff
If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.
If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?
Regards,
Jeff
Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000.html-------999999999.html,but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils
Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000.html-------999999999.html,but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils
Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.
Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.
Ummm..since I'm new to using perl to solve this problem, unsure about the scope of perl.
Your Best bet is to generate a random name for html files and finding links by checking for Non 404 errors or any other error and if there is no error found, then saving it for review, Note: this method will take a lot of processing power.
*but it will solve your problem*,
Sign in to post your reply or Sign up for a free account.
Similar topics
by: RMG |
last post by:
I use VS.net 2003 and VSS 6a.
Whilst working on the network the integration between
VS.net and VSS works perfectly. But when I open a project
whilst disconnected from the network I get some...
|
by: elcc1958 |
last post by:
I need to support a VB6 application that will be receiving
disconnected ADODB.Recordset from out DotNet solution. Our dotnet
solution deals with System.Data.DataTable. I need to populate a...
|
by: Steve Jorgensen |
last post by:
I keep having problems in which ADO disconnected recordset work under some
circumstances, but lose all their data at other times, having no rows or
fields, though the recordset object still exists....
|
by: Andrew |
last post by:
I'm a long time VB6/ADO and Java developer new to ADO.NET. I'm trying
to decide on best practices and I'd appreciate any assistance. I have
one specific question and another more general...
|
by: Steve Le Monnier |
last post by:
The ADO.NET DataSet is idea for application development, especially if you
need disconnected data. DataReader objects are great in the connected
environment but are forward only. What do you do...
|
by: AC |
last post by:
Running VS.NET 2003 Enterprise Arch on WinXP Pro SP1 with a P4-2.4Ghz, 760MB+ RAM, and 10+GB free disk space. Laptop is part of a domain.
When working on a web project connected at the office,...
|
by: SunSmile |
last post by:
Hi,
I am getting the following error when i try to throw an exception. The
exception i throw is caught up in the hierarchy and is rethrown and so on.
This error is only coming up when unit...
|
by: Steven Nagy |
last post by:
I know that .NET is based on a disconnected architecture, but I can't
conceive of why continually opening and closing a connection would be
faster than leaving a connection open.
So I ran a test...
|
by: =?Utf-8?B?ZGdjb29wZXI=?= |
last post by:
When I get a list of drives using the Directory.GetLogicalDrives(), it gives
me all drives including disconnected network drives. When I attempt to use
Directory.GetDirectories() on a disconnected...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
| |