how to webcrawl disconnected components

Hi~!

I am doing webcrawl from a few url links(given by a enclosed set of htmls), but I can't succeed to crawl disconnected components. Could anyone give some advice on it? Thanks for any suggestion!

Sep 9 '08 #1

Subscribe Post Reply

1415

KevinADC

4,059

Expert 2GB

No idea what a disconnected component is.

Sep 9 '08 #2

anklos

No idea what a disconnected component is.

Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.

Sep 10 '08 #3

KevinADC

4,059

Expert 2GB

Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.

Can you give an example?

Sep 10 '08 #4

anklos

Can you give an example?

for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls

Sep 10 '08 #5

numberwhun

3,509

Expert Mod 2GB

for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls

If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

Regards,

Jeff

Sep 10 '08 #6

anklos

If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

Regards,

Jeff

Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000.html-------999999999.html,but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils

Sep 10 '08 #7

KevinADC

4,059

Expert 2GB

Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000.html-------999999999.html,but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils

Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.

Sep 10 '08 #8

anklos

Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.

Ummm..since I'm new to using perl to solve this problem, unsure about the scope of perl.

Sep 11 '08 #9

Icecrack

174

Expert 100+

Your Best bet is to generate a random name for html files and finding links by checking for Non 404 errors or any other error and if there is no error found, then saving it for review, Note: this method will take a lot of processing power.

*but it will solve your problem*,

Sep 15 '08 #10

Similar topics

VS.net and VSS

by: RMG | last post by:

I use VS.net 2003 and VSS 6a. Whilst working on the network the integration between VS.net and VSS works perfectly. But when I open a project whilst disconnected from the network I get some...

.NET Framework

Populating disconnected ADODB.Recordset with System.Data.DataTable data.

by: elcc1958 | last post by:

I need to support a VB6 application that will be receiving disconnected ADODB.Recordset from out DotNet solution. Our dotnet solution deals with System.Data.DataTable. I need to populate a...

.NET Framework

Disconnected ADO recordsets lose their data

by: Steve Jorgensen | last post by:

I keep having problems in which ADO disconnected recordset work under some circumstances, but lose all their data at other times, having no rows or fields, though the recordset object still exists....

Microsoft Access / VBA

Disconnected Recordsets vs. DataSets

by: Andrew | last post by:

I'm a long time VB6/ADO and Java developer new to ADO.NET. I'm trying to decide on best practices and I'd appreciate any assistance. I have one specific question and another more general...

C# / C Sharp

Using ADO.NET in a Connected not Disconnected Fashion

by: Steve Le Monnier | last post by:

The ADO.NET DataSet is idea for application development, especially if you need disconnected data. DataReader objects are great in the connected environment but are forward only. What do you do...

C# / C Sharp

Very slow debugging web projects when disconnected...

by: AC | last post by:

Running VS.NET 2003 Enterprise Arch on WinXP Pro SP1 with a P4-2.4Ghz, 760MB+ RAM, and 10+GB free disk space. Laptop is part of a domain. When working on a web project connected at the office,...

ASP.NET

Strange Context Error: Context 0x197ee0 is disconnected in VS 2005

by: SunSmile | last post by:

Hi, I am getting the following error when i try to throw an exception. The exception i throw is caught up in the hierarchy and is rethrown and so on. This error is only coming up when unit...

ASP.NET

Disconnected vs .. um .. connected?

by: Steven Nagy | last post by:

I know that .NET is based on a disconnected architecture, but I can't conceive of why continually opening and closing a connection would be faster than leaving a connection open. So I ran a test...

C# / C Sharp

Trouble With Directory.GetDirectories and Disconnected Network Dri

by: =?Utf-8?B?ZGdjb29wZXI=?= | last post by:

When I get a list of drives using the Directory.GetLogicalDrives(), it gives me all drives including disconnected network drives. When I attempt to use Directory.GetDirectories() on a disconnected...

.NET Framework

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General