473,655 Members | 3,063 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Need a spider library


Hi All,

I'm writting a spider program. I need to go to serveral URLs and extract
information from the HTML source. Including links.
I was using FancyURLOpener and my own function that extracts the links
from a HTML page. The problem is that I always
need to change it. This is because some sites use lower case tag names,
others upper case tag names. Some of them use
href="page.html " others do it without the quotation href=page.html but
I could even find unclosed quotations <a href="page.html >
double opened and unclosed <a tags etc. There are many kinds of
malformed HTML pages out there and it seems I'm not capable
of handling all of them. The question: is there a good library for
Python for extraction links and images out of (possibly malformed)
HTML soucre code? Like the "references " function in Lynx. I need to
handle relative and absolute references and I need to know the
anchor text too and the position of the anchor inside the HTML source file.

For example this malformed link:

<a href="page.html >Sample link</a>

could be converted to:

['page.html','ht tp://samplesite.curr ent_location/page.html','Sam le link']

Thanks in advance

Les

Oct 12 '05 #1
0 1422

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
8008
by: Kyle Mizell | last post by:
I am looking for a script that I can use to spider a website, and then pull the images... I know how to do it for a single page, but, I would like to be able to do this for the entire site. Any suggestions? Thanks, Kyle Mizell http://www.pimpinonline.com
3
3225
by: Thomas Lindgaard | last post by:
Hello I'm a newcomer to the world of Python trying to write a web spider. I downloaded the skeleton from http://starship.python.net/crew/aahz/OSCON2001/ThreadPoolSpider.py Some of the source shown below. A couple of questions:
5
2433
by: jdonnell | last post by:
I've been writing a simple web spider for fun, and I've run into a problem I can't figure out. The spider hangs (waits for username and pass) when I hit a page that requires .htaccess authentication. self.f = urllib.urlopen('http://blogbloc.com/~jay/test/') #nothing below here gets executed print self.f.info() .... It hangs as soon as I call urllib.urlopen(). I was going to try to read
3
1510
by: martijn | last post by:
H! I thought I was ready with my own spider... But then there was a bug, or in other words a missing part in my code. I forget that people do this in website html: <a href="http://www.nic.nl/monkey.html">is oke</a> <a href="../monkey.html">error</a> <a href="../../monkey.html">error</a>
1
1637
by: shank | last post by:
How can I go about sending a spider to a website and retrieving data? I want to sell product for a company that claims the cannot query their database. Ridiculous! Anyway, is there a way that I can automate someway of collecting data from their site? I would need UPC codes, descriptions etc. thanks
0
2074
by: dtsearch | last post by:
New release expands-through a .NET Spider API, to Linux, and to OpenOffice-dtSearch's ability to index over a terabyte of text in a single index, with indexed search time typically less than a second BETHESDA, MD (January 10, 2006) dtSearch Corp., a leading supplier of enterprise and developer text retrieval software, announces Version 7.2 of its product line for instantly searching terabytes of documents across a desktop, network,...
7
1932
by: baroque Chou | last post by:
anyone know how google spiders access web site, how dose they manage to get the href information? do they have special access right or something? any help is appreciated
2
2345
by: abeen | last post by:
Hello, I would want to know which could be the best programming language for developing web spider. More information about the spider, much better,, thanks http://www.imavista.com
1
4281
by: harryGill | last post by:
Hi Was just wondering how to create a spider diagram in asp by collecting data from SQL database? Spider diagrams can be created in word or excel but i dont know how to do one in programming. If anyone could help, that would be great. Many thanks! Harry
0
8380
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8296
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8816
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8710
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8598
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7310
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5627
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4299
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
1598
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.