473,405 Members | 2,344 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

php to spider a website

I am looking for a script that I can use to spider a website, and then pull
the images... I know how to do it for a single page, but, I would like to be
able to do this for the entire site. Any suggestions?

Thanks,
Kyle Mizell
http://www.pimpinonline.com
Jul 17 '05 #1
4 7982
jn
"Kyle Mizell" <ky**@pimpinonline.comNOSPAM> wrote in message
news:qewyb.174752$Dw6.686810@attbi_s02...
I am looking for a script that I can use to spider a website, and then pull the images... I know how to do it for a single page, but, I would like to be able to do this for the entire site. Any suggestions?

Thanks,
Kyle Mizell
http://www.pimpinonline.com


I don't know about your question, but pimpinonline.com is awesome.
Jul 17 '05 #2
Kyle Mizell wrote:
I am looking for a script that I can use to spider a website, and then pull
the images... I know how to do it for a single page, but, I would like to be
able to do this for the entire site. Any suggestions?


Why php? Use wget if all you want is a somple spider job.

Jul 17 '05 #3
On Mon, 01 Dec 2003 00:49:26 GMT, "Kyle Mizell" <ky**@pimpinonline.comNOSPAM>
wrote:
I am looking for a script that I can use to spider a website, and then pull
the images... I know how to do it for a single page, but, I would like to be
able to do this for the entire site. Any suggestions?


PHP has HTTP client functions; you can simply use file() with a URL.

However, to extract information from the HTML, you need an HTML parser
(regular expressions alone are not sufficient). PHP doesn't have one built in
or as one of the standard extensions. Personally I'd use Perl for this (e.g.
HTML::Parser). I think there is an HTML parser for PHP called HTML-Sax, have a
search for that.

--
Andy Hassall (an**@andyh.co.uk) icq(5747695) (http://www.andyh.co.uk)
Space: disk usage analysis tool (http://www.andyhsoftware.co.uk/space)
Jul 17 '05 #4
"Kyle Mizell" <ky**@pimpinonline.comNOSPAM> wrote in message news:<qewyb.174752$Dw6.686810@attbi_s02>...
I am looking for a script that I can use to spider a website, and then pull
the images... I know how to do it for a single page, but, I would like to be
able to do this for the entire site. Any suggestions?

Thanks,
Kyle Mizell
http://www.pimpinonline.com


As you do for one page do for all your pages.
In one array store all links foud on first page (eliminate
duplicates), then do for all this pages as for first page.
I think the beset is to make function, which save one page and return
found links, then call your function with all urls.
While you are saving a page you have to replace links because static
names will be diferent
i.e.
me*************************************@pimpinonli ne.com&unset_search=true
replace with
members_php_search_sex_Male_search_kyle_pimpinonli ne_com_unset_search_true.HTML

and so name all stored pages.

enjoy
Jul 17 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Thomas Lindgaard | last post by:
Hello I'm a newcomer to the world of Python trying to write a web spider. I downloaded the skeleton from http://starship.python.net/crew/aahz/OSCON2001/ThreadPoolSpider.py Some of the...
3
by: martijn | last post by:
H! I thought I was ready with my own spider... But then there was a bug, or in other words a missing part in my code. I forget that people do this in website html: <a...
1
by: shank | last post by:
How can I go about sending a spider to a website and retrieving data? I want to sell product for a company that claims the cannot query their database. Ridiculous! Anyway, is there a way that I can...
6
by: Erik Steffl | last post by:
I am trying to create a simple custom web spider using mozilla and javascript, the basic functionality is to open a website and then manipulate it using DOM (possibly opening links etc.). it...
0
by: dtsearch | last post by:
New release expands-through a .NET Spider API, to Linux, and to OpenOffice-dtSearch's ability to index over a terabyte of text in a single index, with indexed search time typically less than a...
7
by: baroque Chou | last post by:
anyone know how google spiders access web site, how dose they manage to get the href information? do they have special access right or something? any help is appreciated
3
by: Tony Lance | last post by:
Big Bertha Thing spider Cosmic Ray Series Possible Real World System Constructs http://web.onetel.com/~tonylance/spider.html Access page JPG 11K Image Astrophysics net ring Access site...
2
by: =?Utf-8?B?Q2hhcnRz?= | last post by:
I have been writing C# programs to spider yellow page to get list of restaurant name, address to the database. When I encounter button or hyperlink, I don’t know how to use the program to click...
0
by: kishorealla | last post by:
Hello I need to create a web bot/crawler/spider that would go into different web sites and collect data for us and store in a database. The crawler needs to 'READ' the options on a website (either...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.