By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,304 Members | 1,242 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,304 IT Pros & Developers. It's quick & easy.

Scraping

P: n/a
I was wondering where I would start to try and recreate something
http://goohackle.com/scripts/google_parser.php where it just lists the
urls. I would be using it to check what pages of my site are listed
and then reporting it to my db. I can already do this for single pages
but I need to do it for my entire domain.

Any help is much appreciated
Thanks in advance!
Sep 14 '08 #1
Share this Question
Share on Google+
4 Replies


P: n/a
On Sep 13, 8:48*pm, boxoft <box...@gmail.comwrote:
I tried to input an URL likewww.veturi.comand got the list as
follows:http://www.veturi.com/http://www.vet....php?id=480531

I guess all the pages start withhttp://www.veturi.com/is what you
want. Right?
Well I was mainly wanting info on how I would make a script like that
one where it outputs the urls
Sep 14 '08 #3

P: n/a
The scraping process is as follows:
1. The script sends a HTTP GET request to Google with the search terms
you want.
2. The script parses the useful information from the returned content.
3. The script outputs the result in the format you want.

I recommend "Webbots, Spiders, and Screen Scrapers" at
http://www.schrenk.com/nostarch/webbots/.
This book explains the basic techs used to scrape data from web.

Actually I used its techs when working on some projects at
GetAFreelancer.com and RentACoder.com.
Sep 14 '08 #4

P: n/a
On Sep 13, 8:48 pm, boxoft <box...@gmail.comwrote:
>I tried to input an URL likewww.veturi.comand got the list as
follows:http://www.veturi.com/http://www.vet....php?id=480531

I guess all the pages start withhttp://www.veturi.com/is what you
want. Right?

Well I was mainly wanting info on how I would make a script like that
one where it outputs the urls
OP:
Actually, the small print at the bottom of that page sounds like the
source code is in the tools link, no? Have you checked it? It sounds
like all he's doing is stripping out the URLs from a regular Google
search.
Sep 14 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.