424,294 Members | 1,891 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,294 IT Pros & Developers. It's quick & easy.

Cyclic Google requests to provide SERP scrape data for each user

P: 4
I would like to know how to scrape Google SERP in big project.

What I have:

- Website application writen in PHP with simple user management
- cUrl script which for 9 phrases scrape SERP every hour and fetch top 100 domains for each phrase saving them to db. It does 9(phrases)*10(pages)*24(hours)= 2160 requests per 24 hours. More accurately: at 10:00 1 request per max 6 seconds (3-6 sec freeze), finish cron process and wait for 11:00 and repeat it every hour. It uses cron. It worked good for last month and I didn't get banned and I believe it's very maximum what I can get before Google send me to hell.


What I want:

Expand concept of my cUrl script to multiple users.

Scenario: each user could get top 100 domains for each phrase he want per 1 hour. For example: user A set 3 phrases to analysis which gives 30 requests per hour for this user and user B set 8 phrases to analysis which gives overall 110 requests per hour which is impossible to handle that amout of requests for 1 IP without being punished by Google. I probably need to set up 1 proxy server for each user to get 1 unique IP on which I can get maximum 9 requests per hour. BUT EVEN THEN it's just bad to provide user to analyze only 9 phrases every hour.

The only desperate idea I see for now is to buy a lot of hostings with cron and somehow "assign" for example 10 proxy servers (each with unique IP) to each hosting. Let's say I've 1000 users so I need 100 hostings with cron and 10 proxies for each hosting so it gives me 1000 unique IPs for 1000 users. Each hosting would use the same database where user settings will be kept. This db would give info about SERP scraping for each user (UserID, PhraseToGoogleSearch, DomainToFilterInSERP, SERPLanguage) for each cron process. Basically: 10 users = 10 proxy servers = 1 hosting. How about that?

It would be nice to give every user possibility to expand his 9 analysis (9 phrase|domain SERP scrape for 1 cron process) per hour for example to 80 or even 800 regardless of costs. I want to know how to implement my suggested user <=> IP idea or any solution that would be better. I have never had any experience with proxies and managing IPs.
2 Weeks Ago #1
Share this Question
Share on Google+
9 Replies


gits
Expert Mod 5K+
P: 5,204
well - i think the most suitable way would be to use a Google API like this: https://serpapi.com/ instead of working around their Terms of Service. That way you wouldnt even need to think about consequences by violating the ToS with such workarounds- since you simply use what Google intends to be used for such purposes - of course it will cost something though.
2 Weeks Ago #2

P: 4
"COMPANY
SerpApi, LLC
5540 N Lamar Blvd
Austin, TX 78756
".
This is just another popular scraping service. I don't see any relation between this company and Google, sorry. They scrape SERP just the same way I want but even more globally. Honestly my project won't be commercial and what I want to reach is scraping system for 2 or more users. Really just not one user because I have to implement login/register system so I need to expand my currently scraping script and present that it works for multiple users.
By the way I don't have much money but free proxy-like services aren't safe and I don't want to use them.
2 Weeks Ago #3

gits
Expert Mod 5K+
P: 5,204
ye - seems they are not directly related to google - but obviously they can offer a service that google doesnt provide anymore (however they do that). they did shut down the web search api some time ago - and the custom search api is even not thought to be used for websearches in bigger quantities - the point still stands that you most likely are already violating the ToS by doing automated searches - which, because of the low volume, might not interest anyone. If you expand that volume - and start going to fog that - things might change and google might start to become grumpy.

If its not commercial - why not contact google and ask for their limits/options? If its just educational/exemplary then there wouldnt be a need of going full limits as well. To present something is working for 2 instead of 1 user - simply use 50% resources per user only - that would show its working. Google cant know what you want to do if you dont tell them - so the best bet is either use a service like mentioned or talk to google and see what can be done. Everything else is just working around ToS and might lead to a fail without notice when google blocks your requests.
2 Weeks Ago #4

P: 4
As I know scraping Google isn't againt the law.
Are you going to say that scraping services aren't breaking Google's ToS, because they've Google's blessing or permission?

As I mentioned in first post: maximum Google SERP requests amount per hour is 90 (it's actually 80 (8 phrases) as John said https://stackoverflow.com/questions/...nswer-22703153).

"use 50% resources..."
So for 2 users it would be 4 phrases available for each user. 3 users = 2 or 3 phrases. As I said it's really not enough.

My plan assumed that 1 user could manage 40 different phrases (which is 400 requests per hour).
If 1 IP = 80 requests/hour then I need 5 proxy servers for 1 user. Assuming my cost limitations and only presentation purpose of this sad project I could agree that 24 phrases for each user would be enough and hopefuly 3 users would be enough for my presentation. So overall I would need 9 unique IPs for my project.

Now the technical aspect of implementation is still a problem.

What about IP sharing...
Sharing proxy IP with random John Doe who Googles something could increase requests per hour and make my whole application very random and unprofessional.
How to be sure or almost sure that purchased proxies wouldn't share IP with someone?
2 Weeks Ago #5

gits
Expert Mod 5K+
P: 5,204
how such a service is doing it wasnt the question here - but i assume they have contracts/permissions somehow. if they wouldnt - then same issue with no warning blocking could appear - and why would google not be interested in protecting its searches? they offer all the infrastructure and search apis and if someone would simply use it limitless - it would be more then simple to just copy their main service without having to do much investments - so google just protects its own investment by not allowing such out of the box for free - which certainly wouldnt be in their interest. to find out how to do it correctly i repeat - ask google itself. using search in an automated way is against ToS as far as i am aware i think this comes into effect here:

https://support.google.com/webmaster...er/66357?hl=en

so obviously simply ask google for permission.
2 Weeks Ago #6

Rabbit
Expert Mod 10K+
P: 12,303
You're in murky water here, while merely violating a terms of service isn't a violation of the law, using technological means to bypass restrictions might be criminally liable. In the case of Facebook vs. Power Ventures,
The Court finds that a distinction can be made between access that violates a term of use and access that circumvents technical or code-based barriers that a computer network or website administrator erects to restrict the user’s privileges within the system, or to bar the user from the system altogether. Limiting criminal liability to circumstances in which a user gains access to a computer, computer network, or website to which access was restricted through technological means eliminates any constitutional notice concerns, since a person applying the technical skill necessary to overcome such a barrier will almost always understand that any access gained through such action is unauthorized.
Source: https://www.eff.org/deeplinks/2010/0...rime-bypassing
EFF also links to the court decision if you prefer to read it in full detail.

Power Ventures attempted to come to an arrangement with Facebook to access their data but decided it would be too expensive and tried to scrape Facebook data another way. Which led to Facebook vs. Vachani (Power Venture's CEO) in which he was found liable for unauthorized access.
We affirm the district court’s holding that Vachani is personally liable for Power’s actions. A “corporate officer or director is, in general, personally liable for all torts which he authorizes or directs or in which he participates
2 Weeks Ago #7

P: 4
Thanks for involvement to my problem.
Rabbit thank you for interesting article.

Last question. Where can I get help after obtaining permission from Google? (As if anybody could help me :/)
2 Weeks Ago #8

Rabbit
Expert Mod 10K+
P: 12,303
Are you asking for help with getting permission from google? I can't help with that, you could try their help forum in the link that gits posted above.

Or are you asking for help with the scraping after getting permission? In this case, you won't have to deal with masking your IP, if you come to a data sharing arrangement with google, they won't be rate limiting you and you can send however many request you like at whatever speed you like.
2 Weeks Ago #9

gits
Expert Mod 5K+
P: 5,204
... you won't have to deal with masking your IP ...
exactly that was the point where i was trying to lead this to - because if such fogging is done its starting to become fishy and even more it is intentionally working around the 'Laissez-faire' limits of Google. That falls at least in, when i would use nicer words for it, 'cheating' category - where the cheated service provider - in this case Google - would certainly not be very amused in case they would find out.
1 Week Ago #10

Post your reply

Sign in to post your reply or Sign up for a free account.