By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,865 Members | 1,304 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,865 IT Pros & Developers. It's quick & easy.

Makin search on the other site and getting data and writing in xml

P: n/a
Hi
is it possible to make search on for example on google without api with
a list of words
1- there is word list
2- the script will take the words from the list by turn
3-it iwll make the search
4-will get results
5-will write the results as xml file.

i dont mean only google, for other sites aswell

I hope we get a result

Sep 25 '06 #1
Share this Question
Share on Google+
25 Replies


P: n/a
<al**********@gmail.comwrote:
is it possible to make search on for example on google without api with
a list of words
1- there is word list
2- the script will take the words from the list by turn
3-it iwll make the search
4-will get results
5-will write the results as xml file.
http://www.google.com/terms_of_service.html

"You may not send automated queries of any sort to Google's system without express
permission in advance from Google."

</F>

Sep 25 '06 #2

P: n/a

I dont mean only google, also other sites aswell

Sep 25 '06 #3

P: n/a

al**********@gmail.com wrote:
I dont mean only google, also other sites aswell
Google expressly forbids doing any form of automated search outside of
their api. If you want to write a script that will run Google searches,
you have to use the api to do so. As far as I know most of the other
search sites have the same requirement.

Yes, it is possible to query a bunch of search sites and dump the
results into an xml file. It is not even all that hard. In fact, I bet
running a search on the relevant terms will probably produce something
that almost does what you want.

-Adam

Sep 25 '06 #4

P: n/a
Thank you very much for your explications. I dont mean a search engine.
for example a dictionary site for searching words.

Sep 25 '06 #5

P: n/a
For example i give you an example about making search on one of the
site and get the result.

# #!/usr/bin/python
# # -*- coding: windows-1254; -*-
#
# import urllib
# dictionary = {} # wow, it's actually a dictionary
# words = ['apple', 'banana', 'cheese']
# for word in words:
# dictionary[word] =
urllib.urlopen("http://www.example.com/look.php?w=" + word).read()
#
# print dictionary

i dont know how i can get the words from a txt file for searching by
turn

Sep 26 '06 #6

P: n/a

And also writing the result as a html or xml file

Sep 26 '06 #7

P: n/a
On Mon, 25 Sep 2006 13:51:55 +0200, Fredrik Lundh wrote:
http://www.google.com/terms_of_service.html

"You may not send automated queries of any sort to Google's system without express
permission in advance from Google."
I'm not just being a pedantic weasel here, but what's an automated query?
Google's ToS is a legal document (maybe), and if both parties don't agree
on the meanings of terms, well, then it is a lousy legal document and a
recipe for trouble.

Google don't define "automated query"it, and I don't think they can. In
fact, the closest they come to defining it is to list three things they
want to prevent, NONE of which have anything to do with the distinction
between automated and non-automated.

(What on earth is "meta-searching"? If you're going to use terms which
don't have a commonly understood meaning, define what they mean.)

If I want to search for "foo", and I type "foo" into the Firefox search
box, is that an automated query?

What if I type "gg: foo" into Konqueror's address bar, which expands to
"http://www.google.com/search?q=foo"? Is it okay if I type the URL by hand
myself?

Can I use the browser to save the search page to a local HTML file? If
Google says no, how can they possibly hope to stop me?

What if I type this command into my shell?

elinks --dump "http://www.google.com/search?q=foo" output.html

What if I type

wget "http://www.google.com/search?q=foo"

into the shell? Surely that's no more automated than typing "foo"
into Google's search box. (wget doesn't in fact work, as Google recognises
its user-agent string and blocks it, EVEN in cases where I am using wget
manually. What, can't Google themselves tell the difference between
automatic and non-automatic searching?)

Where is the line I must not cross?

The thing is, Google doesn't want people "reselling" their services, and I
respect Google's intention. But trying to draw a distinction between
"automated" and "non-automated" requests is difficult if not impossible,
as can be seen by the heavy-handed way Google blocks the manual use of
wget. I don't condone the gross abuse of Google's service, but I don't
think an artificial distinction between automated and non-automated is a
useful way to go about it.

Of course, what I think isn't important. If Google wants to write legal
contracts that won't stand up in court (speaking as somebody who isn't a
lawyer and whose legal advice is worthless), they can. But the point is, I
see no ethical nor legal reason why a user can't create a script which is
called MANUALLY by the user and does what a browser does, namely send and
receive data from websites (which may or may not include Google).

And that, it seems to me, is what the Original Poster wanted.

--
Steven D'Aprano

Sep 26 '06 #8

P: n/a
al**********@gmail.com wrote:
i dont know how i can get the words from a txt file for searching by
turn
checking the "reading and writing files" section in the tutorial might
be somewhat helpful:

http://docs.python.org/tut/node9.htm...00000000000000

</F>

Sep 26 '06 #9

P: n/a
Steven D'Aprano wrote:
On Mon, 25 Sep 2006 13:51:55 +0200, Fredrik Lundh wrote:

> http://www.google.com/terms_of_service.html

"You may not send automated queries of any sort to Google's system without express
permission in advance from Google."


I'm not just being a pedantic weasel here, but what's an automated query?
Google's ToS is a legal document (maybe), and if both parties don't agree
on the meanings of terms, well, then it is a lousy legal document and a
recipe for trouble.

Google don't define "automated query"it, and I don't think they can. In
fact, the closest they come to defining it is to list three things they
want to prevent, NONE of which have anything to do with the distinction
between automated and non-automated.
The fact remains that Google can chop your searching ability off at the
knees if *they* determine that you have broken the terms of service, so
whether you agree or not becomes slightly academic.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Sep 26 '06 #10

P: n/a
Steven D'Aprano wrote:
Google don't define "automated query"it, and I don't think they can.
the phrases they use are well understood in the SE business. that's
good enough for everyone involved (including courts; see below).
(What on earth is "meta-searching"? If you're going to use terms which
don't have a commonly understood meaning, define what they mean.)
http://en.wikipedia.org/wiki/Metasearch_engine
If I want to search for "foo", and I type "foo" into the Firefox search
box, is that an automated query?
nope. unless you're a robot.
What if I type "gg: foo" into Konqueror's address bar, which expands to
"http://www.google.com/search?q=foo"? Is it okay if I type the URL by hand
myself?
nope. unless you're a robot.
Can I use the browser to save the search page to a local HTML file? If
Google says no, how can they possibly hope to stop me?
what you do with the search results once you've gotten them is outside
the scope of that clause.
What if I type this command into my shell?

elinks --dump "http://www.google.com/search?q=foo" output.html

What if I type

wget "http://www.google.com/search?q=foo"

into the shell? Surely that's no more automated than typing "foo"
into Google's search box.
neither is automated, unless you're a robot.
Where is the line I must not cross?
letting a program generate search requests based on something other than
"human wants to find something and types some keywords into a prompt
somewhere".
And that, it seems to me, is what the Original Poster wanted.
the OP wanted to read keywords from a text file generated in some
unknown fashion. that's bot behaviour, not human behaviour.
Of course, what I think isn't important. If Google wants to write legal
contracts that won't stand up in court (speaking as somebody who isn't a
lawyer and whose legal advice is worthless)
well, "here's some random guy who didn't understand the terms used in
the contract" isn't a valid defense in court; courts are more interested
in whether people with experience from the relevant field can reasonably
be expected to understand the contract. but this isn't about court
cases, of course; it's about getting banned by Google for abusing their
services.

</F>

Sep 26 '06 #11

P: n/a
GOOGLE IS NOT OUR SUBJECT ANY MORE.

MY GOAL IS NOT MAKING SEARCH ON GOOGLE:
MY GOAL IS MAKING A SEARCH ON
www.onelook.com, for example

Sep 26 '06 #12

P: n/a
al**********@gmail.com wrote:
GOOGLE IS NOT OUR SUBJECT ANY MORE.

MY GOAL IS NOT MAKING SEARCH ON GOOGLE:
MY GOAL IS MAKING A SEARCH ON
www.onelook.com, for example

"""
Can you send me the list of words in the index? May I extract it from your
site?
No, sorry. If you're thinking about writing a script to systematically copy
OneLook.com's word list, please don't. It's not yours to copy, for one
thing. But also, it wastes tremendous bandwidth and slows things down for
other users. We have software in place to detect the abuse of our service
and we'll alert your ISP if you violate our trust in you. If you're looking
for a decent-sized downloadable word list, try WordNet, which offers that
and much more. If you're working on a project for school or academic
research, let us know and we might be able to help steer you in the right
direction.
"""

Consider this: if you'd offered the courtesy of a occasional lemonade for
you neighbours, does that mean that you like them stomping around in your
kitchen?

Nearly all of sites that offer a service like this will have policies of
that kind. So - get a grip, stop shouting, and start thinking if what you
are trying to do is legal or social. If not, and you don't care - be my
guest, but don't ask for help here!

Diez
Sep 26 '06 #13

P: n/a
al**********@gmail.com wrote:
GOOGLE IS NOT OUR SUBJECT ANY MORE.

MY GOAL IS NOT MAKING SEARCH ON GOOGLE:
MY GOAL IS MAKING A SEARCH ON
www.onelook.com, for example
this is usenet; you don't "own" the threads you start. if there's a
subthread that you don't find relevant to your original question, just
ignore it.

</F>

Sep 26 '06 #14

P: n/a
I dont mean google
i dont mean onelook.com

these are only examples

i hop eyou understand what i mean

Sep 26 '06 #15

P: n/a
al**********@gmail.com wrote:
I dont mean google
i dont mean onelook.com

these are only examples

i hop eyou understand what i mean
Apparently, *you* don't understand what they're trying to tell you. It
roughly boils down to the following:

- All (except perhaps the most trivial small) sites disallow in their
Terms of Service the unregulated harvesting of their content by
webbots, both for legal and technical reasons. It's not just Google or
Onelook that does this.
- Yes, it is technically possible to attempt to violate their ToS,
running their risk to be caught (with whatever consequences this
implies).
- Yes, you *might* be able to get away with it (at least for some time)
running in stealth mode.
- No, people here are not willing to help you go down this road, you're
on your own.

Hope this helps,
George

Sep 27 '06 #16

P: n/a
In message <pa****************************@REMOVEME.cybersour ce.com.au>,
Steven D'Aprano wrote:
If Google wants to write legal
contracts that won't stand up in court (speaking as somebody who isn't a
lawyer and whose legal advice is worthless), they can.
What they define as their terms of service doesn't have to stand up in
court. They're not a public service, after all. If you do something that
they don't like, they are free to try to block you from their servers, they
don't need to appeal to any other authority.

wget --user-agent="I'm not Microsoft Internet Explorer, I'm Wget" -O - \
http://www.google.co.nz/search\?q=test
Sep 27 '06 #17

P: n/a
In message <ma**************************************@python.o rg>, Steve
Holden wrote:
The fact remains that Google can chop your searching ability off at the
knees ...
No they can't. They can only chop off your ability to use Google.

Sep 27 '06 #18

P: n/a
Lawrence D'Oliveiro wrote:
In message <ma**************************************@python.o rg>, Steve
Holden wrote:

>>The fact remains that Google can chop your searching ability off at the
knees ...


No they can't. They can only chop off your ability to use Google.
[sigh]. Right, Lawrence, sorry I wasn't quite explicit enough for you.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Sep 27 '06 #19

P: n/a
Steve Holden <st***@holdenweb.comwrites:
Lawrence D'Oliveiro wrote:
Steve Holden wrote:
>The fact remains that Google can chop your searching ability off
at the knees ...
No they can't. They can only chop off your ability to use Google.
[sigh]. Right, Lawrence, sorry I wasn't quite explicit enough for you.
Seems like a fairly important distinction. Google has the power to
"chop your searching ability off at the knees" only to the extent that
you grant them that power.

--
\ "[...] a Microsoft Certified System Engineer is to information |
`\ technology as a McDonalds Certified Food Specialist is to the |
_o__) culinary arts." -- Michael Bacarella |
Ben Finney

Sep 27 '06 #20

P: n/a
In message <ma**************************************@python.o rg>, Ben Finney
wrote:
Steve Holden <st***@holdenweb.comwrites:
>Lawrence D'Oliveiro wrote:
Steve Holden wrote:
The fact remains that Google can chop your searching ability off
at the knees ...
No they can't. They can only chop off your ability to use Google.
[sigh]. Right, Lawrence, sorry I wasn't quite explicit enough for you.

Seems like a fairly important distinction. Google has the power to
"chop your searching ability off at the knees" only to the extent that
you grant them that power.
Saying "search" when you mean "Google" is like saying "using a PC" when you
mean "using Microsoft Windows".
Sep 27 '06 #21

P: n/a
Lawrence D'Oliveiro wrote:
In message <ma**************************************@python.o rg>, Ben Finney
wrote:

>>Steve Holden <st***@holdenweb.comwrites:

>>>Lawrence D'Oliveiro wrote:

Steve Holden wrote:

>The fact remains that Google can chop your searching ability off
>at the knees ...

No they can't. They can only chop off your ability to use Google.
[sigh]. Right, Lawrence, sorry I wasn't quite explicit enough for you.

Seems like a fairly important distinction. Google has the power to
"chop your searching ability off at the knees" only to the extent that
you grant them that power.


Saying "search" when you mean "Google" is like saying "using a PC" when you
mean "using Microsoft Windows".
Well, I thought it was self-evident that since I was referring to Google
I wasn't talking about Alta Vista searching. If I said "Microsoft have
the ability to terminate your license" presumably you'd chastise me by
pointing out that they wouldn't be able to revoke my *Linux* license.
Whatever.

"There's none as thick as them that wants to be."

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Sep 27 '06 #22

P: n/a
ok i close this discussion
i understand everybody no problem

Sep 27 '06 #23

P: n/a
al**********@gmail.com wrote:
ok i close this discussion
No, you don't.

Stefan
Sep 27 '06 #24

P: n/a
George Sakkis wrote:
al**********@gmail.com wrote:
I dont mean google
i dont mean onelook.com

these are only examples

i hop eyou understand what i mean

Apparently, *you* don't understand what they're trying to tell you. It
roughly boils down to the following:
If we just step back from the brink for a moment and give the
questioner the benefit of the doubt - that the exercise merely involves
automating some kind of interactions that would otherwise require lots
of manual messing around piloting a browser, rather than performing
some kind of bulk "suck down" of an entire site's information - then it
is obviously possible to use the following techniques:

* Use a well-known mirroring or archiving tool such as wget.
* Use various testing tools, some of which are written in Python.
* Use urllib, urllib2 or httplib plus an HTML or XML parser in your
own program.
* Automate a Web browser using some off-the-shelf program.
* Use various automation mechanisms provided by your environment
(eg. COM, DCOP), possibly with Python libraries (eg. PAMIE [1],
KPart Plugins [2]).

Various sites forbid wget and friends as a rule, understandably, but
there are sometimes reasons why you might want to use various tools to
automate a procedure involving lots of data which would waste a huge
amount of time if done manually. Perhaps you might have mail residing
in a Webmail system which can't be extracted via any process other than
reading all the messages in a browser, for example, or perhaps your
favourite Internet applications don't provide decent shortcuts to the
information you need, instead believing that it's all about the
"experience": surfing around watching all the animated adverts.
Automation and related technologies can legitimately help users regain
control of their Internet-resident data and make better use of the
services around it.

Paul

[1] http://pamie.sourceforge.net/
[2] http://www.boddie.org.uk/python/kpartplugins.html

Sep 27 '06 #25

P: n/a
In message <11**********************@e3g2000cwe.googlegroups. com>, Paul
Boddie wrote:
Various sites forbid wget and friends as a rule, understandably ...
No, that is not understandable.

Oct 6 '06 #26

This discussion thread is closed

Replies have been disabled for this discussion.