473,385 Members | 1,727 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Merlin, a fun little program

I posted to my web site a fun little program called merlin.py today.
Please keep in mind that I am a hobbyist and this is just a little hack,
if you look at the code you will see that it is still possible to write
spaghetti code, even with Python. I apologize, and I do intend to clean
up the code, but it may take awhile. For now it works, with some bugs.

It is a composite of a few scripts. The first, based on a script Max M
uploaded to this newsgroup a while ago (2 years?), is a web scraper
based multiple choice guesser. I re-wrote the web scraper to use Yahoo
rather than Google, as Google somehow recognizes it as a script now and
so has disabled the ability to use Google, as they say it violates their
terms of service. I certainly do not want to violate anyone's terms of
service, this is a just a fun little script. I also used string
functions instead of regexes and an algorithm of my own. Kudos to David
Mertz' Text Processing in Python for helping me figure out how to do
this, indirectly. (BTW, I also posted a review of his new book on my web
site...and submitted it to Slashdot, but one never knows if they will
run it).

The stand alone version of the web scraper (askMerlin.py) uses NLQ, a
natural query language class found on the web at
http://gurno.com/adam/nlq/ to identify possible answers to a user's's
questions, to then be submitted to the main algorithm to choose amongst
the possible answers, which I call options. Of course, the program is
much more likely to be accurate when you give it a correct "option" to
be picked out from amongst several incorrect options that you also give
it; and in fact a bug in the composite program I call Merlin (
merlin.py) crashes completely if you do not give it any options; but
this can be fixed. askMerlin.py doesn't crash and uses NLQ, but gives
poor answers. However, I have a much better algorithm in mind for this
part of the program; instead of giving NLQ the main response page from a
query, I will give it the first "link" page from a query, which I
reckon to be much more likely to contain keywords that represent good
possible answers. Alas, this may have to wait until the next long
weekend, unless someone else takes up the task ;-)))

In the long run, the program is much more interesting using NLQ to find
answers to questions where the user offers no possible answers to choose
amongst or other clues; I think this has potential.

For now, please give Merlin options to choose amongst. Then, I include a
slightly improved Decision Analysis script, and two fun variations or
specific applications of it. This script has the virtue of being my own
creation, although I did recieve help from Paul Winkler and others on
this list.

Then I also include a script shamelessly stolen off the web that will be
instantly recognizable to most of you on this newsgroup, but perhaps not
to some newbies.

I have in mind more such fun stuff to be added.

Also, I intend to do a full GUI version, with a much better user
interface, and then to create executable installers for Windows, Linux,
and Mac OS X. For now though, the command line interface has the
advatage of working anywhere one can get a Python command prompt; I have
tested it on Windows, Linux, Mac OS X and the Sharp Zaurus PDA. The
additon of a GUI and creation of executable files should keep this
hobbyist busy for a while ;-)))

A GUI version of Decision Analysis, that I wrote using PythonCard, is
available already.

All of the above can wait until I add more fun stuff to it, make it
better, fix bugs, move it from the deprecated regex to the re module,
and clean up the code!

OK, so this hack may not be worth all the words I've given it, but, in
the spirit of computer programming for everybody, I am pleased that I am
producing something. I think it might be something other newbies might
be able to understand and hack on also, since it is so simple.

If not, so be it. I am having fun.

All of this is on my web site, right at the top, at
http://www.awaretek.com/plf.html

Ron Stephens
Jul 18 '05 #1
3 3398
On Mon, Jul 07, 2003 at 01:37:15AM +0000, Ron Stephens wrote:
based multiple choice guesser. I re-wrote the web scraper to use Yahoo
rather than Google, as Google somehow recognizes it as a script now and
so has disabled the ability to use Google, as they say it violates their
terms of service. I certainly do not want to violate anyone's terms of
service


You can still use Google for this - just sign up for the Google API.

http://www.google.com/apis/
http://diveintomark.org/projects/pygoogle/

Oren

Jul 18 '05 #2
Oren wrote """You can still use Google for this """

Yes indeed. I have played with the Google API's, registered and also
use pygoogle. They make this kind of thing easier, no doubt about it.
The reason I used my hand-rolled web scraper on Yahoo is that using
the Google API's means that other potential users, like those who
download form my web site, can't run my code it uses Google api's
unless they download and register also; which might be pain for them.

At any rate, doing my own was fun and informative for me. A big
disadvantage to web scraping is that they code tends to break over
time, though. This happened to me with Max M's original; two year old
algorithm. Google broke it, and I didnt realize it unit I can back and
retried the code.

The two links you gave are good ones and I studied both in my efforts;
I recommend them. Thanks for the inputs.

I guess the bigger question is; is there anything wrong with web
scraping? I surely never meant any harm in it, and certainly no money
is involved. But maybe I should give it up and do other things?

Ron Stephens
Jul 18 '05 #3
On Mon, Jul 07, 2003 at 02:05:28PM -0700, Ron Stephens wrote:
I guess the bigger question is; is there anything wrong with web
scraping? I surely never meant any harm in it, and certainly no money
is involved. But maybe I should give it up and do other things?


I don't think there's anything fundamentally wrong with web scraping but
you have to consider the fact that a single script can easily consume
resources that cost real money and would otherwise serve thousands of
human users. If a provider installs mechanisms to detect scripts and block
them this can quickly become a cat-and-mouse game where the scrapers try
to fool these mechanisms, it starts to get ugly and everybody suffers.

I think Google handled this very well - defusing most of the problem by
letting people have what they want while keeping things under control.
I generally find the way Google handles the issues that come with their
dominant market position quite "Pythonic".

Oren

Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Usman | last post by:
Huy everyone , Well I am not a big C++ programmer , I am just a little young kid on it tryint to learn . Actually I was given an assignment last week by my teacher which I solved ...
38
by: Martin Marcher | last post by:
Hi, I've read several questions and often the answer was 'C knows nothing about .' So if C knows that little as some people say, what are the benefits, I mean do other languages know more...
0
by: Kenneth Lantrip | last post by:
After some cleaning of some of my personal directories and files, I stumbled upon this little program I wrote some time ago. A scammer was trying to introduce me into his little pyramid scam. So...
54
by: ash | last post by:
i am writing this program (for exercise1-9 in k&r-2nd edition) which removes extra spaces in string example- "test string" will be "test string" #include<stdio.h> #include<string.h>...
2
montzter
by: montzter | last post by:
Hi, Is someone in the forum know the capabilities of the CATC Merlin's Wand? The scripting stuff, commant genration and others.... Thanks
102
by: BoogieWithStu22 | last post by:
I am running into a problem with a web page I have created when viewing it in IE6 on some machines. The page has a database lookup. The user enters an account name or number and clicks a lookup...
0
by: _mubashir | last post by:
Hello All, Merlin Data Compass version 2 is available now. It is a Web based OLAP data access, analysis and interactive reporting tool. It provides an organization with the ability to create a...
23
by: guthena | last post by:
Write a small C program to determine whether a machine's type is little-endian or big-endian.
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.