By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,609 Members | 3,812 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,609 IT Pros & Developers. It's quick & easy.

Python good for data mining?

P: n/a
I'm starting a project in data mining, and I'm considering Python and
Java as possible platforms.

I'm conserned by performance. Most benchmarks report that Java is
about 10-15 times faster than Python, and my own experiments confirms
this. I could imagine this to become a problem for very large
datasets.

How good is the integration with MySQL in Python?

What about user interfaces? How easy is it to use Tkinter for
developing a user interface without an IDE? And with an IDE? (which
IDE?)

What if I were to use my Python libraries with a web site written in
PHP, Perl or Java - how do I intergrate with Python?

I really like Python for a number of reasons, and would like to avoid
Java.

Sorry - lot of questions here - but I look forward to your replies!

Nov 4 '07 #1
Share this Question
Share on Google+
18 Replies


P: n/a
Jens wrote:
I'm starting a project in data mining, and I'm considering Python and
Java as possible platforms.

I'm concerned by performance. Most benchmarks report that Java is
about 10-15 times faster than Python, and my own experiments confirms
this. I could imagine this to become a problem for very large
datasets.
If most of the processing is done with SQL calls, this shouldn't be an
issue. I've known a couple of people at Sydney University who were
using Python for data mining. I think they were using sqlite3 and MySQL.
>
How good is the integration with MySQL in Python?
Never tried it, but a quick google reveals a number of approaches you
could try - the MySQLdb module, MySQL for Python, etc.
>
What about user interfaces? How easy is it to use Tkinter for
developing a user interface without an IDE? And with an IDE? (which
IDE?)
WxPython was recommended to me when I was learning how to create a GUI.
It has more features than Tkinter and a more native look and feel across
platforms. With WxPython it was fairly easy to create a multi-pane,
tabbed interface for a couple of programs, without using an IDE. The
demos/tutorials were fantastic.
>
What if I were to use my Python libraries with a web site written in
PHP, Perl or Java - how do I integrate with Python?
Possibly the simplest way would be python .cgi files. The cgi and cgitb
modules allow form data to be read fairly easily. Cookies are also
fairly simple. For a more complicated but more customisable approach,
you could look in to the BaseHTTPServer module or a socket listener of
some sort, running that alongside the webserver publicly or privately.
Publicly you'd have links from the rest of your php/whatever pages to
the python server. Privately the php/perl/java backend would request
data from the local python server before feeding the results back
through the main server (apache?) to the client.
Nov 4 '07 #2

P: n/a
What if I were to use my Python libraries with a web site written in
PHP, Perl or Java - how do I integrate with Python?

Possibly the simplest way would be python .cgi files. The cgi and cgitb
modules allow form data to be read fairly easily. Cookies are also
fairly simple. For a more complicated but more customisable approach,
you could look in to the BaseHTTPServer module or a socket listener of
some sort, running that alongside the webserver publicly or privately.
Publicly you'd have links from the rest of your php/whatever pages to
the python server. Privately the php/perl/java backend would request
data from the local python server before feeding the results back
through the main server (apache?) to the client.
Thanks a lot! I'm not sure I completely understand your description of
how to integrate Python with, say PHP. Could you please give a small
example? I have no experience with Python web development using CGI.
How easy is it compared to web development in PHP?

I still havent't made my mind up about the choice of programming
language for my data mining project. I think it's a difficult
decision. My heart tells me "Python" and my head tells me "Java" :-)

Nov 4 '07 #3

P: n/a
Jens wrote:
>
Thanks a lot! I'm not sure I completely understand your description of
how to integrate Python with, say PHP. Could you please give a small
example? I have no experience with Python web development using CGI.
How easy is it compared to web development in PHP?

I still havent't made my mind up about the choice of programming
language for my data mining project. I think it's a difficult
decision. My heart tells me "Python" and my head tells me "Java" :-)
My C++ lecturer used to tell us "'C++ or Java?' is never the question.
For that matter, Java is never the answer."

As for python and cgi, it's pretty simple. Instead of a .php file to be
handled by the php-handler, you have a .cgi file which is handled by the
cgi-handler. Set the action of your html form to the .cgi file. At the
top of the .cgi file, you'll need a line like:

#!/usr/bin/env python

Which tells it to use python as the interpreter. You'll need a few imports:

import cgi
import cgitb; cgitb.enable() # for debugging - it htmlises
# your exceptions and error messages.
print """Content-type: text/html; charset="iso-8859-1";\n"""
# You need that line or something similar so the browser knows what to
do with the output of the script.

Everything that's printed by the python script goes straight to the
client's browser, so the script will have to print html. The cgi module
handles form data, typically formdata = cgi.FieldStorage() will be
filled when a form is sent to the script. print it and see what's in it.

From here, there's a huge number of tutorials on python and cgi on the
web and I'm tired.

Best of luck,

Cameron.
Nov 4 '07 #4

P: n/a
Jens schrieb:
What about user interfaces? How easy is it to use Tkinter for
developing a user interface without an IDE? And with an IDE? (which
IDE?)
Tkinter is easy but looks ugly (yeah folks, I know it doesn't matter in
you mission critical flight control system). Apart from ActiveStates
Komodo I'm not aware of any GUI builders. Very likely you don't need one.
>
What if I were to use my Python libraries with a web site written in
PHP, Perl or Java - how do I intergrate with Python?
How do you "integrate" Perl and PHP? The usual methods are calling
external programs (slow) or using some IPC method (socket, xmlrpc, corba).
>
I really like Python for a number of reasons, and would like to avoid
Java.
Have you looked at jython?

cheers
Paul

Nov 4 '07 #5

P: n/a
Jens a écrit :
I'm starting a project in data mining, and I'm considering Python and
Java as possible platforms.

I'm conserned by performance. Most benchmarks report that Java is
about 10-15 times faster than Python,
Benchmarking is difficult, and most benchmarks are easily 'oriented'.
(pure) Python is slower than Java for some tasks, and as fast as C for
some others. In the first case, it's quite possible that a C-based
package exists.
and my own experiments confirms
this.
<bis mode="Benchmarking is difficult">
If you go that way, Java is way slower than C++ - and let's not talk
about resources...
</bis>
I could imagine this to become a problem for very large
datasets.
If you have very large datasets, you're probably using a serious RDBMS,
that will do most of the job.
How good is the integration with MySQL in Python?
Pretty good - but I wouldn't call MySQL a serious RDBMS.
What about user interfaces? How easy is it to use Tkinter for
developing a user interface without an IDE? And with an IDE? (which
IDE?)
If your GUI is complex and important enough to need a GUI builder (which
I guess is what you mean by IDE), then forget about Tkinter, and go for
either pyGTK, pyQT or wxPython.
What if I were to use my Python libraries with a web site written in
PHP, Perl or Java - how do I intergrate with Python?
HTTP is language-agnostic.
Nov 4 '07 #6

P: n/a
On Nov 3, 9:02 pm, Jens <j3n...@gmail.comwrote:
I'm starting a project indatamining, and I'm considering Python and
Java as possible platforms.

I'm conserned by performance. Most benchmarks report that Java is
about 10-15 times faster than Python, and my own experiments confirms
this. I could imagine this to become a problem for very large
datasets.

How good is the integration with MySQL in Python?

What about user interfaces? How easy is it to use Tkinter for
developing a user interface without an IDE? And with an IDE? (which
IDE?)

What if I were to use my Python libraries with a web site written in
PHP, Perl or Java - how do I intergrate with Python?

I really like Python for a number of reasons, and would like to avoid
Java.

Sorry - lot of questions here - but I look forward to your replies!

All of my programming is data centric. Data mining is foundational
there in. I started learning computer science via Python in 2003. I
too was concerned about it's performance, especially considering my
need for literally trillions of iterations of financial data tables
with mathematical algorithms.

I then leaned C and then C++. I am now coming home to Python realizing
after my self-eduction, that programming in Python is truly a pleasure
and the performance is not the concern I first considered to be.
Here's why:

Python is very easily extended to near C speed. The Idea that FINALLY
sunk in, was that I should first program my ideas in Python WITHOUT
CONCERN FOR PERFOMANCE. Then, profile the application to find the
"bottlenecks" and extend those blocks of code to C or C++. Cython/
Pyrex/Sip are my preferences for python extension frameworks.

Numpy/Scipy are excellent libraries for optimized mathematical
operations. Pytables is my preferential python database because of
it's excellent API to the acclaimed HDF5 database (used by very many
scientists and government organizations).

As for GUI framework, I have studied Qt intensely and would therefore,
very highly recommend PyQt.

After four years of intense study, I can say that with out a doubt,
Python is most certainly the way to go. I personally don't understand
why, generally, there is any attraction to Java, though I have yet to
study it further.

Nov 5 '07 #7

P: n/a
On 5 Nov., 04:42, "D.Hering" <vel.ac...@gmail.comwrote:
On Nov 3, 9:02 pm, Jens <j3n...@gmail.comwrote:
I'm starting a project indatamining, and I'm considering Python and
Java as possible platforms.
I'm conserned by performance. Most benchmarks report that Java is
about 10-15 times faster than Python, and my own experiments confirms
this. I could imagine this to become a problem for very large
datasets.
How good is the integration with MySQL in Python?
What about user interfaces? How easy is it to use Tkinter for
developing a user interface without an IDE? And with an IDE? (which
IDE?)
What if I were to use my Python libraries with a web site written in
PHP, Perl or Java - how do I intergrate with Python?
I really like Python for a number of reasons, and would like to avoid
Java.
Sorry - lot of questions here - but I look forward to your replies!

All of my programming is data centric. Data mining is foundational
there in. I started learning computer science via Python in 2003. I
too was concerned about it's performance, especially considering my
need for literally trillions of iterations of financial data tables
with mathematical algorithms.

I then leaned C and then C++. I am now coming home to Python realizing
after my self-eduction, that programming in Python is truly a pleasure
and the performance is not the concern I first considered to be.
Here's why:

Python is very easily extended to near C speed. The Idea that FINALLY
sunk in, was that I should first program my ideas in Python WITHOUT
CONCERN FOR PERFOMANCE. Then, profile the application to find the
"bottlenecks" and extend those blocks of code to C or C++. Cython/
Pyrex/Sip are my preferences for python extension frameworks.

Numpy/Scipy are excellent libraries for optimized mathematical
operations. Pytables is my preferential python database because of
it's excellent API to the acclaimed HDF5 database (used by very many
scientists and government organizations).

As for GUI framework, I have studied Qt intensely and would therefore,
very highly recommend PyQt.

After four years of intense study, I can say that with out a doubt,
Python is most certainly the way to go. I personally don't understand
why, generally, there is any attraction to Java, though I have yet to
study it further.
Thanks a lot! I agree, Python is a pleasure to program in.

So what you're saying is, don't worry about performance when you start
coding, but use profiling and optimization in C/C++. Sounds
reasonable. It's been 10 years ago since I've done any programming in C
++, so I have to pick up on that soon I guess.

I've used NumPy for improving my K-Means algorithm, and it now runs
33% faster than "pure" Python. I guess it could be improved upon
further.

I will have a look at PyQt!

Nov 5 '07 #8

P: n/a
On Nov 3, 9:02 pm, Jens <j3n...@gmail.comwrote:
How good is the integration with MySQL in Python?
I think it's very good. However, I'm not sure
how good SQL really is for data mining, depending
on what you mean by that.

Please have a look at nucular for this kind of thing
-- I've advertised it as a "full text search engine" but
the intended use is close to some varieties of
data mining.

http://nucular.sourceforge.net/

The mondial demo is the closest thing to data mining
currently live:

http://www.xfeedme.com/nucular/mondi...o?FREETEXT=hol

If you find that you need some functionality that is
missing, it may be easy to add. Let me know.
For example the underlying indexing methodology can
be manipulated in many clever ways to get different
performance characteristics.

-- Aaron Watters

===
When at first you don't succeed
give up and blame your parents.
-- seen on a tee shirt

Nov 5 '07 #9

P: n/a
On Nov 5, 1:51 pm, Jens <j3n...@gmail.comwrote:
On 5 Nov., 04:42, "D.Hering" <vel.ac...@gmail.comwrote:
On Nov 3, 9:02 pm, Jens <j3n...@gmail.comwrote:
I then leaned C and then C++. I am now coming home to Python realizing
after my self-eduction, that programming in Python is truly a pleasure
and the performance is not the concern I first considered to be.
Here's why:
Python is very easily extended to near C speed. The Idea that FINALLY
sunk in, was that I should first program my ideas in Python WITHOUT
CONCERN FOR PERFOMANCE. Then, profile the application to find the
"bottlenecks" and extend those blocks of code to C or C++. Cython/
Pyrex/Sip are my preferences for python extension frameworks.
Numpy/Scipy are excellent libraries for optimized mathematical
operations. Pytables is my preferential python database because of
it's excellent API to the acclaimed HDF5 database (used by very many
scientists and government organizations).

So what you're saying is, don't worry about performance when you start
coding, but use profiling and optimization in C/C++. Sounds
reasonable. It's been 10 years ago since I've done any programming in C
++, so I have to pick up on that soon I guess.
"Premature optimization is the root of all evil", to quote a famous
person. And he's right, as most people working larger codes will
confirm.

As for pytables: it is the most elegant programming interface for HDF
on any platform that I've encountered so far. Most other platforms
stay close the HDF5 library C-interface, which is low-level, and quite
complex. PyTables was written with the end-user in mind, and it shows.
One correction though: PyTables is not a database: it is a storage for
(large) arrays, datablocks that you don't want in a database. Use a
database for the metadata to find the right file and field within that
file. Keep in mind though that I mostly work with externally created
HDF-5 files, not with files created in pytables. PyTables Pro has an
indexing feature which may be helpful for datamining (if you write the
hdf-5 files from python).

Maarten

Nov 5 '07 #10

P: n/a
On 5 Nov., 16:29, Maarten <maarten.sn...@knmi.nlwrote:
On Nov 5, 1:51 pm, Jens <j3n...@gmail.comwrote:
"Premature optimization is the root of all evil", to quote a famous
person. And he's right, as most people working larger codes will
confirm.
I guess I'll have to agree with that. Still, I would like to get some
kind of indication of if it's a good idea to use NumPy from the start
of the project - for example. It depends on the situation of course.

Nov 5 '07 #11

P: n/a
Jens wrote:
On 5 Nov., 16:29, Maarten <maarten.sn...@knmi.nlwrote:
>On Nov 5, 1:51 pm, Jens <j3n...@gmail.comwrote:
>"Premature optimization is the root of all evil", to quote a famous
person. And he's right, as most people working larger codes will
confirm.

I guess I'll have to agree with that. Still, I would like to get some
kind of indication of if it's a good idea to use NumPy from the start
of the project - for example.
Yes.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Nov 5 '07 #12

P: n/a
On 5 Nov., 16:29, Maarten <maarten.sn...@knmi.nlwrote:
As for pytables: it is the most elegant programming interface for HDF
on any platform that I've encountered so far. Most other platforms
stay close the HDF5 library C-interface, which is low-level, and quite
complex. PyTables was written with the end-user in mind, and it shows.
One correction though: PyTables is not a database: it is a storage for
(large) arrays, datablocks that you don't want in a database. Use a
database for the metadata to find the right file and field within that
file. Keep in mind though that I mostly work with externally created
HDF-5 files, not with files created in pytables. PyTables Pro has an
indexing feature which may be helpful for datamining (if you write the
hdf-5 files from python).
PyTables? Wow! Looks amazing - I'll have to try that out soon. Thanks!
Nov 5 '07 #13

P: n/a
On Nov 4, 4:36 pm, Bruno Desthuilliers
<bdesth.quelquech...@free.quelquepart.frwrote:
>
How good is the integration with MySQL in Python?

Pretty good - but I wouldn't call MySQL a serious RDBMS.
I would disagree with this, for this particular case.
I think it's probably better
than most other rdbms's for apps
like data mining where you don't need
transactional support, especially if you use the table
implementations that don't support ACID transactions.

-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydis...tation%20fault
Nov 5 '07 #14

P: n/a
Aaron Watters a écrit :
On Nov 4, 4:36 pm, Bruno Desthuilliers
<bdesth.quelquech...@free.quelquepart.frwrote:
>>>How good is the integration with MySQL in Python?

Pretty good - but I wouldn't call MySQL a serious RDBMS.


I would disagree with this, for this particular case.
I think it's probably better
than most other rdbms's for apps
like data mining where you don't need
transactional support,
Mmm... Yes, probably right.
Nov 5 '07 #15

P: n/a
Maarten <ma***********@knmi.nlwrites:
>
"Premature optimization is the root of all evil", to quote a famous
person. And he's right, as most people working larger codes will
confirm.
But note that it's "premature optimization...", not "optimization..." :)

Nov 5 '07 #16

P: n/a
On Nov 5, 10:29 am, Maarten <maarten.sn...@knmi.nlwrote:
On Nov 5, 1:51 pm, Jens <j3n...@gmail.comwrote:
On 5 Nov., 04:42, "D.Hering" <vel.ac...@gmail.comwrote:
On Nov 3, 9:02 pm, Jens <j3n...@gmail.comwrote:
I then leaned C and then C++. I am now coming home to Python realizing
after my self-eduction, that programming in Python is truly a pleasure
and the performance is not the concern I first considered to be.
Here's why:
Python is very easily extended to near C speed. The Idea that FINALLY
sunk in, was that I should first program my ideas in Python WITHOUT
CONCERN FOR PERFOMANCE. Then, profile the application to find the
"bottlenecks" and extend those blocks of code to C or C++. Cython/
Pyrex/Sip are my preferences for python extension frameworks.
Numpy/Scipy are excellent libraries for optimized mathematical
operations. Pytables is my preferential python database because of
it's excellent API to the acclaimed HDF5 database (used by very many
scientists and government organizations).
So what you're saying is, don't worry about performance when you start
coding, but use profiling and optimization in C/C++. Sounds
reasonable. It's been 10 years ago since I've done any programming in C
++, so I have to pick up on that soon I guess.

"Premature optimization is the root of all evil", to quote a famous
person. And he's right, as most people working larger codes will
confirm.
On Nov 5, 10:29 am, Maarten <maarten.sn...@knmi.nlwrote:
On Nov 5, 1:51 pm, Jens <j3n...@gmail.comwrote:
On 5 Nov., 04:42, "D.Hering" <vel.ac...@gmail.comwrote:
On Nov 3, 9:02 pm, Jens <j3n...@gmail.comwrote:
I then leaned C and then C++. I am now coming home to Python realizing
after my self-eduction, that programming in Python is truly a pleasure
and the performance is not the concern I first considered to be.
Here's why:
Python is very easily extended to near C speed. The Idea that FINALLY
sunk in, was that I should first program my ideas in Python WITHOUT
CONCERN FOR PERFOMANCE. Then, profile the application to find the
"bottlenecks" and extend those blocks of code to C or C++. Cython/
Pyrex/Sip are my preferences for python extension frameworks.
Numpy/Scipy are excellent libraries for optimized mathematical
operations. Pytables is my preferential python database because of
it's excellent API to the acclaimed HDF5 database (used by very many
scientists and government organizations).
So what you're saying is, don't worry about performance when you start
coding, but use profiling and optimization in C/C++. Sounds
reasonable. It's been 10 years ago since I've done any programming in C
++, so I have to pick up on that soon I guess.

"Premature optimization is the root of all evil", to quote a famous
person. And he's right, as most people working larger codes will
confirm.

As for pytables: it is the most elegant programming interface for HDF
on any platform that I've encountered so far. Most other platforms
stay close the HDF5 library C-interface, which is low-level, and quite
complex. PyTables was written with the end-user in mind, and it shows.
One correction though: PyTables is not a database: it is a storage for
(large) arrays, datablocks that you don't want in a database. Use a
database for the metadata to find the right file and field within that
file. Keep in mind though that I mostly work with externally created
HDF-5 files, not with files created in pytables. PyTables Pro has an
indexing feature which may be helpful for datamining (if you write the
hdf-5 files from python).

Maarten
Hi Maarten,

I respectfully disagree that HDF5 is not a DB. Its true that HDF5 on
its prima facie is not relational but rather hierarchical.

Hierarchical is truely a much more natural/elegant[1] design from my
perspective. HDF has always had meta-data capabilities and with the
new 1.8beta version available, it is increasing its ability with
'references/links' allowing for pure/partial relational datasets,
groups, and files as well as storing self implemented indexing.

The C API is obviously much more low level, and Pytables does not yet
support these new features.

[1] Anything/everything that is physical/virtual, or can be conceived
is hierarchical... if the system itself is not random/chaotic. Thats a
lovely revelation I've had... EVERYTHING is hierarchical. If it has
context it has hierarchy.

Nov 6 '07 #17

P: n/a
"D.Hering" <vel.a..mail.comwrote:
>
[1] Anything/everything that is physical/virtual, or can be conceived
is hierarchical... if the system itself is not random/chaotic. Thats a
lovely revelation I've had... EVERYTHING is hierarchical. If it has
context it has hierarchy.
Do I hear Echoes of What Was Said by a chappie
who rejoiced in the name of Aristotle?

;-)

- Hendrik

Nov 6 '07 #18

P: n/a
On Nov 6, 4:19 am, "Hendrik van Rooyen" <m...@microcorp.co.zawrote:
"D.Hering" <vel.a..mail.comwrote:
[1] Anything/everything that is physical/virtual, or can be conceived
is hierarchical... if the system itself is not random/chaotic. Thats a
lovely revelation I've had... EVERYTHING is hierarchical. If it has
context it has hierarchy.

Do I hear Echoes of What Was Said by a chappie
who rejoiced in the name of Aristotle?
The 20th century perspective found it more flexible to base
everything on set theory (or category theory or similar)
which is fundamentally relational. Historically
hierarchical/network databases preceded rdbms's because they
are fundamentally more efficient. Unfortunately, they are
also fundamentally more inflexible (it is generally agreed).

-- Aaron Watters
===
http://www.xfeedme.com/nucular/pydis...scii+christmas
Nov 6 '07 #19

This discussion thread is closed

Replies have been disabled for this discussion.