By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
429,435 Members | 1,967 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 429,435 IT Pros & Developers. It's quick & easy.

Simple Bayesian classifier?

P: n/a
Hi all,

I am trying to build an application to classify texts from a number of
sources. I am programming it in PHP and I go "by the book" - i.e.
calculating probabilities according to the formula etc.
It works, but it's very slow (due to slow PHP mathematical
implementation, I guess).
Is there some variation of the Naive Bayes classifier which is not so
demanding in the way of computing power used?

Best
Pavel
Jun 8 '07 #1
Share this Question
Share on Google+
4 Replies


P: n/a
On Jun 8, 11:52 am, Pavel Kalinov <pavk...@gmail.comwrote:
Hi all,

I am trying to build an application to classify texts from a number of
sources. I am programming it in PHP and I go "by the book" - i.e.
calculating probabilities according to the formula etc.
It works, but it's very slow (due to slow PHP mathematical
implementation, I guess).
Is there some variation of the Naive Bayes classifier which is not so
demanding in the way of computing power used?

Best
Pavel
spamassasin's code is OS, have you checked that out?
http://svn.apache.org/viewvc/spamass...pm?view=markup
AFAIK php offloads its maths to c libraries; so your problem is that
it can be much more computationally intensive to work by the book,
with no code optimisation techniques etc... (hash tables and so on).
(A mathematician C programmer I know got their code to run in 2 days
rather than 2 weeks after some optimisation)

Jun 8 '07 #2

P: n/a
At Fri, 08 Jun 2007 20:52:39 +1000, Pavel Kalinov let h(is|er) monkeys
type:
Hi all,

I am trying to build an application to classify texts from a number of
sources. I am programming it in PHP and I go "by the book" - i.e.
calculating probabilities according to the formula etc.
It works, but it's very slow (due to slow PHP mathematical
implementation, I guess).
Is there some variation of the Naive Bayes classifier which is not so
demanding in the way of computing power used?

Best
Pavel
You may like http://xhtml.net/php/PHPNaiveBayesianFilter
I am a bit surprised you have such a slow response, the typical algorithms
don't seem to be extremely taxing.

As part of an author authenticity scoring app Naive Bayesian filtering
proved quite useful, for spam filtering its use *by itself) proves rather
limited. Quite a few spam creators (scripts) are well equipped these days
to lower scores substantially, allowing their messages to leak through.

hth

--
Schraalhans Keukenmeester - sc*********@the.Spamtrapexample.nl
[Remove the lowercase part of Spamtrap to send me a message]

"strcmp('apples','oranges') < 0"

Jun 8 '07 #3

P: n/a
Thanks, I didn't know this - will look into it.
BTW, I am not trying to make a spam filter, but to sort news articles in
a number of categories (16 at present, as test). And I need
milliseconds, not days :-(

Best
Pavel

shimmyshack wrote:
On Jun 8, 11:52 am, Pavel Kalinov <pavk...@gmail.comwrote:
>Hi all,

I am trying to build an application to classify texts from a number of
sources. I am programming it in PHP and I go "by the book" - i.e.
calculating probabilities according to the formula etc.
It works, but it's very slow (due to slow PHP mathematical
implementation, I guess).
Is there some variation of the Naive Bayes classifier which is not so
demanding in the way of computing power used?

Best
Pavel

spamassasin's code is OS, have you checked that out?
http://svn.apache.org/viewvc/spamass...pm?view=markup
AFAIK php offloads its maths to c libraries; so your problem is that
it can be much more computationally intensive to work by the book,
with no code optimisation techniques etc... (hash tables and so on).
(A mathematician C programmer I know got their code to run in 2 days
rather than 2 weeks after some optimisation)
Jun 11 '07 #4

P: n/a
Pavel Kalinov wrote:
BTW, I am not trying to make a spam filter, but to sort news articles in
a number of categories (16 at present, as test). And I need
milliseconds, not days :-(
Still, SpamAssassin might be what you're looking for.

Turn off all SA's non-Bayes scoring, and then feed SA a corpus of say, 500
sports articles, telling it that they're "spam"; then 500 non-sports
articles, telling them they're "ham". After this preparation, your SA
configuration should be primed to detect sports articles.

Another 15 SA configurations, and your setup should be complete.

With SA, one user can have multiple configurations using the "--configpath"
command-line option.

--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.12-12mdksmp, up 108 days, 16 min.]

URLs in demiblog
http://tobyinkster.co.uk/blog/2007/05/31/demiblog-urls/
Jun 11 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.