Joho the Blog » Group Bayesianism
EverydayChaos
Everyday Chaos
Too Big to Know
Too Big to Know
Cluetrain 10th Anniversary edition
Cluetrain 10th Anniversary
Everything Is Miscellaneous
Everything Is Miscellaneous
Small Pieces cover
Small Pieces Loosely Joined
Cluetrain cover
Cluetrain Manifesto
My face
Speaker info
Who am I? (Blog Disclosure Form) Copy this link as RSS address Atom Feed

Group Bayesianism

I blogged the other day about a Bayesian spam filter that runs as a service to which you can hook up your local Pop email client. Bayesian filters learn from their analysis of the word usage patterns of content that a user has initially sorted into spam and non-spam piles. This new service, Death2Spam, pools the sortings of all of its users. Since I am a very happy user of a Bayesian filter (PopFile) that runs on my own machine, I asked the service’s creator, Richard Jowsey, why we should prefer a service like his that pools all subscribers’ spam sortings.

Q: What do you about the “my spam is your important msg” problem? Shouldn’t a well-trained personal collection be more useful than a large multi-person collection?

A: Damn fine question, sir! Yes, and no. Our generic database is currently 99% accurate for the average user. It contains word-probability analysis from about 100k messages. And does the basic job really well, for most people. BTW, all our server stats are freely available to worthy causes like anti-spam research. Interesting reading.

We’re presently adding a “premium” (aka geek) version, which will allow you to combine personal database probs with the generic data, weighted per your preferences. Plus support for adjusting upper and lower unsure limits. Plus a URL crawler for nailing micro-spams. Plus the mandatory (but largely vestigial) black/white lists. That ought to keep everyone happy… :)

Micro-spam? I don’t know what that is but I have a bunch of offers for extending it, thickening it and making it last longer if that’ll help any.

FWIW, After a few months of using Popfile, occasionally correcting its mistakes, it’s now about 99.5% accurate. That is, about 1 in 200 messages that it’s classified as spam is in fact non-spam.

Previous: « || Next: »

Leave a Reply

Comments (RSS).  RSS icon