Group Bayesianism
I blogged the other day about a Bayesian spam filter that runs as a service to which you can hook up your local Pop email client. Bayesian filters learn from their analysis of the word usage patterns of content that a user has initially sorted into spam and non-spam piles. This new service, Death2Spam, pools the sortings of all of its users. Since I am a very happy user of a Bayesian filter (PopFile) that runs on my own machine, I asked the service’s creator, Richard Jowsey, why we should prefer a service like his that pools all subscribers’ spam sortings.
Q: What do you about the “my spam is your important msg” problem? Shouldn’t a well-trained personal collection be more useful than a large multi-person collection?
A: Damn fine question, sir! Yes, and no. Our generic database is currently 99% accurate for the average user. It contains word-probability analysis from about 100k messages. And does the basic job really well, for most people. BTW, all our server stats are freely available to worthy causes like anti-spam research. Interesting reading.
We’re presently adding a “premium” (aka geek) version, which will allow you to combine personal database probs with the generic data, weighted per your preferences. Plus support for adjusting upper and lower unsure limits. Plus a URL crawler for nailing micro-spams. Plus the mandatory (but largely vestigial) black/white lists. That ought to keep everyone happy… :)
Micro-spam? I don’t know what that is but I have a bunch of offers for extending it, thickening it and making it last longer if that’ll help any.
FWIW, After a few months of using Popfile, occasionally correcting its mistakes, it’s now about 99.5% accurate. That is, about 1 in 200 messages that it’s classified as spam is in fact non-spam.
Categories: Uncategorized dw