October 10, 2004
Bayesian spellchecker?
In the ’90s, IBM had a machine translation project that bested rule-based translators simply by using probabilities deduced from analyzing the word usage patterns in a large corpus of manually-translated material. (They used the French and English versions of the proceedings of the Canadian Parliament.) Now Bayesian spam filters are all the rage, using word frequency analyses of known spam and non-spam to decide which folder to put a particular message in.
So why not use similar analyses to guide spellchecker alternatives? An analysis of my corpus of documents would reveal that the putative word “cheast” is more likely to be “cheats” if used near the word “game” and “stuck,” but more likely to be “chaste” if used near “Britney” and “supposedly.” Given how well Bayesian spam filters work – they work really well – I might even want to say that if the spellchecker is, say, 95% confident, it should make the change without asking me, while enabling me to review all the auto-changed words, of course.
I am a genuine admirer of Microsoft Word’s spellchecker; in fact, it’s one of the things keeping me from switching to Open Office. Not only does Word’s UI let me correct errors the way I want, jumping from clicking on a list to editing in context, but its first suggestion is almost always the right one. So, I assume Microsoft has analyzed some generic corpus to get the probabilities right. Why not analyze my corpus, too? And do it folder by folder, across time, and by document type. Why not?