Joho the Blog » Latent Semantic Search
EverydayChaos
Everyday Chaos
Too Big to Know
Too Big to Know
Cluetrain 10th Anniversary edition
Cluetrain 10th Anniversary
Everything Is Miscellaneous
Everything Is Miscellaneous
Small Pieces cover
Small Pieces Loosely Joined
Cluetrain cover
Cluetrain Manifesto
My face
Speaker info
Who am I? (Blog Disclosure Form) Copy this link as RSS address Atom Feed

Latent Semantic Search

The always-provocative Arnold Kling suggests in an email that we take a look at the Semant-o-Matic site that uses latent semantic indexing to search blogs. The current site is an open source test bed, indexing only 11 blog sites, but the idea is provocative. Here’s how I understand it, from the site’s readable and informative explanation of searching and LSI.

When you click on the “Find more like this one” button on a search site (= “Similar pages” at Google), the site does an analysis of the word usage pattern on that page and runs a query to find other pages with similar patterns. LSI does this not when a user presses the button but as it’s indexing the page so that it always knows other pages that are similar to the first one. So, when you do a LSI search for, say “French Impressionism,” it finds not only pages that contain that phrase but also pages that are similar to ones that contain that phrase. Thus, an LSI search might turn up a page that talks about 19th Century painters concerned with the play of light in paintings of haystacks even if it never uses the phrase “French Impressionism.” (Of course, it may also turn up a page about Haystack Calhoun, the old professional wrestler. playing with the lights in the arena.)

One of the very cool things about this approach — whether pre-computed or done on the fly — is that it lets a computer find two pages that are about the same thing simply by analyzing the way words are arrayed on the page, without making amighttempt to understand what those words mean.

Previous: « || Next: »

Leave a Reply

Comments (RSS).  RSS icon