Latent semantic indexing explained
In response to my blogging about pages not saying what they’re about, Hanan Cohen points us to an exceptionally well-written article by Clara Yu, John Cuadrado, Maciej Ceglowski and J. Scott Payne about latent semantic indexing (not to be confused with latex cement and indenting).
Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant… Although the LSI algorithm doesn’t understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent.
When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don’t contain the keyword at all.
For example: “In an AP news wire database, a search for Saddam Hussein returns articles on the Gulf War, UN sanctions, the oil embargo, and documents on Iraq that do not contain the Iraqi president’s name at all.”
This is a very well-done article. And it even includes a link to an application of LSI: An automatic essay grader (which is temporarily down because a class is actually using it).
Categories: Uncategorized dw