Latent Semantic Indexing

(GEEK STUFF) One of the largest problems many search engines run into is that after they get to a few hundred million documents their algorithms and hardware hit a wall.

For those companies that can afford the investment to get past this point they still run into the problem that each additional resource makes their job a bit harder.

One of the major ways around this problem is to take advantage of the natural patterns in human language. Using Latent Semantic Indexing allows indexing search results based on the pairing of like words within documents.

Many complex searches may lack exact matches in the results as well. Being able to find near matches will allow search engines to provide more comprehensive results.

Its hard to get computers to understand anything human, but the process of latent semantic indexing delivers conceptual results while being entirely mathematically driven.

There are two main ways to do this, single variable decomposition and multi dimentional scaling.

Some of the steps of the single variable decomposition process are to:

  • create a database of all words in relevant documents
  • remove common stop words
  • stemming
  • remove words appearing in all results
  • remove words only appearing in one result
  • create a database of relavent keywords
  • weight the pages based on the frequency of keyword distribution
  • increasing the relevance of terms which appear in a small number of pages (as they are more likely to be on topic than words that appear in most all documents)
  • normalize the page to remove the pagelength as a factor
  • create relevancy vectors for the keywords

The single variable decomposition process is not scalable enough to work on large scale search engines though as it requires too much processor time. Multi dimentional scaling allows us to take snapshots of the topicology of different documents. "Instead of deriving the best possible projection through matrix decomposition, the MDS algorithm starts with a random arrangement of data, and then incrementally moves it around, calculating a stress function after each perturbation to see if the projection has grown more or less accurate. The algorithm keeps nudging the data points until it can no longer find lower values for the stress function."

This does not provide exact results, but only a rough approximation. When combined with other factors this approximation improves scalability and quality of search.

Good Reading on latent semantic indexing

This technology is so amazing that it may eventually help lead to a cure for cancer. Already the technology is being refined for cognitive improvements and test grading!

Published: January 18, 2004 by Aaron Wall in technology

Comments

Add new comment

(If you're a human, don't change the following field)
Your first name.
(If you're a human, don't change the following field)
Your first name.
(If you're a human, don't change the following field)
Your first name.

New to the site? Join for Free and get over $300 of free SEO software.

Once you set up your free account you can comment on our blog, and you are eligible to receive our search engine success SEO newsletter.

Already have an account? Login to share your opinions.