(GEEK STUFF) One of the largest problems many search engines run into is that after they get to a few hundred million documents their algorithms and hardware hit a wall.

For those companies that can afford the investment to get past this point they still run into the problem that each additional resource makes their job a bit harder.

One of the major ways around this problem is to take advantage of the natural patterns in human language. Using Latent Semantic Indexing allows indexing search results based on the pairing of like words within documents.

Many complex searches may lack exact matches in the results as well. Being able to find near matches will allow search engines to provide more comprehensive results.

Its hard to get computers to understand anything human, but the process of latent semantic indexing delivers conceptual results while being entirely mathematically driven.

There are two main ways to do this, single variable decomposition and multi dimentional scaling.

Some of the steps of the single variable decomposition process are to:

create a database of all words in relevant documents
remove common stop words
stemming
remove words appearing in all results
remove words only appearing in one result
create a database of relavent keywords
weight the pages based on the frequency of keyword distribution
increasing the relevance of terms which appear in a small number of pages (as they are more likely to be on topic than words that appear in most all documents)
normalize the page to remove the pagelength as a factor
create relevancy vectors for the keywords

The single variable decomposition process is not scalable enough to work on large scale search engines though as it requires too much processor time. Multi dimentional scaling allows us to take snapshots of the topicology of different documents. "Instead of deriving the best possible projection through matrix decomposition, the MDS algorithm starts with a random arrangement of data, and then incrementally moves it around, calculating a stress function after each perturbation to see if the projection has grown more or less accurate. The algorithm keeps nudging the data points until it can no longer find lower values for the stress function."

This does not provide exact results, but only a rough approximation. When combined with other factors this approximation improves scalability and quality of search.

Good Reading on latent semantic indexing

This technology is so amazing that it may eventually help lead to a cure for cancer. Already the technology is being refined for cognitive improvements and test grading!

Published: January 18, 2004 by Aaron Wall in technology

Top Search Keywords List

Online Branding through the Eyes of a Schmuck