Latent Semantic Indexing

Jan 18th

(GEEK STUFF) One of the largest problems many search engines run into is that after they get to a few hundred million documents their algorithms and hardware hit a wall.

For those companies that can afford the investment to get past this point they still run into the problem that each additional resource makes their job a bit harder.

One of the major ways around this problem is to take advantage of the natural patterns in human language. Using Latent Semantic Indexing allows indexing search results based on the pairing of like words within documents.

Many complex searches may lack exact matches in the results as well. Being able to find near matches will allow search engines to provide more comprehensive results.

Its hard to get computers to understand anything human, but the process of latent semantic indexing delivers conceptual results while being entirely mathematically driven.

There are two main ways to do this, single variable decomposition and multi dimentional scaling.

Some of the steps of the single variable decomposition process are to:

  • create a database of all words in relevant documents
  • remove common stop words
  • stemming
  • remove words appearing in all results
  • remove words only appearing in one result
  • create a database of relavent keywords
  • weight the pages based on the frequency of keyword distribution
  • increasing the relevance of terms which appear in a small number of pages (as they are more likely to be on topic than words that appear in most all documents)
  • normalize the page to remove the pagelength as a factor
  • create relevancy vectors for the keywords

The single variable decomposition process is not scalable enough to work on large scale search engines though as it requires too much processor time. Multi dimentional scaling allows us to take snapshots of the topicology of different documents. "Instead of deriving the best possible projection through matrix decomposition, the MDS algorithm starts with a random arrangement of data, and then incrementally moves it around, calculating a stress function after each perturbation to see if the projection has grown more or less accurate. The algorithm keeps nudging the data points until it can no longer find lower values for the stress function."

This does not provide exact results, but only a rough approximation. When combined with other factors this approximation improves scalability and quality of search.

Good Reading on latent semantic indexing

This technology is so amazing that it may eventually help lead to a cure for cancer. Already the technology is being refined for cognitive improvements and test grading!

Published: January 18, 2004

Comments

Add new comment

(If you're a human, don't change the following field)
Your first name.
(If you're a human, don't change the following field)
Your first name.
(If you're a human, don't change the following field)
Your first name.

New to the site? Join for Free and get over $300 of free SEO software.

Once you set up your free account you can comment on our blog, and you are eligible to receive our search engine success SEO newsletter.

Already have an account? Login to share your opinions.

Gain a Competitive Advantage Today

Your top competitors have been investing into their marketing strategy for years.

Now you can know exactly where they rank, pick off their best keywords, and track new opportunities as they emerge.

Explore the ranking profile of your competitors in Google and Bing today using SEMrush.

Enter a competing URL below to quickly gain access to their organic & paid search performance history - for free.

See where they rank & beat them!

  • Comprehensive competitive data: research performance across organic search, AdWords, Bing ads, video, display ads, and more.
  • Compare Across Channels: use someone's AdWords strategy to drive your SEO growth, or use their SEO strategy to invest in paid search.
  • Global footprint: Tracks Google results for 120+ million keywords in many languages across 28 markets
  • Historical data: since 2009, before Panda and Penguin existed, so you can look for historical penalties and other potential ranking issues.
  • Risk-free: Free trial & low price.
Your competitors, are researching your site

Find New Opportunities Today






    Email Address
    Pick a Username
    Yes, please send me "7 Days to SEO Success" mini-course (a $57 value) for free.

    Learn More

    We value your privacy. We will not rent or sell your email address.