Google Semantically Related Words & Latent Semantic Indexing Technology

Many people have been noticing a wide shuffle in search relevancy scores recently. Some of those well in the know attribute this to latent semantic indexing. Even if they are not using LSI, Google has likely been using other word relationship technologies for a while, but recently increased its weighting. How Does Latent Semantic Indexing Work?
Latent semantic indexing allows a search engine to determine what a page is about outside of specifically matching search query text.

A page about Apple computers will likely naturally have terms such as iMac or iPod on it.

Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent. source

By placing additional weight on related words in content, or words in similar positions in other related documents, LSI has a net effect of lowering the value of pages which only match the specific term and do not back it up with related terms.

LSI vs Semantically Related Words:
After being roasted by a few IR students and scientists I realized that many SEOs (like me) blended the concepts of semantically related words with latent semantic indexing, and due to constraints of the web it is highly unlikely that large scale search engines are using LSI on their main search indexes.

Nonetheless, it is overtly obvious to anyone who studies search relevancy algorithms by watching the results and ranking pages that the following are true for Google:

  • search engines such as Google do try to figure out phrase relationships when processing queries, improving the rankings of pages with related phrases even if those pages are not focused on the target term

  • pages that are too focused on one phrase tend to rank worse than one would expect (sometimes even being filtered out for what some SEOs call being over-optimized)
  • pages that are focused on a wider net of related keywords tend to have more stable rankings for the core keyword and rank for a wider net of keywords

Given the above, here are tips to help increase your page relevancy scores and make your rankings far more stable...

Mix Your Anchor Text!
Latent semantic indexing (or similar technologies) can also be used to look at the link profile of your website. If all your links are heavy in a few particular phrases and light on other similar phrases then your site may not rank as well.

Example Related Terms:
Many of my links to this site say "SEO Book" but I also used various other anchor text combinations to make the linkage data appear less manipulative.

Instead of using SEO in all the links some of them may use phrases like
search engine optimization
search engine marketing
search engine placement
search engine positioning
search engine promotion
search engine ranking
etc.

Instead of using book in all the links some other good common words might be
ebook
manual
guide
tips
report
tutorial
etc.

How do I Know What Words are Related?
There are a variety of options to know what words are related to one another.

  • Search Google for search results with related terms using a ~. For example, Google Search: ~seo will return pages with terms matching or related to seo and will highlight some of the related words in the search results.

  • Use a lexical database
  • Look at variations of keywords suggested by various keyword suggestion tools.
  • write a page and use the Google AdSense sandbox to see what type of ads they would try to deliver to that page.
  • Read the page copy and analyze the backlinks of high ranking pages.

Google Sandbox and Semantic Relationships:
The concept of "Google Sandbox" has become synonymous with "the damn thing won't rank" or whatever. The Sandbox idea is based upon sites with inadequate perceived trust taking longer to rank well.

Understanding the semantic relationships of words is just another piece of the relevancy algorithms, though many sites will significantly shift in rankings due to it. The Google sandbox theory typically has more to do with people getting the wrong kinds of links or not getting enough links than it does with semantic relationships. Some sites and pages are hurt though by being too focused on a particular keyword or phrase.

Where do I learn more about Latent Semantic Indexing?
A while ago I read Patterns in Unstructured Data and found it was wrote in a rather plain english easy to understand manner.

Brian Turner also listed a good number of research papers in this thread.

Forum Coverage:

Selected Forum Quotes:
BakedJake

I'm not about to go post my research and examples on a public forum. But, I'll warn you now - if you're not varying your anchor text, and you're not writing pages synonymous with your term that don't contain the term you're targetting, you're going to be in a world of hurt within the next 90 days.

We've been tracking this update for the last 6 months. I was surprised to see it happen now - I honestly didn't expect it until next month or March, but it's here.

BakedJake

I have a page about "baby clothes". I link to my site 100 times with the anchor text "baby clothes"

I now pull out the words "baby clothes" and all the links pointing to my site with the words "baby clothes"

Do I still have footing to rank for that term "baby clothes" after you've run some sort of semantic analysis on it?

That's my simplistic explanation. I think they're doing something very similar, but taking links into account like that and maybe even devaluing some links on the "main" term...

valeyard

Well, if it hasn't changed by Monday I'm going out to buy a black hat.

If irrelevant junk is what Google wants then irrelevant junk is what it's gonna get. :-(

dataguy

Man I'm glad I diversified my sites. I think I will work on diverifying some more...

andy_boyd

Google Inc. is all about money. And IMHO ... so are Yahoo Inc. and Microsft Corp.. As webmasters we are the people who build sites and depend on these money hungry companies, who at the heels of the hunt, put their interests miles ahead of ours.

Chico_Loco

My main concern with this new update is that if you search for my brand name (and there are quite a few that do based on referrals), then right now my site does not even rank. Our brand name is perhaps the best in my industry, and Google are, in my opinion, diluting my brand name and causing my company money. The first result for my brand name is a spammy page which is a "scraper site" which is actually SERP's page from somewhere - so that's basically useless.

The Hidden or Not so Hidden Messages:

  • If you are entirely dependant on any single network and a single site for the bulk of your income then you are taking a big risk. Most webmasters would be best off to have at least a couple of income streams to shield themselves from algorithm changes.

  • If you are new to SEO you are best off optimizing your site for MSN and Yahoo! off the start and then hoping to later rank well in Google.
  • Make sure you mix your anchor text to minimize your risk profile. Even if you are generally just using your site name as your anchor text eventually that too can hurt you.
  • Search algorithms and SEO will continue to get more complicated. But that makes for many fun posts ;)

Update: a few additional tools recommended in our comments and the comments at ThreadWatch

Google AdWords / AdSense Shakeup, Free Link Renting Guide, Ask Jeeves Blog

AdSense and AdWords shakeup:

found on ThreadWatch

SearchGuild birthday awards:
fun stuff

I was nominated but was beat out by Orion. a real shame that I do not know more about fractal spam and semantic co-occurance...

Free Link Renting Guide:
Patrick Gavin offers free link renting tips (PDF link)

Complacency:
Tim Converse (from Yahoo!) calls out Marissa Mayer (from Google). I am sure there are lots of fun dialogs between the various engines employees.

Ask Jeeves:
creates their obligitory blog.

Yahoo! Launches Yahoo! Q

Yahoo! launches Yahoo! Q, which shows contextually relevant news and links in a small pop up box next to content.

"The thinking is that if you can read an article, you can be inspired to search," said Ken Norton, senior director of product management at Yahoo search. "We're bringing search to the moment of inspiration... We'll save them time and energy, and the most relevant search." source: MarketWatch

A couple weeks ago Jakob Neilson talked about using fat links (or smart links which offered multiple options or opened multiple windows when clicked). This is the first implimention of the concept I have seen by any major web players.

There are two major modes for Yahoo! Q:

  1. Webmasters can add the Yahoo! Q code to their pages... I will be doing that shortly just to test it out.

  2. Users can download the Yahoo! Q DemoBar or add extensions to FireFox.

Eventually Yahoo! may integrate ads into their Q boxes, but off the start they are primarily hoping to improve search usage. The fact that FireFox is part of the beta release means that Yahoo! is really starting to create products which the web community will help market for them.

I have not tested it much, but is sure sounds like cool stuff.

PostScript: I installed Yahoo! Q on all my individual post pages. It was easy to install, but I am kinda tired.

I few things I do not like about it...

  • the Yahoo! Search blog has not yet installed Yahoo! Q. What is up with that? ;)

  • It slightly messed up my template. Not sure if I am at fault or it is at fault.
  • It requires me to pop the form element up within the content tags when I would have prefered to have it lower...like near all the other search engine links. currently if I do that it might place too much weight on the post title
  • Since many of the highlights will be at the bottom of the screen it will require the user to scroll down to see the Q box. Perhaps they could find a way to ensure a large portion of it fits on the screen?
  • Jeremy stated that they are working on the clunkiness problem.

The technology is a fairly cool idea and should be amazingly useful for community driven sites.

When You Can't Rank for Your Own Name...

So I was looking for a site of a well known SEO in Google and he does not show up for his site name.

I remembered a few others that this happened to recently and spoke to a friend who has seen a bunch of this. It appears that this is a rather common occurance now, where sites that are aggressively improving their rankings stop showing up for their keyword and sometimes their site name.

I looked at some of the keywords for this site and some of the deep pages are ranking poorly for his primary terms in Google, but they are outranking the home page (which heavily targets those same terms and is absolutely burried). None of his pages rank for his site name.

I suppose this is a good way for Google to attack people selling competing advertising systems that manipulate their index. Rank them lowly for their keywords AND remove them from the index for their site name.

If people do not show up for their own name it hurts their brand. On the web AND off the web their entire brand is diminished by not showing up for their own name.

Then the only way these people can show up for their own name and brand is by buying in on AdWords, and if you have a strong brand that can become a competitive landscape and those costs can add up quick.

Pretty damn cool self regulating system if you are Google, but kinda sucky for joe average SEO company. :(

Amazon Margins Drop, Stock Tanks, Amazon offers Cheaper Express Shipping? , John Battelle Interviewed, AIRWEB Conference

Amazon:
Amazon reported their quarterly results, which fell well below expectations due to lower margins.

They then announced a new program by the name of Amazon Prime, where you get unlimited express shipping for the whole year for a one time $79 fee.

John Battelle:
Interviewed

Interesting Sounding New Conference?
First International Workshop on Adversarial Information Retrieval on the Web

from an email I got

The attraction of hundreds of millions of web searches per day provides significant incentive to content providers to do whatever necessary to rank highly in search engine results. The use of techniques that push rankings higher than they belong is often called spamming a search engine. Such methods typically include textual as well as link-based techniques. Like e-mail spam, search engine spam is a form of adversarial information retrieval; the conflicting goals of accurate results of search providers and high positioning by content providers provides an interesting and real-world environment to study techniques in optimization, obfuscation, and reverse engineering, in addition to the application of information retrieval and classification.

The workshop solicits technical papers and synopses of research in progress on any aspect of adversarial information retrieval on the Web. Particular areas of interest include, but are not limited to:

- search engine spam and optimization,
- crawling the web without detection,
- link-bombing,
- reverse engineering of ranking algorithms,
- advertisement blocking, and
- web content filtering.

Papers addressing higher-level concerns (e.g., whether 'open' algorithms can succeed in an adversarial environment, whether permanent solutions are possible, etc.) are also welcome.

IMPORTANT DATES

11 February 2005 E-mail intention to submit (optional, but helpful)
25 February 2005 Deadline for submissions
25 March 2005 Notification of acceptance
8 April 2005 Camera-ready copy due
10 May 2005 Date of workshop

The real question of course, is why would you give away spam white papers to a conference where many current search engineers are part of the program committee?

Google Referral Program & Google Karma

Google has launched an affiliate program.

  1. One of your visitors clicks on the graphic or link and is taken to a Google sign-up page.

  2. That person signs up for an AdWords account or applies to become an AdSense publisher.
  3. Advertisers qualify as completed referrals after they spend $20 with AdWords. Publishers qualify after they earn $75 in AdSense revenue.
  4. Every month, Google calculates the number of completed referrals you have directed to Google.
  5. Once you accrue $100 (five completed referrals) or more, you will receive a check. Google only issues checks once a month (for details, see the FAQ).

I already applied, but am uncertain as to how long it will take to be accepted. In 8 days I may also get to be a Google Advertising Professional.

Their stock is well over $200 and I got 50 free Gmail invites if anyone wants one just shoot me an email.

Google karma continues to brew... ;)
Google Karma.

Google Strong 4th Quarter, Yahoo! Japan Blogs, Australian SEM Conference

Google 4th Quarter:

Yahoo! Japan Blogs:
Yahoo Japan has beaten Yahoo to the blogosphere

Down Under:
Australian SEM Conference

DMOZ Post:
NickW finds Black Knight's what is wrong with DMOZ post ;)

MSN Officially Launches, Comment Spammer Chat

The Search Wars:
MSN makes the official switch announcement and is to spend big.

To appreciate the financial power of MicroSoft you need only look at the various 4th quarter "US Personal Income Soars" news stories which were primarily caused by MicroSoft's $32,000,000,000 dividens.

And while that is a lot of Zeros it certainly is not a Google's worth of them, but Yahoo! apparently is also digging into Google's market share.

The Registrar Wars:
Google is now a registrar

The Blog Comment Wars:

SEO Contest Wars:
Loquine Glupe
I am thinking about running a buy viagra online contest soon. more on that later...

The PPC Wars:

The Oil Wars:
free Exxon Mobile gas

The Echoing Wars:

Lots of Random Stuff...

Searchtextual Ads:
AlmondNet launches an ad network based on search behavior, and apparently they have a patent for it too.

Future of writing:
Steven Berlin Johnson writes about how technology will forever change writing.

Branding:
Apple replaces Google as brand of the year.

ClickTracks Optimizer:
new mid level analytics software

Digital Identity:
MP3 streams from Future Salon on Digital Identity

Hyperlinkage:
new Bloglines competitor

VLIB Update:
A friend of mine is one of the maintainers of the VLIB. I just got off the phone with him and he stated that they are cleaning up the VLIB using technologies such as XML.

Survey Says:
take a Google Survey and read some survey results from another recent survey.

New Wiki Based Search Engine:
loots data from WhoIs database.

Iraq Election:
UN pays bloggers to shill

Merger:
SBC to buy AT&T.

Excrement:
Man peed way out of avalanche
2,000-ton pile of burning cow manure
hat tip to Frankie from TP on the excrement links.

Google AdWords API Beta

Google have launched their Google AdWords API. From their introduction page:

Google's free AdWords API service lets developers engineer computer programs that interact directly with the AdWords server. With the applications created, advertisers and third parties can more efficiently - and creatively - manage their large AdWords accounts and campaigns.

Flexible and Functional
What can you do with the AdWords API? This all depends on your programming genius and clients' advertising needs. Some possibilities might include:

  • Generating automatic keyword, ad text, URL, and custom reports

  • Integrating AdWords data with databases, such as inventory systems
  • Developing additional tools and applications to help you manage accounts

It works in many language and its quota limits will be based on the size and spend of your account. You need a My Client Center account to sign up. Here is some of their support questions.

coverage at

Pages