Peter Norvig - Google Does Not Directly Use Search Usage Data in Relevancy Algorithms

Anand Rajaraman recently spoke with Peter Norvig, who revealed that:

  • their best machine learning algorithms is already as good as, and sometimes better than their current hand roled relevancy algorithms
  • but they still prefer to use their hand roled algorithms because of hubris, and they feel that machine learning algorithms may be more inclined to have catastrophic errors on searches that do not look much like those in the training set

I think a third piece (that you will never hear Google employees admit to) is that as the web's structure changes Google feels they have use FUD to police the web and help ensure Google has revenue entry points into important markets. In their 2007 Google search quality rater guidelines they used a typical Commission Junction link as an example of a sneaky redirect. It is doubtful that Google would ever do that with AdSense code or a Performics link (since they own those).

In the follow up post about his chat with Peter Norvig, Anand highlighted how Google measures relevancy. In the post he stated why Google prefers internal review data relative to using direct usage data:

Peter confirmed that Google does collect such [usage] data, and has scads of it stashed away on their clusters. However -- and here's the shocker -- these metrics are not very sensitive to new ranking models! When Google tries new ranking models, these metrics sometimes move, sometimes not, and never by much. In fact Google does not use such real usage data to tune their search ranking algorithm.

Exposure from top rankings already creates a self-reinforcing effect because of the power of defaults. Further tying in search usage data directly into relevancy might not add much benefit to searchers, especially as more people click on the first search result. Anand further explained why direct usage data is not used to refine Google's relevancy algorithms:

The first is that we have all been trained to trust Google and click on the first result no matter what. So ranking models that make slight changes in ranking may not produce significant swings in the measured usage data. The second, more interesting, factor is that users don't know what they're missing.

Published: June 16, 2008 by Aaron Wall in google


June 17, 2008 - 11:10pm

thanks for sharing that Aaron. Extremely interesting to hear what that guy has to say about them not really using direct usage data.

I'm not sure if I completely understand what he says. Does he sort of equal "usage data" with "click-through rate in the SERPs"?

How about length-of-stay on the site (relative to other metrics like length of the site), number of pages, repeat visits, conversion rates (esp. for PPC), etc.?

Why are they running around the web buying sources from which they can get such clickstream data from (as you mentioned once or twice in a post) if they don't really care about it?

Why is he telling this to the public (if it's the truth)? Do you think Google has to reveal certain things that are true every now and then so that people still buy their FUD/wrong statements (if everything they said wasn't true nobody would listen to their FUD)?

Maybe it is not even true and they're just trying to keep us webmasters and SEOs from understanding the benefit of optimizing for usage data?;-)

June 18, 2008 - 3:58pm

I don't think he meant for that conversation to go public and end up on SEO blogs. That sorta just happened ;)

If they buy up a lot of top web properties they get a second bite on monetization after people leave the search results. It makes their ad business model more diverse, which is somewhat important because display ads can drive search volume + with Youtube they now have a wedge that they can use to try to control TV ads.

