Peter Norvig - Google Does Not Directly Use Search Usage Data in Relevancy Algorithms

Anand Rajaraman recently spoke with Peter Norvig, who revealed that:

  • their best machine learning algorithms is already as good as, and sometimes better than their current hand roled relevancy algorithms
  • but they still prefer to use their hand roled algorithms because of hubris, and they feel that machine learning algorithms may be more inclined to have catastrophic errors on searches that do not look much like those in the training set

I think a third piece (that you will never hear Google employees admit to) is that as the web's structure changes Google feels they have use FUD to police the web and help ensure Google has revenue entry points into important markets. In their 2007 Google search quality rater guidelines they used a typical Commission Junction link as an example of a sneaky redirect. It is doubtful that Google would ever do that with AdSense code or a Performics link (since they own those).

In the follow up post about his chat with Peter Norvig, Anand highlighted how Google measures relevancy. In the post he stated why Google prefers internal review data relative to using direct usage data:

Peter confirmed that Google does collect such [usage] data, and has scads of it stashed away on their clusters. However -- and here's the shocker -- these metrics are not very sensitive to new ranking models! When Google tries new ranking models, these metrics sometimes move, sometimes not, and never by much. In fact Google does not use such real usage data to tune their search ranking algorithm.

Exposure from top rankings already creates a self-reinforcing effect because of the power of defaults. Further tying in search usage data directly into relevancy might not add much benefit to searchers, especially as more people click on the first search result. Anand further explained why direct usage data is not used to refine Google's relevancy algorithms:

The first is that we have all been trained to trust Google and click on the first result no matter what. So ranking models that make slight changes in ranking may not produce significant swings in the measured usage data. The second, more interesting, factor is that users don't know what they're missing.

Google AdWords Price Fixing

In Google's commentary about their ad deal with Yahoo! they wrote:

This does not let Google raise prices for advertisers. Google does not set the prices manually for ads; rather, advertisers themselves determine prices through an ongoing competitive auction. We have found over years of research that an auction is by far the most efficient way to price search advertising and have no intention of changing that.

Aspects of that statement are categorically untrue, perhaps even lies. In many competitive markets with lots of participants the ad market may set ad price minimums, but Google...

  1. publicly talks about how they tweak the number of ads they display to maximize revenues
  2. uses quality scores that allow them to give friendly businesses discounts
  3. Google not only favors its own ads, but also creates custom ad units that only it can buy

Abitrary Pricing Floors

Google has articles in the media talking about how they tweak dials to optimize revenues. While many competitors have increased the number of ads they show, Google has been showing ads across a smaller portion of their search queries, as shown via this comScore data.

If you do not pay Google enough they simply will not show your ads, even if there are no competitors. I have ads where I am the only bidder and I get a 17% clickthrough rate - and yet there is a 17 cent price on those clicks, rather than a true market floor. Bid too low and your ads simply do not show up - even if you are bidding against nobody.

Preferential Pricing

Getting your account Google slapped is a well known phrase amongst many affiliate marketers. One day your ads are going great, and then the next day every keyword has a minimum bid of $5 or $10 per click.

On the flip side of that, many click arbitrage based business models are only profitable *because* a publisher gained access to a high authority trusted Google partner which allowed cheaper ad prices for the same keywords & ad units.

Google has went as far as publishing information about the types of business models that they do not like. Unlike acceptable business models like reverse billing fraud and infidelity, selling ebooks on sites with ads might merit a low landing page quality score.

Google Only Ad Units

Google products are advertised aggressively across Google's content network. Given that internal Google product benefit from brand awareness, bidding with funny money, and cheaper ad prices (since they don't have to give Google a cut) others with similar business models can not compete.

When Google recently entered the mortgage lead market they gave themselves an ad title of 49 characters, and a dropdown that is not available to other advertisers.

Google Does Not Like Flash Welcome Screens, Will the Google Browser?

Philipp Lessen pointed out that Google now offers a link in the search results to skip intro on flash intro pages.

Philipp also mentioned rumors of a Google Browser from February.

Google also launched a major offline advertising campaign in Moscow, hoping to gain market share on Russian search leader Yandex.

Google SERP-rot, Paid Links, & Spam Classification

I talked to a search engineer a few months back and he mentioned that he thought one of my sites and one of the promotions associated with it were both spammy. This month I came across a random blog comment where a person talked about how great that search company was for showing them that same site! The only problem was that since that site was new and we still need more links we had to pay Google for those clicks.

Meanwhile a network of older established poorly designed English third language sites dominate Google's organic search results, and keep getting self-reinforcing links that make it virtually impossible to compete with them without buying links. But our AdWords ads and viral marketing we did lead to some exposure where editors from other companies got to evaluate our site.

  • A number of mainstream media companies (newspapers and radio shows) mentioned us on their site.
  • A leading search company featured a link to our site aggressively in their portal (sorry I can't say more than that or a partner would kill me for doing so).
  • Mahalo listed our site with a cool rating and listed many deep links from our site on their overview page.
  • The Yahoo! Directory listed our site for free.

Had we not paid Google $1,000's, the organic links we got never would have existed, and our site might never rank. Amongst most other search related companies they generally love our site. But because I am associated with the site and I am an aggressive marketer the site is seen in a different light by search engineers at Google, in spite of providing a better user experience than the outdated garbage Google currently ranks (as indicated by searchers and editorial judgement from human reviewers at other search companies).

I am not complaining here, as we are on page 2 and getting close to page 1, but most content producers are not as aggressive at marketing as we have been, and some of the best content might take many years to rank - if ever. The bigger issues at hand are

  • Most English speaking webmasters with trusted sites use Google, thus if something is not in Google it is hard for it to get the quality links needed to rank unless the webmaster buys AdWords or spends a lot on public relations
  • many employees of other search companies are likely using Google search
  • any warp in Google's view of the web (like SERP staleness & bias toward huge media companies) creates opportunity for another search company to be born, and to some extent validate arbitrage plays by companies like Mahalo.

By relying on old websites to clog up the search results Google virtually guarantees that you need to buy links to rank a new site. The only question is who is getting paid!

Google to Police 'The Truth'

Recently a fake story was highlighted in the mainstream media, and the SEO behind it also mentioned it on their site. The SEO space as a whole began debating the legitimacy of such tactics, and Matt Cutts even commented on the issue:

My quick take is that Google’s webmaster guidelines allow for cases such as this:

“Google may respond negatively to other misleading practices not listed here (e.g. tricking users by registering misspellings of well-known websites). It’s not safe to assume that just because a specific deceptive technique isn’t included on this page, Google approves of it.”

There’s not much more deceptive or misleading than a fake story without any disclosure that the story is hoax.

The irony of this statement, as Nick Wilsdon pointed out, was that not only did Fox News syndicate the fake story, but they got in trouble in the past for attributing fake quotes to John Kerry. A person coming up with a clever story to get a few inbound links is nowhere near as sleazy as lying to try to sway the public vote for presidency...but it is much easier for Matt to police the small and weak webmasters while turning a blind eye to similar (but worse) offenses from larger players.

Morals of the story:

  • If you talk about exceptionally effective SEO strategies expect them to lose their effectiveness (search engineers are active in public discourse because it is easier to control people through fear than it is to write a better relevancy algorithm).
  • If your technique works so well that it is featured on many SEO blogs and/or draws a specific public comment from Matt Cutts you have went too far (sheep must be slaughtered to control the herd).
  • If you are going to lie do it in a way that builds a fan base. If you have such a large fan base that most of your traffic comes from channels other than Google it is virtually impossible for Google to block you (unless you use hate speech that extends beyond the lies and spin that are typical on networks like Fox News).

If you want to understand how the mainstream media works I highly recommend investing 5 hours and $50 into the following 3 DVDs. As more time passes Google's ad fueled business model will lead to them essentially replicating the flaws and biases of the mainstream media.

  • Manufacturing Consent - Noam Chomsky talks about how the media operates to shape public opinion and policy.
  • Outfoxed - how Fox News spins the news to fuel their desired political agendas.
  • The Fog of War - in this DVD Robert S. McNamara talks about how he used spin and media control to try to minimize blowback from the Vietnam War.

A New Kind of Duplicate Content - GoogleBot Random Form Crawl

Michael VanDeMar highlights how a website lost an important page to duplication across a new not so important page, which was added to the Google index by Google filling out forms.

If you have limited PageRank and a Google accessible form or search box you may want to block them from indexing output URLs via a robots noindex meta tag or your robots.txt file.

Why is Google Buying Links From SEMPO?

Google, which has arbitrarily forced its will to use nofollow on the web (and declared link buyers and sellers who do not use the tag as spammers) is buying a PageRank 7 link from SEMPO.org.

You would think that if Google wants to set new proprietary standards they would follow them as well. And what better spot to start following them than with a trade organization promoting search engine marketing?

How Much is a #1 Google Ranking Worth?

I just wrote a ~15 page article aimed at helping SEOs estimate how much a top rank in Google is worth.

I would appreciate any feedback you have on making it better. If you like it please hook me up with a Del.icio.us or Stumble. Any and all mentions are appreciated. :)

Will Your Website Pass a Google Review?

Welcome to GoogleNet!

Hitwise recently mentioned that Google controls over 1/3 of UK web traffic.
Upstream uk internet traffic from google properties to other websites in the UK 2007 2008  chart.png
With that much usage data, if you were Google, would you use usage data in your relevancy algorithms?

An Army of Google Search Editors

They could easily use algorithms to detect

  • sites that they send a lot of traffic to relative to its total traffic (comparing ratios between toolbar data and search traffic)
  • sites which have seen a rapid spike in traffic from Google
  • sites which people quickly bounce away from (and do not later return to)
  • sites which get a lot of traffic from Google but get few navigational queries

and flag anything out of the ordinary for human review. Marissa Mayer stated they have 10,000 reviewers.

Does Your Site Look Good to Google's Relevancy Algorithm?

As the web keeps getting richer and deeper, and Google increasingly uses human review for demoting spam, all the aesthetic things matter:

  • domain name
  • site design
  • content formatting
  • branding and public relations

As search evolves so too will spam. Some spam sites will LOOK and FEEL better than most non-spam sites. And so the remote quality raters will be given more data to look at - perhaps eventually even a sample of backlinks or other related data.

False positives will occur - sites and careers built around Google without proper support stilts will crumble. Unless your site is of social significance (you are a big corporation, a non-profit organization, a government institution, an educational institution, a top blogger, an official Google partner, or Youtube/Google house content) then part of the optimization process revolves around not only creating sites that pass a hand review, but also trying to create sites that do not get flagged for review - especially if you are a thin affiliate site.

How do you not get flagged for review?

  • Build enough quality signals and direct traffic that your site looks like a real part of the web.
  • Build something people keep coming back to.
  • Do not make drastic changes to your site unless you are comfortable with it going under review.

How do you pass a review?

Short term I think the aesthetic things matter a lot. Longer term it is best if your site satisfies a few criteria

  • exclusive content that people value and keep coming back to (Google loses if they remove the best content from their index)
  • a brand that people care about and search for (Google looks dumb if they do not rank your site)
  • a meaningful and reliable traffic stream outside of Google (many quality signals may stem from this exposure, which will help keep your overall profile more organic)
  • you could cause public relations harm to Google and diminish their brand value in the eyes of thousands of people (removing your site has real opportunity cost)

Usage Data for Algorithmic Site Promotion

Creating Fake User Accounts is Harder Than it Sounds

If usage data was ever used to promote sites, they could look at regional data and help promote sites based on what is popular locally. Searchers reveal their location by IP address and the queries they search for.

The Trusted Few

Google could use a subset of their users when using usage data to affect relevancy (perhaps users with 6 months account history, credit card on file via Google Checkout, and a normal email profile).

Why Usage Data is Tricky

Much of the signal from usage data is likely mirrored by PageRank, so the lift might not be that great until they really refine the technology.

Some tricky parts with promoting sites based on usage data are:

  • usage data is quite noisy, and
  • it may not favor informational sites over commercial intent the way that PageRank does. That informational bias to the organic search results is a large part of why AdWords is so profitable.

Microsoft recently presented a paper on finding authority pages based on browsing habits.

Google is Quietly Consuming the Internet

TechRepublic asks "Will the Google revolution engulf IT departments?" Each time I write a newsletter, about 80% of the items are about Google. They keep innovating faster than other companies their size. Here are some examples of things they have done over the last ~ 2 months.

  • Changes organic search results based on prior search query.
  • Added a search box for site search inside the search results, giving Google a second taste at displaying ads even on navigational queries for a specific website.
  • Started crawling site search forms on trusted sites, which (along with sitelinks, universal search, Youtube, and branded video ads) distributes more traffic to large trusted sites and business partners, with less traffic going to smaller websites (search keeps getting more editorial).
  • Offered App Engine, which provides free hosting to developers (in exchange for being stuck on their network and letting them spy on your usage data and growth).
  • Created a marketplace for people building on the Google network.
  • Begun policing widgets not on their network, a topic that deserves its own post.

Not only are dumb companies buying into the everything Google strategy, but even some semi-intelligent ones are. After logging into Dreamhost recently I was shocked to see them integrating Google apps and email on all customer domains. What happens if/when Google buys GoDaddy? How does Dreamhost compete when Google gives away hosting as a loss leader?

There is big risk to Google consuming the web. The issue is not only information diversity and innovation, but what happens when your Google account gets hacked? I regret my reliance on Gmail, but am unsure how to fix it.

Pages