Defending Your Site Against a Google Proxy Hack

Dan Thies published a post about how people have been hacking Google's search results using proxies to get the original sites nuked as duplicate content. He also explained how to defend sites against the problem using free PHP scripts developed by Jaimie Sirovich. Dan Thies stated he thought many of the proxy hijack accidents were not accidents at all:

Of course, not all proxies are being run by innocent people for innocent reasons. Some of them are actually designed to hijack content - to deliver ads, etc. Some people want to steal your content, and they want the search engines to index it. In fact, I would not be surprised if a large part of the overall problem isn't caused by such people firing links at their own proxies.

I have seen numerous sites die to proxy hacking, and this is an issue Google has known about for over a year.

Yet another reason hand edits at Google coupled with Google paying AdSense scrapers to steal your content makes Google a pretty dirty company, especially when you consider their unofficial stance on copyright:

Your name can not be stripped and no one else can claim credit for it. That is credit, reputation is a non renewable resource. It can not be replicated. It can not be copied. To the degree that someone takes credit for your stuff, that's the degree to which you lose credit. It is always proportional.

When Google goes so far as trying to police link exchange and link buying why don't they do a better job policing AdSense? If they want to clean up their search index the easiest, most scalable, and most robust way to do so would be for them to worry about their own network, and stop paying content thieves via AdSense.

How to Turn Content Into a Valuable Keyword List

One of the comments on the article I wrote for Wordtracker mentioned WordsFinder, which allows you to create a list of keywords from a piece of content. Their tool uses the Yahoo! Term Extraction Tool, and also provides a few additional keywords next to the results. Three other easy ways to get similar information are

If you find some of the leading keywords for a competing site via tools like Compete.com (or SpyFu, SeoDigger, etc.), or via site targeted AdSense ads you can see what keywords and pages are most worth emulating. If your keyword list is too long to make sense out of consider running it through Google's Traffic Estimator tool to find which keywords are the most valuable.

If you have more authority and target more valuable keywords and traffic streams you win. :)

Using Junk Mail to Find New Keywords

I recently was asked to write an article for Wordtracker about finding sources of keyword inspiration. I tried making it fun. Let me know what you think of it.

The $10,000 Robots.txt File...Ouch!

I recently changed one of my robots.txt files pruning duplicate content pages to help more of the internal PageRank flow to the higher quality and better earning pages. In the process of doing that, I forgot that one of the most well linked to pages on the site had a similar URL as the noisy pages. About a week ago the site's search traffic halved (right after Google was unable to crawl and index the powerful URL). I fixed the error pretty quickly, but the site now has hundreds of pages stuck in Google's supplemental index, and I am out about $10,000 in profit for that one line of code! Both Google and Yahoo support wildcards, but you really have to be careful when changing a robots.txt file because a line like this
Disallow: /*page
also blocks a file like this from being indexed in Google
beauty-pageants.php

Unless you are thinking of that in advance it is easy to make a mistake.

If you are trying to prune duplicate content for Google and are fine with it ranking in other search engines, you may want to make those directives specific for GoogleBot. If you make a directive for a specific robot, that bot will ignore your general robots directives in favor of following the more specific directives you created for it.

Google's webmaster guidelines and Yahoo!'s Search Blog both offer tips on how to format your robots.txt file.

Google also offers a free robots.txt test tool, which allows you to see how robots will respond to your robots.txt file, notifying you of any files that are blocked.

You can use Xenu link sleuth to generate a list of URLs from your site. Upload that URL list to the Google robots.txt test tool (currently in 5,000 character chunks...an arbitrary limit I am sure they will eventually lift).

Inside the webmaster console Google will also show you what pages are currently blocked by your robots.txt file, and let you view when Google tried to crawl the page and noticed it was blocked. Google also shows you what pages are 404 errors, which might be a good way to see if you have any internal broken links or external links pointing at pages that no longer exist.

Internal Link Architecture Made Easy

Jim Boykin recently offered tips to help webmasters understand how to audit a site to see what pages are the most link rich, how internal link equity flows around websites, and how to optimize your internal link architecture. In addition to Jim's tips, you can also improve your internal link structure by using some of the following tips.

Create Promotional Content Sections

The following ideas display social acceptance (which helps improve conversion) while also funneling PageRank at important pages without looking spammy.

  • heavily promote seasonal stuff in advance (internally and externally)

  • use sales data or other metrics to create a what's hot in this category and what's hot on our site section to flow more link equity to best sellers (these can be called anything like what's hot, top rated, etc)
  • create pages high in the site structure to support high value keywords that were only tangentially covered on lower level nodes
  • over-represent new content in your link structure to help it get indexed quickly, see how well it will rank, and learn how profitable it is

Internal to External Link Ratio

Doing theses sorts of things will still give you all the good karma and benefit that linking out does, while minimizing any downside caused by funneling a significant portion of your PageRank out of your site.

  • if you have a blog cross reference old posts where and when it makes sense

  • if you link out heavily on a page ensure you also place numerous internal links on the page
  • use breadcrumb navigation or other navigational schemes to help structure the site and improve the internal to external link ratio
  • if you have a ton of outbound site-wide links change some of them to only list them on a single page or section of your site

Keep the Noise Out of the Index

  • demoting an entire section of the site in the link structure if it has a lower ROI than other sections

  • use robots.txt and meta robots exclusion tags to prevent duplicate content and other low information or noisy pages from getting indexed
  • instead of using pagination try to display more content on each page
  • check your server logs for 404 errors. fix any broken links and redirect old linked to pages to their new locations

Bonus Idea: Create Cross Referencing Navigational Structures

This idea may sound a bit complex until you visualize it as a keyword chart with an x and y axis.

  • Imagine that a, b, c, ... z are all good keywords.

  • Imagine that 1, 2, 3, ... 10 are all good keywords.
  • If you have a page on each subject consider placing the navigation for a through z in the sidebar while using links and brief descriptions for 1 through 10 as the content of the page. If people search for a 7 or b 9 that cross referencing page will be relevant for it, and if it is done well it does not look to spammy.

Since these types of pages can spread link equity across so many pages of different categories make sure they are linked to well high up in the site's structure. These pages works especially well for categorized content cross referenced by locations.

Synergy Between Domain Names & Keyword Based Search Engine Optimization Strategies

SEO Question: Do domain names play a role in SEO? Do search engines understand that the words are in the URL even if they are ran together without hyphens in between them? What techniques are best for registering a domain name that search engines like Google will like?

Answer: Over time the role of the domain name as an SEO tool has changed, but currently I think they carry a lot of weight for the associated exact match search. Depending on how they are leveraged going forward they may or may not continue to be a strong signal of quality to search engines.

Domain Names & Link Anchor Text

When I first got in the SEO game a good domain name was valuable because if you got the exact keywords you wanted to rank for in your name it made it easier to get anchor text related to what you wanted to rank for. For example, being seobook.com made it easier for me to rank for seo book and seo.

That link still exists, but nowhere near as strongly or broadly as it once did.

The Fall of Anchor Text & the Rise of Filters

Anchor text as an SEO technique is no secret. To make up for the long ongoing abuse of it, Google started placing less weight on anchor text AND creating more aggressive filters that would filter out sites that have a link profile that looked too spammy with too many inbound links containing the exact same anchor text. If everyone who links to me uses seo book as the anchor text it is much harder to consistently rank for that term than it would be if there was a more natural mixture to it. A natural mix would have some of the following

  • Aaron Wall

  • Aaron Wall's blog
  • SEO Book blog
  • book about SEO
  • the SEO Book
  • seobook.com
  • www.seobook.com
  • Aaron Wall's Seo Book
  • etc

Natural link profiles also contain deep links to internal pages, whereas spammy sites tend to point almost all of their links at their home page.

Domain Names in Action

As Google started getting more aggressive at filtering anchor text, they started placing more weight on the domain name if the domain name exactly matched the keyword search query. They had to do this because they were filtering out too many brands for the search query attached to their brand. Some examples of how this works:

  • At one point, about 2 years back, SeoBook.com stopped ranking for seo book due to a wonky filter that also prevented Paypal for ranking for their own name for a little bit.

  • A friend recently 301 redirected an education site on a bad URL to a stronger domain name. The site's ranking for the exact phrase went from 100+ to top 20 in Google. But, it still is a long way from #1, and it still is at 100+ for the singular version. In competitive industries you need a lot of links to compete, and the redirect also caused the site to slip a bit for some of the other target keyword phrases that the site used to rank for.
  • When you launch a new site on a domain name like mykeywordphrase.com and get it a few trusted links it should almost immediately rank for mykeywordphrase. A friend launched a 3-word education site about a week ago. That site ranks #1 in Google right now for those keywords ran together. That site also just ranked #118 in Google for the phrase with the words spread apart. As the site ages and gets more links it should be easier to rank for that exact phrase (but that domain probably wouldn't help its rankings much for stuff like the root sub-phrase).
  • My domain name Search Engine History.com ranked better than it should have for the query search engine history when its only real signs of trust were age and domain name. It was nowhere in the rankings for just about any other query.

Things Will Change Over Time

A few other caveats worth noting

  • From my experience this exact match domain bonus works with all domain extensions (even .info), but that could change over time. And if the content isn't any good it is still going to be hard to get traction in any market worth developing content for. This exact match domain bonus also works well in local markets for regional domains like .ca.

  • This post is about the current market, and is highly focused on Google's relevancy algorithms (rather than other search engines). I expect the weight on domain names to be lowered significantly (especially for competitive queries) as Google moves toward incorporation more usage data into their relevancy algorithms. This is especially true if many domainers put up low quality to average quality websites on premium domain names. Moves like creating 100,000 keyword laden sites in one massive push (as Marchex recently did) don't bode well for the future of domain names as a signal of quality.
  • The search traffic trends are moving toward consolidating traffic onto the largest high authority sites, so it probably is not a good idea to have 100 deep niche domain names like OnlineHealthcareDegrees.org, OnlineNurseDegrees.net, OnlineNursingSchools.com, OnlineLawDegrees.com, OnlineParalegalDegrees.net etc when you can cover a lot of those topics with a singular broad domain like Online Degrees.org.
  • Any advantage exact match domains seem to have for ranking is much smaller for related phrases that do not exactly match the keyword string or phrases within the anchor text of most of the inbound links.
  • For local businesses a keyword matching domain might be a way around paying to list in all the regional directories and other related arbitrage plays.
  • Domains that use familiar language and sound credible also have a resonance that helps build trust, make the information seem more credible, easier to link at, easier to syndicate, and easier to do business with. It is hard to estimate the value of that since much of it is indirect, and few have measured the affect of domain name on linkability or clickability of a listing outside of paid search arbitrage.

Website Sustainability: What Percent of Your Traffic Comes From Search Engines?

As an SEO one of our primary goals is to get more search traffic for targeted search terms. Search traffic is typically far more valuable than other traffic sources because it is so targeted. But non-search traffic is perhaps the single most reliable sign of quality. As Google controls a larger portion of the overall traffic flow across the web, they risk creating self fulfilling prophecies where low quality sites continue to rank only because they already rank.

If you were Google, and discovered that 98% of a site's traffic comes from Google.com might you want to give that site a bit less exposure? I would. Maybe those algorithms do not exist now, but eventually they could.

If you have a site that earns far beyond your living costs, and it is almost entirely reliant on search for income, then one of the best moves you can make for the sustainability of that site is to lower the percentage of traffic that comes from search by creating other traffic sources. The other traffic sources may not be as profitable on a CPM basis, but as you diversify you lower your risks. It doesn't matter how the algorithms shift if your site is strong in every signal of trust they could possibly measure.

Buying Sites for Search Engine Ranking Domination

Frank mentioned this NYP article about how some companies are buying sites outright rather than increasing their AdWords bid prices. I expect this to be a large and growing trend for at least a couple years. As Google gets more efficient at pricing the ads they increase the value of the top ranked sites that sit alongside those ads. Internet Search Metrics, quoted in the NYP article as Internet Search Management, is providing audits on the competitive landscape of search

ISM's audits track the top 4.5 million search phrases on Google and Yahoo!, a total of 7.3 billion searches a month, to determine which companies across 50 business sectors pop up most frequently in the top three or four positions in natural search. ...

The ISM audits, to be released in London, break down which of 50 business sectors are locked up - that is, have large chunks of natural search dominated by a handful of companies - and which are wide open.

I have not yet seen any of the reports, but the network is still young. If you love marketing, are in tune with web trends, and are well funded I am guessing that many of the markets that appear locked up are still wide open.

Large Brands Double Dipping in Google's Organic Search Results

Subdomain Spam

Since Google has been over-representing site authority in their relevancy algorithms many sites like eBay have begun abusing the hole with the use of infinite subdomains. These techniques not only effect branded search results, but also carry over to many other competitive keywords.

Creating Shadow Brands & Buying Top Ranked Competing Sites

While small businesses are worried about the risks of buying or renting a few links, some large corporations are launching shadow brands or buying out competing domains en mass. There are thousands or millions of other examples, so it is unfair for me to point any out, but here are a few for the sake of argument.

  • Monster.com has a near unlimited number of education related domains, with a near identical user experience at almost all of them.

  • Bankrate has a double listing at #2 and #3 for mortgage calculator. They also own the #1 and #4 ranked sites, another listing further down the page, and some entries on page 2 as well.
  • Sallie Mae offers around 100 student loan brands.
  • How many different verticals does Yahoo! cover the Nintendo Wii in? Off the top of my head, at least 9: their brand universe, yahoo tech, yahoo shopping, yahoo news, yahoo directory, ask yahoo, yahoo answers, videogames.yahoo, games.yahoo, etc. (and that doesn't even count geolocal subdomains for answers, shopping, etc.)

What happened to result diversity? When and why did Google stop caring about that?

Is Buying Links Ethical?

Some people may report paid links, but the fact that there is a mechanism to do so shows how effective link buying is.

Why is buying links bad, when using infinite domains or buying a bunch of sites are both legitimate? Why is it ok for the WSJ to publish this type of content, but wrong for me to do whatever necessary to compete in a marketplace cluttered with that information pollution?

The point here is not to say that big businesses are bad or doing anything wrong, but to show the stupidity Google is relying on when they scaremonger newer and smaller webmasters about the risks of buying a link here or there. The big businesses do all of the above, gain more organic links by being well known, and still buy links because the techniques works. Whatever Google ranks is what people will create more of, so long as it is profitable to do so.

If you create a real brand you can buy more links and be far spammier with your optimization with a lower risk profile, because Google has to rank your site or they lose marketshare. Create something that is best of breed and then market the hell out of it. If marketing requires buying a few links then open up the wallet and get ready to rank.

Answer Engines

A friend named Brent sent me a link to the Cyc project page on Wikipedia and a background video on Google Video. Cyc is an AI project which aims to enable AI applications to perform human-like reasoning.

What happens to the value of your content when search engines get better at providing answers directly in the search results? Is your site the type of site they would like to cite, or does it fall further down the list on another category of queries? What can you do to make them more likely to want to source your site? Does your site have enough perceived trust and value to draw clicks after they put your content directly in the search results?

As search engines work harder at things like universal search, search personalization, and cyc any sites which are only facts and filler won't get much exposure.

Pages