The $10,000 Robots.txt File...Ouch!

I recently changed one of my robots.txt files pruning duplicate content pages to help more of the internal PageRank flow to the higher quality and better earning pages. In the process of doing that, I forgot that one of the most well linked to pages on the site had a similar URL as the noisy pages. About a week ago the site's search traffic halved (right after Google was unable to crawl and index the powerful URL). I fixed the error pretty quickly, but the site now has hundreds of pages stuck in Google's supplemental index, and I am out about $10,000 in profit for that one line of code! Both Google and Yahoo support wildcards, but you really have to be careful when changing a robots.txt file because a line like this
Disallow: /*page
also blocks a file like this from being indexed in Google
beauty-pageants.php

Unless you are thinking of that in advance it is easy to make a mistake.

If you are trying to prune duplicate content for Google and are fine with it ranking in other search engines, you may want to make those directives specific for GoogleBot. If you make a directive for a specific robot, that bot will ignore your general robots directives in favor of following the more specific directives you created for it.

Google's webmaster guidelines and Yahoo!'s Search Blog both offer tips on how to format your robots.txt file.

Google also offers a free robots.txt test tool, which allows you to see how robots will respond to your robots.txt file, notifying you of any files that are blocked.

You can use Xenu link sleuth to generate a list of URLs from your site. Upload that URL list to the Google robots.txt test tool (currently in 5,000 character chunks...an arbitrary limit I am sure they will eventually lift).

Inside the webmaster console Google will also show you what pages are currently blocked by your robots.txt file, and let you view when Google tried to crawl the page and noticed it was blocked. Google also shows you what pages are 404 errors, which might be a good way to see if you have any internal broken links or external links pointing at pages that no longer exist.

Published: July 20, 2007 by Aaron Wall in seo tips

Comments

Jason W
July 20, 2007 - 9:41am

That totally sucks! I'm going to be careful when making changes to that file. I always have to triple check when I'm doing anything there or .htaccess

Glad you caught that before it was more like a $50k mistake, cause that would reeeeeaallly make you mad.

On the other hand, I wouldn't mind having a single site that made $10k in such a short period of time.

July 20, 2007 - 12:13pm

Need to be careful also that google and yahoo's support for wildcards is not similar. For example different bots can handle ? in a different way.

Also need to take into account that at least Google has robots.txt length limitation (around 5000 bytes).

July 20, 2007 - 12:39pm

Shit happens, but the most important is that now you've got one more lesson! This is priceless!

Regards,
William

July 20, 2007 - 1:09pm

Wow, sorry to hear that.

Thanks for putting out such a good description of what you can (and should) do to filter out the less valuable pages of a website.

SEOesa
July 20, 2007 - 1:12pm

Hi,

Maybe I'm wrong but I think that the 5.000 characters limitation is only for the robots.txt validation tool. The robot.txt of The White House, for instance, has a more than these characters.

and thanks for your blog!

July 20, 2007 - 1:26pm

I did the same on my blog and didn't have any problem... Of course there's no such a page giving me that 10k...

Blogs And Bucks
July 20, 2007 - 3:23pm

I know this is tricky. Thats why we always have to check Google Webmater Tool to see if crawler has been restricted to any page/s that we wanted to index. Sometimes if you put restriction like

Disallow: /search

Just to tell BOT not to crawl your search pages, it will also put restriction against any pages that has "search" term at the beginning of the URL. Suppose http://www.yoursite.com/search-domain-name.html

I wrote a similar article few days ago. In case if you guys are interested.

http://blogsandbucks.com/use-robotstxt-file-correctly-for-your-blog/

Sam I Am
July 30, 2007 - 4:31pm

If you are disallowing a directory, wouldn't it be smarter to do this:

Disallow: /*directory/

That wouldn't affect pages named directory-something.html would it ?

Rakhi
July 20, 2007 - 3:25pm

Really good!! Good source of information.

July 20, 2007 - 3:26pm

yeah, another reason why i keep a very simple robots.txt and i never change it!

July 20, 2007 - 4:14pm

Well, I can see the funny side of that. Have you noticed that Shoemoney has ditched your "improvements" on his robots.txt?

July 20, 2007 - 4:41pm

or chalk it up to costing you $10k for a good topic to blog about. :)

July 20, 2007 - 6:30pm

Once you have a robots.txt directive to block certain pages - in your experience - how long does it take for that to be noticed and effective?

I've had pages disallowed for months that are still in the SERPS and I'm wondering what that means.

July 20, 2007 - 7:23pm

Wow! great info to know

July 20, 2007 - 8:21pm

All i can say is ouch!, one page earning 10 k,

Two questions:

How much do you earn from this website alone?

and which page is the one which earns you 10 K?

July 20, 2007 - 9:40pm

Funny you mention that. I recently did something very similar and was kicking myself for being such an idiot. That will teach me to edit code at 3AM :-)

Yi Lu
July 20, 2007 - 10:55pm

Is this robot.txt file definition and purpose located in your ebook aaron? If so, I must be blind x_x

July 21, 2007 - 4:10am

Hi Yi Lu
I briefly mention robots.txt in my ebook, but I don't go too deep into using it too aggressively because it is so easy to mess it up (as I accidentally did above).

How much do you earn from this website alone? and which page is the one which earns you 10 K?

I can't disclose the specific earnings of that site. Keep in mind that I never said that the one page made 10K...just that it had lots of link equity. That link equity helped to power the crawl depth of the site and help other pages on the same site rank better.

Once you have a robots.txt directive to block certain pages - in your experience - how long does it take for that to be noticed and effective?

It depends on the crawl priority of that site and that page in question, as well as where they are in their crawl cycle when you do it.

I think I made this error about 3 weeks ago and Google started reacting to it about a week ago.

As far as how long it will take to correct goes, that depends on the same factors mentioned above, plus how long it takes Google to discover and trust the link equity pointing at the rest of the site, and reassign those pages to the primary index rather than the supplemental index.

August 21, 2007 - 11:59pm

This is really "Ouch"! Thank you for the great tips btw.

July 22, 2007 - 7:14am

I'm just put a file using name robots.txt without putting any coding it...It's very helpful tip.Thanks a lot.

July 23, 2007 - 2:18am

Aaron:

You should consider a reinclusion request for that page via webmaster central. I heard Matt Cutts speak the other day and he mentioned that Google recrawls the robots.txt file only every couple hundred visits to a site.

Jonah

July 23, 2007 - 4:20pm

I don't understand why the robots.txt error described in this post would cause a lot of the pages in the site to go supplemental.

Are you saying that the page was so important that when you no crawled it by accident the internal links from that page caused opther pages to go supplemental? Is that the rational?

Vijay Teach Me
July 23, 2007 - 7:27pm

Ouch Ouch $$...
Aaron learned that by -$10,000 and he is so gracious to let us know that watch it.....

Thanks Aaron
Vijay

July 23, 2007 - 8:40pm

Hi Philip
Yes...much of the site's link equity went into that one page. And when it went away so did much of the site's link equity, thus many of the pages went supplemental. It is a quite large site too, so that link equity was important.

Mariano
July 25, 2007 - 5:11pm

I forgot to take out a noindex header tag. It took about two months to go back to normal traffic levels. I filled a reinclusion request too for precaution (I haven't got a response)

Drupalzilla
October 7, 2007 - 8:04am

Google, Yahoo and MSN all support the end of string character ($). So for those engines you could use:

Disallow: /*page$

And that would only match example.com/my-page and not example.com/my-pagerank or example.com/my-page/

There are extensive Drupal-specific robots.txt examples on Drupalzilla.com...

NotYourMamma
November 16, 2010 - 4:42pm

Still as important today as it was a few years ago?

November 17, 2010 - 12:05am

If you use it incorrectly / have an error in it then yes it can still be super harmful :)

macfab
December 12, 2010 - 6:38pm

Unless you have a friend at Big "G" - everything seems to react slowly when it comes to subtle changes. It can take five minutes or five months to get a link indexed, and then it can be gone in just a second. If you depend on organic traffic for your lively hood, these can be some severe learning curves.

HonestCarpetCleaning
June 1, 2012 - 8:56am

This is a great post. i just wish I would of read it 6 months ago.

Add new comment

(If you're a human, don't change the following field)
Your first name.
(If you're a human, don't change the following field)
Your first name.
(If you're a human, don't change the following field)
Your first name.