I recently changed one of my robots.txt files pruning duplicate content pages to help more of the internal PageRank flow to the higher quality and better earning pages. In the process of doing that, I forgot that one of the most well linked to pages on the site had a similar URL as the noisy pages. About a week ago the site's search traffic halved (right after Google was unable to crawl and index the powerful URL). I fixed the error pretty quickly, but the site now has hundreds of pages stuck in Google's supplemental index, and I am out about $10,000 in profit for that one line of code! Both Google and Yahoo support wildcards, but you really have to be careful when changing a robots.txt file because a line like this
Disallow: /*page
also blocks a file like this from being indexed in Google
beauty-pageants.php
Unless you are thinking of that in advance it is easy to make a mistake.
If you are trying to prune duplicate content for Google and are fine with it ranking in other search engines, you may want to make those directives specific for GoogleBot. If you make a directive for a specific robot, that bot will ignore your general robots directives in favor of following the more specific directives you created for it.
Google also offers a free robots.txt test tool, which allows you to see how robots will respond to your robots.txt file, notifying you of any files that are blocked.
You can use Xenu link sleuth to generate a list of URLs from your site. Upload that URL list to the Google robots.txt test tool (currently in 5,000 character chunks...an arbitrary limit I am sure they will eventually lift).
Inside the webmaster console Google will also show you what pages are currently blocked by your robots.txt file, and let you view when Google tried to crawl the page and noticed it was blocked. Google also shows you what pages are 404 errors, which might be a good way to see if you have any internal broken links or external links pointing at pages that no longer exist.
Thanks for putting out such a good description of what you can (and should) do to filter out the less valuable pages of a website.
SEOesa
July 20, 2007 - 1:12pm
Hi,
Maybe I'm wrong but I think that the 5.000 characters limitation is only for the robots.txt validation tool. The robot.txt of The White House, for instance, has a more than these characters.
I did the same on my blog and didn't have any problem... Of course there's no such a page giving me that 10k...
Blogs And Bucks
July 20, 2007 - 3:23pm
I know this is tricky. Thats why we always have to check Google Webmater Tool to see if crawler has been restricted to any page/s that we wanted to index. Sometimes if you put restriction like
Disallow: /search
Just to tell BOT not to crawl your search pages, it will also put restriction against any pages that has "search" term at the beginning of the URL. Suppose http://www.yoursite.com/search-domain-name.html
I wrote a similar article few days ago. In case if you guys are interested.
Funny you mention that. I recently did something very similar and was kicking myself for being such an idiot. That will teach me to edit code at 3AM :-)
Yi Lu
July 20, 2007 - 10:55pm
Is this robot.txt file definition and purpose located in your ebook aaron? If so, I must be blind x_x
Hi Yi Lu
I briefly mention robots.txt in my ebook, but I don't go too deep into using it too aggressively because it is so easy to mess it up (as I accidentally did above).
How much do you earn from this website alone? and which page is the one which earns you 10 K?
I can't disclose the specific earnings of that site. Keep in mind that I never said that the one page made 10K...just that it had lots of link equity. That link equity helped to power the crawl depth of the site and help other pages on the same site rank better.
Once you have a robots.txt directive to block certain pages - in your experience - how long does it take for that to be noticed and effective?
It depends on the crawl priority of that site and that page in question, as well as where they are in their crawl cycle when you do it.
I think I made this error about 3 weeks ago and Google started reacting to it about a week ago.
As far as how long it will take to correct goes, that depends on the same factors mentioned above, plus how long it takes Google to discover and trust the link equity pointing at the rest of the site, and reassign those pages to the primary index rather than the supplemental index.
You should consider a reinclusion request for that page via webmaster central. I heard Matt Cutts speak the other day and he mentioned that Google recrawls the robots.txt file only every couple hundred visits to a site.
I don't understand why the robots.txt error described in this post would cause a lot of the pages in the site to go supplemental.
Are you saying that the page was so important that when you no crawled it by accident the internal links from that page caused opther pages to go supplemental? Is that the rational?
Vijay Teach Me
July 23, 2007 - 7:27pm
Ouch Ouch $$...
Aaron learned that by -$10,000 and he is so gracious to let us know that watch it.....
Hi Philip
Yes...much of the site's link equity went into that one page. And when it went away so did much of the site's link equity, thus many of the pages went supplemental. It is a quite large site too, so that link equity was important.
Mariano
July 25, 2007 - 5:11pm
I forgot to take out a noindex header tag. It took about two months to go back to normal traffic levels. I filled a reinclusion request too for precaution (I haven't got a response)
Unless you have a friend at Big "G" - everything seems to react slowly when it comes to subtle changes. It can take five minutes or five months to get a link indexed, and then it can be gone in just a second. If you depend on organic traffic for your lively hood, these can be some severe learning curves.
This is a great post. i just wish I would of read it 6 months ago.
New to the site? Join for Free and get over $300 of free SEO software.
Once you set up your free account you can comment on our blog, and you are eligible to receive our search engine success SEO newsletter.
Already have an account? Login to share your opinions.
Over 100 training modules, covering topics like: keyword research, link building, site architecture, website monetization, pay per click ads, tracking results, and more.
An exclusive interactive community forum
Members only videos and tools
Additional bonuses - like data spreadsheets, and money saving tips