Tony Spencer here doing a guest spot on SEOBook. Aaron was asking me some 301 redirect questions a while back and recently asked me if I would drop in for some
tips on common scenarios so here goes. Feel free to drop me any questions in the comments box.
301 non-www to www
From what I can tell Google has yet to clean up the canonicalization problem that arises when the www version of your site gets indexed along with the non-www version (i.e. http://www.seobook.com & http://seobook.com).
The '(*.)$' says that we'll take anything that comes after http://seobook.com and append it to the end of 'http://www.seobook.com' (thats the '$1' part) and redirect to that URL. For more grit on how this works checkout a good regular expressions resource or two.
Note: You only have to enter 'RewriteEngine On' once at the top of your .htaccess file.
Alternately you may chose to do this 301 redirect from
in the Apache config file httpd.conf.
Note that often webhost managers like CPanel would have placed a 'ServerAlias' seobook.com in the first VirtualHost entry which would negate the following VirtualHost so be sure to remove the non-www ServerAlias.
301 www to non-www
Finally the www 301 redirect to non-www version would look like:
Lets say you no longer carry 'Super Hot Product' and hence want to redirect all requests to the folder /superhotproduct to a single page called /new-hot-stuff.php. This redirect can be accomplished easily by adding the following your .htaccess page:
But what if you want to do the same as the above example EXCEPT for one file? In the next example all files from /superhotproduct/ folder will redirect to the /new-hot-stuff.php file EXCEPT /superhotproduct/tony.html which will redirect to /imakemoney.html
This one is more difficult but I have experienced serious canonicalization problems
when the secure https version of my site was fully indexed along side my http version. I have yet
to find a way to redirect https for the bots only so the only solution I have for now is
to attempt to tell the bots not to index the https version. There are only two ways I know to do this and neither are pretty.
1. Create the following PHP file and include it at the top of each page:
2. Cloak your robots.txt file.
If a visitor comes from https and happens to be one of the known bots such as googlebot, you will display:
Otherwise display your normal robots.txt. To do this you'll need to alter your .htaccess
file treat .txt files as PHP or some other dynamic language and then proceed to write
the cloaking code.
I really wish the search engines would get together and add a new attribute to robots.txt
that would allow us to stop them from indexing https URLs.
Getting Spammy With it!!!
Ok, maybe you aren't getting spammy with it but you just need to redirect a shit ton of pages. First of all it'll take you a long time to type them into .htaccess, secondly too many entries in .htaccess tend to slow Apache down, and third its too prone to human error. So hire a programmer and do some dynamic redirecting from code.
The following example is in PHP but is easy to do with any language. Lets say you switched to a new system and all files that ended in the old id need to be redirected. First create a database table that will hold the old id and the new URL to redirect to:
new_url VARCHAR (255)
Next, write code to populate it with your old id's and your new URLs.
Everybody assumed that the best results would be obtained by algorithms which made an attempt at understanding English syntax. (which is very hard to do). WRONG! Turns out that syntax was a waste of time; all that matters is semantics - the actual words used in the query and the documents - not how they relate to each other in a sentence. Sometimes it was (and still is) useful to search for phrases as if they were words. But you get that just by observing word order or how close words are to each other - not trying to parse sentences.
Modern search engines may use quite a large amount of user tracking and heavily emphasize linkage data, but if you want to see the roots of search I highly recommend reading Salton's A Theory of Indexing.
The probabilities of jumping to an unconnected page in the graph rather than following a link -- and briefly suggests that this personalization vector could be determined from actual usage data.
In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these -- the probability of following a link and the personalization vector's probability of jumping to a page -- to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.
But, if I have enough usage data to do this, can't I calculate the equivalent PageRank directly?
Not sure if I have seen this mentioned before. Dan Thies noticed Googlebot's wildcard robot.txt support:
Google's URL removal page contains a little bit of handy information that's not found on their webmaster info pages where it should be.
Google supports the use of 'wildcards' in robots.txt files. This isn't part of the original 1994 robots.txt protocol, and as far as I know, is not supported by other search engines. To make it work, you need to add a separate section for Googlebot in your robots.txt file. An example:
This would stop Googlebot from reading any URL that included the string &sort= no matter where that string occurs in the URL.
Good information to know if your site has recently suffered in Google due to duplicate content issues.
This GenericScore may not appropriately reflect the site's importance to a particular user if the user's interests or preferences are dramatically different from that of the random surfer. The relevance of a site to user can be accurately characterized by a set of profile ranks, based on the correlation between a sites content and the user's term-based profile, herein called the TermScore, the correlation between one or more categories associated with a site and user's category-based profile, herein called the CategoryScore, and the correlation between the URL and/or host of the site and user's link-based profile, herein called the LinkScore. Therefore, the site may be assigned a personalized rank that is a function of both the document's generic score and the user profile scores. This personalized score can be expressed as: PersonalizedScore=GenericScore*(TermScore+CategoryScore+LinkScore).
For those big into patents: Stephen Arnold has a $50 CD for sale containing over 120 Google patent related documents.
I think he could sell that as a subscription service, so long as people didn't know all the great stuff Gary Price compiles for free. (Link from News.com)
If you look at the SEO Bytes monthly toplist you will see that in spite of a recent major Google update many of the most popular threads are about how to monetize Google AdSense ad space.
A year or two ago few of the threads covered monetizing content. It seemed like everyone just wanted to rank or assumed nobody would share that how to profit info. AdSense and similar programs work well for quality and automated sites alike.
While Google monetizes crap sites they usually deny their connection to it, keeping the shadiness far away, funding much of it.
Ask Jeeves is a bit closer in some of their relationships. A few days ago I noticed my mom's computer had some Ask MySearch type spyware activites on it. Sure some of it may be uninstallable, but sometimes when you enter a URL in the address bar it says no site found just to redirect you to ads. Shady.