The Future Of Search

Interesting news item about the future of search.

Analyst Sue Feldman presented her views to the Enterprise Search Summit West.

Key points:

  • A convergence of tools in search.
  • Move away from today's transaction based platform towards a knowledge platform.
  • Improved capibilities in terms of concepts, relationships, and modes of communication, including speech
  • One problem that needs solving is selection: which information do you trust?
  • Getting the right information to the right people at the right time.
  • Move from transactional computing to user-centric interaction models. See my early post about relationships.
  • More automation of knowledge work across multiple devices.
  • Search will eventually be embedded in the platforms and applications, as opposed to a separate function.
  • Search will be at the center of interactive computing as search is language based - the human mode of communication.
  • Full post here.

History of Modern Search Technology - 1945 to Google

I recently updated my article about search engine history.

Any and all feedback is appreciated.

Making Work a Game

In Human Computation Luis von Ahn talks about how the Google Image labeler turns work into a game, and how you can enhance that information further by using a game like Peakaboom.

How many cool things will people do on the web for arbitrary points? And are the points actually arbitrary if they make people happy :)

.htaccess, 301 Redirects & SEO: Guest Post by NotSleepy

Tony Spencer here doing a guest spot on SEOBook. Aaron was asking me some 301 redirect questions a while back and recently asked me if I would drop in for some
tips on common scenarios so here goes. Feel free to drop me any questions in the comments box.

301 non-www to www

From what I can tell Google has yet to clean up the canonicalization problem that arises when the www version of your site gets indexed along with the non-www version (i.e. http://www.seobook.com & http://seobook.com).

<code>
RewriteEngine On

RewriteCond %{HTTP_HOST} ^seobook.com [NC]
RewriteRule ^(.*)$ http://www.seobook.com/$1 [L,R=301]
</code>

The '(*.)$' says that we'll take anything that comes after http://seobook.com and append it to the end of 'http://www.seobook.com' (thats the '$1' part) and redirect to that URL. For more grit on how this works checkout a good regular expressions resource or two.

Note: You only have to enter 'RewriteEngine On' once at the top of your .htaccess file.

Alternately you may chose to do this 301 redirect from
in the Apache config file httpd.conf.

<code>
<VirtualHost 67.xx.xx.xx>
ServerName www.seobook.com
ServerAdmin webmaster@seobook.com
DocumentRoot /home/seobook/public_html
</VirtualHost>

<VirtualHost 67.xx.xx.xx>
ServerName seobook.com
RedirectMatch permanent ^/(.*) http://www.seobook.com/$1
</VirtualHost>
</code>

Note that often webhost managers like CPanel would have placed a 'ServerAlias' seobook.com in the first VirtualHost entry which would negate the following VirtualHost so be sure to remove the non-www ServerAlias.

301 www to non-www

Finally the www 301 redirect to non-www version would look like:

<code>
RewriteCond %{HTTP_HOST} ^www.seobook.com [NC]
RewriteRule ^(.*)$ http://seobook.com/$1 [L,R=301]
</code>

Redirect All Files in a Folder to One File

Lets say you no longer carry 'Super Hot Product' and hence want to redirect all requests to the folder /superhotproduct to a single page called /new-hot-stuff.php. This redirect can be accomplished easily by adding the following your .htaccess page:

<code>
RewriteRule ^superhotproduct(.*)$ /new-hot-stuff.php [L,R=301]
</code>

But what if you want to do the same as the above example EXCEPT for one file? In the next example all files from /superhotproduct/ folder will redirect to the /new-hot-stuff.php file EXCEPT /superhotproduct/tony.html which will redirect to /imakemoney.html

<code>
RewriteRule ^superhotproduct/tony.html /imakemoney.html [L,R=301]
RewriteRule ^superhotproduct(.*)$ /new-hot-stuff.php [L,R=301]
</code>

Redirect a Dynamic URL to a New Single File

It's common that one will need to redirect dynamic URL's with parameters to single
static file:

<code>
RewriteRule ^article.jsp?id=(.*)$ /latestnews.htm [L,R=301]
</code>

In the above example, a request to a dynamic URL such as http://www.seobook.com/article.jsp?id=8932
will be redirected to http://www.seobook.com/latestnews.htm

SSL https to http

This one is more difficult but I have experienced serious canonicalization problems
when the secure https version of my site was fully indexed along side my http version. I have yet
to find a way to redirect https for the bots only so the only solution I have for now is
to attempt to tell the bots not to index the https version. There are only two ways I know to do this and neither are pretty.

1. Create the following PHP file and include it at the top of each page:

if (isset($_SERVER['HTTPS']) && strtolower($_SERVER['HTTPS']) == 'on') {
echo '<meta name="robots" content="noindex,nofollow">'. "\n";
}

2. Cloak your robots.txt file.
If a visitor comes from https and happens to be one of the known bots such as googlebot, you will display:

User-agent: *
Disallow: /

Otherwise display your normal robots.txt. To do this you'll need to alter your .htaccess
file treat .txt files as PHP or some other dynamic language and then proceed to write
the cloaking code.

I really wish the search engines would get together and add a new attribute to robots.txt
that would allow us to stop them from indexing https URLs.

Getting Spammy With it!!!

Ok, maybe you aren't getting spammy with it but you just need to redirect a shit ton of pages. First of all it'll take you a long time to type them into .htaccess, secondly too many entries in .htaccess tend to slow Apache down, and third its too prone to human error. So hire a programmer and do some dynamic redirecting from code.

The following example is in PHP but is easy to do with any language. Lets say you switched to a new system and all files that ended in the old id need to be redirected. First create a database table that will hold the old id and the new URL to redirect to:

old_id INT
new_url VARCHAR (255)

Next, write code to populate it with your old id's and your new URLs.

Next, add the following line to .htaccess:

<code>
RewriteRule ^/product-(.*)_([0-9]+).php /redirectold.php?productid=$2
</code>

Then create the PHP file redirectold.php which will handle the 301:

<code>
<?php
function getRedirectUrl($productid) {
// Connect to the database
$dServer = "localhost";
$dDb = "mydbname";
$dUser = "mydb_user";
$dPass = "password";

$s = @mysql_connect($dServer, $dUser, $dPass)
or die("Couldn't connect to database server");

@mysql_select_db($dDb, $s)
or die("Couldn't connect to database");

$query = "SELECT new_url FROM redirects WHERE old_id = ". $productid;
mysql_query($query);
$result = mysql_query($query);
$hasRecords = mysql_num_rows($result) == 0 ? false : true;
if (!$hasRecords) {
$ret = 'http://www.yoursite.com/';
} else {
while($row = mysql_fetch_array($result))
{
$ret = 'http://www.yoursite.com/'. $row["new_url"];
}
}
mysql_close($s);
return $ret;
}

$productid = $_GET["productid"];
$url = getRedirectUrl($productid);

header("HTTP/1.1 301 Moved Permanently");
header("Location: $url");
exit();
?>
</code>

Now, all requests to your old URLs will call redirectold.php which will lookup the new URL and return a HTTP response 301 redirect to your new URL.

¿Tiene preguntas?

Questions? Ask them here and I'll do what I can.

Gerard Salton and Early Search Engine Algorithms

Tom Evslin posted about his experiences working with Gerard Salton in the early 1960's.

Everybody assumed that the best results would be obtained by algorithms which made an attempt at understanding English syntax. (which is very hard to do). WRONG! Turns out that syntax was a waste of time; all that matters is semantics - the actual words used in the query and the documents - not how they relate to each other in a sentence. Sometimes it was (and still is) useful to search for phrases as if they were words. But you get that just by observing word order or how close words are to each other - not trying to parse sentences.

Modern search engines may use quite a large amount of user tracking and heavily emphasize linkage data, but if you want to see the roots of search I highly recommend reading Salton's A Theory of Indexing.

How to Create a Search Engine Tips

When I interviewed Matt Cutts he stated that some people who want to know how search engines work might do well to create one. Here are some tips on how to create one.

Reading List on PageRank and Search Algorithms

cool post with links to a variety of search research. I hope to have time to read all the referenced papers.

Greg Linden asks

The probabilities of jumping to an unconnected page in the graph rather than following a link -- and briefly suggests that this personalization vector could be determined from actual usage data.

In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these -- the probability of following a link and the personalization vector's probability of jumping to a page -- to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.

But, if I have enough usage data to do this, can't I calculate the equivalent PageRank directly?

Ho John Lee answers Greg's question here.

Google Robots.txt Wildcard

Not sure if I have seen this mentioned before. Dan Thies noticed Googlebot's wildcard robot.txt support:

Google's URL removal page contains a little bit of handy information that's not found on their webmaster info pages where it should be.

Google supports the use of 'wildcards' in robots.txt files. This isn't part of the original 1994 robots.txt protocol, and as far as I know, is not supported by other search engines. To make it work, you need to add a separate section for Googlebot in your robots.txt file. An example:

User-agent: Googlebot
Disallow: /*sort=

This would stop Googlebot from reading any URL that included the string &sort= no matter where that string occurs in the URL.

Good information to know if your site has recently suffered in Google due to duplicate content issues.

Dan also recently an SEO coach blog on his SEO Research Labs site.

New Google User Profiling Patent

Loren does a good rundown of a new Google patent Personalization of placed content ordering in search results in his Organic Results Ranked by User Profiling post. Some of the things in the patent may be a bit ahead of themselves, but the thesis is...

GenericScore=QueryScore*PageRank.

This GenericScore may not appropriately reflect the site's importance to a particular user if the user's interests or preferences are dramatically different from that of the random surfer. The relevance of a site to user can be accurately characterized by a set of profile ranks, based on the correlation between a sites content and the user's term-based profile, herein called the TermScore, the correlation between one or more categories associated with a site and user's category-based profile, herein called the CategoryScore, and the correlation between the URL and/or host of the site and user's link-based profile, herein called the LinkScore. Therefore, the site may be assigned a personalized rank that is a function of both the document's generic score and the user profile scores. This personalized score can be expressed as: PersonalizedScore=GenericScore*(TermScore+CategoryScore+LinkScore).

For those big into patents: Stephen Arnold has a $50 CD for sale containing over 120 Google patent related documents.

I think he could sell that as a subscription service, so long as people didn't know all the great stuff Gary Price compiles for free. (Link from News.com)

Google vs Madison Avenue

Google looks like it wants to own Madison Avenue. The Journal also has a free article on Google vs Madison Ave., and John Battelle recently interviewed Google's Omid Kordestani and Sergey Brin.

If you look at the SEO Bytes monthly toplist you will see that in spite of a recent major Google update many of the most popular threads are about how to monetize Google AdSense ad space.

A year or two ago few of the threads covered monetizing content. It seemed like everyone just wanted to rank or assumed nobody would share that how to profit info. AdSense and similar programs work well for quality and automated sites alike.

While Google monetizes crap sites they usually deny their connection to it, keeping the shadiness far away, funding much of it.

Ask Jeeves is a bit closer in some of their relationships. A few days ago I noticed my mom's computer had some Ask MySearch type spyware activites on it. Sure some of it may be uninstallable, but sometimes when you enter a URL in the address bar it says no site found just to redirect you to ads. Shady.

While some say one bad AdSense site may bring down the whole Google AdSense only took around an hour to approve my mom's new site for AdSense, so Google is not putting up much of a barrier to entry.

The more I read and learn about communities and click pimping the less value I see in my current business model, especially when SEO is usually framed in a negative light and I have to deal with this sort of garbage. After all, even as Case is out AOL is suddenly hot again, and some said Steve was just another spammer. :)

Pages