Gerard Salton and Early Search Engine Algorithms

Tom Evslin posted about his experiences working with Gerard Salton in the early 1960's.

Everybody assumed that the best results would be obtained by algorithms which made an attempt at understanding English syntax. (which is very hard to do). WRONG! Turns out that syntax was a waste of time; all that matters is semantics - the actual words used in the query and the documents - not how they relate to each other in a sentence. Sometimes it was (and still is) useful to search for phrases as if they were words. But you get that just by observing word order or how close words are to each other - not trying to parse sentences.

Modern search engines may use quite a large amount of user tracking and heavily emphasize linkage data, but if you want to see the roots of search I highly recommend reading Salton's A Theory of Indexing.

Privacy, User Data, Trust and Marketing

I wish I could add more to Danny's excellent coverage of the government's bogus overarching power grab for data from search engines, but I can't, so I just want to parrot it. :)

The US government requested not personally identifiable search data from AOL, Google, MSN and Yahoo! in an effort to evaluate how often children might find porn on the web. Everyone but Google handed it over. The US government is now suing Google.

The stock market punished Google heavily on this and other news, with the stock dropping from about $470 to $399 a share last week. While Google may have wanted to keep the data for trade secret related reasons they also win a ton of user trust by being the only company which said no to the request.

Compare their position to MicroSoft. Only after Google made this request an issue by denying it did news come out that other search companies, like MSN, gave over data last summer.

How did MSN's recent post make them look?

A prime opportunity was missed last summer. Back then there was a chance to come out at a time when Google was being pounded over privacy concerns and stand up to the government instead of folding like a cheap lawn chair and working out some technical response that we would only learn about months later when the heat was on and they had to say something. Shameful, really.

As a person who likes search this lawsuit makes me wish I was a bit smarter so I could work at Google.

As a marketer I think Google being the only one doing what they are doing is a great thing for them.

  • This heavily undermines the Google can't be trusted with data meme.

  • By being the content in the news they raise their brand exposure. If you ARE the content that people are talking about advertising is not needed to gain market share.
  • By standing up against the government they gain user trust. It is going to be hard for a competitor to build an ad demand network of Google's scale while also trying to build that much trust at the same time.

I think this incident enhances Google's implied value, as it will surely increase their market share.

The Dartmouth - The New Spamford Daily

Ho hum...anyone looking for poker links? It looks like The Dartmouth can still fit a few more below the fold.

Leveraging Google Homepage Extensions

I recently took a gander at the Google Modules site and saw a few great extensions. Some of them are a bit random and don't apply to me, but many of them were cool, like the to do list or the Technorati Mini extension which searches for SeoBook.com citations once a minute. (please note to track your own blog you have to view source, copy it, change the s= URL to your own URL, upload it to your server, then add it to your customized Google home page. Niall probably should have made the to track bit something you could enter after you uploaded it.)

Google is going to use many vertical databases to structure information. They also are going to allow users to create their own home pages as they see fit.

I believe one of the extensions was for horse racing. Getting links or visitors into a horse racing site is probably not a cheap and easy task, but imagine the lead value of a customer who loves horse racing so much that they have to be able to access the latest odds from their home page.

If your extension is cool enough it may provide direct traffic AND link popularity. Those who care about something enough to customize their home page for it are likely they same type of people who would also have websites and tell friends what they put on their home page.

I have not looked through all the extensions yet, but creating free extensions is perfect for concert ticket brokers, exotic travel sites, currency exchange sites, or other sites that provide free useful service.

Even if you provide a boring service you may get a few additional citations by spending 10 minutes creating a free Google Modules XML extension. The same can be said for browser extensions (think Mozdev) or other similar free distribution channels.

Retail Only Matters if You Have Reach and People are Buying...

Some people make software convinced that they are giving away and losing money if they let anyone try out their software. But the retail price only matters if people see it and think it is worth spending money on. The shadier your software is the more of a viral buzz you need to make the marketing work.

A guy contacted me wanting me to promote his blog spam software for free. When I suggested advertising on Threadwatch and giving the software out to members for a day or a week he trumped up the value of his software, which makes me wonder why he had to ask me for free viral marketing if his software was actually worth $197 and already selling well.

If your software / information product / etc. has little to no incremental cost per user and is brand new you are not losing money giving it away in exchange for market exposure. Two years ago I gave away the first version of SEO Book. The first version really was not all that good, but I realized that feedback had value and I should spread it far and wide to get whatever feedback I could get.

PageRank Search Engine

While search has in many ways moved past raw PageRank scores there is a newish SEO tool called PRASE which allows you to grab the top search results from Google, Yahoo! and MSN and and then sorts them in order of PageRank. You can also set PageRank limits. One thing that sucks with the current tool is that it does not allow you to expand the depth to get any more than 10 results from any engine at one time.

Personally I would find the tool more interesting for hunting down high ranking low PageRank sites than to find high PageRank sites.

Via Text Link Brokers

Buying and Selling Domains

Not something I know much about yet, but recently there have been a number of good posts on buying and selling domains.

Email Spammers Killing Default Language

Oilman recently reported on email spammers making his happy new year less happy. I recently have got a few penny stock emails with the subjects like Delivery Status Notification (Failure).

What happens when email spam gets more targeted and starts looking more personal? Will we change the default error and common words we use, or will we have to add anti spam phrases, or how will we automate blocking it?

Becoming a Scientific SEO

One of the best things I ever learned in the navy was troubleshooting and half splitting problems into smaller possible problems. I recently did a bit of microspamming stuff to see what would get nailed and what would not, although I have only tested like 0.0000001% of the market. I want to start focusing more of my efforts on learning how to become a scientific SEO.

I have not built a ton of for profit sites yet, but the likes of Andy Hagans and a few of my other friends have been wearing me down into becoming more of a blog overlord / many for profit site owner.

There are really two main ways to do SEO:

  • Manually: Create world class content that is published frequently.

  • Automated: Buy and/or build sites that look good to search algorithms and search reviewers even if they are a bit automated.

People tend to dismiss the word automation as meaning it has to be spam, but I can't tell you how many times I have heard people like Mikkel talk about how many of the most popular websites are heavily automated.

Google, Google News, Digg, and Memeorandum are a few of the many automated sites I generate on a daily basis. I also think it is pretty hypocritical for those creating automated websites to push the image of automation as being associated with spam.

I guess the ultimate goal to create a money printing machine would be to create content that is:

  • useful and value added (needs to pass the Turing Test and be citation, bookmark, and subscription worthy)

  • unique (so duplicate content filters do not catch it)
  • profitable
  • nearly 100% automated

I am pretty much starting from scratch on the above autogen idea, but friends have left me tips here and there. I hired a cool programmer who is working away at creating value added websites. It should be fun.

In some areas I am partnered with friends who are all about making money, but as much as anything I want to watch and understand how search evolves on many levels. You really can't be a true scientific SEO unless you have some automated content you are working with.

Recent SEO Interviews, Etc.

Lee Odden interviews Stuntdubl and Lee gets interviewed here

John Battelle speaks to Google NYC

While I have grown to hate SEO contests I currently am the only advertiser for the phrase on Google AdWords and here is a free link for my brother v7ndotcom elursrebmem, although he is going to have to be a bit more innovative than that if he actually wants to win. I really would like to see Graywolf win the v7ndotcom elursrebmem contest.

Pages