AI & Search Technology

Xan has a cool post:
if you are really interested in AI or search technology you should go read it.

Recently while talking to two different friends they stated that if you want to be a good SEO you should think more like a search scientist than as a webmaster, and Xan is surely trying to help us out with that ;)

TrustRank Algorithm

A buddy of mine pointed me to a white paper by Zoltan Gyongyi, Hector Garcia-Molina, & Jan Pederson about a concept called TrustRank(PDF).

Human editors help search engines combat search engine spam, but reviewing all content is impractical. TrustRank places a core vote of trust on a seed set of reviewed sites to help search engines identify pages that would be considered useful from pages that would be considered spam. This trust is attenuated to other sites through links from the seed sites.
TrustRank can be use to

  • automatically boost pages that have a high probablility of being good, as well as demote the rankings of pages that have a high probability of being bad.

  • help search engines identify what pages should be good canidates for quality review

Some common ideas that TrustRank is based upon:

  • Good pages rarely link to bad ones. Bad pages often link to good ones in an attempt to improve hub scores.

  • The care with which people add links to a page is often inversely proportional to the number of links on the page.
  • Trust score is attenuated as it passes from site to site.

To select seed sites they looked for sites which link to many other sites. DMOZ clones and other similar sites created many non useful seed sites.

Sites which were not listed in any of the major directories were removed from the seed set, of the remaining sites only sites which were backed by government, educational, or corporate bodies were accepted as seed sites.

When deciding what sites to review it is mostly important to identify high PR spam sites since they will be more likely to show in the results and because it would be too expensive to closely monitor the tail.

TrustRank can be bolted onto PageRank to significantly improve search relevancy.

Writing Tips, How to be a Consultant, IR Books, Ask Buys Bloglines

Writing:
Everything You Need to Know About Writing Successfully: in Ten Minutes

How to Be a Consultant:
Create The Warm Fuzzy Feelingâ„¢. Reading it certainly takes much longer than 10 minutes, but it is well worth it if you are considering becoming a consultant.

The list is great, but on the web / marketing front I would also add create affiliate and content sites to help build a stable income stream when down periods occur.

Even when you have few clients you help shore up your technical understand by creating things. If you create great sites then they will make money and you will be able to better filter what work you are willing to take on. If you create lousy sites then they will make for great research and will help you identify symptoms of a lousy site when prospective customers contact you.

As stated in that article, it can't be overly stressed

  • how important it is to be easily available; &

  • how amazingly well syndicated articles act as sophisticated salesmen

found on SearchEngineBlog

Information Retrieval Books:
A while ago I read A Theory of Indexing
by Gerard Salton. I also have heard good things about Information Retreival by C. J. "Keith" van Rijsbergen, and Modern Information Retrieval by Ricardo Baeza-Yates & Berthier Ribeiro-Neto. What information retrieval technology books have you read and liked? Wonder if guys like GoogleGuy have a favorite IR book :)

Ask Jeeves Buys Bloglines?
it is what people are saying...

Trademark Laws:
Deregulating Relevancy in Internet Trademark Law

Would You Name Your PPC?
RipUsOff.com...just randomly came across it and after seeing so many articles about click fraud it would appear as though that name could be took the wrong way.

My Favorite Muppet:
Flying Gonzo, though the Cookie Monster is also cool.

Google Semantically Related Words & Latent Semantic Indexing Technology

Many people have been noticing a wide shuffle in search relevancy scores recently. Some of those well in the know attribute this to latent semantic indexing. Even if they are not using LSI, Google has likely been using other word relationship technologies for a while, but recently increased its weighting. How Does Latent Semantic Indexing Work?
Latent semantic indexing allows a search engine to determine what a page is about outside of specifically matching search query text.

A page about Apple computers will likely naturally have terms such as iMac or iPod on it.

Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent. source

By placing additional weight on related words in content, or words in similar positions in other related documents, LSI has a net effect of lowering the value of pages which only match the specific term and do not back it up with related terms.

LSI vs Semantically Related Words:
After being roasted by a few IR students and scientists I realized that many SEOs (like me) blended the concepts of semantically related words with latent semantic indexing, and due to constraints of the web it is highly unlikely that large scale search engines are using LSI on their main search indexes.

Nonetheless, it is overtly obvious to anyone who studies search relevancy algorithms by watching the results and ranking pages that the following are true for Google:

  • search engines such as Google do try to figure out phrase relationships when processing queries, improving the rankings of pages with related phrases even if those pages are not focused on the target term

  • pages that are too focused on one phrase tend to rank worse than one would expect (sometimes even being filtered out for what some SEOs call being over-optimized)
  • pages that are focused on a wider net of related keywords tend to have more stable rankings for the core keyword and rank for a wider net of keywords

Given the above, here are tips to help increase your page relevancy scores and make your rankings far more stable...

Mix Your Anchor Text!
Latent semantic indexing (or similar technologies) can also be used to look at the link profile of your website. If all your links are heavy in a few particular phrases and light on other similar phrases then your site may not rank as well.

Example Related Terms:
Many of my links to this site say "SEO Book" but I also used various other anchor text combinations to make the linkage data appear less manipulative.

Instead of using SEO in all the links some of them may use phrases like
search engine optimization
search engine marketing
search engine placement
search engine positioning
search engine promotion
search engine ranking
etc.

Instead of using book in all the links some other good common words might be
ebook
manual
guide
tips
report
tutorial
etc.

How do I Know What Words are Related?
There are a variety of options to know what words are related to one another.

  • Search Google for search results with related terms using a ~. For example, Google Search: ~seo will return pages with terms matching or related to seo and will highlight some of the related words in the search results.

  • Use a lexical database
  • Look at variations of keywords suggested by various keyword suggestion tools.
  • write a page and use the Google AdSense sandbox to see what type of ads they would try to deliver to that page.
  • Read the page copy and analyze the backlinks of high ranking pages.

Google Sandbox and Semantic Relationships:
The concept of "Google Sandbox" has become synonymous with "the damn thing won't rank" or whatever. The Sandbox idea is based upon sites with inadequate perceived trust taking longer to rank well.

Understanding the semantic relationships of words is just another piece of the relevancy algorithms, though many sites will significantly shift in rankings due to it. The Google sandbox theory typically has more to do with people getting the wrong kinds of links or not getting enough links than it does with semantic relationships. Some sites and pages are hurt though by being too focused on a particular keyword or phrase.

Where do I learn more about Latent Semantic Indexing?
A while ago I read Patterns in Unstructured Data and found it was wrote in a rather plain english easy to understand manner.

Brian Turner also listed a good number of research papers in this thread.

Forum Coverage:

Selected Forum Quotes:
BakedJake

I'm not about to go post my research and examples on a public forum. But, I'll warn you now - if you're not varying your anchor text, and you're not writing pages synonymous with your term that don't contain the term you're targetting, you're going to be in a world of hurt within the next 90 days.

We've been tracking this update for the last 6 months. I was surprised to see it happen now - I honestly didn't expect it until next month or March, but it's here.

BakedJake

I have a page about "baby clothes". I link to my site 100 times with the anchor text "baby clothes"

I now pull out the words "baby clothes" and all the links pointing to my site with the words "baby clothes"

Do I still have footing to rank for that term "baby clothes" after you've run some sort of semantic analysis on it?

That's my simplistic explanation. I think they're doing something very similar, but taking links into account like that and maybe even devaluing some links on the "main" term...

valeyard

Well, if it hasn't changed by Monday I'm going out to buy a black hat.

If irrelevant junk is what Google wants then irrelevant junk is what it's gonna get. :-(

dataguy

Man I'm glad I diversified my sites. I think I will work on diverifying some more...

andy_boyd

Google Inc. is all about money. And IMHO ... so are Yahoo Inc. and Microsft Corp.. As webmasters we are the people who build sites and depend on these money hungry companies, who at the heels of the hunt, put their interests miles ahead of ours.

Chico_Loco

My main concern with this new update is that if you search for my brand name (and there are quite a few that do based on referrals), then right now my site does not even rank. Our brand name is perhaps the best in my industry, and Google are, in my opinion, diluting my brand name and causing my company money. The first result for my brand name is a spammy page which is a "scraper site" which is actually SERP's page from somewhere - so that's basically useless.

The Hidden or Not so Hidden Messages:

  • If you are entirely dependant on any single network and a single site for the bulk of your income then you are taking a big risk. Most webmasters would be best off to have at least a couple of income streams to shield themselves from algorithm changes.

  • If you are new to SEO you are best off optimizing your site for MSN and Yahoo! off the start and then hoping to later rank well in Google.
  • Make sure you mix your anchor text to minimize your risk profile. Even if you are generally just using your site name as your anchor text eventually that too can hurt you.
  • Search algorithms and SEO will continue to get more complicated. But that makes for many fun posts ;)

Update: a few additional tools recommended in our comments and the comments at ThreadWatch

Fractal Spam, Overture Direct Traffic Center Rant, SEM Cares (or do they?), Google Subdomain Chatter

Eating Your Own Crap:
Fractal Spam - search engines may be known to like their own search results...at least for a while.

Overture Direct Traffic Center:
Some big advertisers are not too impressed with the reporting delays and clunky interface.

SEM Cares? SEMPO Cares? or is it Nobody Cares?
SEM Cares perhaps too little, too late for Barbara and others to put out the good word? The domain name sounds a bit Orewellian, which almost makse it sound like maybe nobody cares.

Free Culture Stuff:
A few good links from ThreadWatch's thread about big blue Open Sourcing 500 patents.

Patented European webshop
Software patents – Obstacles to software development by Richard Stallman

Chatter:
There is also chatter that Google may be dropping some spammed out subdomains from some competitive keywords in some of their data centers.

Google Site Flavored Search

ChrisG mentions that Google's site flavored search automatically suggests categories for websites, and that generally it has spot on results.

I am sure it is only a small sample of what Google's technologies do, but it is interesting nonetheless, and it may tell you what Google thinks of your site as well as help you think of related categorical sites to get links from. Maybe it would also be a good way for a small new directory owner to grab a unique category structure for their site?

On a side note, apparently Google has no idea what Black Hat SEO is...

Jon Kleinberg, Title Attribute Test, Making Friends

Home Page of the Day:
Jon Kleinberg - he worked on lots of the underlying theory that created the hubs and authority ranking system which eventually led to Teoma.

He has all kinds of cool PDFs on his site such as Maximizing the Spread of Influence through a Social Network - cool stuff. If I were better at math and network theory stuff his home page would be a virtual candy store.

Interesting & Awaiting Results:
fathom is conducting a link title attribute test

Undersold ad space
Anna Kournikova on advertising...er, advertising on Anna Kournikova

Illigitimate ad space:
Bush Administration Invents 'News' and Pays Journalist

Hosed Ad Space:
Kraft WHITE American Cheese - AdWords ad targeting problems :(

Really, I am not a Slimeball Ads:
Ken Lay starts advertising on AdWords. Interesting what the other AdWords ads say about him too.

Meta "ingnore this part of the page" tag:
I can't really see it coming anytime soon, but some want to push the idea.

MSN Beta to ramp up testing:
MSN Beta to ramp up testing

Developing a Directory?
The Don'ts of Directory Development offers tips to help you get your directory off the ground.

ESearch Online E Search Online ApexSearch Apex Search (look out):
another SEO firm out of Vegas that is allegedly cold calling people.

I did not find any legitimate backlinks into the apexesearch site. The only one I found in Google was from a forum solicitation by a guy by the name of Sincity

Sincity would like to offer you...

In that forum post it states:

real results refferences provided in business since 1996 no cusomer complaints EVER!!!!

and yet its registration details state

Registered through: GoDaddy.com (http://www.godaddy.com)
Domain Name: APEXESEARCH.COM
Created on: 20-Apr-04

Domain Name: E-SEARCHONLINE.COM
Created on: 22-Dec-04

I did not see any meaningful company information on their company information page either http://www.apexesearch.com/info.htm. Some people are wondering if this firm has anything to do with Traffic Power. If any SEO calls you up out of the blue trying to tell you that you MUST buy something TODAY then odds are they are NOT worth buying from. Cold calls = crap. Traffic

How Not to Make Friends:
Promote your services in others forums while trashing their business model in your own forum.

How can a person wanting to set up an automated link network say that people should not be able to buy links by PageRank?

How Not to Make Friends...Part 2:
For a while the name of the SEO firm that wanted RustyBrick to link to them was posted in this rant thread.

One time some guy with a big mouth emailed me about how great his firm was and felt that for that reason he felt he deserved a link from my site. I also had a hunch that when another well known firm told me to add them to my SEO forums page that they were spamming me. Not too long ago I got an email from an express link building firm which used "stuff" as the the email title. I wonder how many people use these same shoddy techniques to "promote" (or otherwise destroy the brand of) their clients sites?

Google Financial Stats, Mobile Search, RSS Advertising

Google Finance:
John Battelle has lots of yummy stats about Google's finances...

  • nearly 17% of visitors click on ads.

  • Google makes an average of 54 cents a click.
  • Google makes on average nearly a dime from the average US search

Though Danny Sullivan makes a guest appearance in the comments to say the figures may be off (if they did not take in account for contextual ads).

Rob Frankel:
My favorite branding guru has a great rant blog. His view of Paxil and Prozac for children...

Trellian Seasonal Keyword Research:
Out of touch with the season?

Malcolm Gladwell:
One of my favorite authors gives a speech (about a month old, but his stuff is always good)

Contextual Ads:
Chitika is a new contextual ad network (their parent company has also been powering eBay's keyword driven banners)...rumor has it they might be writing some quality PR stuff too.

Laptops & Porn:
always a bad idea...

Mobile Search:
How it will change everything...or will it? I think there is a ton more to the world than just registering a name. Sure people will easily be able to link up regular publications and products to web locations, but the reason Amazon is successful is not just its product offering or customer service, but the rich feedback past consumers have left in their system. I think our social interactions and the trails we leave on the web are worth a ton more than this article seems to believe.

Mobile People Search:
US to use electronic passports.

Eventual RSS Doom:
Will its popularity destroy it?
Should People Run RSS Ads?

I think the links and attention you get from RSS subscribers will have more longterm value than their cost. If hosting costs are killing you go with Blogger or find a host who wants some cheap marketing (a hosted by link on your site).

Its not uncommon for businesses to have loss liters. If many of your readers / RSS subscribers also provide you tons of links then maybe you should look at the bandwidth as an advertising expense.

Those Random Late Night Purchases:
Internet Accelerator may help you download pages rack up credit card bills quicker.

SEO Tips & Search Engine Tips

SEO Old Timer Tips:
An Old Timers Perspective...from SEGuru

Search Engine Old Timer Tips:
Recently a friend of mine bought me a copy of A Theory of Indexing by Gerard Salton. It is a 50 page book from 1975 with lots of charts and math, but in those few pages it has a ton of information about many of the ideas which current search technologies have been built upon.

I am probably going to have to read it again because it was so dense with information and had lots of math that was a wee bit above me the first time around, but to anyone interested in learning about search technology it is a great book...much like Mike Grehan's.

A Theory of Indexing talks about a ton of interesting things like:

  • signal to noise

  • inverse document frequency
  • discrimination value
  • and lots of other stuff

Here is a small bit I learned from the last few pages...

If words exist in a high % of the total documents in a document collection then they are not usually going to be good at discriminating which documents are relevant for a particular query (since they appear in too many documents).

If words exist is a low % of the total documents then they are not usually going to be good at discriminating which documents are relevant for a particular query (since they appear in so few documents).

Words with a mid range document frequency are better discriminators.

To make better use of words that appear in a high % of the total documents you can combine the words into word pairs or triples - which will have a lower frequency and may be better at descriminating document relevancy.

To make better use of words that appear in a low % of the total documents you can cluster the words into groups via the use of a thesaurus - which will have the net effect of creating higher frequency word classes / clusters - which may be better at descriminating document relevancy.

Understanding User Goals in Web Search

Daniel E Ross and Danny Levinson (of Yahoo!) recently created a whitepaper titled Understanding User Goals in Web Search, which aimed to figure out "why are people searching?"

The underlying relevance-ranking algorithms that determine which results are presented to a user might differ depending on the search goal. For example, queries that express a need for advice might rely more on usage- or connectivity-based relevance factors, while those involving open-ended research might weight traditional information retrieval measures (such as term frequency) more highly.

They broke the searches down into three broad groups (and subgroups of these groups).

  • Resource - be entertained or interacting. not just finding info on the page - free online video games, pornography

  • Informational - read or learn something - lists, advice, locate
  • Navigational - going to a single topical hub - Amazon, Ebay

These findings came from 3 sets of approximately 500 AltaVista searches each.
The study found a few interesting things about web search:

  • Nearly 40% of searches in each of their search sessions were non informational.

  • A large percent of informational searches were aiming to locate a product or service vice find information about it.
  • "Just over 35% of all queries appear to have the kind of general research goals (questions, undirected requests for information, and advice-seeking) for what traditional information retrieval systems were designed."
  • Navigational searches were much less common than expected (~13% of total search. Incidentally 62% of searches were informational and 25% were for resources.

They stated the lack of distribution of AltaVista and its reputation for having powerful search capabilities might have thrown their research off and they hoped to eventually be testing Yahoo! results.

Originally found on Cre8tive Flow blog.

Pages