Link Spam Detection Based on Mass Estimation

Not sure how many people believe TrustRank is in effect in the current Google algorithm, but I would be willing to bet it is. Recently another link quality research paper came out by the name of Link Spam Detection Based on Mass Estimation [PDF].

It was authored by Zoltan Gyongyi, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen.

The proposed method for determining Spam Mass works to detect spam, so it compliments nicely with TrustRank (TrustRank is primarily aimed to detect quality pages and demote spam).

The paper starts off by defining what spam mass is.

Spam Mass - an estimate of how much PageRank a page accumulates by being linked to from spam pages.

I covered a bunch of the how it works in theory stuff in the extended area of this post, but the general takehome tips from the article are

  • .edu and .gov love is the real deal, and then some

  • Don't be scared of getting a few spammy links (everyone has some).
  • TrustRank may deweight the effects of some spammy links. Since most spammy links have a low authority score they do not comprise a high percentage of your PageRank weighted link popularity if you have some good quality links. A few bad inbound links are not going to put your site over the edge to where it is algorithmically tagged as spam unless you were already near the limit prior to picking them up.
  • If you can get a few well known trusted links you can get away with having a large number of spammy links.
  • These types of algorithms work on a relative basis. If you can get more traditional media coverage than the competition you can get away with having a bunch more junk links as well.
  • Following up on that last point, some sites may be doing well in spite of some of the things they are doing. If you aim to replicate the linkage profile of a competitor make sure you spend some time building up some serious quality linkage data before going after too many spammy or semi spammy links.
  • Human review is here to stay in search algorithms. Humans are only going to get more important. Inside workers, remote quality raters, and user feedback and tagging gives search engines another layer to build upon beyond link analysis.
  • Only a few quality links are needed to rank in Google in many fields.
  • If you can get the right resources to be interested in linking your way (directly or indirectly) a quality on topic high PageRank .edu link can be worth some serious cash.
  • Sometimes the cheapest way to get those kinds of links will be creating causes or linkbait, which may be external to your main site.

On to the review...

  • To determine the effect of spam mass they computate PageRank twice. Once normally and then again with more weight on known trusted sites that would be deemed to have a low spam mass.

  • Spammers either use a large number of low PageRank links, a few hard to get high PageRank links, or some combination of the two.
  • While the quality authoritative links to spam sites are more rare, they are often obtained through the following
    • blog / comment / forum / guestbook spam

    • honey pots (creating something useful to gather link popularity to send to spam)
    • buying recently expired domain names
  • if the majority of inlinks are from spam nodes it is assumed that the host is spam, otherwise it is labeled good. Rather than looking at the raw link count this can further be biased by looking at percent of total PageRank which comes from spam nodes
  • to further determine the percent of PageRank due to spam nodes you can also look at link structure of in-direct nodes and how they pass PageRank toward the end node
  • the presumption of knowing weather something is good or bad is not feasible, so it must be estimated from a subset of the index
  • for this to be practical search engines must have white lists and / or black lists to compare other nodes to. this can be automated or manual compiled
  • it is easier to assemble a good core since it is fairly reliable and does not change as often as spam techniques and spam sites (Aaron speculation: perhaps this is part of the reason some uber spammy older sites are getting away with murder...having many links from the good core from back when links were easier to obtain)
  • since the small reviewed core will be much smaller of a sample than the number of good pages on the web you must also review a small random uniform sample of the web to determine the approximate percent of the web that is spam to normalize the estimated spam mass
  • due to sampling methods some nodes may have a negative spam mass, and are likely to be nodes that were either assumed to be good in advance or nodes which are linked closely and heavily to other good nodes
  • it was too hard to manually create a large human reviewed set, so
    • they placed all sites listed in a small directory they considered to be virtually void of spam in the good core (they chose not to disclose the URL...anyone want to guess which one it was?). this group consisted of 16,776 hosts.

    • .gov and .edu hosts (and a few international organizations) also got placed in the good core
    • those sources gave them 504,150 unique trusted hosts
  • of the 73.3 million hosts in their test set 91.1% have a PageRank less than 2 (less than double the minimum PageRank value)
  • only about 64,000 hosts had a PageRank 100 times the minimum or more
  • they selected an arbitrary limit for minimum PageRank for reviewing the final results (since you are only concerned about the higher PageRank results that would appear atop search results)
    of this group of 883,328 sites and they hand reviewed 892 hosts

    • 564 (63.2%) were quality

    • 229 (25.7%) were spam
    • 54 (6.1%) uncertain (like beauty, spam is in the eye of the beholder)
    • 45 (5%) hosts down
  • ALL high spam mass anomalies on good sites were categorized into the following three groups
    • some Alibia sites (Chinese was far from the core group),
    • Blogger.com.br (relatively isolated from core group),
    • .pl URLs (there were only 12 polish educational institusions in the core group)
  • Calculating relative mass is better than absolute mass (which is only logical if you wanted the system to scale, so I don't know why they put it in the paper). Example of why absolute spam mass does not work:
    • Adobe had lowest absolute spam mass (Aaron speculation: those taking the time to create a PDF are probably more concerned with content quality than the average website)

    • Macromedia had third highest absolute spam mass (Aaron speculation: lots of adult and casino type sites have links to Flash)

[update: Orion also mentioned something useful about the paper on SEW forums.

"A number of recent publications propose link spam detection methods. For instance, Fetterly et al. [Fetterly et al., 2004] analyze the indegree and outdegree distributions of web pages. Most web pages have in- and outdegrees that follow a power-law distribution. Occasionally, however, 17 search engines encounter substantially more pages with the exact same in- or outdegrees than what is predicted by the distribution formula. The authors find that the vast majority of such outliers are spam pages. Similarly, Benczúr et al. [Benczúr et al., 2005] verify for each page x whether the distribution of PageRank scores of pages pointing to x conforms a power law. They claim that a major deviation in PageRank distribution is an indicator of link spamming that benefits x. These methods are powerful at detecting large, automatically generated link spam structures with "unnatural" link patterns. However, they fail to recognize more sophisticated forms of spam, when spammers mimic reputable web content. "

So if you are using an off the shelf spam generator script you bought from a hyped up sales letter and a few thousand other people are using it that might set some flags off, as search engines look at the various systematic footprints most spam generators leave to remove the bulk of them from the index.]

Link from Gary

Today is the Right Time to Buy Old Sites...

I work by myself, and am always a bit scared of spreading myself too thin, so I have not been to active on the old domain buying front.

Having said that, now would probably be a good time to buy old domains. Jim Boykin again mentioned his new love for oldies and Graywolf said

Came to the same conclusion myself, emailed about 150 people picked up 2 domains from 2000 for under $1K.

Think of how cheap those site purchases are. Decent links can cost $50 to $300 or more each, so buying whole sites for $500 is cheap cheap cheap! How cheap is it? Even the most well known link broker is recommending buying a few old domains.

Why now is the perfect time to buy old domains:

  • It is right before the Christmas shopping season and many people not monetizing their sites might be able to use a bit of spare cash.

  • Many older domains are doing better than one would expect in Google's search results, which means they may recoup their costs quickly.
  • As Andy Hagans said, "Some older sites seem to be able to get away with murder in Google's search results."
  • Link popularity flowed much more naturally to commercial sites in the past than it does now. This means buying something with the a natural link profile may be far cheaper than it would be to try to reproduce similar linkage data.
  • At different times search algorithms show you different things. Before the Christmas shopping season each of the last few year it seems Google rolled out a new algorithm that wacked many sites which SEO'ed their way to the top (IMHO via link trading and low quality linkage data). Most of the algorithm changes are related to looking at linkage quality, communities, and ways to trust sites. The most recent update seems to have (at least temorarily) dialed up the weighting on TrustRank or a similar technology, which has had the net effect of highly ranking many old/trusted/authoritative sites that may lack some query specific authority. If you shop for sites that fit the current Google criteria well then add some good SEO to it you should be sitting good no matter which way the algorithms slide.

Before MSN was launched GoogleGuy recommended everyone taking a look at the MSN search results:

I recommend that everyone spend their full attention coming up to speed on beta.search.msn.com.

It's very rare to get to see a search engine in transition, because that's the best time to see what the different criteria are for ranking.

Now that Google is in a state of flux it might be a good time to perform many searches to look for some underpriced ad inventory. If you know what you are looking for you are more likely to find it in the organic search results than in the AdWords system.

The search vs SEO cat fight:

and going forward...

  • creating causes

  • social networking
  • buzz marketing

I think there is way more competition and SEO is way more complex than when I started learning about it, but that is offset in part by:

  • more searches

  • greater consumer trust of online shopping
  • many channels discussing the topic of SEO
  • many free tools (SEO and content management)
  • lower hosting costs
  • the speed at which viral stories spread if you can create one
  • the vastly expanding pool of options to use to monetize your sites

Why Off Topic Link Exchanges Suck

So I recently got this email:

Dear Webmaster,

As you are probably aware, Google has changed its algorythm and now removes sites from its search results when they have exchanged links with sites that are not in EXACTLY the same category.

To prevent being from blacklisted in Google, it is imperative that we remove our link to you and that your remove your link to us!

Our url is www.SITE.com.

We are removing our link to you now. PLEASE return the courtesy and remove your link to us!

Note that Google is updating its results this week and failure to remove these links immediately will likely mean not showing up in Google for AT LEAST the next 4 months!

Thank you for understanding,
Site.com Partners

The email is bogusly incorrect, and I don't think I traded links with the site mentioned, but that is the exact reason why this email is extra crappy.

If you trade links highly off topic you increase your risk profile, and if it helps you rank:

  • Whenever there is an update your competitors can send these remove my link reminders out for you.

  • There are only a limited number of relationships you can have. If you link out to a million sites your links out to junky sites will be a higher percentage than most sites, you will have more dead links than most quality sites, and many of those people will remove their links to you.
  • Your competitors could pay people from Jakarta $3 a day to go through your link trades and trade the same links.
  • Quality on topic sites may be less likely to link to you if your site frequently links off to low quality resources.

I think most sites which recently went south in Google probably lacked quality linkage data, not because they had too many links.

Link Monkeys and Links Within Content

Jim Boykin continues his ongoing rants about links:

Since most people are still thinking "the numbers game" when it comes to obtaining links, most people are buying "numbers" from "monkeys" on crappy link pages.

When will the world wake up that the numbers game has passed the tipping point in Google. Engine are trying to get smarter with how they analyze sites. My overall thought is that they are working to identify, simply, "Links within Content and Linking to Content"

Not too long ago when I interviewed NFFC he stated:

This is what I think, SEO is all about emotions, all about human interaction. People, search engineers even, try and force it into a numbers box.

Numbers, math and formulas are for people not smart enough to think in concepts.

Best Way to Deal With SEO Clients that Don't Pay?

Dan Thies SEO Training Course Coupon

My buddy Dan Thies is doing another one of his SEO training courses. If you do well with audio training & want to learn SEO I highly recommend it.

Dan gave me a coupon code for SeoBook.com readers to save $100 off his course fees. After you create your login the next screen lets you enter the coupon code seobook. If you would like my ebook to go along with the course just ping me after you sign up.

Hidden AdSense Ads & Hidden Search Results

Ads as content...works well for some. You know AdSense is out of hand when premade sites are selling on eBay.

Debt Consolidation:
Fun to hear a guy whine about his 2 sites in his sig file not outranking Forbes for debt consolidation
http://forums.seochat.com/t54393/s.html
because Forbes has an ad page. :(

I bet his debt consolidation lead generation sites are informational in nature :)

Some are discussing creating links wars
http://forums.seochat.com/t54378/s.html
as if that will help them rank. Good luck knocking out Forbes.com.

If Google dials up their weighting on large authority sites before Christmas maybe the solution is to buy ad pages on some of them. I bet there are some great underpriced ad links and advertisement pages if people would look hard enough.

Coolness:
Link to Jim Boykin's new tool...still a bad tool name though, IMHO.

Will Unfollowed Redirect Links Ever Count as Votes?

I have not put much effort into following most directory type sites that use redirect links (especailly if they are not ranked well in the related search results), but will engines eventually count many weird links as votes if they notice that the users click on a link and like what is on the other end of the series of redirects? Will those links ever count as much or more than static links that never get clicked on?

Jim Boykin On History of SEO & the Top Results

The History of SEO:

2005 SEO - Yahoo and MSN, pound with lots of links at once and keep pounding with anything you can get for backlinks with a focused backlink text campaign. With Google, the older the site the better, slow and steady link building with a large variety of backlink text wins (notice it’s the opposite of yahoo and msn).

Google’s Top 10 Choices for Search Results:

I think that for most searches, the top 10 will consist mostly of these types of pages. I think Google does this on purpose to show a variety of Types of pages to the user.

If you’re targeting a phrase, you should start by figuring what type of result your site will be, and what it’s role is in the top 10, and who you’re "real" and "direct" competitor is and what it will take to replace them.

Great posts Jim.

also good comments here Reputable SEO Companies

Dir Sirs, Would You Like More PageRank? Link Exchange...

Jim Boykin talks link exchange emails, and why over 99% of them are rubbbbiiiiissssssshhhhhhhh!

LOL! It's spam, and you know it is. LOL ;)

Pages