How Do I Get Large Websites Indexed by Google & Other Search Engines?

SEO Question: I have a 100,000+ page website. Is there any easy way to ensure all major search engines completely index my website?

SEO Answer: Search engines are constantly changing their crawl priorities. Crawl too deeply and get many low quality pages while increasing indexing time and costs. Crawl too shallow and you don't get down to the relevant pages. Crawl depth is a balancing act.

There is no way to ensure all pages get and stay indexed...they change their crawl priorities constantly. Having said that, you can set your site up to make it as crawler friendly as possible.

Five big things to look at are

  • content duplication - are your page titles or meta description tags nearly duplicate (for example thin content pages that are cross referenced by topic and location)? or do other sites publish the same content (for example an affiliate feed or a wikipedia article)? are search engines indexing many pages with similar content (for example different model color or splitting feedback for one item across many pages)?

  • link authority - does your site have real high quality links? how does your link profile compare with leading competing sites? what features or interactive elements are on your site that would make people desire to link to you instead of an older and more established competing site?
  • site growth rate - does your site grow at a rate consistent with its own history? how does your growth rate compare with the growth rate of competing sites in the same vertical?
  • internal link structure - is every valuable page on your site linked to from other pages on your site? do you force search engines to go through long loops rather than providing parallel navigation to similar priority pages? do you link to low value noisy pages (sometimes a search engine indexing less pages is better than more)?
  • technical issues - don't feed the search engines cookies or session IDs, and try to use clean descriptive URLs

Some signs of health are

  • you don't have pages you don't want getting indexed - wasting link equity on low quality pages means you have less authority to spread across your higher quality pages

  • most the pages you want indexed are getting indexed, actively crawled, and are not stuck in Google's supplemental index - supplemental problems and / or reduced indexing or crawl priority are common on sites with heavy content duplication, wonky link profiles, or many dead URLs
  • your site is building natural link equity over time and people are actively talking about your brand - if you have to request every link you get then you are losing market share to competitors who get free high quality editorial links
  • you see a growing traffic trend from search engines for relevant search queries - this is really what matters. this includes getting more traffic, higher quality traffic, and searchers landing on the appropriate page for their query.

things you can do if conditions are less than ideal

  • focus internal link equity at important high value pages (for example, on your internal sitemap consider featuring new product categories, new and seasonal items, or link to your most important categories sitewide)

  • trim the site depth (by placing multiple options on a single page instead of offering many near duplicate pages) or come up with ways to make the page level content more unique (such as user feedback)
  • cut out the fat - if many low value pages are getting indexed block their indexing by doing something like nuking them / not linking to them / integrating their information into other higher value pages
  • use descriptive page relevant URLs / page titles / meta descriptions - this helps ensure the right page ranks for the right query and that search engines will be more inclined to deeply crawl and index your site
  • restructure site to be more top / mid / bottom heavy - if a certain section of your site is overrepresented in the search results consider changing your internal link structure to place more weight on other sections. in addition you can add features or ideas which make the under-represented pages more attractive to link at
  • use Sitemaps - while you should link to all quality pages of your site from your site and use internal link structure to help them understand what pages are important you can also help search engines understand page relationships using the open sitemap standard
Published: November 19, 2006 by Aaron Wall in Q & A

Comments

November 19, 2006 - 11:13pm

Are there high quality web sites, aside from the usual suspects, that are over 100,000 pages? Do you start to dilute your content when your site has too many pages or topics?

There are projects that we've worked on in the past where the client has received a significant increase in traffic as a result of streamlining their site. It was easier for the user to find information.

What basket (quality versus quantity) do you put "your eggs in" when 90% of your page views are from a small number of pages?

November 19, 2006 - 11:37pm

Great article, it definitely covers a lot of the basics that a lot of webmasters look over.

I do feel, however, that there's basically a "magic bullet" to getting out of the supplemental results. I responded to your post and gave my additional advice here: http://www.revenuegirl.com/getting-out-of-supplemental-results/

Eugene Loj: In my experience, you definitely dilute your site when you have too many pages relative to your trust and link popularity in Google.

November 20, 2006 - 12:30am

I see no mention of the Deep Web issue here. I would have thought that was the key problem with a site that big. Are most of these 100,000 pages generated from a database? If so, spiders don't crawl them.
You might want to try having a static html sitemap that lists all the major categories that are in the database. Then spiders could at least index that list.

November 20, 2006 - 12:38am

Hi Tim
I don't think databases are the issue so long as each logical page has sufficient unique content and is linked to in the site navigational scheme.

November 20, 2006 - 3:27am

Nice run down there (as always!).

I'm actually dealing with this very issue with a client just now and it is problematic simply because there aren't many large sites out there in this position (most of them are scraper sites or otherwise auto generated crap).

We're currently sitting at about 80k pages indexed out of about 300k (yes it is all genuine content! We have around 400 individuals writing for the site).

Having resolved some URL and other technical issues (legacy from old SEO agency), we're starting to see an improvement, but still a long way to go, although we have covered the points you mentioned here.

TBH, the site is a casualty of Google's war on spam - no other sites in our vertical have the sheer volume of content we do or the rate of growth (5-10k pages per month or so). "Get more links" mantra from Cutts really doesn't sit well with me as an excuse for excluding unique, rich content from the SERPs, but I do see the need for such measures.

Plus sitemaps are a problem - regular sitemap generators have been dying at the scale of the site, but I'm having some luck with Gsite just now (its crashed a few times, but I'm 1/3 of the way indexed so far). Slightly off topic but I got a hat tip from someone inside Google that their commerical search appliance will come with a sitemap generator soon. ;)

I haven't really used Google sitemaps all that much (just for a few test sites, etc but they were already fully indexed). In your experience would you say a full sitemap inclusion would boost # pages indexed? I'd suspect that link weight / crawl depth would still be the over riding factor.

MG

SEO Buzz Box
November 20, 2006 - 4:39am

remove pages

November 20, 2006 - 8:10am

I'd imagine you can get internal links from trusted websites to more or less important pages (categories, site sections) to get them indexed faster.

Then again, internal links to content at the deepest level should help.

It'd also help to get links only from trusted sources from the start (if it is a new site).

November 20, 2006 - 9:14am

regarding the database comment: databases are server side. you can use database data to make a website that'd look identical to completely static, regular plain old html pages if you wanted to.

a database is simply a method of storing information in an organized fashion.

Sufyan
November 20, 2006 - 2:10pm

-URL rewriting
-Unique descriptive Titles and meta tags
-Strong internal linking
-Presence of a well-structured Site Map
-And a couple quality BLs and you are done!

No more worries about indexing and falling supplemental!

P.S. That has worked for me pretty well.

November 20, 2006 - 8:38pm

I wanted to comment on one particular statement from Tim:

"Are most of these 100,000 pages generated from a database? If so, spiders don't crawl them."

That isn't accurate actually. I actually have done SEO on dynamic websites before and it's not much different than with static web sites and pages.

The example I'll share is my little pet project - the Madtown Lounge, or http://www.madtownlounge.com. I have a database of venues, bands and events, and they all get picked up by the search engines. It helps to have dynamic Sitemap - an example is at http://www.madtownlounge.com/sitemap_venues.asp - that is a dynamic Sitemap that helps the crawlers index dynamic content. It works quite well, as I'm ranked higher in the SERPs than some of the websites of the venues themselves.

I have shared the code for building a dynamic sitemap with Active Server Pgaes in my other blog at http://madisonseo.blogspot.com/.

Cheers,
Allen

November 20, 2006 - 9:48pm

Another helpful tip:

Don't use a lot of co-op or reciprocal linking schemes. Google has co-op stuff fingerprinted and it will surely get you de-indexed or supplemental. If you have some quality backlinks then your site might just end up supplemental, but either way it does seem that link quality will make a difference in the supplemental indexing of your site.

seopractices
November 21, 2006 - 4:47pm

Good information, thanks Aaron. Something else you can add to the list and that has worked for me is to place a few related links at the bottom of the pages you have in supplemental, so when you write new related articles you can place links to the pages you have in the supplemental index:

For example if you have a page about Aaron Walls in the supplemental, next time you write a new article about Aaron Walls, you go to the supplemental page about Aaron Walls and at the bottom you place the link of the new article you wrote, like placing a list of related articles.

Paul
June 15, 2007 - 2:03pm

I currently am using content specific blogs to imporve links to my site, this seems to be working for me. Not only has my sites PR risen but I see a steady increase in unique vistors each month.
Very interesting post and a few good points I will consider myself in imporving my sites PR.

Keep it up, will be sure to check for updates soon...

mjzivko
May 6, 2009 - 2:29pm

I am currently trying to index a site residential-security-services.com and 4garage-doors.com. The combined pages of both sites is around 50,000.

I have been trying to get my links on blogs that have page ranks. Not just the main URL of the blog, but the actual post where my link is going to be.

Example, this post has a PR of 3. I hope my links I just dropped are hot links so I get some link love from this post.

May 7, 2009 - 1:34am

Sorry, but the point of our comments box is not to promote spammy sites. Learning a bit of tact couldn't hurt.

Add new comment

(If you're a human, don't change the following field)
Your first name.
(If you're a human, don't change the following field)
Your first name.
(If you're a human, don't change the following field)
Your first name.