This post is a bit on the geek-centric side of SEO. You’ve been warned. Also, this is strictly an opinion rant. I still love Google and believe every client should optimize their site for the best rankings in Google. <hugs></hugs>
Ever wonder why the number of pages indexed in Google (as shown in the Sitemaps section of Webmaster Tools) is different from the number of pages indexed in Google (as shown in a site: search of Google itself)?
<rant>Google forums and blogs, as well as other industry sources say that it could be because of the way Google stores data. You see, Google utilizes what are called data centers. According to the Webmaster Central Blog, “Occasionally, fluctuation in search results is the result of differences in our data centers. When you perform a Google search, your query is sent to a Google data center in order to retrieve search results. There are numerous data centers, and many factors (such as geographic location and search traffic) determine where a query is sent. Because not all of our data centers are updated simultaneously, it’s possible to see slightly different search results depending on which data center handles your query.”I understand “slightly different” results depending on what data center you hit when you perform the search. However, I don’t consider 1,000+ urls a “slight” difference.
Here’s a real example for demonstration purposes. A national professional services firm has over 6,000 pages on their website. According to Webmaster Tools, 2,659 of them are indexed. This seems to be the more trustworthy number, as many of the urls are intentionally blocked from the index using robots.txt. When doing a search in Google for “site:www.sampledomain.com” Google reports there are only 1,560 pages indexed.
Why would Google show two completely different numbers for pages in their index? I’ve spoken to other designers and SEOs about this issue and a common answer is that maybe the site has duplicate content that Google’s site: search filters out. Valid hypothesis, that. OK, so I perform another site: search and add the filter=0 parameter that shows all duplicate content that Google “filters out” of the results. This search shows 1,590 results. So the 30 pages of duplicate content aren’t the issue.
Another common hypothesis is that personalized search is impacting the results I’m seeing on the site: search. OK, so I add the pws=0 parameter to the url and come up with 1,590. Where are the additional 1,069 urls Webmaster Tools says Google has in its index? If they’re there, why can’t we see them?
According to this post, it’s crazy to even ask the question of why they don’t match! The Top Contributor’s answers as to why they don’t match: Different Data Centers (addressed above), Log In/Log Out of iGoogle (addressed above), the exact search query (correct above), Filtering (addressed above), and finally (not in the post) – estimating…
“The SERPs – the figures you see when you do a search… those are NOT accurate in many cases.
In fact – Google does tell you this ….. it clearly says ‘of about’ – it does not say ‘exactly’. To the point – as you click through the pages (using the pager at the bottom), the chances are damned good that the number you see quoted will change. It may start off on page1 saying ‘of about 1270′ … and yet on page2 it may say ‘of about 421′. So Click Through the Pager and watch the figures.”
Really? So, basically, Google’s numbers are “estimated” and not exact. I could live with that if they were even CLOSE to what Webmaster tools shows. It’s just a little hard to believe that Google could be that far off when metrics and numbers are such a key part of their algorithm (and available to them already via Webmaster Tools).
OK, so what do I think? I think that Google doesn’t want webmasters to be able to accurately research how their competitor sites are doing. One of the measures of SEO is number of pages indexed. If we don’t really know how many pages our competition has, it’s one less metric we can use to compete. Google did something similar with Page Rank several years ago. They started releasing updates to their Page Rank numbers about once every 6 months. Why? Probably so that webmasters couldn’t use that information to figure out what works and what doesn’t when it comes to SEO techniques. They want to make it hard for black hats and spammers to play the system. A very noble cause that I happen to agree with. However, Google should also acknowledge that removing this data, or showing incorrect “estimates,” makes it harder for legitimate sites to know how they’re doing and improve what needs to be improved.</rant>
I propose that in the interim, when logged into your Google account that is connected to a Webmaster Tools account (or accounts), that Google show you the “real” number of indexed pages. This would be an improvement even if you still wouldn’t be able to see true numbers for your competitors.