Search Results pages – Should we “No Index” them in Robots txt files?
I often get questions from a number of SEOs and people working on SEO for sites. One interesting one that I see time and time again is “Should we use No Index” for stuff we no longer want search engines to index in the results pages?
If this was a few years back, my standard response would be to use the “disallow” line in robots.txt. You can see that recommendation from Matt Cutts way back in 2007 here. However, search engines, and google in particular, changed the way they like to index stuff. So it isnt enough you tell them at the root level that you DONT want that page in the SERPs, you have to specify that on the page. According to Google’ own support page on the subject:
It’s important to note that even if you use a robots.txt file to block spiders from crawling content on your site, Google could discover it in other ways and add it to our index. For example, other sites may still link to it. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site
So how do I keep my search results pages out of the index?
My advice is to use the page level meta tag “No index” instead. The disallow rule stops google from crawling the directory or page, but if someone links to a results page, then the disallow tag is pretty much useless to keep the content out. (you can see an example on the content strategy post as well)
The directive on page would look like this:
<meta name=”robots” content=”noindex”>
However, I have seen instances where even the page level meta isn’t respected, and I prefer to use:
<meta name=”robots” content=”noindex, follow”>
This tells the bot to crawl the page, (if someone is linking to it, it may have equity and the equity to the links on that page get passed on) but NOT index it. I have seen this work way better than robots level disallows, and would be my optimum choice for:
- Search results pages
- Paginated results
- Tag pages
It is amazing how many sites I still see that have pages in the index with the snippet:
“A description for this result is not available because of this site’s robots.txt”
You see that when sites haven’t used the right method of keeping pages off the search results. This I think came into play midway through 2012. As you can see, Barry’s example is still valid to this day:
What about Virtual folders such as redirect urls?
See that is an interesting question isnt it? Typically large sites that link out, but may not want to pass equity or to track outbound clicks, or want to cloak the urls would use redirect URLs, similar to URL shortners. A good example is twitter – who uses the “t.co” domain to link to any links you post on the site in order to shorten them (in addition of using nofollows on those links). The same issue exists with affiliate cloaked links. Read the next post which covers this area.