A lot of sites today have the ability for users to sign in to show them some sort of personalized content, whether its a forum, a news reader, or some e-commerce application. To simplify their users life they usually want to give them the ability to log on from any page of the Site they are currently looking at. Similarly, in an effort to keep a simple navigation for users Web Sites usually generate dynamic links to have a way to go back to the page where they were before visiting the login page, something like: <a href=”/login?returnUrl=/currentUrl”>Sign in</a>.
If your site has a login page you should definitely consider adding it to the Robots Exclusion list since that is a good example of the things you do not want a search engine crawler to spend their time on. Remember you have a limited amount of time and you really want them to focus on what is important in your site.
Out of curiosity I searched for login.php and login.aspx and found over 14 million login pages… that is a lot of useless content in a search engine.
Another big reason is because having this kind of URL’s that vary depending on each page means there will be hundreds of variations that crawlers will need to follow, like /login?returnUrl=page1.htm, /login?returnUrl=page2.htm, etc, so it basically means you just increased the work for the crawler by two-fold. And even worst, in some cases if you are not careful you can easily cause an infinite loop for them when you add the same “login-link” in the actual login page since you get /login?returnUrl=login as the link and then when you click that you get /login?returnUrl=login?returnUrl=login… and so on with an ever changing URL for each page on your site. Note that this is not hypothetical this is actually a real example from a few famous Web sites (which I will not disclose). Of course crawlers will not infinitely crawl your Web site and they are not that silly and will stop after looking at the same resource /login for a few hundred times, but this means you are just reducing the time of them looking at what really matters to your users.
IIS SEO Toolkit
If you use the IIS SEO Toolkit it will detect the condition when the same resource (like login.aspx) is being used too many times (and only varying the Query String) and will give you a violation error like: Resource is used too many times.
So how do I fix this?
There are a few fixes, but by far the best thing to do is just add the login page to the Robots Exclusion protocol.
- Add the URL to the /robots.txt, you can use the IIS Search Engine Optimization Toolkit to edit the robots file, or just drop a file with something like:
- Alternatively (or additionally) you can add a rel attribute with the nofollow value to tell them not to even try. Something like:
<a href=”/login?returnUrl=page” rel=”nofollow”>Log in</a>
- Finally make sure to use the Site Analysis feature in the IIS SEO Toolkit to make sure you don’t have this kind of behavior. It will automatically flag a violation when it identifies that the same “page” (with different Query String) has already been visited over 500 times.
To summarize always add the login page to the robots exclusion protocol file, otherwise you will end up:
- sacrificing valuable “search engine crawling time” in your site.
- spending unnecessary bandwidth and server resources.
- potentially even blocking crawlsers from your content.