Recently I was working on a very interesting case. My customer deployed his website on a Cloud service. He deployed the same codebase to both Production and staging environment. He has also configured a 3rd party monitoring service to keep an eye on these environments.
After sometime he started seeing the alerts about his staging Environment that his website was unreachable externally. Here are few symptoms he noticed about the issue:
- It was happening only with Staging environment
- It was very intermittent and it seems that whenever customer is accessing the website, it works well at least for that request.
We confirmed that there were no issues from Azure backend side and it looks like an application specific issue. We logged into the VM and investigated the following logs:
This log contains status updates and heartbeat notifications and is updated every 2-3 seconds. This log will contain a historic view of the status of the instance and will tell you when the instance was not in the Ready state. We found that VM was in Good health on all those reported times.
IIS logs shows very few activities but we noticed entries with 200 response during the reported time with very high response time for small static files.
8/9/2013 at 4:33 AM GMT - 2013-08-09 04:33:22 W3SVC839472807 RDXXXXXXXXXXXX IP.x.x.x GET / - 80 - IP.x.x.x HTTP/1.1 - - - xxx.cloudapp.net 200 0 64 0 79 17224
Application & System Event Logs
We looked into Application and System event logs and they looked clean with few generic logs. But we saw a log (type Information) which was a bit interesting. This logs suggests that the AppPool was Idle and was shutdown due to no activity. And it was happening after 20 minutes for each reported time for incident.
Log Name: System
Event ID: 5186
Task Category: None
A worker process with process id of '12168' serving application pool 'DefaultAppPool' was shutdown due to inactivity. Application Pool timeout configuration was set to 20 minutes. A new worker process will be started when needed.
We looked at these logs individually which makes no sense and could not tell us more about the issue. But when we combined the IIS logs and the Event ID 5186, we found the following pattern:
- AppPool Idle timeout was 20 minutes
- We don’t see any entries in IIS logs before the 20 minutes time of each reported incident.
- We see Event ID 5186 mostly after 20 minutes of reported incident.
- We see Event ID 5186 20 minutes or earlier than the reported incident.
- We see entries in IIS logs with 200 response but with huge response time.
So all these symptoms suggest the following:
- Cloud Service was probably Idle and no one is using it expect the 3rd party monitoring service sending GET request to check the health once in a while.
- Since there was no one accessing the website, AppPool was shutdown after 20 minutes of idle period.
- Whenever a request comes in and AppPool is shutdown, IIS took a bit longer (since it was the first request and it got to launch the worker process) to responds. This is why we see a high response time in IIS logs.
- Since no one was using it, AppPool will probably shutdown again due to inactivity.
So it seems that the 3rd party monitoring service is probably having a smaller timeout and declaring the CLoudService unreachable while it was responding.