Why did my Azure VM restart?


 

An unexpected restart of an Azure VM is an issue that commonly results in a customer opening a support incident to determine the cause of the restart. Hopefully the explanation below provides details to help understand why an Azure VM could have been restarted.

 

Windows Azure updates the host environment approximately once every 2-3 months to keep the environment secure for all applications and virtual machines running on the platform. This update process may result in your VM restarting, causing downtime to your applications/services hosted by the Virtual Machines feature. There is no option or configuration to avoid these host updates. In addition to platform updates, Windows Azure service healing occurs automatically when a problem with a host server is detected and the VMs running on that server are moved to a different host. When this occurs, you loose connectivity to VM during the service healing process. After the service healing process is completed, when you connect to VM, you will likely to find a event log entry indicating VM restart (either gracefully or unexpected). Because of this, it is important to configure your VMs to handle these situations in order to avoid downtime for your applications/services. 

 

To ensure high availability of your applications/services hosted in Windows Azure Virtual Machines,  we recommend using multiple VMs with availability sets. VMs in the same availability set are placed in different fault domains and update domains so that planned updates, or unexpected failures, will not impact all the VMs in that availability set. For example, if you have two VMs and configure them to be part of an availability set, when a host is being updated, only one VM is brought down at a time. This will provide high availability since you have one VM available to serve the user requests during the host update process. Mark Russinovich has posted a great blog post which explains Windows Azure Host updates in detail. Managing the high availability is detailed here.

 

While availability sets help provide high availability for your VMs, we recognize that proactive notification of planned maintenance is a much-requested feature, particularly to help prepare in a situation where you have a workload that is running on a single VM and is not configured for high availability. While this type of proactive notification of planned maintenance is not currently provided, we encourage you to provide comments on this topic so we can take the feedback to the product teams.

 [Update] Planned notification is being sent for single instance VMs. However it may only be reaching Account administrators.   

 

Key words : VM , Restart, Shutdown, Unexpected reboot, Windows Azure


Comments (53)

  1. Wei JIN says:

    I would like to have a email notification 48 hours before my vm restarting.

    Thanks.

  2. dskrtic says:

    Also, by default, VMs are set to download and install Windows Update patches automatically. Post-update reboots can also cause downtime. My suggestion is to adjust Windows Update settings to download but not install patches; then when you have a maintenance window, manually install the patches and allow the VM to reboot.

  3. Shaun Fang says:

    For Linux, will Role.StateConsumer in waagent.conf be triggered when such a reboot will be performed? How long will the fabric wait for the script to return?

  4. Ryan says:

    It would be great if the updates are scheduled and that advanced notice is given.

    Having hosts that are already upgraded ahead of time would be great.

    The customer could then choose to move to the upgrade themselves at a time suited to their scheduled ahead of time and thereby avoid the scheduled maintenance window which may not suit them, or wait and be upgraded automatically during the scheduled upgrade window.

  5. Sean says:

    So let met get this straight, it's standard practice to randomly reboot production servers without any notice?

  6. Pete says:

    This is the worst business practice I've ever seen.  You will NOT reboot or shutdown any of our servers.

    If this is an issue, I expect someone to contact me about this immediately.

  7. Robert Chipperfield says:

    Is there any equivalent of fault domains / availability sets for Azure Web Sites running in standard mode?

    I see the recommendation to run multiple VMs for AWS to increase availability, but will these automatically get assigned to different fault domains when they're Web Sites VMs? If not, is there a way of exposing the availability set control available to Virtual Machines to Web Sites as well?

  8. jimh says:

    That really sucks. you MS guys just give a SHUTDOWN without any respect on your user's data. i would move to AWS or softlayer if this issue cannot be resolved

  9. Andy Ball says:

    Think of it as an incentive to deploy an effective High Availability solution 😉 . Seriously though , it would be nice to be able to Self schedule within a range of days via the Portal. I've worked in a few enterprises where they had this option and it was very effective and popular. ie Customers get notified that your VM's have been scheduled to be patched between 2-6am on 29/12/2013, Click here to edit schedule

  10. Naz says:

    Hi Guy's,

    Why can't you live migrate the VM's off the host and patch your hosts. This function is there so please use it otherwise it makes life difficult for people using Azure when VM's just shutdown and restart.

  11. Kas says:

    Has anyone at MS seen these comments? Is anything being done? This lack of control is really not acceptable. We just had 2 servers restart back-to-back for about 30 minutes each in the middle of the afternoon. No notice and a generic shutdown message in the event log. Effectively your recommendation is to double the infrastructure cost across the board to get around an update policy you have forced driving random server updates. What a joke.

  12. jimbob says:

    I am completely blown away that you guys think random restarts of production VMs is even remotely acceptable. Or that you think doubling up on VMs (and thus cost) is actually a reasonable answer. Positively boggles the mind. If we had known in advance that this was considered par for the course on Azure IaaS, we wouldn't have even considered putting production services on it. I think you know full well that this is true of just about any prospective Azure customer. Radical idea: how about if, instead of joking about it being "incentive to deploy .. HA", you bloody well fix it?  

    Seriously. Web server down at 2PM EST for us today. Then within 30 minutes of it being back up - after ensuring customers that, yes, it's back up now, so sorry, just a glitch - the DB server goes down. Just amazing that you think this is OK. Do you seriously think any other VM provider does?

  13. Roberto says:

    This rebooting thing is our main reason for not choosing Azure. Actually, it is the ONLY reason. We are a service provider and host several virtual servers for our customers in a third party datacenter. None of our customers would accept a planned reboot during work hours, not even when they are notified in advance.

  14. Bob says:

    SQL Azure doesn't meet all the needs so if someone has to deploy their SQL Server in Azure VM and it randomly reboots, who in their right mind would think of using Windows Azure VM's?

  15. Jeff says:

    This is ridiculous! I can't believe that a service provider would expect their customers to be ok with this process. Rebooting all servers with zero notification and in the middle of the day is unacceptable. Looks like we will have to find a new provider. The escalation desk told me they were unaware of any scheduled maintenance, so I guess they don't even provide internal notification.

  16. Mike says:

    This is the first page that should come up when any potential customer searches for Azure hosting or Azure vs Amazon Web Services

  17. Mike says:

    Another caveat is that the load balancer is not told that the reboot is about to occur.   Thus, incoming HTTP requests will be routed to the stopped VM for some time.   Even the HA solution suffers from this problem.

  18. Naz says:

    Hi Guy's, why is Azure not using Live Migration to move customer Vm's to other nodes when they have to do updates? also when a VM is restarted it goes down for 20-30 mins not just 2mins which makes it impossible to explain to the customer, especially when your application relies on multiple VM's which can then cause an outage of a few hours while Microsoft reboots them, also the lack of any notifications of upgrades taking place is a huge issue. If we had a notification at least we can warn our customers.

    Azure Vm's are by no means ready for production use, be warned everyone!

  19. Bob says:

    Engineer your product properly and this doesnt become a issue.

  20. Josh says:

    This just happened to me twice in one evening.  and not only did it take each server much longer than normal to boot, the essential services (iis on web services, sql on SQL services, etc ) did not restart with out manual intervention.  I have fail overs and HA, but it is of no use when i have to babysit all the servers so that they actually come back up before another server restarts.  This does not happen when i restart the servers myself so why when MS does it?   I have been pretty happy with Azure so far but this is a pretty big thorn in my side.

  21. pfr0g says:

    Josh,

    Same issue here. Last night my servers, in an availability set, were rebooted an hour apart from each other. MS support confirmed the reboots were initiated because of host upgrades.  After the Microsoft initiated restarts my IIS, ADFS and SQL services did not come up without manually intervention. This caused an outage of my entire Office 365 environment that I specifically designed to NOT happen with HA, Availability set. Load Balancing, 2 Farm servers, 2 proxy and a separate AD site with RODC servers!  

    To add insult to injury it happened again on 2/22 at 12am EST to both servers. My guess is service healing after they realized something with the host upgrade went wrong which causes another restart and again services had to be manually restarted.

    Issue I have with this

    1. Servers in an availability set being updated within an hour of each other. This is, in my opinion, an unacceptable practice. At least separate by a day or two in case something goes wrong, as it did, to allow the customer time to recover the first server.

    2. Updates causing server issues were services don't start. I can understand the no notification, I can even tolerate the reboots in the middle of the day. I cant tolerate maintenance causing my services to fail.  

    Needless to say we are looking at moving our entire Azure IaaS to another platform.

    -Nick

  22. josh says:

    Nick, did MS support provide any reason why the services failed after the restart?  I have gotten no where on finding the answer to that question.  This really gets me thinking, if I had a couple hundred servers, and they begin restarting haphazardly, am I to wait for days on end for each server while it restarts? What is the solution?

    If I find out any information I will be sure to post it back here.

  23. josh says:

    Not to go on a rant but one other thing I think could be helpful - during these maintenance windows where no notification is given, the azure service dashboard indicates everything is hunky dory - when clearly it is not.  If Microsoft is not going to send out a notification to its customers it could at least acknowledge the SOMETHING is happening - with any updates that I could then pass along to my customers. It baffles me that the whole process has to be so opaque.  My $10/mo small time web site host tells me when maintenance is going to happen - why cant an organization as large as MS make this happen?

  24. Nick says:

    Josh,

    I have a pending case still. I will let you know when I get an answer. I am also calling our Microsoft TAM on Monday to escalate. Did this occur on your servers on 3/21? Just curious if it was on the same day.

    Nick

  25. Larry says:

    Microsoft reboots and does whatever the hell they want. There is no regard for the customer.  Vm's will reboot any time any where. Get used to it.

  26. Connor says:

    Just got a prod DB server (linux) rebooted with NO notification. Absolutely INSANE.

  27. Connor says:

    and now the web server... Time to move to AWS

  28. drbell says:

    Is it safe to conclude that with 6 months worth of comments, that no MS moderator is responding to customer needs discussed on this post?

  29. Lucian D says:

    Hello,

    We've also had machines rebooting for a few months. Things only got worse in the last month: starting March 21st, when there was a major storage outage in West Europe (which Microsoft didn't bother to point out in the service dashboard until 7 hours AFTER we started having significant issues), our instances started restarting about once or twice a day per instance.

    I've added a pretty long comment detailing our problem to a MSDN thread and got no reply, even though it was a  fresh post: social.msdn.microsoft.com/.../ubuntu-vm-completely-unusable

    The only way to get an answer was to get in touch with Microsoft Romania and convince them that we must get through to the Azure team because this can't keep hapenning. We got a promise, but no support for the moment.

    Unfortunately we're stuck with Azure because of a large commitment we made (Enterprise Agreement), but right now nobody wants to start another instance on Azure so we're left with the money spent and an unstable platform to run our services on (the ones that were already migrated).

    As somebody pointed out before: it is NOT ok for instances to be stopping randomly (this hapenned for a few weeks in December, nobody explained why, even though it was widespread). It is also NOT even close to OK to have instance restarting at random. Out of the few hundred servers we have on Amazon, we've had maybe 10-15 restarts in the last two years. And that's with notifications in advance (we've detected that the instance is degraded, we might have to restart it), which gives you enough time to either restart it when you decide it's best or just start another instance to send traffic to.

    So please, Microsoft, start being more transparent and respect the people who want to develop applications on Azure. Hiding your face in the sand every time we start complaining is not a great communication strategy. Also, not replying to even one of these comments will hurt you WAY more (as in people will migrate to AWS) than replying and saying "yeah, we screwed it up and we have no fix for it at the moment. But at least we're trying".

  30. Alan says:

    Riding a wave of post-BUILD enthusiasm, we were seriously looking at moving some.of our stuff into Azure.  Now I'm having second thoughts.  I can't understand why they let this sort of thread smoulder unattended.

  31. Hari says:

    Thank you very much for providing the feedback, sharing your experience. Feedback shared on this thread, shared via support channels have been communicated internally and it has been very valuable, well received and acknowledged. We recognize that customers do experience disruption, application outages during updates to VM infrastructure maintenance that cause VM restarts. Teams are rigorously making efforts to minimize the impact to customers during the Azure platform maintenance.

  32. Tom says:

    This is not acceptable as there are services which are running and to stop non gracefully means that there's a possibility of data corruption.

  33. IanC says:

    Wow, so this is why my VM was restarted at 6pm on Saturday... could you at least not wait until early morning?! My server is located in Europe so I think it's safe to assume the restarting at 3AM CET would be better than 6pm. That said I'd rather you didn't restart it at all.

    Will be moving to a different provider ASAP.

  34. Rob says:

    For the past three days, between 8:30AM and 9:30AM, the virtual network connectivity between VM's in our Azure cloud disappeared. Over the past year, I have documented cases of servers being randomly rebooted (healing MS calls it) in the middle of the day. I have documented errors of a SQL Server VM not being able to write to a SQL database file (I/O taking longer than 15s) and then the errors "vanish" after a day or two. I have documented support requests from Microsoft where they admit that you must rewrite applications to actually take advantage of "availability sets" - e.g. any architecture that uses a shared network drive. I have logged at least a dozen support requests, and Microsoft has never once gotten to any conclusion. Microsoft support has admitted (in writing) they have no access to the actual hardware or data center diagnostics. Now, they give me two 12 hour time windows on Friday and Saturday when they're going to reboot my servers again. They can't figure out within 24 hours what they're doing.

    We have created a new cloud with a new provider and are copying content and database files over to it now. We've had to send a message to our customers coming clean on how bad Azure is, and that we're moving as fast as we can to the new cloud, and that we made a mistake in putting a couple new customers in this environment.

    I have been writing Web software for well over a decade (and have been programming for 30 years), hosted with many providers, and I have never encountered anything as unstable as Azure - and I have it all documented so this isn't just a rant. It's truly awful. Everything above is true, if you still think Azure is something to consider, do yourself a favor and move on.

  35. Rob says:

    And one more important p.s., I have an e-mail exchange from a Microsoft Support person who admits that the health dashboard does not always reflect all outages and they "recognize" that it is at the discretion of Microsoft what they post in the health dashboard.  Pretty easy to tout records of stability when the records aren't correct huh?

  36. Rob says:

    A conclusion to the aforementioned case.  There was a VM that everyday would simply disappear from the network of VM's in my cloud at exactly 8 AM PST.  Microsoft confirmed there is a "problem with the host upon which the VM is running".  So, just move it Microsoft, right?  And by the way, how come it didn't "heal" itself already?  Nope, I have to completely deallocate the VM or resize it to force the VM to jump to a new host.

    So, no automatic "healing", and they cannot even move VM's off malfunctioning hosts themselves, you have to do it. An entire week of customer down time, code debugging, diagnostics, etc. all lost to the above.

  37. Tw Bert says:

    We just had a restart of our linux VM's, without prior notice. We need a warning from Azure 48 hours before any server reboot. I can't find any information about the reboot on the dashboard, not even after the fact.

  38. Dan says:

    I am terrified reading this thread! I just a shutdown of our production SQL database, and had to Start it up manually from the portal!

    I open a support ticket, but from reading these posts I am not too optimistic about this! We just sign up for a prepaid 3 year  Enterprise Agreement - what a nightmare

  39. Mike says:

    We just finished moving a 80 VM Azure enterprise deployment to AWS for the reasons others have commented on above.  When we engaged MS Premium Support to help assist diagnosing why VMs we're restarting automatically, they offered to bring an Australian Azure Architect in to discuss the deployment architecture.  We took them up on this offer and surprise, surprise, we were told by the architect we needed an additional 30 VM DR production deployment to get around various problems Azure problems.  Needless to say the client wasn't impressed that the architect acknowledged problems and suggested they pay significant extra dollars to work around them!  Since moving to AWS we've been satisfied (it has it's issues too) but the level of maturity is much higher and the whole offering is a lot more robust.  Additionally their support is responsive so far and have been helpful without recommending we buy more VMs!

  40. Markus says:

    If we cannot do anything about these updates, why is this performed in the european data center during BUSINESS HOURS???? (last incident on a Monday at 3:46 PM)

    Why do you not shut down the machines cleanly? -> Got a Kernel-Power EventID 41...

    In our local Hyper-V environment i can move VMs Live (VSM) to another Hyper-V host - does Azure not support this functionality????

  41. Brandon S says:

    Dear Microsoft,

    I am a very senior Microsoft engineer. I have been building and supporting large Microsoft networks for over 20 years. I have managed large scale, internal data centers at AT&T, Aetna, and Prudential. I've also managed servers from hosting providers Amazon, Rackspace, GoDaddy, CBeyond, HostMySite, Hosting.com, HostGator, LiquidWeb, and a few others. At no point over the past 20 years have I ever experienced random, unscheduled reboots at the frequency that Azure seems to have. Not to mention the massive outages that have occurred in August and November '14.

    Please help me understand why you chose to get into the hosting business when you clearly cannot keep servers up reliably. Your support people keep referring me to the SLA which states that you need 2 servers in a balanced configuration in order to have any sort of reliability. That's some bold SLA language you've got there Microsoft. How is it that Amazon can keep a single Windows server up more reliably than you can? Their SLA covers single servers, why doesn't yours?

    In fact, I'm having trouble trying to come up with something... anything else that I could purchase where the manufacturer tells me that I need 2 of them in order for it to be reliable. Does Ford tell me I need to buy a second Fiesta in order to guarantee that the car will work? I realize that is a different industry, but in the hosting world, no other provider tells me that I need 2 servers in order for them to be reliable. Everyone else just seems to be able to keep their single servers up for extended periods of time. But for some reason, this is too much to ask of Microsoft.

    At the moment, I have a client with 30 servers in their data center that they are looking to move. I really want to bring them over to Azure, however, telling them that they need 60 servers is NOT an option, and frankly it's just silly to suggest that.

    Please, please fix this. At least make Azure as reliable as your biggest competitor AWS, and put your SLA behind it, just like they do. I agree that having 2 servers is better, and does allow for "high availability", however that shouldn't suggest that 1 server means limited availability and zero guarantee.

    Thanks for listening. Hope you can do something, soon.

    -Brandon

  42. Bob says:

    Yep this sucks.  Who can work like this?  I've been evaluating Azure and unless this policy changes, I think I've evaluated enough.

  43. Timothy says:

    I've been doing a trial run of Azure and it's pretty stable for the most part, but I get random reboots just like everybody else here.  I'm running a single VM with a non-critical workload, so that's actually okay for me.  My problem is how this always seems to happen in the middle of the day local time.  I mean if I buy a VM hosted in USA East, I would expect that host maintenance would happen at 3AM, not 3PM.

    I initially suspected the host went down and it was a force failover to a new physical host, but the fact that it's happened about 10 times now, all in the middle of the day, I'm pretty sure I haven't had the bad luck of 10 physical hosts going down over four months.

    Some of the posters here have unrealistic expectations, but everybody is correct that the current way this is being done is well below par compared to other hosting providers.

    Personally, I'd be happy with just a few small improvements.

    1. Email notification immediately after a failed physical host and forced migration.

    2. 48 hours notice ahead of time for host updates that will cause a shutdown.

    As others above have mentioned... why the heck isn't live migration being used for scheduled host updates?  With Server 2012 R2, in my tiny little 12 physical server setup, I can do shared-nothing live migration between my Hyper-V hosts.  How on earth is this not available in Azure?

  44. Sean says:

    Alright MS not to that whiny kid but this is a real deal breaker.  Like many people point out, you do have live migration and yes it might be of some work to set it up at first but come on.. people dont want random server restarts... Honestly i was looking forward to working and understand Azure but this is a real problem..

  45. Nigel Moore says:

    We have had this in the Australian East data centre right in the middle of the Australian work day (11.30am). Completely unexpected / non graceful reboots. Out of the machines in question, one of them was a Domain Controller and suffered data corruption, obviously NOT ideal.

    We are highly concerned about putting any more clients in Azure until this is resolved. We just reverted back to putting a deal in our old cloud VM provider and we have another few more deals on the table that we'd love to put through Azure, however are too concerned about the business risk. Small businesses in Australia (or anywhere in the world) can't justify putting twice (if not more) the amount just to stick with Microsoft - they'll simply go to other providers until this is sorted.

    Honestly, this really does sound like a money grab (customers needs twice the number of services to run properly) than a technical issue as we all know Microsoft has the technology to ensure that this doesn't happen.

    Will be keeping a keen eye on this thread as we REALLY want to just sell/support a full Microsoft stack however Azure is the one thing stopping us from doing that at the moment (O365, Server, Win8, Office are all amazing and are core of everything else we do, we just can't rely on Azure just yet).

  46. Darian says:

    I am in the middle of "Lift and Shift" operation and have been planning on Azure as the target.  In fact, I have spent plenty of time learning the PowerShell cmdlets and prepping the design for a truly automated deployment for a subscription and all of its services.  After reading this, and all the comments, to say that we need two servers in an availability set to rely on uptime due to random server downtime on MS's end is, quite simply, insane.  "Proactive notification" shouldn't be a "suggested feature" for goodness sakes - it's a required element!   True...new client connections will be routed properly to the other host on a shutdown  (and some will be lost in the shuffle..depends on how fast the Load Balancer detects the outage.)   But that's only half the issue... there's not a single comment made to all of the current work being processed by the host at the time some tech just happens to flip the power switch without even performing a proper shutdown, let alone a notification.  True, we can use Aysnc Queues.. but at some point in the life of a process, it has to do some actual work...like database access and data manipulation, computation, interaction with third party services...ya know, stuff.  It's pretty difficult to coordinate ACID type behavior with third parties and remote database operations while experiencing frequent system drops. To know full well that you are guaranteed to get hard stops at random points in time for every active process going on gives me some serious pause when it's dismissed so easily...  I wish I saw this thread earlier as now I'm seriously re-examining the decision.  True, we are supposed to code defensively and we do... but the point of moving to hardened data centers is to get IMPROVED operations, not random restarts.  If my NOC guys randomly rebooted my servers in my datacenter routinely without notice, I'd fire them on the spot.  Azure is supposed to be the 'better way' for goodness sakes!    100+ Servers, 5+TB of data, 20,000 active users, 5,000 SQL transactions per second with guaranteed financial calculations.. I cannot have random system drops.  I don't think I can even bother trying to Lift and Shift small, less important components over given this behavior, let alone the core/important pieces.  Azure VMs apparently must be relegated to simple website hosts reading static files from disks?  Very disappointed as I thought for sure Azure was the way to go.  Might be good for PaaS with simple websites but apparently not in the decision tree for IaaS until this is corrected.

  47. Rini Boo says:

    I have been running a small business and I have been hosting my web site in my basement for 18+ years, I  have been using Windows 2000 back then to now Windows 2012 R2. Recently I have been  trying to move my business to the clod, and evaluating both Amazon AWS and Azure.  It is unbelievable that  my home made  servers (made with standard desktop hardware)  is more stable than Azure, seriously no kidding. I have desktop hardware with UPS battery backup and RAID 5  and the only non-scheduled downtime I had was 2003 North East blackout and 2014 ice storm in Toronto. I signed up for Azure VPC  for only less than 1.5 months, and there have been 3 critical reboot!!!???  What???  On the other hand, my Amazon AWS have been running for 2 months ROCK SOLID. Everyone, the choice is so clear.  Amazon is more than VPC, it hosts infrastructure, networking even DNS with Route 53.  Microsoft Azure is a joke, it cannot even beat my basement setup in terms of stability. LOL..

  48. Michael Smith says:

    Microsoft, can you do everyone a huge favor and put this at the top of all your Azure VM pages?  It would have saved me like 5 months of research time and work.

    "DO NOT USE AZURE VMs UNLESS YOUR AN ENTERPRISE THAT USES SOFTWARE WITH FULL HIGH AVAILABILITY FEATURES AND SUPPORT."

    Please put that above all the "look how great & cool azure is" stuff and stop wasting everyone's time.

  49. Frustrated says:

    My DS Series VM running SQL Server Enterprise just restarted for the 7th time in the late 2 days. This is insane.

  50. Dion says:

    Please people if you are affected by this then up vote the suggestion here. Its mind boggling that this doesn't have more attention being such a critical flaw

    feedback.azure.com/.../7031369-host-reboots-without-vm-reboot

  51. Mike says:

    I would like to be able to stipulate a default time of day for restarts. 03:30 AM for example

Skip to main content