(originally written by Gaurav Patole)
Microsoft Azure makes every effort to provide 100% availability. Like all hyper scale cloud providers, Azure services occasionally encounter SLA impacting downtime. When an event impacts you, your resources, or your customers, it is critical that you have the latest information and can make the right decisions to ensure your business environment remains running optimally.
The majority of events that may impact services happen at such a small scale that we can usually identify customers that are impacted or potentially impacted, and we will communicated to you via their management portal
The types of communications that customers will receive notifications for, are
- Service Impacting Events (Incident)
- Maintenance events
- Informational messaging
Service Impacting Events (Incident)
A service impacting event is an incident that is currently impacting one or more Azure resources. This could affect the ability to provision or manage new resource, or in the unlikely event, the availability of resources or performance of resources. During an incident in which we send targeted notifications you can access what communications are being sent to you by:
- Logging into the portal (https://portal.azure.com)
- When you reach the landing page, you may have enabled the Service Health Map, this is an indicator of incidents about which you are receiving communications.
- If you have not enabled the map, you will see the monitor icon on the left hand side of your portal. If you do not see the icon you can click on “More services” and search for “monitor”
- NOTE: You can bypass step 2 and 3 by following this link https://aka.ms/portal-notifications
- Once you click on any of the regions you will be taken to the Service Notifications blade.
As the issue progress we will update you accordingly. Once the issue is mitigated a “resolved” message will be sent to your Service notification blade logs. With further details regarding the event. This post will include all the communications that have been sent regarding this incident.
Whilst maintenance is an ongoing activity in Azure, there are times when a planned maintenance will impact your resources. A common scenario is you may have a Virtual Machine that will experience a reboot. Whenever this happens Azure will communicate planned maintenance events that will impact any of your Azure service.
Microsoft’s policy is to communicate via email to the registered Admin and Co-admins of the Azure subscription when sending notifications on maintenance events. A typical email with details of a maintenance event can be found here. At the bottom of the email we will list the impact resource on the account so that you can either prepare for, or understand the behavior of impacted resources, during the maintenance window.
A maintenance communication is also sent to your management. Management notifications will not appear on your map but will appear in the Service notifications blade.
From time to time we will need to communicate to you about an event that is not currently impacting your resources, but may have an impact in the future. As a result we will need to inform you to take some action. As an example, we may have identified that you have exceeded storage scalability targets for your Storage account. We would send a recommendation that you spread your VHD disks across multiple Storage accounts for better availability and performance. A message of this nature would be posted in your activity logs for a period of no more than 7 days.
We send informational message to your management portal. This will be evident from your map: You will see a blue notification in the affected region of the resource that is impacted:
Configuring email/sms/webhook notification messages.
On all the aforementioned types of communications that are entered into the logs of the Service notifications blade. From here you can create alerts rules from;
- An existing event in the logs
- A new event from the template
Creating Alerts from existing events
If you navigate to the Service notifications blade you will be able to see ACTIVE and RESOLVED types of communications. If you click on the communication, they will see the option to “add activity log alert” just above the content of the communication.
You will then be taken to a blade where they can configure the notification.
Once an ‘action group’ is created it can be used again for other types of alerts. For example; the same action group can be used for Incident communications and maintenance communication types.
You can view/edit existing Action groups or create a new Action group by clicking on the “Action Groups” blade in the portal.
A new alert from scratch
You can also create a new alert from scratch by clicking on the Alerts Blade in the management portal, from there you can chose to “Add activity log alert” where you will be brought to the page to configure the alerts.
SMS and Email notifications are rate limited. A particular phone number or email address will not be sent more than
- 10 SMS per hour
- 100 Email per hour
Azure Status Page
Certain events impact at a larger regional scale. In these scenarios, we publish to the Azure status page.
The status page is organized in a large matrix with every single service and their regional offerings. The services and regions affected will have a warning, error, or information icon displayed in that cell of the matrix.
Communications posted here will be applicable to the general customer impacted. Not all customers who have a service in an affected region are guaranteed to be impacted, for customized notifications about the impact to one’s resources, the management portal is the place to visit.
Users can also subscribe to the RSS feed of the status page and get notified anytime there is a change. This is helpful when you want to be proactively communicated about a platform issue rather than reactively check the status page if you believe there is an outage.
Let’s say you believe you were impacted by an Azure outage but it’s no longer on the status page. Or let’s say you want to view details about an outage that was in the past. You can visit the status history page which displays outages posted in the last 90 days. If you were impacted this incident should also appear in your Service Notification blade. Any RCAs published that potentially impacted customers will be published to that location.
There is also the capability to filter by product or service, region, and date of the outage
Let us know if you found this useful! For feedback on the features of Azure App Service you can submit them here: Azure App Services - User Voice