Monitoring and Alerting in Azure

Article
02/20/2016

The following features are provided to monitor Azure resources:

Azure Status – portal showing status of all Azure services
Service Health – information on health of Azure services
Resource Health – information on health of Azure resources
Audit Logs – outcome of management operations
Metric Alerts – metric alerts created for Azure resources
Azure Insights – consumption of monitoring information
Log Analytics – searchability of monitoring information
Security Center – results of a security analysis of virtual machines

Azure Status

The Azure Status web site provides up-to-date information on the current status of Azure services. It comprises two pages:

Current Status – information on the current state of all Azure services in all public Azure regions
History – impact summary for significant Azure service outages

Service health

Azure Service Health provides information about service interruptions and performance degradations that may impact the Azure services used in a subscription. The information is about the service in general and is not specific to its use by individual resources in a subscription. Azure Service Health provides the history of changes to the incident status.

The Service Health information is accessed through the Service Health blade in the Production Portal – and is available in summary or detail form, filtered by timespan. It can also be displayed by setting an Event Category filter of Service Health in the Audit Logs blade in the Production Portal. Stephen Siciliano documents how to do this here.

The Service Health information can be accessed using the Get-AzureRmLog command as in the following example:

Get-AzureRmlog -ResourceProvider Azure.Health -StartTime '01/01/2016' -EndTime '01/12/2016' -DetailedOutput

The Azure Resource Manager Insights API can be used to configure the automatic emailing of Azure Service Health alerts when a service health incident occurs. Matt Loflin has written a blog post showing how to setup email alerts for Azure Service Health using the preview Azure Insights .NET API. It is not currently possible to configure these alerts in the Production Portal or directly in either Azure PowerShell or the Azure CLI – however it is possible to use this ARM Template to “deploy” an alert.

Resource Health

Azure Resource Health provides information about the current health of a resource. Bernardo Munoz introduces Azure Resource Health in this post.

Currently, Resource Health provides information for the following Azure services:

Virtual Machine (classic)
Web App
SQL Database

The Resource Health blade in the Production Portal lists all the resources for which resource health information is available. Additional information is then accessible for each resource, along with the opportunity to perform additional health checks and troubleshooting.

This information can also be found using the Check health link on the Settings property for an individual resource.

There is a preview Azure Resource Manager API that allows the retrieval of the resource health reports for a subscription, resource group or individual resource. The information retrieved is the same as that displayed in the Production Portal.

Audit Logs

Audit Logs provide status on every management operation performed against an Azure resource.

The Audit Logs blade in the Production Portal lists all the operations in the past week. Additional summary and detailed information is then accessible for each operation.

The primary Audit Logs blade has a filter that can be used to filter the summary list by:

Resource group
Resource type – e.g., Microsoft.Compute/virtualMachines
Resource Level – critical, error, warning, informational
Time span
Caller – i.e., user who invoked the operation

Audit Logs are also accessible from Azure PowerShell. The Get-AzureRmLog command can be used to retrieve the same summary and detailed information provided on the Production Portal. The command has various parameters allowing the returned information to be filtered. (Note that in PowerShell 0.x there are 3 separate commands – one each for subscription, resource group, resource provider).

For example, the following Azure PowerShell command retrieves detailed Audit Log events in the specified time interval for the Microsoft.Compute resource provider:

Get-AzureRmLog -ResourceProvider Microsoft.Compute –StartTime '1/1/2016' -EndTime '1/5/2016' -DetailedOutput

Metric Alerts

Some Azure resources support the collection of performance metrics and the creation of metric alert rules that fire when certain criteria are met, such as CPU % greater than 90% for 5 minutes.

Metric alert rules can be configured to send alert emails to specified Azure administrators and email addresses. They can also be configured to invoke web hooks which can cause additional notification, such as to pagers, messaging services, etc.

Stephen Siciliano has documented how to configure metric alerts in the Production Portal.Rob Boucher has documented how to configure alert web hooks in the Production Portal. It is also possible to use the Azure Resource Manager API to create metric alerts, similar to the creation of Azure Service alerts.

Azure Insights

Azure Insights is an Azure Resource Manager API supporting the retrieval of information about Azure resources and the use of this information to alter the behavior of the resources. Among other features it provides support for:

Management of Azure service health alerts
Management of metric alerts for a resource
Retrieval of metrics
Retrieval of Azure Service Events
Retrieval of Audit Logs
Retrieval of usage quotas

Azure Insights can be used only with ARM resources, and it provides no support for ASM services. Not all ARM resources support Azure Insights currently.

Virtual Machines

Azure Insights is an Azure Resource Manager REST API that can be used with Virtual Machines to:

Configure the collection of metric information
Create alerts based on metric values
Configure autoscaling

The collection of metric information from a VM requires the deployment of a VM extension:

Name: Microsoft.Insights.VMDiagnosticsSettings
Type: Microsoft.Azure.Diagnostics.IaaSDiagnostics

This can be done in the Production Portal or using the Set-AzureRmVmDiagnosticsExtension PowerShell cmdlet. The information captured, including performance counters, is specified in a configuration file generated automatically when the portal is used or uploaded when PowerShell is used. Once captured, the diagnostic information is persisted into a specified Azure Storage account. The schema for the configuration is documented here. Note that the use of a Metrics element is essential for the information to be accessible using Azure Insights. For example:

<Metrics resourceId="/subscriptions/SUBSCRIPTION_GUID/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/myVM">
<MetricAggregation scheduledTransferPeriod="PT1H"/>
<MetricAggregation scheduledTransferPeriod="PT1M"/>
</Metrics>

Once configured, the metric information is available both on the Production Portal and through the Azure Insights API where it can be used for monitoring, autoscaling and the generation of performance-based alerts (such as high CPU % for 10 minutes).

Log Analytics

Log Analytics is the monitoring feature of the Operational Management Suite (OMS). Among other features, it provides a way to ingest logs from IaaS Virtual Machines and PaaS Cloud Services and make these logs searchable in the Log Analytics Portal. Note that Log Analytics was recently renamed from Operational Insights.

Log Analytics supports two ingestion techniques: agent-based or Azure Storage account based. Agent-based ingestion uses the Microsoft Monitoring Agent deployed onto the VM to transfer data directly into Log Analytics. Azure Storage account ingestion uses a standard PaaS or IaaS Diagnostics extension to capture and persist monitoring data to an Azure Storage account from which Log Analytics can then pull it into its datastore.

The types of data supported vary between agent and storage-based ingestion.

Agent-based ingestion supports the following types of data:

Windows Event logs
Windows Performance counters
Linux Performance counters
IIS logs
Syslog (Unix)

Storage-based ingestion supports the following types of data (specifically not including performance counters):

WindowsEvent logs
IIS Logs
Syslog (Unix)
ETW Logs
Service Fabric Events

The Production Portal can be used to configure a Log Analytics workspace and configure VMs to be monitored and storage accounts to be used for data ingestion. Other elements of the configuration must be performed using the Log Analytics Preview Portal – such as the configuration of which data is to be ingested through the agent (Overview/Settings/Data tab). The ingested logs can be queried using the Log Analytics Preview Portal.

The Log Analytics Preview Portal provides 32-bit and 64-bit versions of the Microsoft Monitoring Agent (Overview/Settings/Connected Sources tab), allowing it to be deployed anywhere and then configured using Log Analytics workspace credentials allowing the agent to connect securely. ARM Templates (Windows, Ubuntu) can be used to deploy and configure the Microsoft Monitoring Agent on Windows or Ubuntu VMs.

Alternatively, the IaaS Diagnostics extension can be deployed onto the VM and configured to persist monitoring information to an Azure Storage account configured as a source for Log Analytics ingestion. For PaaS cloud services, the PaaS Diagnostics Extension can be deployed into the cloud service roles and then configured to persist monitoring information to an Azure Storage account configured as a source for Log Analytics ingestion.

Azure Security Center

Azure Security Center is a preview service available in the Production Portal that provides for the creation of security policies and the monitoring of various Azure resources to identify non-compliance with the policies. It also provides information on mitigating non-compliance. Currently, Azure Security Center supports the monitoring of Virtual Machines (v1 and v2), Endpoints, Network Security Groups, Web Application Firewall and Azure SQL Database.

For Virtual Machines, Azure Security Center works by installing several monitoring extensions (Microsoft.EnterpriseCloud.Monitoring/MicrosoftMonitoring and Microsoft.Azure.Security/Monitoring) onto VMs as well as mitigation agents (Microsoft.Azure.Security/IaaSAntimalware).

The following policy areas can be configured currently:

System updates
Baseline rules
Antimalware
Access Control List on endpoints
Network Security Groups
Web Application Firewall
SQL Auditing
SQL Transparent Data Encryption

System updates, baseline rules and antimalware support requires that data collection be configured for VMs.