Connection Timeouts in Multi-subnet Availability Group


THE DEFINITION

One of the issues that generates a lot of call volume we see on the AlwaysOn team is dealing with connectivity issues to the availability group listener in multi-subnet environments.

A “multi-subnet” environment is defined when the OS cluster used as the backbone for AlwaysOn has server nodes that are located in multiple, different subnets. Usually there are only 2 subnets, however, there can be more.

When the availability group listener (AGL) is configured properly, it will have an IP address for each defined subnet and have an “OR” dependency on each of the IP addresses. By default, when it is brought online it will be registered in DNS by Windows Cluster. The cluster will submit all of the IP addresses that are in the dependency list and the DNS server will generally register an A record for each IP address. (If non Microsoft Windows DNS servers are used, the exact implementation can be different).

When a client operating system (OS) needs to resolve the AGL name to IP by querying the DNS server, the DNS server will return multiple IP addresses – one for each subnet. The listener IP address in the subnet currently hosting the availability group primary replica will be online. The other listener IP address(es) will be offline. Because not all of the IP addresses returned by DNS will be online, client applications can run into problems when attempting to connect to the listener.

THE PROBLEM

By default, the behavior of the SQL client libraries is to try all IP addresses returned by the DNS lookup - one after another (serially) until the all of the IP addresses have been exhausted and either a connection is made, or a connection timeout threshold has been reached. This can be problematic, because depending upon DNS configurations, the “correct” or “online” IP address may not be the first IP address returned. The default timeout for a TCP connection attempt is 21 seconds and if the first IP address attempted is not online, it will wait 21 seconds before attempting the next IP address. For each subsequent IP address, it will again have to wait 21 seconds before moving to the next IP address until the connection attempt times out or it establishes a connection to an IP address that responds.

The default connection timeout period for .NET client libraries is 15 seconds, therefore, some applications may experience intermittent connection timeouts – or delays in connecting – which can cause application delays or performance issues.

THE RESOLUTION

Beginning with the native client libraries for SQL 2012 as well as the .NET 4.5 libraries (earlier .NET libraries with hotfixes – see below in Appendix A), Microsoft added a new connection string parameter that can be added to change the connection behavior. This new parameter, MultiSubnetFailover, should be used and set to “TRUE.”  When set to TRUE, the connection attempt behavior changes. It will no longer attempt all of the IP addresses serially, but in parallel. That is, all of the IP addresses that the availability group listener is dependent on will receive a SYN request at the TCP layer “in parallel” (technically one immediately after the other, but not waiting for acknowledgement – so effectively “in parallel”). This means that whichever IP address is online will be attempted immediately rather than waiting for any timeouts on IP addresses that are not online. The server will respond immediately and establish a connection, while the other IP addresses and their respective connection attempts will eventually timeout – but since the application is already connected it does not matter that those connection attempts timeout and fail.

THE PROBLEM 2 – THE SEQUEL

The client libraries by default do not enable this parameter (i.e. it is set to “FALSE”). The connection strings must be modified in order to ensure successful consistent and successful connections to a multi-subnet listener. Sometimes it isn’t possible to modify the connection strings – and so some applications will still encounter timeout issues when trying to connect. In some cases, the connection timeouts can be intermittent or they can be very consistent - depending upon the order in which IP addresses are returned.

RESOULTION 2 – THE SEQUEL

One option to resolve the issue if an application cannot use the MultiSubnetFailover parameter is to change the behavior of how the AGL is registered with DNS. This assumes that dynamic updating of DNS is allowed within the environment. If you are unsure, please check with your DNS administrators to determine if dynamic updating of DNS is allowed.

There are two parameters that affect how the AGL is registered with DNS. By modifying these parameters on the server we can transparently change the experience of the client OS in its name resolution caching.

The first parameter of interest is called RegisterAllProvidersIP. This parameter determines whether the Windows Cluster will register all of the IP addresses the AGL is dependent on, or only the one active IP address. When set to 1 (default if the AGL is created from SQL Server), the AGL clustered resource is created with all of the IP addresses the AGL is dependent on, registered in DNS. When set to 0, only the one active IP address is registered in DNS (the IP address in the subnet hosting the primary replica). (NOTE: if a Client Access Point is created using Windows Failover Cluster Manager, the RegisterAllProvidersIP parameter is set to 0 by default.)

The second parameter is called HostRecordTTL. This parameter governs how long (in seconds) before cached DNS entries on a client OS are expired, forcing the client OS to re-query the DNS server again to obtain the current IP address. By default, this value is 1200 (20 minutes). This means that after a client OS makes a call to the DNS server to resolve this name to an IP address, the client OS will cache that value for 20 minutes, only querying the DNS server again after that cached record expires. If this value is reduced to 120 or 60 for example, then the client OS cached copy will expire much more quickly.

This is important because during a failover in which the primary replica moves from one subnet to the other, the old IP address that was online is un-registered, and the new IP address that is brought online is registered. This updates DNS with the new IP address as soon as the AGL comes online, but client operating systems will not resolve the AGL name to the new IP address until the currently cached entry expires, which if it had just re-queried DNS immediately before the failover, the client OS would have to wait up to 20 minutes before expiring its cached copy and querying the DNS server again to get the new IP address. This causes the client OS to continue trying to connect to the OLD IP address until its cached copy has expired. By changing the HostRecordTTL parameter value to a much lower setting than 1200, it will cause the cached value to expire more quickly. So if set to 60, the client OS will only have to wait at most 60 seconds after a failover before acquiring the new IP address – allowing client operating systems to resolve to the new, correct IP address much sooner.

The drawback to setting the value to a lower number is how often the client OS will query the DNS server. If you have a handful of application servers, then changing the value from 1200 to 60 would probably have no perceptible impact on the DNS server(s). However, if there are thousands of client machines that all must resolve the AGL name to IP, this increases the load on the DNS server(s) and could cause problems.

A balance must be drawn between the lowest possible cache expiration time and the increased DNS server load.

The following PowerShell instructions show how to change the RegisterAllProvidersIP and HostRecordTTL settings. It is important to note that these settings cannot take effect until the AGL is brought offline and then online again, forcing it to re-register with DNS. Remember, the availability group is dependent on the AGL. If the AGL goes offline, so will the availability group and the databases. However, this dependency can be temporarily removed, allowing for the OFFLINE and re-ONLINE of the AGL without taking the availability group offline.

If there are applications or users that are actively using the AGL to connect to a replica (primary or secondary with read-only Routing), then the OFFLINE/ONLINE process will cause service interruption regardless of dependencies. Therefore, if the AGL is in use, perform the following steps during a maintenance window.


STEPS TO CHANGE

Availability group listener resource parameters:  RegisterAllProvidersIP and HostRecordTTL

The following steps show how to turn off RegisterAllProvidersIP and reduce the Client OS DNS cache timeouts (HostRecordTTL) parameters, in the event that you cannot use the MultiSubnetFailover=True parameter on all connection strings.

If you prefer, there are two sample scripts (one TSQL and one PowerShell) in Appendix B at the end of this document that has all of the required commands already configured.  These scripts can then be executed to perform all of the steps necessary.

NOTES:

  • The “>” symbol at the beginning of each line represents the command prompt and should not be typed.
  • Resource names are listed inside brackets (“<” and “>”) – do not include the brackets when typing the command for execution.

       

Before making any changes, it is necessary to get the PORT values for each listener defined in the AG.  This is because temporarily removing the cluster resource dependencies for the AG resource on the AGL will eliminate the port assignment for the listener.  If the port assigned was not 1433, the port needs to be specified again for the listener.   Capturing the existing port assignments before making any changes will allow the restoration of the proper port assignments at the end of the script.

1.  Capture the existing port assignment for each listener in the AG.

In SQL Server Management Studio, connect to the AG primary node execute the following TSQL and then keep the results to execute after all of the PowerShell commands have been completed.

      

SELECT '-- (1) Copy/Paste the results of this query '
    + ' into a query window.'
    AS [Generated TSQL script:]
UNION
SELECT '-- (2) After all PowerShell scripts/command'
    + ' have been executed,'
UNION
SELECT '-- (3) Execute the following TSQL commands'
    + ' to restore PORT settings.'
UNION
SELECT 'ALTER AVAILABILITY GROUP '
    + ag.name + ' MODIFY LISTENER '''
    + agl.dns_name + ''' (PORT = '
    + CAST(ISNULL(agl.port,1433) AS VARCHAR(5)) + ');'
    + CHAR(13) + CHAR(10)
    FROM sys.availability_group_listeners agl
         INNER JOIN sys.availability_groups ag
        ON agl.group_id = ag.group_id

The above script should yield one or more TSQL statements that can be copied and pasted into a query window in SSMS later to re-configure the port for each listener.

image

image

2. On any one of the nodes in the cluster, open an administrative PowerShell window.

3. Get the cluster resource name for the availability group resource and the availability group listener resource using the following commands:

  • >Import-Module FailoverClusters
  • >Get-ClusterResource

This will produce an output similar to the following:

image

In this list of resources, we will concern ourselves with three different columns:

  • the Name (left most, with heading surrounded in light green box)
  • the OwnerGroup (third column from left, with heading in orange box)
  • the ResourceType (right most column with heading in yellow box)

The scripts below require the use of the resource name (left most column) for the resource on which we will make changes.

To get the correct resource, first find the name of your availability group in the third column (OwnerGroup) (light orange box in picture below). Once you have found the correct group, then find the resources that we need to change. We will look for two types: “SQL Server Availability Group” and “Network Name”. The resource types will be found in the right most column. In the picture below, the availability group resource is underlined in red, and the network name resource (for the listener) is underlined in yellow.

image

For the subsequent steps, use the following resource names:

  • TestAG_TestAGList_”, to substitute for <AG Listener Resource Name>
  • TestAG”, to substitute for <AG RESOURCE NAME>

4. Change the parameters with the following commands:

  • >Get-ClusterResource <AG Listener Resource Name> | Set-ClusterParameter -Name HostRecordTTL -Value 120
  • >Get-ClusterResource <AG Listener Resource Name> | Set-ClusterParameter -Name RegisterAllProvidersIP -Value 0

image

As can be seen in the example above, the resource name for the “Network Name” type is used, “TestAG_TestAGList”. After the command is executed, a yellow warning message is shown that indicates the parameter change will not take effect until the resource is taken offline and then brought back online. This can be done during a normal availability group cluster failover, or through PowerShell script (later in this document). Similarly, the second command above can be issued to change the parameter RegisterAllProvidersIP. It, too, will return a yellow warning message identical to the one shown – indicating the parameter change will not take effect until the resource is taken offline and brought back online.

5. Temporarily remove dependency between the availability group resource and the listener name resource.

Because the listener name resource has to be taken offline and back online for the above changes to take effect, and the fact that the availability group resource is dependent on the listener name, simply taking the listener name resource offline will also take the availability group (and its databases) offline in the process. To avoid taking the availability group resource offline, the dependency that the availability group has on the listener name can be temporarily removed and then re-applied. This can either be done by the Windows Failover Cluster Manager utility or through PowerShell commands.

To remove the dependency using Windows Failover Cluster Manager:

  • Select the availability group resource.
  • Right click and select properties.
  • On the Properties dialog, navigate to the dependencies tab
  • Select the resource and click the “Delete” button and then “OK” to close the dialog box.

image

To remove the dependency using PowerShell:

  • >Remove-ClusterResourceDependency -Resource <AG RESOURCE NAME> -Provider <AG Listener Resource Name>

6. Offline and re-online the listener resource to force re-registration with DNS and complete the changes:

  • >Stop-ClusterResource <AG Listener Resource Name>
  • >Start-ClusterResource <AG Listener Resource Name>

To force updating DNS on Windows Server 2012 or 2012 R2:

  • >Get-ClusterResource <AG Listener Resource Name> | Update-ClusterNetworkNameResource

To force updating DNS on Windows Server 2008 or 2008R2:

  • >Cluster.exe RES <AG Listener Resource Name> /registerdns

7. Re-add the dependency of the AG resource on the Listener name resource. (The dependency should exist for proper function of the availability group and access to the databases within the availability group. Failure to re-add the dependency could cause unintended behavior of the availability group and database availability.)

To re-add the dependency using Windows Server Failover Cluster Manager:

  • Select the availability group resource.
  • Right click and select properties.
  • On the Properties dialog, navigate to the dependencies tab
  • Click the “drop down underneath “Resource” and select the listener name resource.
  • Click the “apply” button, then “OK” to close the dialog box.

image

To re-add the dependency using PowerShell execute the following command:

  • >Add-ClusterResourceDependency -Resource <AG RESOURCE NAME> -Provider <AG Listener Resource Name>

8. Verification – view the dependency to make sure it is re-applied, ensure all cluster resources for this AG are online, and review the parameters to make sure they’re set to the new value:

  • >Get-ClusterResourceDependency <AG RESOURCE NAME>
  • >Get-ClusterResource <AG Listener Resource Name> | Get-ClusterParameter HostRecordTTL, RegisterAllProvidersIP
  • >Get-ClusterResource

9. Re-configure Listener PORT settings.

In step 1 above, a TSQL script was executed that generated additional TSQL commands that will restore the original PORT settings for the AG Listeners.   Copy and paste the results from the query in step 1 into a query window in SSMS – connected to the primary and execute the TSQL.   The TSQL generated from step one should look similar to:

image

Do not type the TSQL from the above image, use the TSQL that was generated in step 1 on your machine!

After pasting into a query window, the generated TSQL text should look something similar to the following with ALTER AVAILABILITY GROUP statements inside TRY/CATCH blocks.

image

Upon execution it should return successful completion:

image

Finally, the following query will return the list of listeners and their port settings – for all availability groups on the machine:

/* this script will obtain the ports defined
  * for each availability group listener that
  * exists.*/
SELECT ag.name AS [Availability Group],
    agl.dns_name AS [Listener DNS Name],
    agl.port AS [Port]
    FROM sys.availability_group_listeners agl
        INNER JOIN sys.availability_groups ag
        ON agl.group_id = ag.group_id
    ORDER BY ag.name, agl.dns_name

image

HELPFUL REFERENCES

FINAL NOTES

For older operating systems such as Windows 7 and Windows Server 2008 R2, it is recommended the hotfixes referenced below be applied to ensure connection timeouts do not occur. Because of an issue with TDX/TDI filter drivers, timeouts can still occur when connecting to a name with multiple IP addresses – even when the correct client libraries are used and the MultisubnetFailover=True parameter is specified. These drivers are usually installed as part of older security systems such as anti-virus and intrusion detection. Ensuring these hotfixes have been applied will help prevent connectivity timeouts. Please note however, there is no hotfix for Windows Server 2008, only Windows Server 7 and Windows Server 2008 R2. If using Windows Server 2008 as a client, please refer to the articles below for more options in resolving timeout issues.

Two additional things to consider with respect to registering the listener name with DNS after making changes to either the HostRecordTTL or RegisterAllProvidersIP parameters --- DNS replication and previous settings.

The DNS server that is contacted by the OS cluster when registering or de-registering hostnames may not be the same DNS server that clients are using to resolve names to IP addresses. If this is the case, then it is possible to have additional delays in the client’s ability to get freshly updated information – simply because the client’s DNS server may not have the updated information. DNS replication topology and configuration settings can cause additional delays before all changes are replicated throughout an enterprise network. If significant delays are experienced either after a failover, or when changing these parameters, the network or DNS administrator should be contacted to investigate the DNS replication topology for the enterprise to determine if the time required to replicate across the entire organization can be reduced.

The other item to consider is that the “previous settings” are most likely already cached on client machines. If the parameter settings were “default” prior to making any changes, then any cached entries on client machines will still have the “old” expiration setting (TTL) – which is 20 minutes. That means, that even after changing the RegisterAllProvidersIP and HostRecordTTL settings – and taking the resource offline and back online to take effect – previously cached entries are not automatically expired. The client must wait the for the current TTL setting before it expires its cached copy. This means that it could still be up to 20 minutes before a client will get the new settings.

After the changes have been made and cached entries have been expired, the new settings will take effect and any subsequent TTL expirations will take place based upon the new setting (for example after 60 or 120 seconds) rather than the original default value of 20 minutes. This can be expedited on client machines if necessary by issuing an IPCONFIG /FLUSHDNS command from an elevated command prompt. This will cause the client to expire all cached entries and re-query the DNS server to obtain the new settings.

APPENDIX A

Section 5.7.1 Client-Connectivity For AlwaysOn Availability Groups from: SQL Server 2012 Release Notes.

The following table summarizes driver support for AlwaysOn Availability Groups:

Driver

Multi-Subnet Failover

Application Intent

Read-Only Routing

Multi-Subnet Failover: Faster Single Subnet Endpoint Failover

Multi-Subnet Failover: Named Instance Resolution For SQL Clustered Instances

SQL Native Client 11.0 ODBC

Yes

Yes

Yes

Yes

Yes

SQL Native Client 11.0 OLEDB

No

Yes

Yes

No

No

ADO.NET with .NET Framework 4.0 with connectivity patch*

Yes

Yes

Yes

Yes

Yes

ADO.NET with .NET Framework 3.5 SP1 with connectivity patch **

Yes

Yes

Yes

Yes

Yes

Microsoft JDBC driver 4.0 for SQL Server

Yes

Yes

Yes

Yes

Yes

* Download the connectivity patch for ADO .NET with .NET Framework 4.0: http://support.microsoft.com/kb/2600211.

** Download the connectivity patch for ADO.NET with .NET Framework 3.5 SP1: http://support.microsoft.com/kb/2654347.

MultiSubnetFailover Keyword and Associated Features

MultiSubnetFailover is a new connection string keyword used to enable faster failover with AlwaysOn availability groups and AlwaysOn Failover Cluster Instances in SQL Server 2012. The following three sub-features are enabled when MultiSubnetFailover=True is set in connection string:

  • Faster multi-subnet failover to a multi-subnet listener for an AlwaysOn Availability Group or Failover Cluster Instances.
    • Named instance resolution to a multi-subnet AlwaysOn Failover Cluster Instance.
  • Faster single subnet failover to a single subnet listener for an AlwaysOn Availability Group or Failover Cluster Instances.
    • This feature is used when connecting to a listener that has a single IP in a single subnet. This performs more aggressive TCP connection retries to speed up single subnet failovers.
  • Named instance resolution to a multi-subnet AlwaysOn Failover Cluster Instance.
    • This is to add named instance resolution support for an AlwaysOn Failover Cluster Instances with multiple subnet endpoints.

MultiSubnetFailover=True Not Supported by NET Framework 3.5 or OLEDB

Issue: If your Availability Group or Failover Cluster Instance has a listener name (known as the network name or Client Access Point in the WSFC Cluster Manager) depending on multiple IP addresses from different subnets, and you are using either ADO.NET with .NET Framework 3.5SP1 or SQL Native Client 11.0 OLEDB, potentially, 50% of your client-connection requests to the availability group listener will hit a connection timeout.


APPENDIX B - SCRIPTS

This is the script that should be run first to collect the port assignments for the listeners and generate TSQL code to be executed after the PowerShell script to re-configure the port settings to their original values.

/* this script will obtain the ports defined
  * for each availability group listener that
  * exists.  If no port is defined, it will
  * assign it will use port 1433.
  * The output will show the TSQL syntax
  * to alter the listeners to apply the
  * same port values later, should they
  * need to be re-configured to the same
  * ports.*/

DECLARE @CRLF CHAR(2) = CHAR(13) + CHAR(10)
DECLARE @EndTryCatch VARCHAR(max) = 'END TRY' + @CRLF +
    'BEGIN CATCH' + @CRLF +
    'IF (@@ERROR <> 19468)' + @CRLF +
    'SELECT ERROR_NUMBER() AS ErrNum, ERROR_MESSAGE() AS ErrMsg' +
    @CRLF + 'END CATCH' + @CRLF

SELECT '-- (1) Copy/Paste the results of this query '
    + ' into a query window.'
    AS [Generated TSQL script:]
UNION
SELECT '-- (2) After all PowerShell scripts/command'
    + ' have been executed,'
UNION
SELECT '-- (3) Execute the following TSQL commands'
     + ' to restore PORT settings.'
UNION
SELECT 'BEGIN TRY' + @CRLF + 'ALTER AVAILABILITY GROUP '
    + ag.name + ' MODIFY LISTENER '''
    + agl.dns_name + ''' (PORT = '
    + CAST(ISNULL(agl.port,1433) AS VARCHAR(5)) + ');'
    + @CRLF + @EndTryCatch + @CRLF
    FROM sys.availability_group_listeners agl
        INNER JOIN sys.availability_groups ag
        ON agl.group_id = ag.group_id

PowerShell script to change the HostRecordTTL and RegisterAllProvidersIP settings.

There are five variables that need to be changed before executing the script. They are located toward the top of the script underneath “CHANGE THESE VARIABLES”. It is recommended you become familiar with the script and its options in a test environment before attempting in production. The script is written such that it will affect ALL availability group listener resources for the specified availability group, since if the parameters need to be changed for one listener, then most likely, if the availability group has more than one listener, they should all be changed.

        

#**************************************************************************
#This script is provided "AS IS" with no warranties, and confers no rights.
#   Use of included script samples are subject to the terms specified at
#   http://www.microsoft.com/info/cpyright.htm
#**************************************************************************

#**************************************************************************
#  VARIABLES
#$strAGName          the name of the availability group
#$TTLValue           the # of seconds for HostRecordTTL timeout value
#$AllIPs             [0 | 1] 0 = only register one IP, 1 = register all IPs
#$RestartListener    [0 | 1] 1 = restart listener / 0 = do not restart
#$RemoveDependencies [0 | 1] 1 = temporarily remove / 0 = leave alone
#**************************************************************************

#**************************************************************************
#
# CHANGE THESE VARIABLES
#Define Variables
$strAGName = "TestAG"         #<<<<<<<<<<<<<<<<<<<<<<<<<
$TTLValue = "120"             #<<<<<<<<<<<<<<<<<<<<<<<<<
$AllIPs = 0                   #<<<<<<<<<<<<<<<<<<<<<<<<<
$RestartListener = 1          #<<<<<<<<<<<<<<<<<<<<<<<<<
$RemoveDependencies = 1       #<<<<<<<<<<<<<<<<<<<<<<<<<
#
#**************************************************************************

#**************************************************************************
#Notes: 
#   1) Test this script in non-production environments first.
#   2) This script will change the parameters for _all_ listeners
#      for the specified availability group 
#   3) This script can optionally restart the listener(s)
#   4) if restaring listeners, it can optionally temporarily
#      remove and restore the dependencies to take the
#      listener(s) offline without taking the availability group
#      itself offline.  If choosing not to temporarily remove
#      and restore dependencies, then when the listener(s) are
#      taken offline, the availability group resource will also
#      go offline - thus making the databases in the AG inaccessible.
#   5) if choosing to remove dependencies, the existing depenedencies
#      are collected and restored after restaring the listener(s)
#   6) Windows Server 2012/2012R2 has a powershell command to
#      re-register listener(s) with DNS.  Server 2008/2008R2 does
#      does not.  there is logic to determine and use the CLUSTER.EXE
#      command for Windows Server 2008/2008R2
#**************************************************************************

#no changes required below this point

#Get OS version
$OSMajor = ([System.Environment]::OSVersion.Version).Major
$OSMinor = ([System.Environment]::OSVersion.Version).Minor

#load cluster module
Import-Module FailoverClusters

#get the cluster role (group) object based on the AG name provided above
$objAGGroup = Get-ClusterGroup $strAGName -ErrorAction SilentlyContinue

if ($objAGGroup -eq $null)
    {Write-Host "Error:  Availability Group not found."}
else
    {
    #get the AG resource object in this cluster role (group)
    $objAGRes = $objAGGroup | Get-ClusterResource |
        Where-Object {$_.ResourceType -match "SQL Server Availability Group*"}
    #get the listener(s) object(s) in this cluster role (group)
    $objListener = $objAGGroup | Get-ClusterResource |
          Where-Object {$_.ResourceType -match "Network Name*"}

    #change the parameter settings: HostRecordTTL & RegisterAllProvidersIP
    Write-Host "Making changes to Network Name:"  $list.Name
    $objListener | Set-ClusterParameter -Name HostRecordTTL -Value $TTLValue
    $objListener | Set-ClusterParameter -Name RegisterAllProvidersIP -Value @AllIPs
    $objListener | Get-ClusterParameter -Name HostRecordTTL
    $objListener | Get-ClusterParameter -Name RegisterAllProvidersIP

    if ($RestartListener -eq 1) {
        if($RemoveDependencies -eq 1) {
            #capture the dependency(ies) that the AG resource depends on
            $DepStr = ($objAGRes | Get-ClusterResourceDependency).DependencyExpression
            Write-Host "Removing dependecny for " $objAGRes.Name  " on '" $DepStr "'"
             Set-ClusterResourceDependency -Resource $objAGRes -Dependency $null
        } #if remove dependencies

        #restart the listener resource(es)
        Write-Host "Restarting Network Name resource:" $list.Name
        $objListener | Stop-ClusterResource
        $objListener | Start-ClusterResource

        #force re-registration in DNS
        if ($OSMajor -ge 6 -and $OSMinor -ge 2) {
            #Windows Server 2012 and up
            $objListener | Update-ClusterNetworkNameResource -Verbose
         }
        else {
            #for Windows Server 2008/2008R2
             ForEach($list in $objListener) {
                cluster.exe res $list.name  /registerdns
            }#foreach
         }
        if($RemoveDependencies -eq 1) {
            #restore the dependency(ies) to previous setting
            Write-Host "Reapplying dependencies for " $objAGRes.Name
            Set-ClusterResourceDependency -Resource $objAGRes -Dependency $DepStr
            #show dependency (so it can be compared) / show the settings
            $objAGRes | Get-ClusterResourceDependency
        } #if remove dependencies
        else {
        #if we chose not to remove dependencies we need to restart
        #the availability group resource
        $objAGRes | Start-ClusterResource
        }
    } #if restart
}#else - availability group found


Comments (24)

  1. Daniel Antenor says:

    That´s really a nice and very useful post. It helps a lot to troubleshoot common issues in AG Listener connectivity.

    Thanks!

  2. AS says:

    Fantastic article, thanks.

  3. Dave V says:

    thank you very much.  great article, exactly what I was looking for.

  4. M. Kokoy says:

    Thank you very much for this article.

  5. Someone says:

    Saved my job! Thank you.

  6. Someone says:

    Saved my job! thank you.

  7. Nicolas Soukoff (SQL PFE) says:

    Thank you so much for this post, very useful for our customers.

    you can also find an official KB from Microsoft website  that can help on that issue: support.microsoft.com/.../2792139

    Thanks again for the post.

    regards,

    Nick.

  8. Stefan Boychev says:

    Outstanding and truly comprehensive article. Thank you very much.

  9. Trent says:

    Hello,

    Great article! When a failover does occur to the other subnet, will the AG automatically register the new IP address in DNS or will that have to be done manually?

  10. @Trent - yes whenever the AG Listener comes online, the cluster service will call the APIs to re-register with DNS automatically.   As long the DNS server is configured to allow updates and the computer object for the OS cluster has permissions - then the re-registration will happen.

    1. Paul-Andre Panon says:

      My observation is that, for Always On Availability Groups in multi subnet failover configurations, IP addresses from nodes in both/all subnets are entered in a single A record for the listener in DNS. Perhaps you were describing behaviour in Failover cluster instances and that is handled differently? With AGs, the newer drivers try to connect to both IP addresses concurrently and only the primary node responds to the connection request. The DNS record therefore does not change when you fail over AGs, it only changes when you add or remove secondary replicas on additional cluster nodes to an availability group. The driver updates appear to add support for the new connection string keywords and for handling listener DNS records with multiple IP addresses.

  11. Marty Bostick says:

    Thanks so much for the article.  I was able to get a third node for DR working in our remote office using the details provided.

  12. Paul-Andre Panon says:

    Is there a current kb article for the connectivity patch for ADO.NET with .NET Framework 3.5 SP1 on Windows 7? Your Final Notes recommends applying hotfixes on Windows 7, but then the versions seem a little garbled in the rest of that paragraph. However KB 2654347 only provides links for Windows XP SP3, Windows 2008, and Windows 2008 R2.
    Why ever would Microsoft provide a .Net patch for Windows XP SP3, but not for Windows 7?

    We have a client app using .Net 2.0 APIs on Windows 7 and we would like to move its database to a SQL Server Always On Cluster.

    1. jatin pathak says:

      Paul, have you implemented this?

  13. Ravindranath Chowdary says:

    Hi,

    we have an application servers hosted in XXX.XX.YYY.00 subnet and Database servers which are hosted in Multi Subnet Cluster Environment(3 node Cluster- 2 nodes are in One Subnet(XXX.XX.ZZZ.00) and one node in another subnet(XXX.XX.PPP.00). MultisubnetFailover=True is not supported by application as it is .NetFramework3.5. so we have gone to another option RegisterAllProvidersIP. we have set the RegisterAllprovidersIP to '0'. but still we are facing the connection time Out issue. when we try to ping Sql Server network name it is pointing to Offline IP and some times it is pointing Online IP.

    Note: Application and database servers are in Different Subnets.

    databases are hosted in Multi subnet cluster which are not in app subnet

    1. Remember that after failovers, the new IP address is registered with the DNS server at the new subnet. Then DNS replication has to occur to the DNS server that the olds servers were hitting before they get the new IP address. Generally the servers in the same subnet as where it failed over to -- should see the new IP address after the TTL setting. The servers at the old site could take 15 minutes or even longer depending upon the DNS topology before they get notified of the new IP address. Also, .net 3.5 SP1 with a patch does support MultisubnetFailover=True;

  14. Ambarish Sawant says:

    Is it mandatory to temporarily remove dependency between the availability group resource and the listener name resource. Or can I do a cluster failover to achieve the same results?

    1. Yes you can do a cluster failover and achieve the same results.

  15. jatin pathak says:

    We are going to implement AlwaysON so want want to foresee if we can face this issue on not . we are on SQL Server 2014 SP2 with windows server 2012 R2. Do we need to implement this ? HostRecordTTL Value is currently 1200 and RegisterAllProvidersIP Value is 1 . We are on multisubnet cluster environment . Does it matter which version of .net or OLEDB or JDBC is running on client to connect to DB. Because although we are on higher version on DB side , the application is running on lower .net and JDBC version.

    1. If your client drivers are .net 4.6.2 (technically 4.6.1, but you really want 4.6.2), you do not necessarily need to implement MSF=True; The drivers now have "TransparentNetworkIPResolution" which eliminates the need for MSF=true by attempting 1 IP address but timing out after 500ms rather than the normal 21 seconds. If the 1 IP address times out, then it switches automatically to trying all IPs in parallel. This works fine except for environments where the 500ms would still potentially introduce too much application to database latency. Check out the information on TransparentNetworkIPResolution here: TransparentNetworkIPResolution in SQLClient for .NET 4.6.1
      https://blogs.msdn.microsoft.com/dbrowne/2016/04/04/transparentnetworkipresolution-in-sqlclient-for-net-4-6-1/

      My recommendation is to still use MultisubnetFailover=True if at all possible and let DNS have all the IPs registered. Then you never have to work about DNS changes upon failover, and you don't have the 500ms delay. Then the only area where you still have problems is if you're on older OS and anti-virus (and similar) uses TDX.SYS as described in one of the article references above in this article.

      1. Vikas says:

        My Listener is accessible only from the primary replica node itself. Do you have any idea how to resolve this?
        FYI, Always on has been setup in Amazon EC2 .Not working even i set up RegisterAllProvidersIP =0 and the changes you mentioned in the article.

  16. George Popescu says:

    This is so pointless and I can not believe MS is telling this to customers. I discovered that you are able to get this working properly without this non sense TTL and register just a single IP. Let me explain:
    1. First of all obvious I don't like this custom setups, long run my folks always need to be aware they can't create a regular AG, they need parameters....long term will cause issues with new folks etc.
    2. Everything works properly if your Application client is on the same subnet with one of the SQL Multi-subnets nodes, doesn't matter which one. So I am wondering how come MS is not capable of figure it out why as soon as I move my Application client to another subnet outside of the Nodes subnet it won't work anymore.
    3. Challenge with lowerTTL let`s say 30 seconds. Good luck buddy. Your application client in Subnet A let`s say talks to DNS server A. Now your SQL Node B in Subnet B fail-over, updates the DNS server B in site B, no way in hell your Client A can now take the new IP of that record even with a TTL of 30 seconds, WHYYY ? Cause DNS Server B needs to replicate though AD replication most likely 15 minutes or 5 minutes whatever, so my point is .. Lower TTL for fail-over is just a story...

    George Popescu ...

Skip to main content