As part of the SQL Server 2012 release several changes were made to Analysis Services that should alleviate some issues previously seen on large NUMA node machines.
Separation of IO and Processing Jobs
To better understand how the changes that were made work, it is helpful to have some background information about what types of issues were seen in earlier releases.
During testing, it was found that a 2 NUMA node 24 core server was handling roughly 100 queries per second during stress testing. The same tests were then run on a 4 NUMA node server with 32 cores and the queries answered per second actually decreased. Investigation indicated the cause as cross NUMA node access to the system file cache. For example if the first job that read a file was scheduled on a CPU assigned to NUMA node 1 and a subsequent job, which needed to read the same pages from the file, was scheduled on a CPU assigned to NUMA node 3, a significant performance penalty was seen reading the cached file page. Deeper investigation determined that the bottleneck appeared to be in accessing the standby list of file cache pages. The performance impact appeared to become more pronounced the higher the number of CPUs and NUMA nodes on the machine. For machines with less than 4 nodes the impact was not found to be significant. However, on machines with 4 or more nodes the impact was such that performance could begin to regress.
One possible work around is to open files in Random mode since this changes the file cache access behavior. To accommodate this SSAS 2012 now allows Random file access to be configured through the msmdsrv.ini file by setting the RandomFileAccessMode property to a value of 1. This server property does not require a service restart to take effect, but a restart is recommended. If the server is not restarted then Analysis Services will not release open files or change the way it accesses open files, but the setting will take effect for newly opened or created files.
While changing the file access mode may provide some relief, eliminating cross NUMA node file access whenever possible is a better long term solution. Towards this goal, SQL Server 2012 Analysis Services now has separate Process and IOProcess thread pools which are NUMA node aware. The new IOProcess thread pool will handle read jobs while the Process thread pool will continue to handle ROLAP and processing related jobs, including writing new files. On machines with 4 or more NUMA nodes, the IOProcess thread pool is not a single thread pool but instead a collection of thread pools with each NUMA node having its own pool of IOProcess threads. Assigning each NUMA node its own IOProcess pool doesn’t by itself help with cross NUMA node file access unless file IO operations are consistently assigned to the same IOProcess thead pool, so an algorithm for assigning partition file read operations to specific IO thread pools was also added. At RTM, the algorithm spreads partition reads across IO thread pools based on the ordinal position of the partition in the partitions collection for a measure group. The algorithm is subject to change without notice in future builds so design decisions should not be made based on this behavior.
An attentive reader will notice that the discussion has highlighted partition reads to this point, what about dimension operations? Dimension read jobs are always assigned to the IOProcess thread pool for NUMA node 0. While one could argue that this scheme could result in NUMA node 0 being assigned a larger percentage of the work, it is expected that most dimension operations will be operating on cached data in memory and won’t have a noticeable impact.
Because IOProcess threads are expected to perform short duration read operations on files and as such do not register themselves as cancellable objects. This means that even if a query which requested an I/O operation was cancelled, the in-process I/O jobs could continue to run for a short period of time after the query was cancelled, however new I/O jobs should not be created for the canceled query.
As with other SSAS features, the default behavior of the per NUMA node IOProcess thread pools is intended to cover the most common scenarios. However, in some situations it may make sense to override the 4 NUMA node threshold for the NUMA node affinitized IOProcess thread pools. To allow administrators to revert back to a single IOProcess thread pool or force per NUMA node IOProcess thread pools for machines with less than 4 NUMA nodes a new entry has been added to the Analysis Services configuration file (msmdsrv.ini). The PerNumaNode setting under ThreadPool\IOProcess has a default value of -1 which tells the server to use the automatic 4 NUMA node threshold. Changing this value to 0 disables the per NUMA node thread pool behavior, while setting it to 1 will turn on this behavior (even if there are less than 4 nodes).
The splitting of Process and I/O jobs into separate thread pools and the assignment of I/O jobs to consistent NUMA nodes should alleviate some of the performance impacts of cross NUMA node operations, significantly increasing the performance of SSAS workloads on higher end servers.
Greater than 64 CPU support and Thread Affinitization
In adition to giving IO operations their own thread pools, Analysis Services now supports more than 64 CPUs, something the relational engine has had for a while.
In order to add support for more than 64 CPUs, Windows uses the concept of processor groups. A processor group in Windows can contain a maximum of 64 CPUs and systems with more than 64 CPUs will contain multiple processor groups. For more details on processor groups and support for more than 64 CPUs read the following:
To support multiple processor groups and thus more than 64 CPUs, SSAS 2012 was updated to set the process affinity mask for the msmdsrv.exe process to span multiple processor groups. Along with this capability a new configuration property named GroupAffinity was added for each thread pool in the server. This property allows an SSAS administrator to have fine grain control over which CPUs on a machine are used for each thread pool. The GroupAffinity setting is a bitmask that is used to determine which CPUs in a processor group can be used for the thread pool in which the GroupAffinity mask is defined. For example if the following entry:
were to appear under <ThreadPool> <Process> in the msmdsrv.ini file, it would affinitize threads to 16 logical processors in the first two processor groups on the server. Where the following entry:
would affinitize threads to the CPUs 4-7 in the first processor group, and the first 32 CPUs in the second processor group. The GroupAffinity property can have as many comma separated hex values as there are defined CPU groups on a server. If the mask contains less bits than the number of CPUs for the processor group then it is assumed that non-specified bits are zeros. If no GroupAffinity value is specified for a thread pool (default) then that thread pool is allowed to spread work across processor groups and CPUs on the box.
For diagnostic purposes the msmdsrv.log file now contains entries at service start that reflect the size of each of the five thread pools (Query, ParsingShort, ParsingLong, Processing, and IOProcessing) and their settings, including affinity.
(12/21/2011 4:16:04 PM) Message: The Query thread pool now has 1 minimum threads, 10 maximum threads, and a concurrency of 4. Its thread pool affinity mask is 0x0000000000000003. (Source: \\?\C:\Program Files\Microsoft SQL Server\MSAS11.DENALIRC0\OLAP\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x4121000A)
Note: Although VertiPaq can use more than 64 CPUs, GroupAffinitization is not currently supported for the VertiPaq thread pool, even though an entry exists in the msmdsrv.ini file.
While the GroupAffinity setting was added as part of the work to support more than 64 CPUs, this property can also be used on servers with less than 64 CPUs to control which CPUs are used for specific operations. Through use of the GroupAffinity mask administrators can push I/O, processing, query, or parsing threads to specific CPUs to obtain optimal resource usage, or better enable resource sharing across multiple processes on the same server.
Wayne Robertson, Sr. Escalation Engineer Analysis Services