A new platform layer in SQL Server 2005 to exploit new hardware capabilities and their trends
Hardware trends affect choices we make when designing and developing software. In the current market personal computers with high number of CPUs are no longer fiction - they are reality. Exploiting of such features as symmetric multi-threading, SMT, multiple CPUs per socket, weaker consistency models, hot memory and CPU add-ons and removals become more and more important to achieve required levels of system performance, scalability and administration. There are two major problems associated with designing software for modern hardware: lack of good tools and non-adequate developers experience. Database servers attempt to take advantage of new hardware features as they become available. A database server usually contains a platform layer that hides hardware specifics from most of the database server developers. SQL Server in its previous versions had a rather thin platform layer with limited responsibilities. In SQL 2005 we have created a new platform layer that contains features helping to exploit concurrency, partitioning and locality to their full degree. The layer is called SQLOS. SQLOS is a user mode operating system that leverages a hierarchical design similar to the hardware on which it runs. SQLOS’s major objects are nodes, schedulers and tasks. SQLOS is highly configurable and can adjust itself according to the hardware configuration on top of which it runs. It has a sophisticated API design that enables high level and low level developers to do their job best. It has built-in support for locality, concurrency, and administration. SQLOS has given a significant performance and scalability boost to SQL Server 2005. Moreover, it enabled DBAs to partition SQL Server according to hardware configuration or application needs. We have learned several valuable lessons while working on SQLOS and have identified possible future improvements.
Today’s computer architectures are increasingly diverse and difficult to exploit for maximum performance and scalability. What yesterday was research - today is widely used in the industry. Just looking back to several years ago, computers with multiple CPUs were rare. Nowadays multiple CPUs are no longer a privilege of large corporations. Features such as Symmetric Multi Threading (SMT) are in the mainstream, available in a large number of desktop computers shipping today to households. Moreover, computer manufacturers have been delivering multi-CPU configuration with multiple cores per socket for some time now. If this trend continues, in tomorrow’s market computers with high number of CPUs and large amount of memory will become the mainstream. Such computer architectures as Non Uniform Memory Access (cc-NUMA) will continue gaining their market share. Some processor manufacturers have built private memory bus for CPU, thereby enabling cc-NUMA on even the low-end architectures.
Trends in the enterprise hardware design have become even more dramatic. Enterprise computers today look very similar to yesterday’s supercomputers. Nowadays computers with 64 CPUs and 512GB of memory are common. As current trends continue, soon configuration with 256 physical/logical CPUs and with more than 1TB of memory will dominate enterprise installations. As a number of CPUs grows, hardware manufacturers prefer more and more cc-NUMA architectures to SMP. In today’s market cc-NUMA’s computers have become the standard for manufacturers developing platforms with a number of CPUs higher than 16. Memory sizes and a number of CPUs is not the only trend present on the enterprise market. Hardware manufactures attempt to minimize systems down time and put more flexibility in their systems by introducing support for hot memory and CPU add-ons and removals. Hot add-ons and removals enable system administrators automatically to reconfigure hardware without rebooting systems.
As the processing power and the amount of available memory have continued their growth at a speed of Moore’s law, memory access latency hasn’t kept up. Hardware manufacturers have been heavily investing resources in improving CPU-memory behavior. They have built not only multiple level caches but also added more support for instruction level parallelism, weak consistency models, and prefetching.
Understanding and taking advantage of these features has become very important to software engineers designing performance-oriented, scalable applications with flexible administration support.
Current hardware trends have significantly affected software designs. In order to get adequate performance and scalability, software developers have to explicitly take advantages of the hardware features.
Some of the burden, but not all, is put on the compilers. Compilers are becoming increasingly sophisticated. Today compilers take advantage of instruction level parallelism, exploit different consistency semantics, and even perform cache prefetching. However, with all advanced features, compilers still fall short of helping developers with design of concurrent, parallel, and multi-threaded applications. Applications not designed for parallel multi-CPU architectures can see their performance dropping on new hardware installations. Compilers only assist developers with exploiting some of the hardware features but definitely not all.
To take advantage of new hardware trends, software should be able to exploit parallelism of its domain by leveraging concurrent solutions. Historically the design and development of concurrent programs and taking advantage of platform features have been troublesome. Major reasons are: 1) concurrent programming is different from our natural way of thinking, and 2) available tools such as compilers, programming languages, and debuggers don’t provide adequate support. Tools usually fall into two extremes: some tools attempt to hide concurrency completely or don’t support it at all, while some other tools require complete knowledge about exploitation of concurrency – explicit concurrency. Such situation requires highly skilled developers to design and implement concurrent programs correctly.
When design concurrent solutions software developers need to be careful - concurrent applications can still perform as non-concurrent ones or even worse due to global shared state. Software has to be designed to exploit locality. Global state should be partitioned or completely eliminated. There are two different types of locality: locality of execution and locality of memory, data. Each type of locality may have multilevel semantics. For example, data can be present in CPU L2 cache and be allocated from cc-NUMA node that a CPU belongs to. As in exploiting concurrency, exploiting locality in many cases can be difficult due to non-adequate support from the platforms on which an application might run. Moreover locality can be significantly affected by incorrect design choices. For example an application with more number of instructions might have less CPI, Clock Per Instruction, and perform better than application with shorter number of instructions but with high CPI. High CPI could be caused by many reasons such as incorrect data layout, unnecessary virtual function calls and wrong design choices.
As well as hardware software has to support hot hardware add-ons and removal so that system administrators can reconfigure systems without reboot. It is not a simple task from software perspective since software has to build sufficient abstraction layer to hide the complexity of reconfiguration.
Database Servers Impact
Database servers, DBMS, are one of the first software types that are eager to exploit new hardware trends. DBMS not only have to take advantage of such trends but also in some cases they have to predict them. Database servers usually employ layered architecture. In such cases Hardware and OS abstraction layer is at the bottom, enabling applications to take advantage of the platforms.
Microsoft’s SQL Server is one of the examples of database server applications. In the previous version of SQL Server, the platform layer was rather thin. It had limited support for enabling concurrency but it did not have many features for exploiting locality. Even though some locality support did exist, it was scattered around the code and was hard to make use of. Moreover, the layer did not have any built-in prediction for new hardware trends especially for dynamic hardware changes.
We had realized that in order to move forward, the server needs to have a complete platform layer that can enable full support for the current and the future hardware features. Considering research trends, we had come to the conclusion that we needed to build a new platform - a user level operating system. The new operating system had to bring together the existing support for parallelism and the new support for locality and dynamic configuration. We named the new operating system layer SQLOS. The major goal of SQLOS was not to provide platform independence, but to empower existing and future hardware capabilities for SQL Server developers.
There were several major requirements for SQLOS. The SQLOS layer had to be highly configurable so that the SQL Server could run on low-end as well as on high-end hardware platforms. The SQLOS layer should hide complexity from the high level developers but give broad range of flexibility to the low level developers. In addition it should expand the operating system’s services over the new hardware even when real OS would have limited support for such services.
In the latest version of SQL Server we built SQLOS - a user level highly configurable operating system with powerful API, enabling automatic locality and advanced parallelism. SQLOS attempts to hide the complexity of underlying hardware from high level programmers and at the same time it provides powerful and comprehensive set of features to programmers willing to take advantage of the hardware underneath the system. SQLOS provides operating system services such as a non preemptive scheduling, memory management, deadlock detection, exception handling, hosting for external components such as Common Language Runtime, CLR and other services. Currently SQL Server and CLR developers are the major users of SQLOS, in the near future we expect to convert other server products to take advantage of SQLOS layer. In the following sections I will describe SQLOS internal architecture, touch briefly on design principles of SQLOS, uncover its API, talk about lessons learned and possible future work.
Hardware trends show that locality exploration is the key in enabling application’s scalability, especially for different types of hardware. Locality can be explored by different means. Common way to achieve it is to create hierarchical family of objects, in which each object provides locality-centric functionality and services that are meaningful for its level.
One of the major requirements of SQLOS design was the enabling of application’s scalability. SQLOS had to enable scalability both on the high-end and the low-end hardware. Since locality exploration is one of the ways to satisfy scalability requirements, and locality can be explored through hierarchy, SQLOS leverages hierarchical architecture. The major objects in SQLOS design are nodes, schedulers, and tasks. Each object at its level exposes functionality maximizing local state and minimizing global state. SQLOS attempts to minimize global state as much as possible. Figure 1 depicts SQLOS configurations on cc-NUMA hardware with two nodes and two CPUs per node.
Memory and CPU Nodes
The first objects in SQLOS’s hierarchy are memory nodes. One can think of a memory node as an abstraction primitive over memory attached to a CPU or a set of CPUs. Different hardware configurations have different relationships between CPU and memory. For example, the SMP architecture shares memory across all CPUs, while the cc-NUMA architecture shares memory across a set of CPUs. Memory sharing can be described as memory or residence affinity and represents memory locality. Memory node’s goal is to provide locality of memory management on cc-NUMA or cc-NUMA-like hardware.
A CPU node is a next object in SQLOS’s hierarchy. A CPU node provides a logical grouping of CPUs. On the SMP architecture there is a single CPU node. The cc-NUMA architecture has as many CPU nodes as the hardware platform has, by default. CPU nodes provide locality of reference as well as scheduling affinity. They enable developers to dispatch related tasks on CPU close to each other.
The relationship between a memory node and a CPU node is important. A CPU node is a proper subset of a memory node. As depicted in Figure 1, A CPU node can be associated only with one memory node but a memory node can be associated with multiple CPU nodes. This rule addresses two issues. First, it simplifies software model and enables a well-defined relationship between nodes. Second, it enables SQLOS to model different node configurations on cc-NUMA as well as on the high-end SMP hardware platforms.
In addition to strict hardware configuration, SQLOS can be configured to use logical CPU nodes. Due to historical reasons, a configuration of the system that doesn't fully map to the real hardware configuration is called “Soft NUMA”. Soft NUMA concept is extremely powerful. It enables applications servers built on top of SQLOS to exploit different types of locality provided by hardware as well as by the nature of the application domain. For example, the SMP system with two dual core CPUs could be configured to have a single memory node containing two CPU nodes, in which case each node would manage two cores. The configuration is shown in Figure 2. The benefits of Soft NUMA also include a way of testing SQLOS and SQL Server’s cc-NUMA support without actually requiring costly hardware.
A CPU node also provides functionality such as load balancing across its schedulers, I/O affinity, and local storage. Figure 3 depicts high level CPU Node infrastructure.
Each CPU node contains a set of schedulers. A scheduler is bound to the node’s CPU. A CPU node makes decisions where to dispatch new task depending on the schedulers load. As shown in Figure 3, a CPU node, in addition to schedulers, has an I/O completion port associated with it. An I/O completion port can be bound to multiple I/O devices such as network cards and disks. Having I/O port per node provides I/O locality so that all I/O requests can be scheduled and completed directly on a node’s CPU by which they were initiated.
CPU node’s Local storage can be used to store and associate data local to the node. Having local storage at the node level enables developers to have local state per node as well as to partition global state across nodes.
Schedulers & Tasks
The main goal of a scheduler is to schedule tasks on a given CPU. A scheduler runs one task at the time – only one task is actively running on a CPU that scheduler is affinitized to. By limiting one active task per scheduler SQLOS minimizes CPU demand on the system, which leads to smaller amount of context switches, and, therefore, improves throughput and scalability of the server application. There is some burden put on developers because SQLOS is non-preemptive user level operating system. Non-preemptiveness causes developers to write code with sprinkled yields. A task that doesn’t properly yield could affect performance dramatically. As well as CPU nodes, schedulers could be used to store local state per CPU as well as to partition global state across CPUs. In addition to task dispatching schedulers provide support for non preemptive I/Os. Asynchronous I/Os that are completed during scheduler context switch are called non-preemptive I/Os. Their counterpart asynchronous I/Os processed by CPU Node’s I/O completion port called preemptive I/Os. The advantage of non-preemptive I/O is that their completion doesn’t introduce extra context switches. Their disadvantage is the in some case I/Os might take longer than otherwise required. SQL Server uses preemptive I/Os to complete all network I/Os and non-preemptive I/Os to complete disk I/Os.
SQLOS’s Task object completes main SQLOS hierarchy. A task is a unit of execution. Depending on the mode task either maps to a thread or a fiber. A task is dispatched by a scheduler on a CPU. As well as nodes and schedulers tasks have local storage to enable even finer data partitioning and provide finer locality. When dispatched a task runs as long as its quantum allows or until it gets suspended on synchronization object. It checks its quantum expiration by periodically calling SQLOS yield methods sprinkled in SQLOS code as well as in SQL Server. In SQLOS tasks could asynchronously aborted. When aborted every wait including yield returns abort return code until abort is handled and reset.
SQLOS’s Concurrency Support
SQLOS provides versatile support for concurrency. It hides concurrency details from high level developers and gives fine control to low level ones. When using SQLOS, high level developers that implement single-threaded code should not be concern with execution locality – locality is exploited automatically for them. For example 1) a single task can be enqueued to SQLOS 2) a single task can be enqueued to a given CPU node or 3) multiple tasks could be enqueued to multiple schedulers or nodes in one call. The first method is used by developers when they rely on the SQLOS load balancing mechanism to start task on the correct node. The second method provides a developer with ability to start the task on a specific node. This method should be used when task requires specific locality. The developers should be careful when using this method since incorrect decision might put SQLOS out of balance. The last method is useful and powerful. It is used when multiple tasks need to be running at the same time in parallel but with different locality requirements. For example SQL Server’s Query Execution layer utilizes this method to create tasks on different schedulers when running parallel queries.
For high level developers SQLOS provides a set of powerful constructs enabling them to find solutions to complex concurrent problems and at the same time avoiding the complexity. These constructs include caches, resource semaphores, highly concurrent library of common data structures such as lists, hash tables, enumerators and etc.
For low level developers preferring to build their own concurrent solutions and requiring more flexibility SQLOS supports a set of low level synchronization objects such as spinlocks, events, mutexes, reader-writer locks, partitioned mutexes, and others.
SQLOS’s Dynamic Configuration Support
SQLOS provides a support for dynamic configuration – its configuration can be automatically adjusted without restart. SQLOS supports both online and offline states for nodes and schedulers. An offline scheduler doesn’t get any tasks enqueued. An offline CPU node has all of its schedulers offline. Offline CPU node doesn’t process any work. A memory node with all offline CPU nodes is not in use by SQLOS so memory node, if present in hardware, could be used by other applications. If memory node is not present it could be added latter so that SQLOS could leverage it. During boot SQLOS allocates required resources to support possible physical configuration provided by hardware and configures itself based on logical configuration supplied by an administrator. Once SQLOS is started an administrator can move nodes and schedulers offline and online depending on the desired configuration.
SQLOS API’s design was the key to the SQL Server’s acceptance of SQLOS. Requirements for API design were conflicting: we had to hide locality awareness as much as possible from high level developers and at the same time enable powerful developers to exploit capabilities of the underneath hardware up to its limits. The solution was to provide two sets of APIs: one with implicit locality built in, and another one that would require explicit locality knowledge.
The first set of API provides implicit support to a set of SQLOS’s local objects during current execution context – task, scheduler, node and memory node. This API set enables developers to do their everyday work without having to think about the underneath hardware platform. It also helps them to avoid mistakes often made by choosing the wrong context and thereby hurting scalability and performance. For example, SQLOS has a heap called a “memory object” per task. This heap can be only used by its task and not by anything else. There are several ways to get to the task memory object. Here is one example on how to achieve it:
SOS_Task* pTask = SOS_OS::GetCurrentTask ();
SOS_MemoryObject* memObj = pTask->GetMemoryObject ().
This approach is rather dangerous – it allows developers to retrieve a task object and gives the impression and ability that it is OK to pass both the task and the memory objects around. SQLOS takes different approach in such case. It hides explicit access to the task object:
SOS_MemoryObject* pMemoObj = SOS_Task::GetMemoryObject ();
This approach is still not perfect however, because developers are free to pass the memory object around. To address these issues SQLOS exposes the macro to allocate memory from the task object:
MyObject* pMyObject = NewTask () MyObject;
Both schedulers as well as CPU nodes have memory objects. SQLOS utilizes the same mechanism to provide local allocation and to avoid object traveling. Most of the developers leverage the macro to allocate memory from the task objects. However, exposing API is still useful in cases when a developer needs to perform a number of allocations.
The second API set is used by powerful users that would like to take full advantage of the hardware, when for example the query execution layer makes decision to which CPU node to enqueue tasks, based on the current node’s load and memory availability. As a part of powerful API SQLOS provides partitioned memory object. Partitioned memory object enables developers to allocate a piece of memory on either every CPU node or scheduler. Such pointer could be used to implement hot data structures that could be strongly affected by false sharing. For example:
MyObj* pPointerOnCurrentNode = partitionedObject->Alloc (sizeof (MyObj));
// In order to use the pointer efficiently on the other node we
// retrieve the partition
MyObj pPointerOnOtherNode = pObjOnCurrentNode ->GetSibling (NodeId);
Partitioned memory object concept is powerful but on the other hand if misused it can cause significant performance degradations – it is very expensive to allocate objects out of it. In general we tried to keep the explicit portion of the API small and efficient. It only gets extended when there is a real demand.
One of the major SQLOS API’s goals was to eliminate extra memory references that are sometimes referred to as “pointer chasing”. “Pointer chasing” can significantly hurt performance especially when objects allocated on different cache lines or even worse on different OS pages. Having objects on different pages might cause Transition Lookaside Buffer, TLB, misses which are really bad for performance. From the beginning of the project we recognized that SQLOS’s objects are very hot. We have achieved required performance characteristics by keeping object model simple, allocation of objects close to each other and by eliminating virtual functions altogether. These choices have significantly improved SQLOS CPU cache behavior and hence overall performance.
SQL Server & SQLOS
SQLOS enables SQL Server’s scalability through its infrastructure and API design. SQL Server developers have highly configurable platform that enables them to exploit underline hardware and write highly performant and scalable database server. SQL Server developers have taken full advantage of SQLOS’ exposed functionality. Moreover starting with SQL Server 2005, a DBA has a choice of configuring and dynamically reconfiguring SQL Server according to either the hardware or the application requirements. For example, if a DBA has two applications sharing a SQL Server, SQL Server can be configured to run in two-node configuration so that every application will use its own node. This will make both work loads completely separate. Without having SQLOS it would be very hard to achieve.
While working on SQLOS we have learned many interesting lessons. SQLOS hierarchical multi-level design proved to be very valuable. SQLOS has given significant performance and scalability improvements to SQL Server both on the cc-NUMA and the Soft NUMA architectures. Soft NUMA support showed to be very useful when testing features for cc-NUMA without actually having expensive hardware. SQLOS’s two-headed API design proved to be easy to use and hard to get it wrong by both high level as well as low level developers. Removing global shared state turned out to be not an easy task which can sometimes require a complete reengineering of applications. Incomplete partitioning might cause troubles. It turned out that in many cases incomplete partitioning does not give any performance benefit – single global shared state can hurt performance and scalability just as much as multiple global states can do. Design of SQLOS showed that following hardware trends and research is an important part in predicting and designing software for the future. Originally SQLOS was designed to support cc-NUMA. In addition to cc-NUMA and Soft NUMA, we put support for different types of topologies and support for dynamic reconfiguring. These decisions enabled immediately SQLOS to take advantage of the multi core systems with add-ons of hardware blocks appearing on today’s market
As SQLOS moves forward it will continue its evolvement around exploiting locality and concurrency. There is already work started on building partitioning data structures, such as stacks, linked lists, etc. The next version of SQLOS will provide support for partitioned caches and pools. It will help developers to write concurrent applications more easily by enabling them to exploit explicit and implicit parallelism of their problem space. For example, Single Instruction Multiple Data (SIMD) types of load could be processed implicitly without developers’ involvement in any type of synchronization. In the future we expect SQLOS to be heavily used by server type applications other than SQL Server.
1. D. Patterson, “Computer Organization and Design Second Edition: The Hardware/Software Interface.” Morgan Kaufmann; 2nd edition (August 1, 1997) ISBN: 1558604286.
2. A. Tanenbaum, “Modern Operating Systems (2nd Edition).” Prentice Hall; 2nd edition (February 28, 2001) ISBN: 0130313580.
3. J. Gray, and A, Reuter, “Transaction Processing: Concepts and Techniques (Morgan Kaufmann Series in Data Management Systems.” Morgan Kaufmann; 1st edition (1993) ISBN: 1558601902.
4. M. Russinovich, and D. Solomon, “Microsoft Windows Internals, Fourth Edition: Microsoft Windows Server(TM) 2003, Windows XP, and Windows 2000 (Pro-Developer).” Microsoft Press; 4th edition (December 8, 2004) ISBN: 0735619174.
5. J. Richter, and J. Clark, “Programming Server-Side Applications for Microsoft Windows 2000.” Microsoft Press; (March 1, 2000) ISBN: 0735607532.
6. M. Stonebraker, “Operating System support for Database Management”, in Comm. of ACM, Vol. 24, No. 7, 1981