Partitioning Large NUMA Based Servers into Small VMs – Question from Jason
The idea of using a large 4 Processor or 8 Processor server and then dividing this in smaller VM is not recommended both from a cost and performance point of view. This comment is specific to SAP application servers due to the rather unique architecture SAP developed in the early 1990’s and still use today. Virtualization solutions still face considerable challenges around NUMA handling and these challenges become critical on larger servers with more physical processors.
The architecture of Hyper-V and VMware differs a little here. When a Hyper-V VM starts it is “assigned” onto a NUMA node. This blog details how to monitor and change this. You can sometimes easily spot this in Task Manager when one VM is on high load, Task Manager will show unbalanced Logical Processor distribution (some Task Manager bars higher than others).
More information on Hyper-V can be found here: I highly recommend watching the Video from a Microsoft Technical Fellow Mark Russinovich (who started sysinternals.com). The topics around NUMA comes after about the first 5 min – keep going until you see the diagrams on the whiteboard, it is worth watching. More from Mark here and blog here – I recommend the “Pushing the limits series” especially the memory one.
VMware has more controls for allocating VMs onto specific Processors in an effort to reduce NUMA remote memory calls. As the documentation from VMware states, this is only possible when the number of vCPU and memory is less than one NUMA node. This article from VMware is interesting, but is lacking detail. It would be very interesting to see the metric Local Memory KB/sec and Remote Memory KB/sec and sec/Remote Memory Access and sec/Local Memory Access in addition to the Local+Remote Memory Accesses/sec. This information would show some of the scalability issues. After the Hyper-Transport links are flooded performance is severely impacted. This will become particularly significant when a VM memory size is larger than the physical memory on one NUMA node.
For example: a VM with 32GB RAM and run SAP ECC 6.0 application server and there are many poorly written ABAP programs that read entire DB tables into SAP internal tables. These programs force multiple Dialog work process to run in HEAP (they will display PRIV in SM66 or SM50) and consume 8-10GB RAM each. The server config in this case is 4 Processors AMD Opteron with 64GB RAM. Each of the 4 Opteron Processors has 16GB of RAM local. In this scenario 50% of my memory access is remote because the VM uses a full 32GB (16GB local and 16GB connecting via a memory link to another CPU).
Testing this scenario will show a dramatic difference from physical hardware. We have already come across customers with these kinds of issues in the field. The unfortunate reality is that customer systems are often not representative of the scenarios developers test in a lab. Many customer systems become extremely busy during month end or quarter end closing. Check page 21 of this document on vSphere 4.0 and note the comment about VM with more vCPU than cores on a single NUMA node. Page 147 onwards has some interesting information, but no specific information on performance and real world scenarios.
Customers are advised to avoid using large scale up servers as SAP application servers. Our internal testing and real customer deployments have shown that scale out 2 processor commodity Intel/AMD servers deliver not only the best performance but do so at a far lower cost than trying to use larger 4 & 8 processor systems. In particular certain server models have shown severe performance issues when used as SAP application servers (an 8 CPU AMD based server in particular).
SAP SD Benchmark rules prohibit any correlation between cost and an SAP benchmark. Without reference to SAP benchmarks in particular, other industry benchmarks show that the Price/Performance ratios for 2 Processor servers are far better than 4 or 8 Processor servers. The majority of 2 Processor servers from HP, Dell, IBM, NEC, Cisco, Fujitsu etc with 96GB RAM sell for less than $8,000 USD and most 4 Processor server with 256GB RAM sell for around $35,000 USD – meaning roughly twice the performance (as measured on most industry benchmarks) but around 4.5 times the cost. It should also be noted that the SAP Application is not a Single Point of Failure. Commodity 2 Processor type rack or blade servers are the perfect solution for the SAP application server. Another critical issue particular for SAP batch jobs is the fact SAP is single threaded on all platforms (Windows/UNIX/Linux). Therefore it is essential to have the highest SAPS/Logical Processor (thread) for best performance.
Many customers are using SAP Multi-SID Active/Active 2, 3 or 4 node clusters to consolidate multiple SAP systems onto a single larger cluster. Technically the SQL Server instance can be clustered on up to 16 nodes. The SAP ASCS can only be clustered on two nodes due to the fact the Enqueue Replication Server can only replicate the lock table to one target. SQL Server scales very well on 2, 4 and 8 Processor ISA systems. In contrast to our recommendations for the application server, we do recommend to scale up the database layer. Care must be taken to ensure sufficient memory and HBA cards are specified. Currently we recommend at least 2 x Dual Port HBA cards & queue depth/execution throttle (meaning a total of 4 ports) even for small systems.
After Installing New 2 Processor SAP Application Servers CPU Utilization is Never Higher than 10% Even When the System is Busy. Is Something Wrong?
Many customers have observed that the CPU utilization on modern Intel Nehalem or AMD based application servers is seldom higher than 10%. A modern ISA 2 Processor server is very powerful and rated around 28,000 SAPS regardless of the Hardware Vendor. To increase CPU utilization it is possible to install around 4-8 SAP instances each with different SAP System numbers such as 00, 01, 02, 03. To improve the resource utilization of CPU resources customers are implementing the following:
- Ensure an absolute minimum of 6GB RAM/CPU core is installed
- Install between 4 to 8 SAP instances each with separate SAP System numbers
- Configure the PHYS_MEMSIZE profile parameter for 1/nth of the physical RAM in the server where n is the number of SAP instances installed
- Use Windows Zero Administration Memory as per OSS Note 88416 – if required reduce the em/max_size_MB size from the default 100GB to a smaller value (Note: the actual maximum value is limited to 64GB)
A typical customer deployment we commonly see (3 to 12 physical servers configured as below):
- 2 Processor Intel Nehalem EP Westmere 56xx 12 cores, 24 Logical Processors or AMD 24 core
- 128GB RAM
- 4-6 SAP Instances installed each with PHYS_MEMSIZE set to 20-32GB RAM
- SAP System Numbers 00,01,02,03….
- No more than 50* Work Processes per SAP Instance – usually 35 DIA, 7 BGD, 4 UPD, 4 SPO & 2 UP2
- Each Instance: Program Buffer = 1000MB, Generic Key = 450MB and Single Record = 350MB
*we do not recommend making instances larger than 50 work processes, though it is technically possible to do so. OSS Note 9942 discusses the disadvantages of large instances. Two 50 work process instances will perform better than one large 100 work process instance. We recommend to test this and see.
Often customer use a single “pool” of SAP application servers for ECC, BW, CRM, XI, SCM etc. This is fully supported and works well. Each SAP component has its own configuration and executables under Drive:usrSAP<SID>Dxx (where xx is the system number). The instances can be started, stopped and even upgraded independently of one and other.