Customer Proof of Concept on New HP DL980

Recently we conducted a Performance Proof of Concept for a large customer using the new 8 Intel Nehalem-EX E7540 8-core processor HP DL980 G7 server. This blog discusses some of the configurations and tuning conducted during the PoC. One HP DL980 with 512GB of RAM was used for SQL Server and 9 x 2 Intel Nehalem-EP 5670 processor were used as application servers.

HP and other Hardware manufacturers have recently released 8 way Intel based servers. Intel Quick Path Interconnect architecture requires additional OEM node controllers to support more than 4 processors. The configuration and installation of Intel based servers with more than 4 processors therefore requires careful planning & execution. This blog discusses some of the steps required.

1. Background on HP DL980

DL980 G7 is the first HP Proliant scale-up server with 8 processor sockets, using the new HP PREMA architecture incorporating node controllers with smart CPU caching and redundant system fabric. The first release of the DL980 G7 servers uses Intel® Xeon® 6500 and 7500 processors (4-, 6- and 8-core SKUs), 128 DIMM sockets and 16 PCIe slots. The next version supports the Xeon E7 2800 and 4800 processors (up to the 10-core SKU). With the current Intel Nehalem-EX configuration with 8 processors, each with 8 cores, Windows Server 2008 R2 enables a total of 128 Logical Processors (with HyperThreading), the maximum OS support of 2TB of RAM and the 16 PCIe slots. With Intel Xeon E7 2800 and 4800 (Westmere EX) 10-core SKU processors and Windows Server 2008 R2 enables a total of 160 Logical Processors are supported for the DL980 G7.

The performance and scalability capabilities of these new 8 way Intel based servers with HP scalable node controllers exceeds anything that has previously existed on Intel/AMD and meet or exceeds the performance of many UNIX platforms, doing so at considerably lower price point than UNIX platforms. In addition these new Nehalem-EX & Westmere-EX are at close to parity with UNIX platforms with respect to the R.A.S feature set according to Gartner* and other industry analysts.

The installation and configuration of a HP DL980 differs from earlier simpler SMP servers in a number of respects:

a. Placement of PCIe cards such as HBA and Network cards is critical

b. Device Drivers must be Windows KGROUP-aware and NUMA-aware

c. Windows Operating System and Applications must be the correct versions and patches to fully support and scale on the DL980 G7 platform

As Windows x64 Server workloads scale to 8 and higher socket configurations, incorrect configuration or use of platforms like the HP DL980 G7 can result in substantial performance penalties.  With this 8 processor configuration, incorrect device drivers or non-NUMA aware applications (such as SAP’s ABAP application server) can drastically reduce the performance capabilities of the DL980 G7. Typical problems that could occur include flooding the non-scalable transactions the QPI, IO Hub and PREMA chipsets interconnects.  It is strongly recommended to read this blog on NUMA and this blog on using large scale up server like DL980 for SAP applications. We strongly advise against the use of large scale up servers as SAP application servers.

The Microsoft SQL Server team has extensively tested, deployed & benchmarked SQL Server on the DL980 G7 with exceptional results. The HP DL980 G7 and other 8 processor systems (such as NEC) have demonstrated excellent outcomes as Database Servers. Windows Server, SQL Server Team and Hardware Partners such as HP are continuing to expand the scale up performance capabilities with new products and releases over the next 1-2 years as Windows 8 and SQL 11 are released.

The following figure illustrates the DL980 G7 block diagram. it is provided for convenience to the reader as he/she follows the different configuration options and tuning presented in the following paragraphs.

trays

2. Configuration Options

The DL980 G7 ships in a variety of configurations.  4 and 8 processor socket models are available. Additionally, there are several different configurations for PCI expansion slots.  These are documented in the HP Technical Reference Guide for the DL980.

Prior to installing and configuring the DL980 G7, it is essential to read and understand the server architecture.  This is far more important for the DL980 G7 than any other 2 processor or 4 processor servers. 

Throughout this document, we with refer to Processors in terms of how Windows recognizes and describes them, Processors 0 – 7 and NUMA nodes 0 – 7. 

DL980 G7 has two “trays”.  The upper tray holds Processor 0-3 and directly controls the Main and Sub IO boards.  The lower tray holds Processor 4-7 and controls the “Option” or Low Profile (LP) IO board. 

The DL980 G7 has three IO Buses:

a. Main PCI board – Directly connected to Processors 0-1 this board provides 5x PCIe Gen 2 IO slots [ 2 (x8) and 3 (x4) electrical connectors] suitable for high bandwidth PCI devices such as full height HBA, NIC and FusionIO/SSD cards It also connects the embedded devices like the LAN On Motherboard (LOM –NC375i), video, internal disk controller (Smart array P410i), SATA DVD, USB ports etc

b. Sub IO PCI board (optional) – Directly connected to Processors 2-3 this board provides 5 PCIe Gen2 slots [4 (x8) and 1 (x4) electrical connectors] and 1 PCIe Gen 1 (x4) [slot ID 1] (optionally a PCI-X slot – not recommended) suitable for full height high bandwidth devices.

c. Low Profile (LP) IO board – Directly connected to Processors 4-7 this board provides 4 x PCIe x8 and 1 x PCIe x4 slots.  These slots are half height only, most recent network and HBA cards ship with a low-profile bracket which can replace the standard one in order to fit in half height slots.

3. Recommended Configuration & Settings

a. RAM

The DL980 G7 has extremely powerful processors, each containing 2 memory controllers and is capable of massive IO throughput when correctly configured.  The platform is configured so memory accesses are spread evenly across both memory controllers of each processor. To ensure a “balanced” system design, we recommend at least 512GB of RAM, with more typical deployments with 1TB of RAM as at June 2011.  A DL980 G7 with less than 512GB-1TB of RAM is probably never able to leverage the very powerful processors because of insufficient RAM.  Most customers will observe a dramatic decrease in IOPS and huge improvement in IO performance due to the large SQL Server data cache size and the effects of SQL Server 2008 R2 compression.

Please note the following memory configuration facts:

- DDR3 DIMMs only.

- Each processor connects to 2 Memory Risers, with each riser supporting 1 memory controller and 8 DIMM connectors.

- Supports Registered DIMMs (RDIMM) only. Unbuffered DIMMs (UDIMM) are not supported.

o LR or DDR3L are only supported with Westmere-EX processors.

- Supports single-rank (SR), dual-rank (DR) and quad-rank (QR) DIMM modules

- 1Gb and 2Gb DRAM technologies are supported with Nehalem-EX processors and 4Gb are also supported with Westmere-EX processors.

- DIMMs are added in Quads across 2 memory controllers.

- Supports Advanced ECC, Online Rank Sparing and Mirroring.

- Memory ECC support includes correction of x4 and x8 chip fail.

b. Network

10 Gigabit network adaptors are recommended as standard. 1 Gigabit network adaptors are most likely to be a bottleneck on these powerful systems

i. Reduce the total number of Network Adaptors required to achieve the required network performance by using 10G NIC

ii. 10 Gigabit network adaptors have significantly improved device drivers and device driver configuration options

iii. The HP 10Gb NC550SFP (Intel) network card has been tested and proved to be very highly performing. It requires a PCIe x8 slot on the Main or Sub IO Boards to reach full performance

iv. HP DL980s support up to 4 x 10 Gbps NICs such as NC550SFP and NC523SFP (refer to HP DL980 quickspecs documents for up to date information)

v. For load balancing purposes 10 Gigabit cards can be balanced between the Main and Sub IO boards (example: 2 x Dual port 10G NIC on Main Board and 2 x Dual port 10G on Sub IO Board)

vi. The onboard Broadcom quad port device should only be used for heartbeat and HP iLO/management network

vii. It is recommended to leverage Receive Side Scaling (RSS) on more modern Network Adaptors and modern Device drivers.[LG1] 1 Gigabit Network Adaptors usually only support up to 8 RSS “Rings” or “Queues”.  10 Gigabit Adaptors support at least 16 Rings/Queues[LG2] .   Receive Side Scaling is a mechanism to balance the DPC offload across multiple logical processors.  This avoids the problem sometimes seen during extremely high network activity where high kernel time is seen on one processor only (often logical processor 0 or 1, but not always). Note: The RSS implementation for Windows Server 2008 R2 covers only logical processors in the first Windows KGROUP of processors (KGROUP#0). This restriction will be eliminated with the next version of Windows Server.

viii. The latest version of the HP Network Configuration Utility (NCU) should be used as this will enable RSS on the Team.[LG3] The HP Network teaming software purpose is to increase server’s network availability and performance. Fault tolerance is enabled by default by teaming network ports together (up to 8) but load balancing must be carefully planned to allow both RX and TX bandwidth aggregation. It is often necessary to configure network switches to have this capability (refer to the HP NCU documentation for details).

c. HBA & IO

i. For high IO applications, such as SQL Server, install at least 2 x dual port HBA, more common configurations are 4 x dual port HBA. To fully leverage the IO capabilities of a DL980 a powerful SAN should be used with hundreds of disks

ii. It is important that all the HBA ports are connected and active.  MPIO software must be installed and configured correctly. Automatic Load balancing (ALB) is recommended for HP EVA series SAN.

iii. Different HBA have different settings and different SAN models have different capabilities, however as a general guidance it is recommended:

1. Emulex – set HBA Queue Depth to around 64-254 in “OC Manager / HBA Anywhere” in most configurations

2. QLogic – set Execution Throttle to 64-96 in “SanSurfer”

3. Brocade - Queue Depth is documented in the Brocade Admin Guide

iv. HBA Cards should be placed cards should be distributed across all 3 IO Hubs according to their PCI e interface characteristics. We will always tend to put a x8 capable card in a x8 slot to benefit from its full features and balance the load across the system.

Special Note on FusionIO Devices (generally applicable to other SSD cards):

FusionIO (HP IO accelerator boards) provide ultra fast access (up to 10,000 times faster than mechanical disks).  So far extensive testing has only been conducted with FusionIO 640GB and 1.28TB cards.  The results have been exceptional provided the following:

i. Use FusionIO 2.2.3 device driver – K-Group Aware (available from FusionIO website)

ii. Place the FusionIO card on the Main or SubIO Board only, they are not low profile cards.

iii. If you use more than 4 HP IO accelerators, modifications must be made within the BIOS to allow for more cooling.

iv. The current FusionIO stack still needs scalability improvements and FusionIO is working on them. Meanwhile, please note that with a larger number of Fusion IO cards in this system, we have pushed over 1 million IOPs and above 16GB/s. With the current FusionIO driver implementation, you might find:

a. Very high processor utilization on a subset of the logical processors. This is related to the fact that current FusionIO HBAs don’t support MSI-X and each HBA sends interrupts to a single logical processor. Additionally, the current implementation supports a DPC and FusionIO completion worker thread for each HBA only. For highly demanding scaling IO workloads, the user is almost expected identifying these logical processors by checking the CPU utilizations at 100% or at almost 100% on these specific logical processors. FusionIO has made available configuration parameters to partially mitigate this by allowing the specification of which logical processors to dedicate for these tasks. Use SQL affinity masks to avoid these logical processors.

b. IO throughput imbalances. The best IO performance is obtained by generating read/write requests on the same socket as where the FusionIO completion thread runs.

c. Throughput modifications due to the impact of grooming the SSD, issue seen with previous and this generations of SSDs.

d. BIOS

It is recommended to change the following default values in the DL980 G7 BIOS

SETTING NAME

RECOMMENDED SETTING

DEFAULT SETTING

HP Power Profile                                                                                       

Custom 

(default Balanced Power & Perf)

HP Power Regulator                      

OS Control Mode               

(default HP Dynamic Power Savings Mode)

Min Processor Idle Pwr State      

For Ultra-High Load systems configure no C-States (no Processor Power gains) or C1e

 

Most systems can run on the default “Green” energy saving power plan          

(default C6 State)

Memory Power Capping               

Disabled   

(default Enabled)

Collaborative Power Control        

Disabled                                                              

(default Enabled)

 

 

MPS Table Mode                            

Full Table APIC                                                 

(default)

Address Mode 44-bit                     

Enabled (with Windows Server 2008 R2)                                                         

(default Disabled)

Thermal Configuration

Increased cooling or even “max fans –(blowout) –If using a lot of HP IO accelerator/Fusion IO HBAs.

(optimal cooling)

 

 

Windows Debugging: ASR Status (disabled when no debugger is attached)              

Disabled                                                              

(default Enabled)

Windows Debugging: iLO Cli (from iLO session)                            

Disabled                                                              

(default Enabled)

· With DL980 G7 platforms it is recommended to enable x2APIC mode with Windows Server 2008 R2 SP1 for improve IO Scaling. However, this requires KB2303458 + KB2398906 and a Windows QFE to install on the platform and an additional Opt-In BCDEDIT command to execute.

e. Firmware

It is critical that the firmware for the below components is updated to the latest available.

i. DL980 G7 System

ii. HBA Cards

iii. Network Cards

iv. FusionIO Cards (if used)

It is strongly recommended to verify the firmware of these components on the vendors support web sites. It is generally not a good idea to run DL980 G7 configurations on the default firmware that comes with the DL980 G7 and associated PCI cards.

f. Windows versions & Configuration

i. Windows 2008 R2 + SP1 or higher is recommended. Service Pack 1 contains many critical performance fixes for > 64 logical processor support. 

ii. Windows 2008 cannot support 44 bit addressing correctly and should not be installed on DL980 G7. If for some reason Windows 2008 (non-R2 version) is deployed on DL980 both Hyperthreading and 44 Bit memory must be disabled, system is then limited to 64 logical processors and 1 TB of RAM. Windows Server 2008 SP2 is the only supported Windows Server 2008 Server version on the HP DL980 G7. If Windows Server 2008 is still required, please refer to the HP Support web site and organization for the complete list of Windows QFEs for Windows Server 2008 SP2 on the HP DL980 G7.

iii. You should remove Internet Explorer and any other non-essential software that with increase the patching and security update requirements. Removing Internet Explorer dramatically reduces the need to apply security patches as most of the vulnerabilities require a user to access external sites.  This KB article describes how to remove Internet Explorer.  Testing has shown that all features and functions of most software works without IE (including Cluster Validation as the XML file can be saved locally and opened on another PC)

iv. Create large single contiguous C: drive hopefully large enough to hold a complete memory dump in case a kernel dump is insufficient: (pagefile = RAM size). Note: it is highly unlikely that Windows would need to use much more than 30GB of pagefile under normal circumstances.

v. Processor Scheduling. System Properties ->Advanced ->Processor scheduling ->Adjust for best performance: Background Services

vi. Control Panel -> Power OS High Performance power settings (Optionally leave on Balanced for better Energy efficiency)

If disabling of C-states is required:

powercfg -setacvalueindex scheme_min sub_processor 5d76a2ca-e8c0-402f-a133-2158492d58ad 1

powercfg -setacvalueindex scheme_max sub_processor 5d76a2ca-e8c0-402f-a133-2158492d58ad 1

powercfg -setacvalueindex scheme_balanced sub_processor 5d76a2ca-e8c0-402f-a133-2158492d58ad 1

powercfg -setactive scheme_current

Note: To re-enable C-states, repeat commands above with a value of 0 instead of 1

vii. Storport DPC scheduling (https://support.microsoft.com/kb/982383):

\Registry\MACHINE\System\CurrentControlSet\Control\Session Manager\Kernel\MaximumDpcQueueDepth = REG_DWORD 0x00000001

viii. A Reminder that the RSS implementation of Windows Server 2008 R2 supports only logical processors in KGROUP#0 (maximum 64 logical processors per KGROUP). This is effectively an NETIO scaling issue with the DL980 G7 128LPs (Nehalem-EX) or 160LPs (Westmere-EX).

ix. It has been identified that the default Windows Server 2008 R2 default assignment of logical processors to KGROUPs might not be as robust as expected. To avoid confusing assignments, the default assignments can be overridden via registry settings. As examples:

1. For DL980 G7 - 8 Nehalem-EX 8-core SKU processor socket configurations and a 32LP assignment for each KGROUP, up to 4 for a total of 128LPs; i.e a “32-32-32-32” configuration:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\NUMA]

"Group Assignment"=hex:08,00,00,00,00,00,00,00,00,00,00,00,01,00,00,00,00,00,\

00,00,02,00,00,00,01,00,00,00,03,00,00,00,01,00,00,00,04,00,00,00,02,00,00,\

00,05,00,00,00,02,00,00,00,06,00,00,00,03,00,00,00,07,00,00,00,03,00,00,00

2. For DL980 G7 - 8 Westmere-EX 10-core SKU processor socket configurations and a 40LP assignment for each KGROUP, up to 4 for a total of 160LPs; i.e. a “40-40-40-40” configuration:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\NUMA]

"Group Assignment"=hex:08,00,00,00,00,00,00,00,00,00,00,00,01,00,00,00,00,00,\

00,00,02,00,00,00,01,00,00,00,03,00,00,00,01,00,00,00,04,00,00,00,02,00,00,\

00,05,00,00,00,02,00,00,00,06,00,00,00,03,00,00,00,07,00,00,00,03,00,00,00

3. For DL980 G7 - 8 Westmere-EX 10-core SKU processor socket configurations and a 60LP assignment for the first 2 KGROUPs and only 40LPs for the third KGROUP, for a total of 160LPs; i.e. a “60-60-40” configuration. This configuration might be used to maximize the use of Logical Processor targets for RSS and IO Drivers not supporting KGROUPs other than KGROUP#0.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\NUMA]

"Group Assignment"=hex:08,00,00,00,00,00,00,00,00,00,00,00,01,00,00,00,00,00,\

00,00,02,00,00,00,00,00,00,00,03,00,00,00,01,00,00,00,04,00,00,00,01,00,00,\

00,05,00,00,00,01,00,00,00,06,00,00,00,02,00,00,00,07,00,00,00,02,00,00,00

g. SQL Server versions

SQL Server 2008 R2 + CU7 (or latest CU/Service Pack) or higher is recommended. SQL Server 2008 cannot address more than 64 logical processors, therefore it is not recommended for the DL980. SQL Server 2008 R2 Enterprise Edition x64 can address 8 physical processors and 256 logical processors as detailed in this blog

Finally, a recommended server installation sequence would be as follows:

i. Insert MS Volume Licensing Win2008 R2 EE x64 Retail DVD – boot and install

ii. Install Service Pack 1 for Windows 2008 R2

iii. Download the HP Smart Start 8.7 or higher ISO image from www.hp.com/support/DL980G7

a. Run support pack for Windows 2008 R2 (see below for instructions)

b. Run HP Smart Update 8.7 or higher

iv. Actions 1 and 2 should load the correct, validated and tested device drivers.  Do not download vendor device drivers from vendor websites unless these are clearly tested and verified (example: do not run the Emulex HBA device driver, us the HP branded device driver).[LG8]

v. This HP Smart Update 8.7 must be downloaded and run. Smart Update 8.7 is required to Support Windows 2008 R2 Service Pack 1

vi. Perform the tuning steps as detailed in this blog

h. What to do if you still have issues? HP Support can run an Easy Assist session and check the DL980.  The purpose of this session is to allow HP to tune the DL980 and to run utilities that will expose the HW performance counters of the QPI and HP chipsets under workloads and help the configuration of the box and the software limitations.

4. Sample customer configuration

See below a resource and configuration map that was used by one of our large SAP customers. It shows 8 HBA on Sub, Main and LP IO cards. The NIC cards are linked to the first group of 4 processors for optimal performance. It meets or exceeds customer’s requirements in terms of IO throughput, redundancy and Network performance. You will note that all 8 HBA are on PCIe generation 2.0 x8 ports.  Low Profile adapters will be needed.

config

5. Summary of Do’s & Don’t

a. Don’t use PCIe Slot 1 – it is much slower than other slots

b. Don’t use PCI X cards

c. Don’t use DL980 as a PC – install only essential software that is required. Remove Internet Explorer to reduce security patching requirements

d. Don’t run Windows 2008 – install Windows 2008 R2 + Service Pack 1

e. Don’t run SQL 2008 – install SQL 2008 R2 + CU7 or higher (use SQL SP1 when released)

f. Do use the latest HP Smart Start – 8.70 as of May 2011

g. Do use the latest HP Smart Update – 8.70 as of April 2011 (required for Win2008 R2 SP1)

h. Do use 10G NIC – improved drivers, more RSS rings

i. Do consider using FusionIO or other SSD  to speed up IO intensive operations (Log or Tempdb) – Windows 2008 R2 & SQL Server fully support the use of SSD disks such as FusionIO

j. Do set Windows Power Savings feature to “High Performance” for systems on high load (Balanced can be used to save energy)

k. Do use at least 16 to 32 LUNs for holding the SQL datafiles for larger systems

l. DL980 is extremely powerful CPU – do not underspec memory.  Min is 512GB, 1TB or more

6. Summary of Results

During the PoC the customers’ existing 6TB Database was compressed using SQL Server 2008 PAGE compression to 1.34TB. The Performance has increased on key batch jobs of up to 1,500% through infrastructure improvements and query tuning. Below are some screenshots from the PoC system.

SAN Throughput on Backup to NUL: 1304 Mega Bytes/sec - SAN was mid-range with 168 spindles. All modern SCSI FC based SAN are capable of at least 500-600 MB/sec when correctly configured.

1

Time to run a CHECKDB physical only 49min 31sec (previously well over 2 hours)

2

Time to run a full CHECKDB 5hours 8min 51sec (Previously over 36 hours)

3

Time to run a full BACKUP to a high speed disk : 19min 43sec 1.34TB Database compresses to 598GB backup file. Previously over 5 hours. Time to restore database was just less than 1 hour.

4

7. References to HP DL980 G7 & Contributors to this Blog

1. https://h18004.www1.hp.com/products/quickspecs/DS_00190/DS_00190.pdf

2. https://h20195.www2.hp.com/V2/GetPDF.aspx/4AA3-0643ENW.pdf

3. https://bizsupport2.austin.hp.com/bc/docs/support/SupportManual/c02577023/c02577023.pdf

Special thanks to the HP BCS Windows Engineering : Thierry & Laurence for their assistance with this article

Contributors: Nicholas Dritsas & Jimmy May from SQL CAT and Rainer Goetzmann SQL PFE based at SAP-AG

*Gartner report : “Impact of the New Generation of x86 on the Server Market” – “Intel and OEMs have driven performance and reliability of the Intel Xeon 7500 processor series (code-named Nehalem-EX) to a level overlapping what Gartner judges to be about 80% of the RISC/Itanium function/feature set, but at a much lower price point. The result will cause increased server technology consolidation toward a bipolar distribution: x86-based platforms and one primary market-share-leading RISC technology (IBM Power), complementing the prevailing legacy of mainframes.”