# DirectXMath: F16C and FMA

In this installment in our series, we cover a few additional instructions that extend the AVX instruction set. These instructions make use of the VEX prefix and require the OS implement “OXSAVE”. Without this support, these instructions are all invalid and will generate an invalid instruction hardware exception.

# Half-precision Floating-point Conversion

The F16C instruction set (also called CVT16 by AMD) provides support for doing half-precision <-> single-precision floating-point conversions. These intrinsics are in the `immintrin.h` header.

`````` inline float XMConvertHalfToFloat(HALF Value )
{
__m128i V1 = _mm_cvtsi32_si128( static_cast<uint32_t>(Value) );
__m128 V2 = _mm_cvtph_ps( V1 );
return _mm_cvtss_f32( V2 );
}

inline HALF XMConvertFloatToHalf( float Value )
{
__m128 V1 = _mm_set_ss( Value );
__m128i V2 = _mm_cvtps_ph( V1, 0 );
return static_cast<HALF>( _mm_cvtsi128_si32(V2) );
}
``````

This instruction actually converts 4 HALF <-> float values at a time, so this can be used to improve the performance of both `XMConvertHalfToFloatStream` and `XMConvertFloatToHalfStream`.

Computations often contain steps where two values are multiplied and then the result is accumulated with previous results. This can be done in single instruction using a ‘fused’ multiply-add operation:

V = V1 * V2 + V3

DirectXMath provides this functionality with the `XMVectorMultiplyAdd` function. The challenge in making use of FMA is that Intel and AMD took a while to agree on the exact details—thankfully ARM-NEON has a fused multiply-add instruction.

AMD Bulldozer implements FMA4. which uses a non-destructive destination form using 4 registers. These intrinsics are located in the `ammintrin.h` header.

``````inline XMVECTOR XMVectorMultiplyAdd(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
return _mm_macc_ps( V1, V2, V3 );
}

inline XMVECTOR XM_CALLCONV XMVectorNegativeMultiplySubtract(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
return _mm_nmacc_ps( V1, V2, V3 );
} ``````

Intel “Haswell” is expected to implement FMA3, which uses a destructive destination form using only 3 registers. The intrinsics are located in the `immintrin.h` header.

``````inline XMVECTOR XMVectorMultiplyAdd(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
return _mm_fmadd_ps( V1, V2, V3 );
}

inline XMVECTOR XM_CALLCONV XMVectorNegativeMultiplySubtract(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
return _mm_fnmadd_ps( V1, V2, V3 );
}``````

AMD has announced it is planning to implement FMA3 with “Piledriver”. It is also fairly easy to use the same source code to generate both versions by just substituting one intrinsic for the other.

# Processor Support

F16C/CVT16 is supported by AMD “Piledriver”, Intel “Ivy Bridge”, and later processors.

FMA4 is supported by AMD Bulldozer.

FMA3 will be supported by Intel “Haswell” and AMD “Piledriver” processors.

As extensions of the AVX instruction set, these instructions all require OSXSAVE support. This support is included in Windows 7 Service Pack 1, Windows Server 2008 R2 Service Pack 1, Windows 8, and Windows Server 2012.

`````` int CPUInfo[4] = {-1};
__cpuid( CPUInfo, 0 );
bool bOSXSAVE = false;
bool bAVX = false;
bool bF16C = false;
bool bFMA3 = false;
bool bFMA4 = false;
if ( CPUInfo[0] > 0 )
{
__cpuid(CPUInfo, 1 );
bOSXSAVE = (CPUInfo[2] & 0x8000000) != 0;
bF16C = bOSXSAVE && (CPUInfo[2] & 0x20000000) != 0;
bAVX = bOSXSAVE && (CPUInfo[2] & 0x10000000) != 0;
bFMA3 = bOSXSAVE && (CPUInfo[2] & 0x1000) != 0;
}
__cpuid( CPUInfo, 0x80000000 );
if ( CPUInfo[0] > 0x80000000 )
{
_cpuid(CPUInfo, 0x80000001 );
bFMA4 = bOSXSAVE && (CPUInfo[2] & 0x10000) != 0;
}

``````

# Compiler Support

FMA4 intrinsics were added to Visual Studio 2010 via Service Pack 1.

FMA3 and F16C/CVT16 intrinsic support requires Visual Studio 2012.

# Utility Code

Update: The source for this project is now available on GitHub under the MIT license.

Xbox One: This platform supports F16C, but does not support FMA3 or FMA4.

Tags

1. barbie says:

Thanks for this series Chuck. I was wondering, do you know if there is any method to the names of the intrinsic headers? They seem to be using a bit of a random distribution, making remembering them … hard!

2. clayman says:

Nice series !

3. Many of them are Intel codenames. 'pmmintrin.h' which houses SSE3 was originally codenamed "Prescott New Instructions (PNI)", hence the 'p'. As such the early ones tend to be a little unintuitive.

All the more recent stuff (AVX, F16C, FMA3) is in 'immintrin.h' which is Intel or (FMA4, XOP, etc.) in 'ammintrin.h' which is AMD, although that's of course not indicating which ones are vendor-specific and which have been adopted by the other vendor.

In theory most intrinsics should all be in intrin.h.

 Header Description `intrin.h` General intrinsics, notably `__cpuid` and various intrinsics forms of the CRT routines `ammintrin.h` SSE5, FMA4, and XOP instrinsics `xmmintrin.h` SSE intrinsics and the `__m128` type (single-precision float SIMD) `emmintrin.h` SSE2 intrinsics and the `__m128i/__m128d` types (double-precision float and integer SIMD) `pmmintrin.h` SSE3 intrinsics (horizontal adds and subtracts float/double operations, specific ‘dup’ operations) `tmmintrin.h` SSSE3 intrinsics (more horizontal ops, integer abs,  ‘byte’ shuffle to augment SSE2) `smmintrin.h` SSE4.1 intrinsics (dot-product, rounding, augmented min/max support for SSE2) `nmmintrin.h` SSE4.2 intrinsics `immintrin.h` AVX, FMA3, F16C/CVT16, and AVX2 intrinsics `wmmintrin.h` AES intrinsics
4. Alecazam says:

What is the recommended safe method of working with intrin.h?  This can cause a crash if I don't have the appropriate SSE processor for an intrinsic, but they're all defined by that header (AVX, etc).  In VS2012, intrin.h is also included by <string> now since it wants the InterlockedIncrement/Decrement calls in there.    I much preferred explicitly including the SSE headers, but I see no way to mask out those in intrin.h.

5. Any use of intrinsics assumes the programmer knows what they are doing. It's basically like inline assembly.