When hyperthreading is enabled, all the processors are virtual


A common problem when answering technical questions is that people sometimes ask a question that can't or shouldn't be answered because it is based upon a misunderstanding. What's particularly frustrating is when they insist that you answer their question as posed, even when you try to explain to them that their question is itself flawed.

It's as if somebody asked you the question, "Do I have to use the remote control to lock my kangaroo?" You could answer the question literally ("No"), but the person asking the question would walk away with the wrong conclusion ("Wow, kangaroos are self-locking!"). Robert Flaming recalls a similar analogy I made with balsa wood and nails.

Here's an example of a question that betrays misunderstanding.

I just enabled hyperthreading on my dual-Xenon machine, and Task Manager now shows four processors instead of two. Which of them are the physical processors and which are the virtual ones?

When you turn on hyperthreading, each individual physical processor acts as if it were two virtual processors. From Task Manager's point of view, the computer has four virtual processors. The two virtual processors associated with each physical processor are completely equivalent. It's not like one is physical and one is virtual. They are both virtual and compete equally for a share of the one physical CPU. When you set processor affinities, you set them to virtual processors.

To find out which virtual processors are associated with the same physical processor, you can call the GetLogicalProcessorInformation function.

Comments (25)
  1. vince says:

    But of course in many cases turning on Hyperthreading reduces your performance due to Intel’s botched implementation.

    For the heavy simulation workload we do, I did extensive benchmarks and we actually got poorer performance when HT turned on, especially because users would treat it like a 4-way rather than 2-way machine.

    You are much better off if you somehow can communicate to the users "yes it says 4 cpus, but only use 2". If you treat it as a 4 way machine often tasks would take longer to run in the end.

    So maybe you should have told them "don’t turn it on at all".

  2. Roel says:

    The correct answer to such a question is ‘mu’. See http://en.wikipedia.org/wiki/Mu_%28Japanese_word%29.

    (Just happy to be able to provide a useful comment once, even if it’s not technical :) )

  3. Dave says:

    "you can call the GetLogicalProcessorInformation function"

    The documentation says that function requires Windows Vista, Windows XP Professional x64 Edition, Windows Server "Longhorn" or Windows Server 2003. So I can "try to call" or in five years "will be able to call" with decent success, but in today’s Windows installed base it will be rare that I "can call" GetLogicalProcessorInformation.

    I’m glad that future generations will have this function though. It’s pretty darned hard to scope out HT processors right now. Intel has some ugly code that plays with SetProcessorAffinityMask in order to divine whether the "processors" are real or HT, but it would have been nice if they had added a CPUID function to just tell us whether HT was enabled.

  4. BryanK says:

    vince: We see the same thing with the 3D CAD program that we use. When HT is on, it runs slower.

    Of course, it’s also single-threaded, so maybe that has something to do with it.

  5. Mihai says:

    You can always answer "yain" (which is a German word meaning ‘ya’ and ‘nein’, "yes and no", at the same time.)

    Or you can start with "the standard expert answer" which is "well, it depends …"

    :-)

  6. BryanK says:

    Raymond: I’m not sure about vince, but our 3D CAD software slowdown happens on both 2K Pro+SP4 and XP Pro+SP2. According to your earlier post, XP understands HT and can schedule processes appropriately.

    Our issue doesn’t appear to be related to scheduling, just something strange that happens in the CPU when HT is on and it’s getting used heavily by one thread.

  7. Nick Lamb says:

    "Our issue doesn’t appear to be related to scheduling, just something strange that happens in the CPU when HT is on and it’s getting used heavily by one thread."

    Caches are very important, and HT has to share one cache between two virtual processors.

    Normally, when something else must run briefly on your UP (uni-processor) machine, the CAD thread is stopped, the other thread is started, it finishes, and then the CAD thread starts again. Each time this happens, the code & data for the CAD thread is (likely to be) flushed from the cache, and the cache warms up again when the CAD thread is re-started.

    Now, on the HT system, the OS knows it should prefer an empty physical CPU to one with a thread on it, but it can’t find such a CPU, so it starts the short-lived thread on another virtual processor sharing with your CAD thread. The CAD thread isn’t stopped, but it is sharing its cache with the other thread. This causes a lot more misses than normal, and in highly optimised inner loops (which your engineers may have used in heavy calculations) this makes things many times slower.

    (For example, suppose your code does millions of "random" accesses in a 400×400 array of int32s. With 1MB data cache such an algorithm can be tuned to fit in the cache and run very fast. However if half the cache is being used by another thread, half your accesses go back to RAM, which is an order of magntiude slower. The code will run very slowly until that other thread goes away and you get all of the cache back.)

    The OS can’t really detect this, so the only thing to do about it is to turn of HT. The same can happen on a real multi-processor machine, and on any system where some resources are shared. But it’s annoyingly common on HT, which is why the technology hasn’t been as a big a money-spinner as Intel hoped.

  8. Cooney says:

    The OS can’t really detect this, so the only thing to do about it is to turn of HT. The same can happen on a real multi-processor machine, and on any system where some resources are shared. But it’s annoyingly common on HT, which is why the technology hasn’t been as a big a money-spinner as Intel hoped.

    I don’t see why this would be the case – most OSes use thread affinity to keep a thread on the same cpu, and most cpus (not the fakie HT ones) have their own cache. The dual core amd chips seem to be an exception, but they can share a larger cache at full speed, right?

  9. VAS says:

    If the hamburger came from Hamburg, where the heck is CHEESEBURG?

  10. Brent Dax says:

    Incidentally, the classic Unix version of this question is something like "I can use stat() to tell a soft link from a file. How can I tell a hard link from a file?" (You can’t; every entry pointing to a file, including the original one, is a hard link.)

  11. David Heffernan says:

    Intel did add to CPUID to check whether HT is available on the processor. The problem is that it might be on the processor but disabled by the BIOS or the OS.

    So the code on Intel’s site is what you need. It just so happens that I was porting this code to my app this week. This is what I came up with:

    ————————

    function AvailableProcessorCount: DWORD;

    //returns total number of processors available to system including logical hyperthreaded processors

    var

    i: Integer;

    ProcessAffinityMask, SystemAffinityMask: DWORD;

    Mask: DWORD;

    begin

    if GetProcessAffinityMask(GetCurrentProcess, ProcessAffinityMask, SystemAffinityMask) then begin

    Result := 0;

    for i := 0 to 31 do begin

    Mask := 1 shl i;

    if (ProcessAffinityMask and Mask)<>0 then begin

    inc(Result);

    end;

    end;

    end else begin

    //can’t get the affinity mask so we just report the total number of processors

    Result := OperatingSystemInfo.ProcessorCount;

    end;

    end; (* AvailableProcessorCount *)

    function AvailableProcessorCoreCount: DWORD;

    (* Returns total number of processors available to system excluding logical hyperthreaded processors.

    We only have to do significant work for Intel processors since they are the only ones which implement

    hyperthreading.

    It’s not 100% clear whether the hyperthreading bit (CPUID(1) -> EDX[28]) will be set for processors

    with multiple cores but without hyperthreading. My reading of the documentation is that it will be

    set but the code is conservative and performs the APIC ID decoding if either:

    1. The hyperthreading bit is set, or

    2. The processor reports >1 cores on the physical package.

    If either of these conditions hold then we proceed to read the APIC ID for each logical processor

    recognised by the OS. This ID can be decoded to the form (PACKAGE_ID, CORE_ID, LOGICAL_ID) where

    PACKAGE_ID identifies the physical processor package, CORE_ID identifies a physical core on that

    package and LOGICAL_ID identifies a hyperthreaded processor on that core.

    The job of this routine is therefore to count the number of unique cores, that is the number of

    unique pairs (PACKAGE_ID, CORE_ID).

    If the chip is not an Intel processor, or if it is Intel but doesn’t have multiple logical processors

    on a physical package then the routine simply returns AvailableProcessorCount. *)

    function GetMaxBasicCPUIDLeaf: DWORD;

    begin

    asm

    PUSH EBX

    MOV EAX,0

    CPUID

    MOV Result,EAX

    POP EBX

    end;

    end; (* GetMaxBasicCPUIDLeaf *)

    function ProcessorPackageSupportsLogicalProcessors: Boolean;

    const

    HT_BIT = $10000000;

    FAMILY_ID = $00000F00;

    EXT_FAMILY_ID = $00F00000;

    PENTIUM4_ID = $00000F00;

    var

    VendorID: array [1..12] of char;

    RegEDX: DWORD;

    ProcessorSupportsHT: Boolean;

    begin

    ZeroMemory(@VendorID, SizeOf(VendorID));

    RegEDX := 0;

    Result := False;//may be overwritten later

    asm

    PUSH EBX

    //call CPUID with EAX=0 and record the result in VendorID

    MOV EAX,0

    CPUID

    //test the maximum basic CPUID leaf and quit if it’s less than 1 which we need below

    CMP EAX,1

    JL @@quit

    //record Vendor ID

    MOV [DWORD PTR VendorID+0],EBX

    MOV [DWORD PTR VendorID+4],EDX

    MOV [DWORD PTR VendorID+8],ECX

    //call CPUID with EAX=1 and record the EDX register

    MOV EAX,1

    CPUID

    MOV RegEDX,EDX

    @@quit:

    POP EBX

    end;

    if VendorID=’GenuineIntel’ then begin

    if (RegEDX and HT_BIT)<>0 then begin

    Result := True;

    end;

    end;

    end; (* ProcessorPackageSupportsLogicalProcessors *)

    function GetLogicalProcessorCountPerPackage: DWORD;

    const

    NUM_LOGICAL_BITS = $00FF0000;

    var

    RegEBX: DWORD;

    begin

    asm

    PUSH EBX

    MOV EAX,1

    CPUID

    MOV RegEBX,EBX

    POP EBX

    end;

    Result := ((RegEBX and NUM_LOGICAL_BITS) shr 16);

    end; (* GetLogicalProcessorCountPerPackage *)

    function GetMaxCoresPerPackage: DWORD;

    var

    RegEAX: DWORD;

    begin

    if GetMaxBasicCPUIDLeaf>=4 then begin

    asm

    PUSH EBX

    MOV EAX,4

    MOV ECX,0

    CPUID

    MOV RegEAX,EAX

    POP EBX

    end;

    Result := (RegEAX shr 26) + 1;

    end else begin

    Result := 1;

    end;

    end; (* GetMaxCoresPerPackage *)

    function GetAPIC_ID: DWORD;

    var

    RegEBX: DWORD;

    begin

    asm

    PUSH EBX

    MOV EAX,1

    CPUID

    MOV RegEBX,EBX

    POP EBX

    end;

    Result := RegEBX shr 24;

    end; (* GetAPIC_ID *)

    var

    i: Integer;

    PackCoreList: TIntegerList;

    ThreadHandle: THandle;

    LogicalProcessorCountPerPackage, MaxCoresPerPackage, LogicalPerCore,

    APIC_ID, PACKAGE_ID, CORE_ID, LOGICAL_ID, PACKAGE_CORE_ID,

    CORE_ID_MASK, CORE_ID_SHIFT, LOGICAL_ID_MASK, LOGICAL_ID_SHIFT,

    ProcessAffinityMask, SystemAffinityMask, ThreadAffinityMask, Mask: DWORD;

    begin

    Result := 0;

    Try

    //see Intel documentation (Y:IntelIA32_manuals) for details on logical processor topology

    if OperatingSystemInfo.PlatformID=VER_PLATFORM_WIN32_NT then begin

    MaxCoresPerPackage := GetMaxCoresPerPackage;

    if ProcessorPackageSupportsLogicalProcessors or (MaxCoresPerPackage>1) then begin

    LogicalProcessorCountPerPackage := GetLogicalProcessorCountPerPackage;

    LogicalPerCore := LogicalProcessorCountPerPackage div MaxCoresPerPackage;

    LOGICAL_ID_MASK := $FF;

    LOGICAL_ID_SHIFT := 0;

    i := 1;

    while i<LogicalPerCore do begin

    i := i*2;

    LOGICAL_ID_MASK := LOGICAL_ID_MASK shl 1;

    inc(LOGICAL_ID_SHIFT);

    end;

    CORE_ID_SHIFT := 0;

    if MaxCoresPerPackage>1 then begin

    CORE_ID_MASK := LOGICAL_ID_MASK;

    i := 1;

    while i<MaxCoresPerPackage do begin

    i := i*2;

    CORE_ID_MASK := CORE_ID_MASK shl 1;

    inc(CORE_ID_SHIFT);

    end;

    end else begin

    CORE_ID_MASK := $FF;

    end;

    LOGICAL_ID_MASK := not LOGICAL_ID_MASK;

    CORE_ID_MASK := not CORE_ID_MASK;

    if GetProcessAffinityMask(GetCurrentProcess, ProcessAffinityMask, SystemAffinityMask) then begin

    ThreadHandle := GetCurrentThread;

    ThreadAffinityMask := SetThreadAffinityMask(ThreadHandle, ProcessAffinityMask);//get the current thread affinity

    if ThreadAffinityMask<>0 then begin

    Try

    PackCoreList := TIntegerList.Create;

    Try

    for i := 0 to 31 do begin

    Mask := 1 shl i;

    if (ProcessAffinityMask and Mask)<>0 then begin

    if SetThreadAffinityMask(ThreadHandle, Mask)<>0 then begin

    Sleep(0);//allow OS to reschedule thread onto the selected processor

    APIC_ID := GetAPIC_ID;

    LOGICAL_ID := APIC_ID and LOGICAL_ID_MASK;

    CORE_ID := (APIC_ID and CORE_ID_MASK) shr LOGICAL_ID_SHIFT;

    PACKAGE_ID := APIC_ID shr (LOGICAL_ID_SHIFT + CORE_ID_SHIFT);

    PACKAGE_CORE_ID := APIC_ID and (not LOGICAL_ID_MASK);//mask out LOGICAL_ID

    //identifies the processor core – it’s not a value defined by Intel, rather it’s defined by us!

    if PackCoreList.IndexOf(PACKAGE_CORE_ID)=-1 then begin

    //count the number of unique processor cores

    PackCoreList.Add(PACKAGE_CORE_ID)

    end;

    end;

    end;

    end;

    Result := PackCoreList.Count;

    Finally

    FreeAndNil(PackCoreList);

    End;

    Finally

    //restore thread affinity

    SetThreadAffinityMask(ThreadHandle, ThreadAffinityMask);

    End;

    end;

    end;

    end;

    end;

    Except

    ;//some processors don’t support CPUID and so will raise exceptions when it is called

    End;

    if Result=0 then begin

    //if we haven’t modified Result above, then assume that all logical processors are true physical processor cores

    Result := AvailableProcessorCount;

    end;

    end; (* AvailableProcessorCoreCount *)

    ————————

    It works (I think) but what a pain in the backside!

  12. C. Calculus says:

    Dual-*Xenon* machine? Dude, quit hogging the XBox 360s and sell one of those on EBay!

  13. RevMike says:

    <blockquote>Incidentally, the classic Unix version of this question is something like "I can use stat() to tell a soft link from a file. How can I tell a hard link from a file?" (You can’t; every entry pointing to a file, including the original one, is a hard link.)</blockquote>

    The Unix way is very powerful, but it takes a little time to wrap one’s head around it. "Every attribute of a file EXCEPT its name and path are associated directly with the file through the inode? But the thing I care about most often – the file name – is dereferenced? WTF!"

  14. bluestix says:

    One word dude: Opteron

  15. Joe Huffman says:

    "mu" is good answer for people who know the definition. The lawyers (and I, even though I’m not a lawyer) use, "The question presumes facts not in evidence."

  16. Yuliy says:

    "Every attribute of a file EXCEPT its name and path are associated directly with the file through the inode? But the thing I care about most often – the file name – is dereferenced? WTF!"

    Path & Name have a *:1 relationship to inodes. Inodes are constant size, hence you can’t store all of the paths to a file in its inode.

  17. Fei Liu says:

    The wikipedia HT page has an excellent illustration of the HT performance problem. It comes down to the false assumption by the HT pipeline that all data are immediately available accessible in L1 cache. When the cache is shared and data access is delayed, the HT pipeline stalls in a disastrous way without retiring any u-ops.

  18. Peter says:

    This type of questions is called "Karlsson’s questions". In the book by Astrid Lindgren, Swedish fabler, Karlsson asked Freken Bok: "Have you stopped from drinking brandy in the morning?" If she answers "Yes, I do!", you can tell she drinked brandy before. And if Freken Bok answers no, this evidently means she is drunk now. This is not a yes/no question; you should explain the person that his statement is incorrect.

  19. VAS – since the German word for cheese is kaese or käse, the place that my "name" will take you to should help.

    Not that I know what this has to do with hyperthreading, but what the heck.

  20. Rick C says:

    @Peter:

    The canonical example in the US of the question you posed is "Have you stopped beating your wife yet?"

  21. Miral says:

    All this talk of dual-core machines reminds me: any explanation on why QueryPerformanceCounter is horribly horribly b0rken on multicore machines? Even though the docs claim otherwise?

  22. Yaniv says:

    Additional consideration as to hyperthreadign performance can be found on http://msdn.microsoft.com/msdnmag/issues/05/06/HyperThreading/default.aspx

  23. A raging email thread on one of our internal aliases led me to an old blog entry about self-locking kangaroos

Comments are closed.