The cost-benefit analysis of bitfields for a collection of booleans

Consider a class with a bunch of BOOL members:

// no nitpicking over BOOL vs bool allowed
class Pear {
 BOOL m_peeled;
 BOOL m_sliced;
 BOOL m_pitted;
 BOOL m_rotten;

You might be tempted to convert the BOOL fields into bitfields:

class Pear {
 BOOL m_peeled:1;
 BOOL m_sliced:1;
 BOOL m_pitted:1;
 BOOL m_rotten:1;

Since a BOOL is typedef'd as INT (which on Windows platforms is a signed 32-bit integer), this takes sixteen bytes and packs them into one. That's a 93% savings! Who could complain about that?

How much did that savings cost you, and how much did you save anyway?

Let's look at the cost of that savings. Code that updated the plain BOOL m_sliced member could do it by simply storing the result into the member. Since it was a normal field, this could be accomplished directly:

  mov [ebx+01Ch], eax ; m_sliced = sliced

On the other hand, when it's a bitfield, updating it becomes trickier:

  add eax, eax        ; shift "sliced" into the correct position
  xor eax, [ebx+01Ch] ; merge "sliced" with other bits
  and eax, 2
  xor [ebx+01Ch], eax ; store the new bitfield

Exercise: Figure out how the above trick works.

Converting a BOOL to a single-bit field saved three bytes of data but cost you eight bytes of code when the member is assigned a non-constant value. Similarly, extracting the value gets more expensive. What used to be

 push [ebx+01Ch]      ; m_sliced
 call _Something@4    ; Something(m_sliced);


 mov  ecx, [ebx+01Ch] ; load bitfield value
 shl  ecx, 30         ; put bit at top
 sar  ecx, 31         ; move down and sign extend
 push ecx
 call _Something@4    ; Something(m_sliced);

The bitfield version is bigger by nine bytes.

Let's sit down and do some arithmetic. Suppose each of these bitfielded fields is accessed six times in your code, three times for writing and three times for reading. The cost in code growth is approximately 100 bytes. It won't be exactly 102 bytes because the optimizer may be able to take advantage of values already in registers for some operations, and the additional instructions may have hidden costs in terms of reduced register flexibility. The actual difference may be more, it may be less, but for a back-of-the-envelope calculation let's call it 100. Meanwhile, the memory savings was 15 byte per class. Therefore, the breakeven point is seven. If your program creates fewer than seven instances of this class, then the code cost exceeds the data savings: Your memory optimization was a memory de-optimization.

Even if you manage to come out ahead in the accounting ledger, it may be a win of just a few hundred bytes. That's an awful lot of extra hassle to save a few hundred bytes. All somebody has to do is add an icon to a dialog box and your savings will vanished.

When I see people making these sorts of micro-optimizations, sometimes I'll ask them, "How many instances of this class does the program create?" and sometimes the response will be, "Oh, maybe a half dozen. Why do you ask?"

But wait, there's more. Packing all these members into a bitfield has other costs. You lose the ability to set a hardware write breakpoint on a specific bit, since hardware breakpoints are done at the byte level (at a minimum). You also lose atomicity: An update to m_sliced will interfere with a simultaneous update to m_peeled on another thread, since the update process merges the two values and stores the result non-atomically. (Note that you also lose atomicity if you had used a byte-sized bool instead of a 32-bit BOOL because some CPU architectures such as the original Alpha AXP cannot access memory in units smaller than a DWORD.)

These are just a few things to take into account when considering whether you should change your fields to bitfields. Sure, bitfields save data memory, but you have to balance it against the cost in code size, debuggability, and reduced multithreading. If your class is going to be instantiated only a few times (and by "a few" I'm thinking less than a few thousand times), then these costs most likely exceed the savings.

Comments (68)
  1. nathan_works says:

    With the multi-threading and the second setup, no locking occurs unless the author codes in the locking ? Interesting. Wouldn’t have considered that.

  2. laonianren says:

    A further problem is that this optimization might silently break your code.

    If sliced is (say) 8 then "m_sliced = sliced" will correctly set m_sliced to TRUE in the first class, but will set m_sliced to FALSE in the bit field version.

    In other words, you can’t reliably assign a BOOL to a bit field.

  3. Adam says:

    Exercise: all but the second bit get set to 0 from the and instruction and then get set to the original bit values on the second xor.  The second bit (m_sliced) gets set to ((sliced ^ m_sliced) ^ m_sliced) = sliced, since xor is associative and x^x = 0 for all x.

    I would have implemented that taking the original value, and’ing by ~2, and or’ing by (sliced << 1) & 2, but then I realized that doing so would have required an extra register.  Even if an extra register is available, I’m not convinced my code would be faster than yours.

  4. SuperKoko says:

    There’s simplier “micro-optimization”, with no major issue: Use 8-bits BYTE rather than 32-bits BOOL.

    IMO, even for as few as 256 (very-frequently accessed) elements, it might be worth.

    1 KiB istiny, but it might save enough L1 data cache to give a small performance boost.

    For example: The EeePC has only 32KiB of L1 cache and maybe as small as 16KiB of data cache.

    [I already called out a major issue of using BYTEs in the article: Not all CPU architectures are byte-addressible. -Raymond]
  5. Frederik Slijkerman says:

    I agree that usually bitfields aren’t worth the trouble, except perhaps when you need to store a huge amount of bits in an array.

    You say “takes sixteen bytes and packs them into one” — shouldn’t this be 32, since there are 32 bits in BOOL?

    [Actually we’re both wrong. It should be “packs 32 bytes into 4.” Math is hard… -Raymond]
  6. Pierre B. says:

    Another way you lose when using bit fields is that you can often simplify code by converting the discrete BOOL into an array which let you easily apply a given algorithm to each field in loops. No looping possible over bit fields.

    (For example, you can have an array of UI elements corresponding to the array of BOOL and update all UI easily, of enable / disable UI based on BOOL, save the BOOL to persistent data, etc.)

  7. SuperKoko says:

    @Frederik Slijkerman:

    32 bits = 4 bytes.

    4 bytes * 4 = 16 bytes.

  8. Stalker says:

    I’ve never seen bit fields as a way to save data memory. Only time that I have used them (when its not because I’m just really tight on RAM) is when mapping to hardware; reading a stream of bytes into a struct with bit fields in order to be able to access individual bits, nibbles and even two- and three-bit values by logical names without using (manual) masks and shifts.

  9. Leo Davidson says:

    It’s also worth remembering that (simplifying things drastically) structures can  add padding after their members that are not 32-bits wide so using smaller built-in types doesn’t always have much/any benefit.

  10. jondr says:

    "Premature optimization is the root of all evil."  — Donald Knuth

  11. KJK::Hyperion says:

    laonianren: not if you use bool instead of BOOL. Or you can normalize the value to a C boolean (int with value 0 or 1) with the !! operator

  12. Phill says:

    The only time I’ve ever considered (and done) this was when writing a server for sending a file using UDP multicast to many recipients.

    It seperated the file into packets and assigned each a sequence number. When a recipient acknowledged receipt of the packet it flipped a bit in a byte array.

    It was faster and had a smaller footprint than using an array of booleans or ints, etc.

  13. Tim Smith says:

    I’m currently working on optimizing a program that has to worry about such things as L1 and L2 caches.  Packing data can have significant performance boosts when you find yourself (unfortunately) jumping all around memory.  But fixing the original problem of jumping all around memory is probably the better way to improve performance.  ;)

    I’ve also seen some very interesting optimizations.  Consider the following code:

    if (aFlag || bFlag) {}

    When both of these bit packed flags end up in the same longword, the optimizer ends up fetching the longword once and then testing for both bits at the same time.  Interesting, but to be perfectly honest, fragile since if you have a lot of bit fields, like we do, those two variables can easily end up crossing a longword boundary causing you to silently lose that optimization.

    In the end, I have to agree with Raymond, in general, this is a bad thing to do.  Then again, in the games industry, we commit all sorts of sins against programming for performance reasons.

  14. Frederik Slijkerman says:

    @Raymond, superkoko:

    Say you have 32 BOOLs, taking 32 * 4 = 128 bytes of space. You can fit them into 4 bytes with bitfields, so this is a saving of 128/4=32.

    So it should be “packs 128 bytes into 4” …

    [I told you math was hard… I think I should have just written “packs a bunch of stuff into less stuff.” -Raymond]
  15. Andy Simpson says:

    This is interesting because I just wrote some code for a numerical modelling project – I have an n x m array, some points of which (most of them) get numerically simulated, while the others are held fixed, with a corresponding array of bools to store which is which.

    Since I’m storing a really large number of bools (easily over 10,000 just at the moment, and the resolution of the simulation is only going up) I reckon this is still a memory win, and I don’t need multithreaded writes or individual bit debugging.

    Hopefully this is another case where bitfields are OK…

  16. asymtote says:

    In a few edge cases the improvement data cache hit rate can be worth the code bloat that bit fields incur to get a better data cache hit rate. They are truly edge cases however because the extra code typically causes a reduction in the instruction cache hit rate.

    A lot of computer industry commentators bemoan the fact that too few students learn C any more but I think the real problem is that too few people learn assembly language programming, and almost nobody studies the code emitted by compilers. Randall Hyde’s Writing Great Code books are a good place to start for those interested in self-education as is Dr Paul Carter’s PC Assembly Language introduction to the x86 instruction set.

  17. - says:

    Suppose each of these bitfielded fields is accessed six times in your code

    Unless I’m mistaken (duh!), your calculation assumes just one of the fields accessed six times. The cost of accessing each of the four fields six times would have been about 400 bytes.

    That said I generally agree with using bytes instead of INTs: realistically everything is x86/x64 now, and this architecture has excellent support for byte-sized stuff (same length opcodes). Besides, multithreaded simultaneous writes to the same class aren’t that common (and either way a performance disaster since each processor will be trashing the other’s cache – if you are going to have two threads writing at once, you better do it on different cachelines at least)

    I’ve seen performance-and-footprint-minded portable code which goes one step further: it defines types such as FAST_U16, which basically means "I need a fast integer which is at least 16-bit unsigned". If the target architecture would struggle with 16-bit sized stuff, it gets defined as a 32-bit one. Best of both worlds.

  18. Mike Gibson says:

    @ Andy Simpson

    Is memory really the factor you’re trying to minimize?  If it’s numerical modelling, then isn’t it speed?  Because that was one of Ray’s points.  Not only are bitfields often not memory wins, but they’re a definite speed loss.

  19. strik says:

    Raymond, many of your points are valid. However, you also have to note that on some architectures, there are more efficient ways to handle bit fields. For example, x86 has BT, BTS and BTR for test, set and reset bits, respectively. The only drawback is that I do not know a compiler that actually emits these assembler instructions. So, in this sense, it is rather a limitation of the compiler.

    Of course, I must admit that many of your points still remain valid. It depends upon what you/I have to achieve. For example, using a bitfield is more efficient if there are really many bits. If I need some hundreds of thousands of bits, this micro-optimisation perfectly suits my needs.

    The issue you raised is similar to people using char variables instead of ints, as they want to save memory. They never check that in general, they do not save much data memory; however, the code footprint increases drastically, as the compiler will promote the char to an int and back multiple times. So, unless you have really many char variables, it is not worth the effort, and even counter-productive.


    "I’ve never seen bit fields as a way to save data memory. Only time that I have used them (when its not because I’m just really tight on RAM) is when mapping to hardware;"

    If you do this, make sure you never have to change to another compiler, and your compiler writer never changes the way the bits are allocated in the bitfield. This is UB in C. If you ever had to change such code from a co-worker to work on another machine, you would never ever do this again.

  20. Fredy Treboux says:


    Wouldn’t it be 128 "bits" into 4??

    math, hard, indeed, i know.

  21. Stalker says:

    "If you ever had to change such code from a co-worker to work on another machine, you would never ever do this again."

    Actually, that is one of the reason we use it. This way the definitions can be the same for all platforms. (How bit fields are allocated is explicitly spelled out in the documentation for this compiler, which isn’t likely to change.) If we used masks and shifts then the definitions would have to be adapted to match that particular CPU’s native size.

  22. Gabe says:

    Mike Gibson: If you have many thousands of bitfields, you are saving enough memory to use fewer pages of RAM. Since each page could mean a pagefault, this could be quite a substantial time savings. Additionally, if you access the bits in the right pattern, having the bits all fit in cache could prevent enough cache misses to provide a speed-up.

    Keep in mind that this is the sort of situation that a SQL Server developer might be keenly aware of, but that most coders should never need to care about.

  23. Evan says:

    There’s yet another problem with "BOOL f:1", which you may see as nitpicky or may not.

    A signed integer of n bits is only guaranteed by the C or C++ standard to hold numbers from -2^(n-1) + 1 to 2^(n-1) – 1; e.g. you can only rely on a signed char to hold a number between -127 and 127. (This is so that C can be implemented on a machine that uses, e.g., sign-and-magnitude integer representations.)

    A signed four-bit bitfield is only guaranteed to be able to hold numbers between -7 and 7. But note what this means for our signed one-bit bitfield: the only value that it’s guaranteed to be able to hold is 0!

    In order to use a one-bit bitfield truly portably to any conforming C platform, you need to make it unsigned.

  24. Aleko says:

    I would have used something like this:

    enum FruitStatus {

    Peeled = 0x0001,

    Sliced = 0x0002,

    Pitted = 0x0004,

    Rotten = 0x0008


    class Pear



    int status; // this is public just for simplicity


    void main()


    Pear* pear = FruitBowl::GetPear();

    pear->status |= Sliced;

    if(pear->status & Rotten)



  25. Sergey Vlasov says:

    @strik: Bit manipulation instructions on x86 are slow, because they are implemented in microcode. They are useful mostly in cases when atomic bit modification is needed (with the LOCK prefix to ensure multiprocessor locking).

  26. Daniel says:

    I like reading these articles on programming real computers.

    Sometimes I work on embedded systems where you have, say, 128 bytes of RAM, but these are bit-addressible and there are commands to manipulate a single bit in one instruction, so you can make the best use of the tiny memory.

  27. ton says:


    I think everything else you said was correct in your last comment except for this sentence:

    "so this is a saving of 128/4=32."

    It should be calculated as 128 – 4 = 124 to report the savings in memory I suspect.

  28. someone says:

    But 4 bits cann’t be stored, so the compiler uses 8, so the saving is 128 – 8 = 120 bits!

  29. Evan says:

    Even saving just 8 bits is suspect because of alignment issues. If your struct was just those four flags, it would almost certainly be given a full word (32 bits) by the compiler anyway. If you had another int member, it would almost certainly get 64 bits. Thus the savings would probably be more like 128-32 = 96 bits.

  30. steveg says:

    Another reason not to use C bit fields (or any obscure language feature) is they make you think something like "Oh! Right, haven’t seen those since 92… how do they work again. Where’s my K+R?". Maintenance costs go up.

  31. SuperKoko says:

    @Frederik Slijkerman:

    The original example contains only four BOOL members: m_peeled, m_sliced, m_pitted and m_rotten, so that 4 bits are packed into a byte and there’re 4 padding bits. Moreover, I suspect that some compilers may pack that in 4 bytes rather than 1 byte.

    Anyway, as Raymond told us, math is hard.

  32. Kaenneth says:

    I was an admin for a multidimensional database, it was 2 GB on disk, and the server had 256 megs of ram.

    The database software would track block usage, so an entirly empty/zero block would just take a single index entry (like integer ‘-1’). a series of such blocks would also take just one entry (‘-2’ for 2 blocks, etc.)

    The database was at first setup so the pattern of full/empty blocks was like {-1,7,-1,2,-1,4,-1,8} reordering the dimensions aligning the empty blocks changed it to to be like {7,2,4,8,-4}.

    The database shrunk from 2 gigs to 500 megs, recalc times went from 2 hours to 10 minutes; and it stopped crashing everyday.

    If you are tracking a large number of bools, you should consider the higher level structure of the data, it’s almost never ‘a bunch of bools’ unless you are working with compression/crypto/rngs. Is there a bias in the values?, will there always be less trues than falses? do the values cluster/occur in runs?

  33. Simon Cooke says:

    Reasons to bit-pack:

    Cache coherency. Very important on XBOX and PS3. The smaller your object size, the more likely it is to fit in a cacheline (or the more you can fit in a cache line..)

    Reasons not to bit pack that you haven’t covered:

    GCC will happily merge bitfields across class boundaries. eg:

    class A {

      BYTE mydata : 5;


    class B : public A {

      BYTE moreData : 3;


    sizeof (B) = 1 byte, as in memory, the bits from B are merged into the bits from A. And it’ll do this with char data too on byte boundaries when they nestle up against bitfields.

    Talk about scary behavior.

  34. Robert says:

    1) If it was a universally good idea, compilers would pack the structs for you unless you specified some kind of portable struct.

    2) I work with GIS.  My current data set is 18.3 billion records.  This can save a LOT of space in data files, and disk IO’s are not cheap..

    Pick the right tools..

  35. Ken Hagan says:

    If your processors have a weakly ordered memory model, multi-threaded writes to BOOL aren’t safe either, but on Windows one can at least use InterlockedXxx() functions on BOOL-sized data. That apart, bool is a fairer comparison in C++ and char probably fairer for C. There *are* machines which can’t address bytes efficiently, but most of us will never write code for one and there are other machines that can address bits efficiently that "we" will never use either so using an int "for efficiency" strikes me as a premature optimisation in its own right!

  36. Mike Gibson says:

    @Tim Smith

    If you need performance, and you have to violate some rules of programming to do it, fine.  That rule applies anywhere, not just in games.  What people fail to understand is that you should use a profiler or other *real-world* results to figure out what you should be optimizing, not just doing it wily-nily.

    But I think that was your point anyway.

  37. Nuno Lucas says:

    I almost never use the bit struct feature, but it’s a common technique to pack a bunch of flags in an (unsigned) int.

    Things like

    flags |= FLG_1 | FLG_2;

    flags &= ~(FLG_3 | FLG2);

    etc. are really handy.

  38. mikeb says:

    Bitfields are almost always a bad idea.  People often try to use them to model fields in a network packet or hardware register, but the problem with that is that the ordering, alignment and some of the packing behavior of bits fields is largely left as ‘implementation defined’ in the standard, so your code ends up being fragile and completely non-portable.

    If you need to actually model bitfields in a network packet, file format, or hardware register – please just define constants and inline functions (or simple macros if you must) to mask, extract, and set the fields.

    Everyone will be much happier in the long run.

  39. Andy says:

    From memory way back when I seem to remember something along the lines of bitmasks being optimum only when they were the same size as the data bus, data was padded if it was smaller meaning that if you sent 4 bits over an 8 bit bus, all 8 bits were used.

    One can use bitwise operations on the decimal values to test for each bit eg.  (flag & 3) would equate to true if bits 0 and 1 where turned on.

    In assembler would this statement not equate to something like?

    (assuming 1=peeled, 2=sliced, 4=pitted, 8=…)

      and ah,al  ; assume AH is flag and al is 3

      cmp ah,al  

      je peeledAndSliced  

    So I guess the optimium number of bits one can use is dependent on the default register size for the processor mode.  On a 32bit wintel boxes I suspect much padding is performed when bitmasks are used. Could not the spare space be filled with data for another operation?

    [I have not written assembler for over 20 years, so a bit rusty].

  40. Karellen says:

    I think I’d still go for the packed representation, on the very basis that you should ignore micro-optimisations most of the time.

    The point is that a 32x memory reduction is not a micro-optimisation. A 32x memory reduction (or even a 93% reduction) is a *huge* win.

    The natural way to implement a field that holds a single bit of data should be with a data element that holds a single bit of data. If you’re writing a program and you want a single bit, you should get/be in the habit of using a single bit to represent it.

    If you later find out that you can save 100 bytes by using 32 bits per bool because you only create one of these structs and it’s only got 4 bit fields anyway, *that’s* a micro-optimisation. If you later find out that you can save 3 instructions per write and that really helps because it turns out you’re doing a lot of writes to one of these things in your inner loop, *that’s* a micro-optimisation.

    Using a single bit instead of 4 bytes to store a single bit is *not* a micro-optimisation. It’s what you *should* *be* *doing*.

    [A 32x memory reduction is impressive when the starting point was 2MB. But if your starting point is 100 bytes, a 32x memory reduction is noise. -Raymond]
  41. > Using a single bit instead of 4 bytes to store a single bit is *not* a micro-optimisation. It’s what you *should* *be* *doing*.

    I can’t agree more.

  42. Ulric says:


    > Using a single bit instead of 4 bytes to store a single bit is *not* a micro-optimisation

    As this is semantic aspect of the question, you should program in something higher level with a ‘bool’ type, but not in C with bit fields.

    Using bit fields is not a about the nature ‘types’, but matching some memory layout for embedded system, or memory saving when you know how the machine works.  Here’s the problem: bit fields are not something that’s native to the CPU, they are not made to deal with them.  They deal with bytes and DWORDS. Bit fields have all the problems listed here, including being slow to access and having problems with threads.

    If you want to ‘say’ something in your code about saving ‘a bit’, use the higher-level ‘bool’ type, and ignore the implementation.  Bit fields are not about semantics, they’re about choosing an implementation.

  43. Worf says:

    The first time I saw bitfields was in a header file defining hardware. I certainly hated it as I couldn’t figure out the ordering of the bits … (I need to set bit 21… which one was that?). And if someone forgot to union it with unsigned ints, then setting them en masse was a bit more difficult, involving interesting casts.

    Nowadays people use macros – more readable! And portable, and the bits are more obvious.

    The other time I dealt with bitfields was really fun – had a packed bitfield (no choice – all the elements had to fit in 15 bytes due to hardware), but my destination machine was big-endian. My build machine was little. Was quite interesting exploring how the bits got set where, and writing a set of mask-shifts to rearrange the bits properly for big endian.

  44. cthulhu says:

    class Pear {

    BOOL pld;

    BOOL sld;

    BOOL ptd;

    BOOL rtn;


    fixed it for you, now the variables take up half as much space.

  45. strik says:

    @Sergey Vlasov:

    Yes, I know that the bit instructions are slow like hell. I never spoke about running time. ;)

    Here, it is a matter of what is more important, memory or speed. There can be reasons to try to optimise for memory. Then, these instructions come in handy.


    "Actually, that is one of the reason we use it. This way the definitions can be the same for all platforms."

    No, they cannot be the same for all platforms. You must implement at least two versions, one for a left-to-right compiler, one for a right-to-left compiler. Then, you must determine for each platform if it belongs to the one or the other group.

    Also note that as far as I remember – but I must admit, I have not checked -, C does not even guarantee that the compiler does not add padding to the bit field. Thus, the compiler could even use only "half of the bits" in the structure. For example, a bit field of two one bit elements could be generated with the corresponding bits being 0x80 and 0x08, or similar.

    Furthermore: You cannot guarantee that accessing some a.f field, that no other fields are accessed, too. Take, for example, a bit field of 12 bit: Will the compiler access 3 byte, 4 byte, 8 byte, or whatever on access? This can make a difference for hardware (bits might get cleared on read or write, and so on).

    Thus, you loose the control over what you are doing.

  46. Memet says:

    Is that compiler code or did you write it up yourself?

    I hadn’t seen the add eax, eax to accomplish a rol operation.

    I must say: assembly language has no code of honor. Anything goes so long as it works.

    [the add instruction can execute in either the u-pipe or the v-pipe, whereas rol executes only in the u-pipe. -Raymond]
  47. lh says:

    Good points Raymond as I am currently working on something just like that.

    But let me flip your problem on end.  What if the code is running out of a rom or flash memory.  Then your heap and stack need to run in a small confined area?  You are memory constrained not ‘disk’ constrained.

    If you have plenty of flash space yet not enough memory what you are doing is a *good* idea.  If both are being loaded into normal memory then it is like you said a speed and space deoptimization.

    Remember optimization is not always about speed.  You may be optimizing to a different constraint.  In my case a small memory area with plenty of flash area mapped into memory.  I am trading speed for memory.  In this case a slow down for more memory.

    Also another thing you may want to consider (and it doesnt work on all platforms) is the packing of a struct.  You can bloat code quickly by messing with the default packing.  As the compiler will emit extra instructions to get at the missaligned data.

    I am totaly with you though.  When you optimize make sure you are doing the right thing.  In windows what you said would probably be a bad thing to do.

    When you have 2 gig of memory saving 20k of memory is not that big of deal.  When you only have 64k of memory 2k can be a big deal.

    [It’s a cost-benefit analysis. Sometimes the benefit is worth the cost; sometimes it isn’t. If you’re in some special case like embedded programming, then of course my general discussion does not apply. My point is that the benefit comes at a cost. It’s not free. -Raymond]
  48. Ken Bateman says:

    It’s true that storing each field in a byte results in fewer instructions, but if you look at it from an aggregate standpoint, using a smaller bit-based data structure will result in less cache pressure.  Depending on the large-scale access pattern, the benefit of fewer cache misses can outweigh the cost of more instructions.

    [True in the general case, but here we have only six copies of the object. How much cache pressure is there from just 6 objects? -Raymond]
  49. meh says:

    Then again, if your file format and application memory requirement balloons to 2 times its size as you port your application to a 64 bit architecture, your customers might be rightly upset.

    A lot of companies do not apparently care; they say it’s not worth saving those bytes if it costs them a days work. Oh well, once the customers start to notice its their time and money being wasted, most companies will wise up. Vista anyone?

    [If there are only six copies of the object, the memory savings in data is overcome by the memory cost in code. So packing the data makes your program consume more memory. (Did you not read the article?) -Raymond]
  50. Me says:

    Yes I read the article. Obviously there is no memory saving if you only have 1 copy of the data. Often though, you have tons of copies of the data, and only a few copies of the code. Besides, even if you are wasting a few bytes, it’ll never automatically become a huge amount, just because the user had more data than you expected.

    There is a reason why in algorithms you choose the algorithm that has the lower complexity class, even though it may be slower for small numbers. It’s the large numbers that are important anyway.

    Nobody cares if you overoptimize and lose 50 bytes. People care if you underoptimize and waste 500 megabytes.

    That’s not to say that using bitfields is always the solution though. There needs to be a compromise, looking at what features you need (atomic access, really fast access) and what they cost (memory, debuggability).

    [The article was about the case where you have only a half dozen copies. Obviously if you have 100,000 copies, then the cost/benefit analysis tips the other way. I’m sorry you read my conclusions out of context. -Raymond]
  51. xix says:

    But will you always know how many copies might be in use?

    I was reminded of this post:

    The final lesson is qualified as when writing code for a system library.  But if you’re writing userland app code that today might be invoked in a half dozen copies, it still could be invoked in a half million a month later when you weren’t involved.  I guess the lesson is to check those corners.

  52. SuperKoko says:

    @strik: BTS, BT and BTR will LOOSE code space compared to AND or TEST on the relevant byte. They save space when the bit position is dynamic.


    mystruct.bit = 1;

    may generate (if the bit is at position 2 in a DWORD):

    AND byte ptr [mem], FCh ; hexa: 80 Mod/RM (SIB) FC ; 3 or 4 bytes


    BTS dword ptr [mem], 2 ; hexa: 0F BA Mod/RM (SIB) 02 ; 4 or 5 bytes

    Fetching the value:


    TEST byte ptr [mem], FCh ; hexa: F6 Mod/RM (SIB) FC ; 3 or 4 bytes

    BT dword ptr [mem], 2 ; hexa: 0F BA Mod/RM (SIB) 02 ; 4 or 5 bytes

    With dynamic bit values, that’s even worse: A conditionnal jump has to be used for BTS/BTR, while a simple SHL+AND can do the job.

  53. Eric Preisz says:

    In my opinion, I think you would have a hard time making any perfomance claims about this code across many different processors.  

    I still have a lot to learn about micro optimizations, can someone speak on the point of the extra assembly and how instruction level parallelism affects it?  

    Isn’t throughput really the way we make code faster today?

  54. pc486 says:

    In regards to the people who claim embedded environments would justify the cost, Raymond still makes a *very* valid point. I professionally work in the embedded realm and there is often no justification for bitfields in the name of saving space. The big-name compiler I use for the 68HC11 (which shall rename nameless to protect the guilty) has a hard time optimizing for bitfields. Using the native ALU width is the better way to go. 64k of memory means there is no such thing as 100+ structures, so the code generation cost isn’t worth the effort. Bitmapped structures are useful for register access so it’s not like this stuff is totally useless.

    Honestly my only complaint in Raymond’s article is the comment about threading. C makes no guarantees about atomic access to any type of variable except for sig_atomic_t. You, the programmer, must lock that access or use a lockless structure/access protocol. Anything else should be considered a hack and not portable or maintainable. Besides, both classes in Raymond’s example would need to be marked volatile. No volatile means no threading support, period. And what if we add the volatile keyword? More code generation! Ouch!

    [Win32 makes guarantees above and beyond what C guarantees. -Raymond]
  55. Dinev says:


    How does this require an extra register to Raymond’s version?

    add eax, eax

    and dword ptr [ebx+01Ch],-3

    or  dword ptr [ebx+01Ch],eax

    Only if you want to help modern cpus by using register renaming, I guess that another register is necessary.

    [(You also need to put an “and eax, 1” at the start.) This version reads the memory twice and writes it twice. It also leaves a window where the value of the bit is 0 even if it used to be 1 and you’re setting it to 1. -Raymond]
  56. Joseph Koss says:

    Finding one aspect of something to be under-performant is no reason to abandon it. The example is essentialy the worst-case in at least four ways.

    First is that there exists only a few fields.

    Second is that there exists only a few instances.

    Third is that the existing state of the field is unknown to the compiler within that code block.

    Fourth is that the new state of the field is unknown to the compiler within that code block.

    The last two are important because its actualy rare for the compiler to be put in that situation.

  57. peterchen says:

    @Joseph, I think the point is just behind what you say. Of course it is a bad example for using bit fields.

    A simple-minded student would see the lesson: "don’t use bit fields".

    A good student would say "don’t use bit fields, when the additional code exceeds the storage savings". And would probbly receive an A for that.

    A students who prefers to think in abstractions would learn: "When making an optimization, take all tradeoffs into account." He did learn a valuable lesson indeed – maybe the most valuable of the three – but one that isn’t easily applicable, and it doesn’t help the other students.

    Somewhere, there is even a lesson in it that will enlighten all three students, help them natively understand a similar situation.

  58. There are times when bitfields are useful.  This obviously isn’t one of them.

    Examples of useful ones: accessing an I/O device when writing device drivers (or in my case emulators.)  You can map out a structure (not a class) that maps out to the bits in an I/O device and this allows you access to the individual lines.

    There are issues with this too.  Endian issues, but now you’re dealing with bitwise endian issues where on one architecture (or compiler) the order that the bits are stored in are different from another.  So you need to be careful there too.

    Otherwise, yes, if you just want a bunch of booleans, it’s not worth the trouble.

  59. code taxing says:


    Simple-minded student vs good student vs abstract student:

    What you propose are 3 code taxes. The last tax is heavy, the first tax is lightweight.

  60. ashleigh says:

    Using bitfields like this makes sense on embedded systems where RAM can be limited and expensive (eg a processor upgrade to a more expensive version to get more RAM).

    But generally when you have a big OS and lots of memory, micro-optimisations like this are a waste of time and make the program run slower.

  61. 1bit=1bit says:

    But generally when you have a big OS and lots of memory, micro-optimisations like this are a waste of time and make the program run slower.

    Ofcourse, who doesn’t agree to that? But the actual micro-optimisations in this case is to use 32 bits instead of 1 bit to store 1 bit.

  62. Igor Levicki says:

    There is a something that has been bugging me for quite some time — compilers and language standards.

    Namely, bitfield ordering is "implementation specific" as noted in C standard.

    I must admit I do not quite understand how can we call something (in this case the C/C++ Standard) a standard when it doesn’t define precise rules for every situation? How can a rule still be considered a rule when all it says is "behavior is undefined and implementation specific"? It is the same as if they said "rule is that there are no rules".

    For example, the execution of your code is guaranteed to happen in the right order regardless of the compiler optimizations, but the standard doesn’t guarantee that your variables will be created on the stack in the same order you listed them in your code (unless of course they are part of a structure). Can you say double "standard"? How about other "implementation specific" details such as vtable layout, etc?

    Furthermore, languages are not evolving with hardware. An example of this would be the inability of new[] to return __m128 aligned memory pointer unless overriden using placement new[], and even then it may fail depending on where and how the compiler stores the number of elements to delete[] later.

    Finally, I wonder if someone from Microsoft knows the reason why BOOL was never defined as an unsigned int instead? That would avoid sign extension on bit extraction from a bitfield.

  63. SuperKoko says:

    I must admit I do not quite understand how can we call something (in this case the C/C++ Standard) a standard when it doesn’t define precise rules for every situation?

    Because there are things that are best handled by platform-specific specifications. Namely: ABI.

    Would you want int to be 16 bits (as most int were in 1990) FOREVER?

    but the standard doesn’t guarantee that your variables will be created on the stack in the same order you listed them in your code

    Specifying that would forbid major optimizations such as storing variables in registers, using "holes" for alignment issues (x86 is one of the very rare platform to support unaligned data), or packing small data together.

    What about platforms where the stack grows up compared to platforms where the stack grows down?

    What would you do for IA-64 where there’re MORE than one stack?

    What about platforms where the "hardware" stack is too small (e.g. 6502 has 256 bytes of stack) and compilers want to use a larger "software" stack?

    What would be the benefit of specifying these details? It would allow brain damaged code performing mad pointer arithmetic, and would either break debugging tools or make them useless.

    If you wish to exactly control code generation, use MASM or HLA, not C.

    If you wish to know your binary interface, simply read the ABI specification for your platform. Unfortunately, Win32 has no such ABI. All other platforms I’m aware of, have one.

    Similarly, vtable layout will depends on object format (ELF vs A.OUT vs COFF) and are best specified by a separate ABI.

    Without undefined behavior, C everything would’ve to check everything… That would be huge overhead. Pointers would’ve to check memory bounds, which would make sizeof(void*) much higher.

    An example of this would be the inability of new[] to return __m128 aligned

    Compilers are required to return 128-bits aligned memory on platforms where some builtin data types require this alignment. This is enough for pure C90 code.

    If you need another alignment for a platform-specific reason, then, use a platform-specific function: e.g. posix_memalign on POSIX systems.

  64. Igor Levicki says:

    >Would you want int to be 16 bits (as most int were in 1990) FOREVER?<<

    Changing size of an integer is not what I would do.

    I would rather standardize it and call it int8, int16, int32, etc. That would make developers more aware of the data size and make them chose more carefully which type to use.

    Having char (which backfired with unicode making it !char), short and long (and long long!) is extremely stupid. And __int64 is even more stupid, and I cringe everytime I have to type:

    unsigned __int64 crap;

    I mean what those two underscores mean to ME?!? What is the point in making it harder to type when the compiler doesn’t give a damn?

    >Specifying that would forbid major optimizations such as storing variables in registers…<<

    I never said it should keep all the variables on the stack, just that they should appear in written order, and if I chose so I can always add the volatile keyword to prevent some of them from being kept in registers.

    As for the other types of stack I simply don’t care. How many people today use C on 6502 or IA64 compared to C on Core 2 Duo or Athlon 64?

    >…or packing small data together.<<

    I simply do not want the compiler doing layout and space optimizations on my variables (as opposed to "on my code") because I wasn’t explicitly asking for those.

    >Without undefined behavior, C everything would’ve to check everything<<

    I am not asking C for 6502 to support all the features of C for Win32 — what I want is a C which is at least standardized across the platform (in this case x86).

    >Compilers are required to return 128-bits aligned memory on platforms where some builtin data types require this alignment.<<

    Geee… I wonder what that datatype would be for x86 and x64? Could it be a 128-bit SIMD register with its 16 byte alignment requirement? Wait, that data type is builtin since Pentium III era, but obviously only into the CPU, not into the compiler!

    But that has been a problem only since 1999, I guess it takes more time than 9 years for C/C++ Standard Committee to move their lazy arses into high gear and finally add some support for it.

    >If you need another alignment for a platform-specific reason, then, use a platform-specific function: e.g. posix_memalign on POSIX systems.<<

    And how do you advise using those platform-specific functions for returning an array of initialized C++ objects like new[] does?

    True, I could use placement new[] but I won’t. Why would I want to litter my projects with those, when they could just add another new[] prototype with alignment parameter to the standard?

    Apart from being much better at code optimization today (and at the same time making it harder to find the compiler induced bugs, and believe me those exist!), compilers and the C/C++ language have been stagnant over the last 10 years if not even longer.

  65. SuperKoko says:

    How many people today use C on 6502 or IA64 compared to C on Core 2 Duo or Athlon 64?

    You seem not to understand that, if you don’t need portability, you are allowed to use platform-specific things.

    But, the C standard is designed to be adopted on every platform, including the most recent IA-64 and ARM CPU.

    That would make developers more aware of the data size and make them chose more carefully which type to use.

    On the other hand, that would give the Win32 fixed-type-sizes fiasco on every platform. Most ILP32 UNIX platforms have been cleanly moved to LP64, benefitting from easy-to-use clean 64 bits file pointers with the good old C90 and POSIX file interfaces.

    Your view seems to be biaised towards the Win32 platform (which, unfortunately, has so poor C & C++ implementations, without proper ABI).

    Having char (which backfired with unicode making it !char), short and long (and long long!) is extremely stupid.

    long long is brain damaged, right. It seems to have been adopted by the C standard because bad programmers (mainly Win32 programmers) are too stupid to think that long might, someday, be longer than 32 bits, and so rely so heavily on it that the Windows C implementations cannot change its size anymore.

    Any proper platform make long the widest integer type available so that sizeof(long long)==sizeof(long) in C99 mode, otherwise the compiler would break C90 compatibility by making some conversions such as "unsigned long->size_t" lossy.

    In the end, specifying fixed types might be the only way to go in a world where programmers don’t care about future.

    (If you’re a C99 user, you can right now use them).

  66. Michiel says:

    "The C standard should specify everything" is the same thinking as "640K is enough for everybody". Because the C standard should obviously then specify at which point malloc() would fail. I never expected Igor Levicki to share that view.

Comments are closed.