Why is Identical COMDAT Folding called Identical COMDAT Folding?


We saw a while ago that the linker will recognize that two functions consist of the same code sequence and will use the same bytes to represent both functions, a process known as identical COMDAT folding. But why is it called identical COMDAT folding?

COMDAT is short for "common data", a feature of the FORTRAN programming language.

For those of you who need some brushing up on FORTRAN: Here's a crash course in common data.

In FORTRAN 77, if you want to share variables between functions and subroutines, you put them in a so-called "common data block", usually shortened to just "common block". For example, here are two FORTRAN subroutines that share a variable called LAST:

C     THE SETLAST SUBROUTINE TAKES ITS
C     PARAMETER AND SAVES IT IN THE
C     COMMON VARIABLE "LAST"

      SUBROUTINE SETLAST(I)

C     DECLARE THE DATA TYPE OF THE PARAMETER "I"
C     AS INTEGER. THIS IS TECHNICALLY NOT NECESSARY,
C     BECAUSE VARIABLES WHOSE NAMES BEGIN WITH THE
C     LETTERS I THROUGH N DEFAULT TO INTEGER.

      INTEGER I

C     DECLARE A VARIABLE CALLED LAST AND
C     PUT IT IN A COMMON BLOCK CALLED /LASTV/

      INTEGER LAST
      COMMON /LASTV/ LAST

C     OKAY, HERE WE GO. SAVE THE VALUE.
      LAST=I
      END

C     THE GETLAST SUBROUTINE RETURNS THE
C     VALUE SET BY THE MOST RECENT CALL TO
C     THE SETLAST SUBROUTINE.

      INTEGER FUNCTION GETLAST()

C     DECLARE A VARIABLE CALLED LAST AND
C     PUT IT IN A COMMON BLOCK CALLED /LASTV/

      INTEGER LAST
      COMMON /LASTV/ LAST

C     RETURN THE VALUE IN "LAST". THIS VALUE
C     WAS PUT THERE BY THE SETLAST SUBROUTINE.

      GETLAST = LAST
      END

(Modern FORTRAN supports lowercase, but I grew up in the days before lowercase was invented. Writing FORTRAN code in lowercase just looks wrong to me.)

Both SETLAST and GETLAST declare a variable called LAST and put it in a common block named LASTV. The compiler matches up all common blocks with the same name and aliases them together, so that they all refer to the same block of memory.

You can put multiple variables into a common block by separating them with commas.

Note that it is conventional to give the variables in a common block the same name each time they occur. But there's no requirement that they do. You can give the variables different names each time you declare the common block:

      SUBROUTINE SETLAST(I)
      INTEGER I
      INTEGER FRED
      COMMON /LASTV/ FRED
      FRED=I
      END

      INTEGER FUNCTION GETLAST()
      INTEGER BARNEY
      COMMON /LASTV/ BARNEY
      GETLAST = BARNEY
      END

This block of code is functionally equivalent to the previous one. Here, the SETLAST subroutine calls the sole variable in the block "FRED", whereas the GETLAST function calls it "BARNEY". This is perfectly legal, albeit strange.

You aren't even required to match up the data types, as long as the total size of the common block stays the same. For example, you might say

      INTEGER*2 A
      INTEGER*2 B
      COMMON /FOURBYTES/ A, B

in one function, declaring two two-byte integers in a common block called FOURBYTES, and then in a different function, declare it like this:

      INTEGER*4 I
      COMMON /FOURBYTES/ I

The two common blocks are four bytes long, so this is perfectly legal. Of course, the results depend on the endianness of the processor.

Okay, so anyway, FORTRAN had these weird things called "common blocks" which are used to get multiple functions to share a chunk of memory. I'm guessing that these things are what the COMDAT object file segments were originally intended for. The rule that normally applies to COMDAT sections is that if the linker sees more than one COMDAT section with the same name, it will keep one of them and throw away the rest. This is why it's important that all common blocks have the same size: You don't know which one the linker is going to use!

The C++ language introduced places where the compiler may end up emitting the same code multiple times, for example, vtables and non-inline versions of inline functions. The compiler can use these old FORTRAN COMDAT segments to hold those things, and rely on the linker to keep only one copy. (Note that the linker doesn't validate that the duplicates are identical. Yet another reason why the C++ language requires that inline functions be identically-defined in all translation units.)

And finally we get to identical COMDAT folding.

The idea is to put not just inline functions and vtables in COMDAT segments. Let's just put everything into COMDAT segments. And then let's tell the linker, "Hey, if you see two COMDAT code segments that are byte-for-byte identical, then go ahead and treat them as if they were the same thing."

That's how we got to the name "identical COMDAT folding". We are taking COMDATs, looking for those which are identical, and collapsing (folding) them together.

Bonus chatter: I pulled a fast one in this article. Next week, I'll come back and unwind it a little.

Comments (13)
  1. George says:

    "before lowercase was invented". Of course I know what you mean, having entered Fortran IV on punch cards as the least bad input method long ago, but now I'm picturing you using hammer and chisel to enter code on marble slabs, with the comments in Latin.

    1. DWalker07 says:

      When I was a kid, we had sticks and a string. Uphill both ways!

  2. camhusmj38 says:

    I learned Fortran 95 - so Fortran 77 always looks wrong to me. In addition to the case business, there's the abandonment of the importance of column position. Comments start with a ! and can begin anywhere. It's so much more relaxing.

  3. Jim Lyon says:

    Oh, you youngsters!
    I started with FORTRAN II, on a machine with 6-bit characters. There literally weren't enough bits in a character to afford lower case.

    1. st says:

      And 10 characters per word? Whatever happened to CDC?

  4. Andrea D'Alessandro says:

    "Modern Fortran" is an oxymoron. ;-)

    1. Ivan K says:

      And yet Olde Fortran will be a popular brand of malt liquor / fuel for miscreant robots in the 31'st century... at least in the Futurama universe.

  5. Martin Bonner says:

    Nitpick: The official name of the language you refer to was "Fortran 77" (with title case). Having said that, your code is perfectly valid FORTRAN IV.

    Where did you say the nitpicker's corner was? First door on the right? I'm on my way.

  6. Ray Koopa says:

    "DECLARE THE DATA TYPE OF THE PARAMETER "I" AS INTEGER. THIS IS TECHNICALLY NOT NECESSARY, BECAUSE VARIABLES WHOSE NAMES BEGIN WITH THE LETTERS I THROUGH N DEFAULT TO INTEGER."
    I guess I finally understand why iteration counters are often simply called 'i'.

    1. parkrrrr says:

      Perhaps not entirely. The Fortran rule wasn't arbitrary. It came from the common mathematical practice of using i, j, k as indices and m, n as matrix/vector dimensions, both usages that imply integers. (I suspect L just got swept up in the tide; it's rare to see a straight (i.e. not script or bolded) L in math, because mathematicians love lowercase, and lowercase L looks like 1.)

      1. Alex Cohn says:

        👍 good answer

      2. SI says:

        I once had to figure out what some code did that used 'i' 'I' 'l' '1' interchangably.... some days you think the previous programmer was out to get you.

  7. Austin Donnelly (MSFT) says:

    Note that plain old C also has a similar notion: globals that are not explicitly initialized are put into the .common section of the object files, and when the linker produces the final output those common sections are merged, identical symbols are folded, and they are assign a location in the .bss section. NB: initializing a global puts it into the .data section, and those symbols are not eligible for folding: you get a link-time error.

Comments are closed.

Skip to main content