C++98 -> C++11: Pass by value or pass by reference?

In GoingNative 2012, there are some discussions on the new coding style for C++11. One interesting thing which is mentioned by Bjarne Stroustrup, Stephan T. Lavavej and Herb Sutter is the most efficient way to pass the argument.

In C++98, pass by reference is in general more efficient because it saves extra copies (if the copy is expensive, otherwise the extra indirection may hurt the perf instead). However, after the introduction of move semantics in C++11, the story is now a little different (Note that premature optimization doesn’t buy you anything, so the following is to give you a full picture of the new thought introduced in C++11, see the guidelines at the end and use it when appropriate)

Let’s see the following example:

#include <iostream>
#include <vector>

using namespace std;

struct String {
    String(const wchar_t *) {
        cout << “ctor” << endl;
    String(const String &) {
        cout << “copy ctor” << endl;
    String(String &&) {
        cout << “move ctor” << endl;

void f1(String s) {
    vector<String> v;
void f2(const String &s) {
    vector<String> v;

void g() {
    String s(L“”);
    cout << “f1(s)” << endl;
    cout << “f1(L””)” << endl;
    cout << “f2(s)” << endl;
    cout << “f2(L””)” << endl;

If you run the code, you will find that ‘f1’ has one less copy ctor call for rvalue argument and the same  amount of copy ctor call for lvalue argument than ‘f2’, which means ‘f1’ is more efficient (assume that move ctor is much cheaper than copy ctor).

The trick is as follows: for lvalue argument, ‘f1’ has one extra copy to pass the argument because it is by-value, while ‘f2’ has one extra copy to call push_back. So no difference; for rvalue argument, the compiler has to create a temporary ‘String(L“”)’ and pass the temporary to ‘f1’ or ‘f2’ anyway. Because ‘f1’ can take advantage of move ctor when the argument is a temporary (which is an rvalue), the costs to pass the argument are the same now for ‘f1’ and ‘f2’.

This means in C++11 we can get better performance by using pass-by-value approach when:

  1. The parameter type supports move semantics.
  2. The cost of move constructor is much cheaper than the copy constructor (both the time and stack usage).
  3. Inside the function, the parameter type will be passed to another function or operation which supports both copy and move.
  4. It is common to pass a temporary as the argument.


I learn today (April 2nd, 2013) that ‘return m;’ is in general more efficient than ‘return std::move(m);’. Both will use move constructor of Matrix, but the latter inhibits NRVO (which means the compiler can’t eliminate the call to the move constructor).

They are the same in the above example because ‘operator-‘ returns function parameter. If you return a local variable instead of a function parameter, you can save one call to the move constructor.

The criteria for copy / move constructor elision is in C++11 standard draft n3485.pdf, 12.8 / 31:

    in a return statement in a function with a class return type, when the expression is the name of a non-volatile automatic object (other than a function or catch-clause parameter) with the same cv-unqualified type as the function return type, the copy/move operation can be omitted by constructing the automatic object directly into the function’s return value

The complex two-stage overload resolution to choose move constructor when you return an lvalue object is in 12.8 / 32:

    When the criteria for elision of a copy operation are met or would be met save for the fact that the source object is a function parameter, and the object to be copied is designated by an lvalue, overload resolution to select the constructor for the copy is first performed as if the object were designated by an rvalue. If overload resolution fails, or if the type of the first parameter of the selected constructor is not an rvalue reference to the object’s type (possibly cv-qualified), overload resolution is performed again, considering the object as an lvalue. [ Note: This two-stage overload resolution must be performed regardless of whether copy elision will occur. It determines the constructor to be called if elision is not performed, and the selected constructor must be accessible even if the call is elided. —end note ]


Here is an example to explain the rules.

Matrix Add(const Matrix &m1const Matrix &m2)
    Matrix m;
    // Calculate m
    return m;
Matrix NegateByRef(const Matrix &m)
    // Calculate m 
    return m;
Matrix NegateByValue(Matrix m)
    // Calculate m 
    return m;
void NoOpByRef(const Matrix &m)
void NoOpByValue(Matrix m)

The following two tables show how many copy ctor (number before the comma) and move ctor (number after the comma) are called in each scenario.
For each scenario, the row is the function we call, the column is the argument to the function. For example, row ‘NegateByRef’ and column ‘m’ means ‘NegateByRef(m)’.
m, m1 and m2 are known matrices.

Matrix has move ctor m Add(m1,m2)
NegateByRef 1,0 1,1
NegateByValue 1,1 0,2
NoOpByRef 0,0 0,1
NoOpByValue 1,0 0,1
Matrix doesn’t have move ctor m Add(m1,m2)
NegateByRef 1,0 2,0
NegateByValue 2,0 2,0
NoOpByRef 0,0 1,0
NoOpByValue 1,0 1,0
  1. Rule 1 is the basis of our discussion and we assume that it is always true.
  2. From the two tables, you can find that the total numbers of move + copy ctor are the same in each scenario. So if rule 2 is true, it is always better to add a move ctor to the type.
  3. You can find that ‘NoOpByValue’ is always worse than ‘NoOpByRef’. This is related to rule3. Note that, rule 3 may be true even if the parameter is read only. That is, whether you modify ‘m’ in ‘Negate’ doesn’t have impact on the number of ctor called. It is only related to whether you pass the parameter further. In this case, ‘return m;’ is the operation which supports both copy and move.
  4. ‘Add’ returns rvalue (temporary is an rvalue) and ‘m’ is an lvalue. You can find that ‘NegateByValue’ is only better than ‘NegateByRef’ when rule 4 (the argument is an rvalue) is true.
Comments (2)

  1. Arash Partow says:

    Xiang, refrain from hand-waiving and simply put up benchmarks demonstrating the times between each calling convention, between various versions of compiler pre/post c++11, for varing sizes of strings (SBO et al.)

  2. xiangfan says:

    Thanks, Arash. I update the post with tables to show the actual numbers of copy and move ctor calls. They can demonstrate the differences better than the raw running time.

    If move ctor is much cheaper than copy ctor and other operations take negligible time, then the total running time is proportional to the number of copy ctor calls which is the number before comma in each cell.