Threads, Events, and Mala (in C++)


By Chen Fu and Tanj Bennett

When it comes to constructing high performance server applications, there is a decade long debate between threads vs events. And it is not getting old.  Here we examine the story from both sides, and explore how back end server programmers can get the advantages from both sides, using an open source library Mala.

Thread-per-Request

"Threads" people usually assigns a single thread to process a client request. Accessing to shared resources, such as a remote database or a local memory table, can be wrapped behind blocking interfaces to hide the synchronization details. As a result, this approach provides an intuitive sequential programming model, and helps programmers avoid the headache of parallel programming.

The "Event" people like to accuse their "Threads" friends for being wasteful. Using blocking interfaces means a thread is suspended while waiting for completion. To keep the CPU busy, the system has to maintain a big thread pool to process lots of requests in parallel.  It is difficult to tune the size of the thread pool, since the portion of suspended threads varies. To avoid the sin of letting CPU idle and letting clients wait, programmers tend to allocate thread pools larger than necessary.

The cost of a big thread pool can not be ignored. Each thread has 2MB user space stack and more in kernel space. A context switch takes a few microseconds, a non-trivial cost, considering modern I/O devices provide response in tens of microseconds. And the cost of thread scheduling increases with the total number of threads in the system.

Events

In event driven systems, the request processing logic are broken into multiple event handlers (aka. callback functions). Examples of events include completion of disk I/O, arrival of packets over the network. Whenever an event is raised, the corresponding handler is dispatched to run in a thread pool. An event handler may raise other events, or call async I/O operations that would raise events when they finish.

Aha! So these guys also use a thread pool! But the size of this thread pool is usually equals to the number of CPU cores, since event handlers never block. Smaller thread pool leads to smaller memory overhead and less number of context switches.

On the other hand, the programming model of the event driven systems is not intuitive. It forces programmers to split naturally connected logic into multiple functions at the boundary of I/O operations, hiding explicit event sequences and reducing readability along the way. As a result, event driven systems are harder to write and maintain.

Then there is concurrency to deal with. Let's say you need to read multiple files and dump their content to the console output. So you wrote a loop that issue async read on each file and pass the same call back function Output that dumps the data to console, like the code below.  When you run the code, content of multiple files are mingled together, as multiple instances of Output function are run in parallel.

// Dump data to console
void Output(void* buffer, int length)
{
  char* ptr = reinterpret_cast<char*>(buffer);
  for (int i=0; i< length; i++){
    cout << *(ptr++);
  }
} 

void DumpFiles(vector<string> allFiles)
{
  char buffer[BIG_ENOUGH]; 
  for (auto& filename : allFiles) {
    asyncReadFile(filename, buffer, &Output);
    // Output function will be triggered after file read complete
    // multiple file read may finish around the same time,
    // causing multiple instances of Output to run in parallel
  }
}

Looks like the low runtime overhead of the event-driven systems does not come free either.

C++ Lambda

C++11 introduced Lambda expressions. Now we can write anonymous function objects right where it is passed into an async operation as a call back. It would have been great if we could use Lambda expressions this way to specify the sequence of event directly in code layout, almost like in thread based system.

Ideally we can write something like the following, where the code of writing file content to the console is right after invocation of an async read. Semantically connected code are placed next to each other, making it easier to read.

void DumpFiles(vector<string> allFiles)
{
   char buffer[BIG_ENOUGH]; 
   for (auto& filename : allFiles) {
      asyncReadFile(filename, buffer, 
          [filename&]
          (void* buffer, int length){
              // logic for dumping data here
          });
   }
}

Unfortunately, the above code does not work. The lambda expression is a function object allocated on the stack of the function DumpFiles. When the async file read finishes, chances are DumpFiles already returned, the function object is not there anymore, causing a memory corruption problem. So Lambda expressions can not be used directly as event handlers.

Plus you still have to deal with concurrency.

Mala, strung beads

To reduce the headache of parallel programming, we want a certain set of semantically related event handlers to execute sequentially. This does not necessarily mean they have to be executed on a single thread. Sequential execution is a requirement. Single threaded execution, on the other hand, is only one implementation of this requirement, a pretty restrictive and costly one. We believe there are more efficient implementations.

Mala, or prayers beads, is a bunch of beads strung together, like a necklace. We use it to name our open source thread library that supports async sequential execution.

In Mala, our event handlers are called Continuations, each of which is a wrapper of a function object. Each Activity has multiple continuations. All continuations in one activity are guaranteed to execute sequentially, with no overlap. An activity acts like a string that connects multiple beads (continuations) together sequentially, forming a mala. Our implementations takes care of the memory management of lambda expressions and synchronizations, so that the programmers can write event-driven systems in a more intuitive way.

The code below illustrates how the dump files example can be implemented using Mala. The key here is function lambdaContinue. It takes two parameters. The first parameter is an activity, the second is a lambda expression. It returns a continuation that belongs to the activity in the first parameter. When this continuation is triggered, it executes the lambda expression.

void dumpFiles(vector<string> allFiles, Scheduler& scheduler)
{
   // create a "string" to string all the beads
   auto pActivity = ActivityFactory(scheduler, "Test Activity");

   // context shared by all the beads
   auto pFilesRemain = new int(allFiles.size());
   for (auto& filename : allFiles){
      // open file
      auto f = AsyncFileFactory(filename, scheduler);
      char* readbuffer = new char[f.GetSize()];
      f->Read(0, f.GetSize(), readbuffer, // read entire file to buffer

         lambdaContinue(  // create a bead as call back
            *pActivity,   // all the beads are sequentialized by pActivity
            [pActivity, pFilesRemain, f, readBuffer]
         (WorkerThread&, intptr_t status, uint32_t length)
         {
            if (status != S_OK){
               cerr << "File read error " << f.GetName() << " \n";
            }
            else {
               cout << "==== File Start: " << f.GetName() << "====\n";
               for (int i=0; i < length; i++){
                  cout << readBuffer[i];
               }
               cout << endl;
            }
            delete readBuffer;
            f->Close();
            
            if (--(*pFilesRmain) == 0){
               delete pFilesRemain;
               pActivity->Shutdown();
            }
         })); // end of the bead

   } // end for each file
}

In the code above, lambdaContinue enables the lambda expression to live beyond the scope of dumpFiles function, and specifies all of the runtime instances execute sequentially, as they belong to the same activity.  This way, contents from different files will not be mingled together.

A continuation can also be triggered by invoking pContinuation->Post(). The continuation will be executed when all the previously triggered continuations in the same activity finish. This provides a way for one activity to send a notification to another activity, by posting a continuation of the second activity.

We hope Mala helps you in making event-drive code easier to write and read. We welcome your feedback to the library and invite you to contribute.


Comments (2)

  1. Chen, have you evaluated performance on large apps?

    1. An apple to apple comparison against existing programming paradigm is difficult. It requires rewriting of existing systems under Mala at the same time preserving all other design and implementation decisions. We didn’t do that.

      However, we did implement a key-value pair storage engine using Mala, named ExaVenger (exabyte scavenger). Our engine achieved same performance with RocksDB in random key access, but can host more data when CPU and memory resources is limited, with much lower write amplification on ssd. The saving comes from replacing B-tree with a new hash based index, so that we can get rid of the sorting compaction that is CPU and memory hogging, ssd wracking. By using Mala, we have further memory saving but more importantly, really fine control on CPU usage. At the same time we can limit the reasoning of race conditions to only a few places.

Skip to main content