SYSK 244: Writing Multi-Processor and Multi-Core Aware Applications with #pragma Directives

Imagine being able to create applications that leverage parallel processing capabilities of computers by simply adding a #pragma directive…

 

Before you get too excited, at this time, this is only available for Visual C++. Yes, there are features C++ programmers have that C# or VB.NET don’t… yet. I believe the .NET concurrency team is working on making it available for .NET.

 

So, what am I talking about?

 

What if I told you that I can make the code below run 25 times faster or better with just a couple of #pragma omp statements:

for(int i = 0; i < size; i++)

      {

            sum += SomeTimeConsumingFuction(a[i]);

      }

 

First, if you’re not familiar with the OpenMP API, a de-facto standard for writing shared-memory parallel processing applications, you may want to check out http://www.openmp.org/. In short, OpenMP is a “portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.”

 

If you’re writing your applications in VS 2005 using C++, you can define parallelizable sections of code (those operations that are order independent) with directives in the following form:

#pragma omp <directive> [clauses]

 

 

Here are a couple of code samples:

 

1. Multiplication of two matrices (no caching)

 

static const __int32 size = 500;

 . . .

 

__int32 i, j, k;

                       

array<__int32, 2>^ a = gcnew array<__int32, 2>(size, size);

#pragma omp parallel for private(i, j, k) shared(a)

for (i = 0; i < size; i++)

{

      for(j = 0; j < size; j++)

      {

            for (k = 0; k < size; k++)

            {

                  a[i,j] = a[i,k] * a[k,j];

            }

      }

}

On my dual-processor laptop, the code above resulted in following stats (debug build, running form VS):

No #pragma: 654 ms for matrix size 300

With #pragma: 496 ms for matrix size 300

No #pragma: 954 ms for matrix size 300

With #pragma: 447 ms for matrix size 300

No #pragma: 581 ms for matrix size 300

With #pragma: 434 ms for matrix size 300

No #pragma: 3804 ms for matrix size 500

With #pragma: 1585 ms for matrix size 500

No #pragma: 3779 ms for matrix size 500

With #pragma: 1538 ms for matrix size 500

No #pragma: 3108 ms for matrix size 500

With #pragma: 1608 ms for matrix size 500

 

 

2. Loop with long running function invocation

 

array<int>^ a = gcnew array<int>(size);

for (int i = 0; i < size; i++)

      a[i] = i;

int threadCount = 0;

long sum = 0;

omp_set_num_threads(50);

#pragma omp parallel shared(sum, a)

{

      threadCount = omp_get_num_threads();

      #pragma omp for reduction(+ : sum)

      for(int i = 0; i < size; i++)

      {

      sum += SomeTimeConsumingFuction(a[i]);

      }

}

 

. . .

 

private: int SomeTimeConsumingFuction(int data)

{

            System::Threading::Thread::CurrentThread->Join(20);

            return 1;

}

 

By default, I’ll get as many threads as processers… So, to further speed up this code, I indicated that I want to use up to 40 threads (see omp_set_num_threads statement). The omp_get_num_threads() call tells me how many threads were actually used…

 

Results? Quite impressive, in my view:

 

No #pragma: 15617 ms for matrix size 500

With #pragma: 533 ms for matrix size 500. 50 threads were used.

 

That’s about 29 times faster!

 

 

 

Now, if this post peaked your interest, look for OpenMP Directives in Visual Studio Documentation. There is so much more to Open MP than the two small samples I’ve demonstrated, including updating memory locating atomically (like Interlocked type of statements) via #pragma omp atomic directive, defining sections of code that can execute in parallel via #pragma omp [parallel] sections [clauses], and much, much more!

 

 

Note: you must compile with /openmp switch, and change CLR switch from /clr:pure to /clr.