section example in C++ AMP

In this blog post I will give a simple example of using the section member function for array and array_view, demonstrating how to offset your origin point in order to operate on a smaller section of data in your computation. So for example if your data is matrix that looks like this:

array_view<float, 2> qin(height, width, data);

Where height and width are divisible by 2, you can view it in four quarters as follows:

array_view<float, 2> q1 = qin.section(index<2>(0, 0), extent<2>(height/2, width/2));

array_view<float, 2> q2 = qin.section(index<2>(height/2,0), extent<2>(height/2, width/2));

array_view<float, 2> q3 = qin.section(index<2>(0,width/2), extent<2>(height/2, width/2));

array_view<float, 2> q4 = qin.section(index<2>(height/2, width/2));

clip_image001

Below is a complete code example that does a summation of all elements in the array_view ‘qin’ and places the result in the first element. The algorithm views the data as two dimensions and splits it into four quarters, and then it sums up all elements in one quarter ‘qout’. By repeating this operation making ‘qout’ to be ‘qin’ it stores the overall reduction result in qin(0,0).

example

The code demonstrates the section functionality, but is not aimed to be (and indeed isn’t) an optimum implementation of a reduction algorithm (we have one of those in the pipeline) – it was written simply to demonstrate usage of the section API.

  1: #include <amp.h>
  2:  
  3: using namespace concurrency;
  4: using std::vector;
  5:  
  6: void main()
  7: {
  8:   // a small data size for example
  9:   // a sample constrain require data to be equal and power of 2
  10:   int width = 16;
  11:   int height = 16;
  12:  
  13:   // generate dummy data
  14:   vector<float> data (width * height);
  15:  
  16:   for (int x = 0; x < (width * height); x++)
  17:   {
  18:     data[x] = x * 1.0f;
  19:   }
  20:  
  21:   // wrap data so it is ready to copy to accelerator
  22:   array_view<float,2> qin(height, width, data);
  23:  
  24:   // repeat reduction
  25:   // till data can't be reduced
  26:   while(width > 1)
  27:   {
  28:     height /= 2;
  29:     width /= 2;
  30:     extent<2> quarterdim(height, width);
  31:     array<float,2> qout(quarterdim);
  32:  
  33:     // view the data in 4 quarters 
  34:     // create an array_view with offset to each quarters
  35:     const array_view<const float,2> q1 =
  36:             qin.section(index<2>(0, 0) /*origin*/, quarterdim /*extent*/);
  37:     const array_view<const float,2> q2 =
  38:             qin.section(index<2>(height, 0), quarterdim);
  39:     const array_view<const float,2> q3 =
  40:             qin.section(index<2>(0, width), quarterdim);
  41:     const array_view<const float,2> q4 =
  42:             qin.section(index<2>(height, width));
  43:  
  44:     // execute the kernel to accumulate all quarters into the first one
  45:     parallel_for_each(quarterdim, [=, &qout] (index<2> idx) restrict(amp)
  46:     {
  47:       // accumulate all quarters in output quarter
  48:       // using same index but in different section
  49:       qout[idx] = q1[idx] + q2[idx] + q3[idx] + q4[idx];
  50:     });
  51:  
  52:     // set output data array as input view
  53:     // for next loop
  54:     // NOTE: that doesn't sync data from GPU to host
  55:     qin = qout;
  56:  
  57:     // only for demo, print output data
  58:     // transition after every iteration
  59:     for(int y = 0; y < height; y++)
  60:     {
  61:       for (int x = 0; x < width; x++)
  62:       {
  63:         // accessing qin here force sync that quarter back to host
  64:         // this cause a performance hit 
  65:         printf( "%0.1f ", qin(y, x));
  66:       }
  67:       printf("\n");
  68:     }
  69:     printf("===============================================\n");
  70:  
  71:   } // while loop
  72:  
  73:   // final summation result can be obtained from
  74:   // qin(0,0) here
  75: }
  76: // Sample print out
  77:  
  78: //272.0 276.0 280.0 284.0 288.0 292.0 296.0 300.0
  79: //336.0 340.0 344.0 348.0 352.0 356.0 360.0 364.0
  80: //400.0 404.0 408.0 412.0 416.0 420.0 424.0 428.0
  81: //464.0 468.0 472.0 476.0 480.0 484.0 488.0 492.0
  82: //528.0 532.0 536.0 540.0 544.0 548.0 552.0 556.0
  83: //592.0 596.0 600.0 604.0 608.0 612.0 616.0 620.0
  84: //656.0 660.0 664.0 668.0 672.0 676.0 680.0 684.0
  85: //720.0 724.0 728.0 732.0 736.0 740.0 744.0 748.0
  86: //===============================================
  87: //1632.0 1648.0 1664.0 1680.0
  88: //1888.0 1904.0 1920.0 1936.0
  89: //2144.0 2160.0 2176.0 2192.0
  90: //2400.0 2416.0 2432.0 2448.0
  91: //===============================================
  92: //7616.0 7680.0
  93: //8640.0 8704.0
  94: //===============================================
  95: //32640.0
  96: //===============================================

 

Observe in the sample that array_view objects captured in the kernel need read only access to data, that is why I declared them as array_view<const float,2> .

Also notice that ‘q1’ creation - line(35) - can benefit from the section overloads to retrieve same view as follows:

array_view<float,2> q1 = qin.section(quarterdim);

In this case the extent is inferred to cover the rest of the parent array/array_view.

array_view<float,2> q1 = qin.section(0, 0, height, width);

Similarly q2 and q3 can be created using the latter section function call.

Finally, one might look close to ‘q1’ and ask couldn’t ‘qin’ replace its functionality and reduce the number of lines of code? The answer is “yes”, but that would introduce a performance overhead; instead of copying 4 quarters to GPU memory, this change will copy 3 quarters plus the whole matrix. Also copying data back to the host would again copy the whole matrix instead of just one quarter of it.

That completes my example for creating sub-sections using the section member function. Feel free to ask questions in the comments section below or in our MSDN forum.