Getting started with GPU acceleration for MicrosoftML's rxNeuralNet

Article
03/10/2017

MicrosoftML's rxNeuralNet model supports GPU acceleration. To enable GPU acceleration, you need to do a few things:

Install and configure CUDA on your Windows machine.

For the purpose of this setup and later performance comparison, this is the machine used in this blog.

mypcinfo

There is an excellent old blog on how to do this. However, it's slightly out of date for the purpose of our setup, so the steps are repeated here with updated product versions.

(If you were to get a Deep Learning toolkit for the DSVM which comes with R Server, Visual Studio 2015 and CUDA Toolit 8.0 preinstalled, then all you is the four dlls, basically step 3, 5 and 6. Update: Not needed any more for Deep Learning toolkit for the DSVM. )

Check to make sure your graphic card is CUDA-enabled. To find what graphic card you are using, there are two ways. One is press Win+R, type dxdiag and look under the Display tab(s). The other is open device manager and look under display adapters. Then see if it's on the list of CUDA-enabled products.
Install Visual Studio Community 2013.
Install NVidia CUDA Toolkit 6.5.
This step is optional. Use Visual Studio 2013 to build deviceQuery_vs2013.sln found in C:\ProgramData\NVIDIA Corporation\CUDA Samples\v6.5\1_Utilities\deviceQuery and bandwidthTest_vs2013.sln found in C:\ProgramData\NVIDIA Corporation\CUDA Samples\v6.5\1_Utilities\bandwidthTest, you need to set solution configuration to release and solution platform to x64. Once these two solutions are built successfully, open command line from C:\ProgramData\NVIDIA Corporation\CUDA Samples\v6.5\bin\win64\Release and run deviceQuery.exe and bandwidthTest.exe. Make sure the results are pass.
Copy cublas64_65.dll, cudart64_65.dll and cusparse64_65.dll from C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin to MicrosoftML's libs directory. MicrosoftML's libs directory can be found by calling system.file("mxLibs/x64", package = "MicrosoftML").
Download NVidia cuDNN v2 Library (It's free but you need to register). Extract somewhere. Copy cudnn64_65.dll from the extracted files into MicrosoftML's libs directory.

Experiment: Performance analysis of rxNeuralNet in Microsoft ML with OR without GPU through a simple example

(By performance here we simply mean the training speed. )

We make a 10,000 by 10,000 matrix composed completely of random noise and another column of 10,000 random 0s and 1s. They are combined together as our training data. Then we simply build two rxNeuralNet models for binary classification. One with GPU acceleration and the other without. The parameter miniBatchSize is the number of training examples used to take a step in stochastic gradient descent. The bigger it is, the faster progress is made. But large step sizes can lead to difficulty for the algorithm to converge. We used ADADELTA here as our optimizer here as it can automatically adjust the learning rate. miniBatchSize is only used with GPU acceleration. Without GPU acceleration, it's by default set to 1.

Output fromrxNeuralNet with GPU acceleration

Using: GPU CUDA Math on Quadro K1200(0000:01:00), Compute Capability: 5.0, 4 SMs, Global Memory: 4,096 MB, Theoretical B/W: 80.16 GB/sec

***** Net definition *****

input Data [10000];

hidden H [100] sigmoid { // Depth 1

from Data all;

}

output Result [1] sigmoid { // Depth 0

from H all;

}

***** End net definition *****

…

Using Adadelta with decay 0.950000 and conditioner 0.000001

…

Mini-Batch Size: 256

___________________________________________________________________

Initializing 1 Hidden Layers, 1000201 Weights...

Estimated Pre-training MeanError = 0.693474

Iter:1/50, MeanErr=0.695739(0.33%%), 6335.03M WeightUpdates/sec

Iter:2/50, MeanErr=0.595985(-14.34%%), 5956.85M WeightUpdates/sec

Iter:3/50, MeanErr=0.448573(-24.73%%), 6900.08M WeightUpdates/sec

Iter:4/50, MeanErr=0.309050(-31.10%%), 6151.36M WeightUpdates/sec

Iter:5/50, MeanErr=0.184206(-40.40%%), 6673.38M WeightUpdates/sec

…

Iter:46/50, MeanErr=0.000003(-6.52%%), 6670.44M WeightUpdates/sec

Iter:47/50, MeanErr=0.000002(-5.89%%), 6436.20M WeightUpdates/sec

Iter:48/50, MeanErr=0.000002(-5.77%%), 7371.10M WeightUpdates/sec

Iter:49/50, MeanErr=0.000002(-7.83%%), 6233.96M WeightUpdates/sec

Iter:50/50, MeanErr=0.000002(-2.50%%), 6282.75M WeightUpdates/sec

Done!

Estimated Post-training MeanError = 0.000002

___________________________________________________________________

Elapsed time: 00:02:15.7357840

Output fromrxNeuralNet with SSE acceleration

Using: SSE Math

***** Net definition *****

input Data [10000];

hidden H [100] sigmoid { // Depth 1

from Data all;

}

output Result [1] sigmoid { // Depth 0

from H all;

}

***** End net definition *****

…

Using Adadelta with decay 0.950000 and conditioner 0.000001

…

___________________________________________________________________

Initializing 1 Hidden Layers, 1000201 Weights...

Estimated Pre-training MeanError = 0.693470

Iter:1/50, MeanErr=0.718174(3.56%%), 467.50M WeightUpdates/sec

Iter:2/50, MeanErr=0.630149(-12.26%%), 456.83M WeightUpdates/sec

Iter:3/50, MeanErr=0.353611(-43.88%%), 462.52M WeightUpdates/sec

Iter:4/50, MeanErr=0.108437(-69.33%%), 460.95M WeightUpdates/sec

Iter:5/50, MeanErr=0.019884(-81.66%%), 469.45M WeightUpdates/sec

…

Iter:46/50, MeanErr=0.000003(-2.81%%), 516.87M WeightUpdates/sec

Iter:47/50, MeanErr=0.000003(-2.74%%), 547.21M WeightUpdates/sec

Iter:48/50, MeanErr=0.000003(-2.65%%), 527.95M WeightUpdates/sec

Iter:49/50, MeanErr=0.000003(-2.60%%), 537.68M WeightUpdates/sec

Iter:50/50, MeanErr=0.000003(-2.53%%), 540.10M WeightUpdates/sec

Done!

Estimated Post-training MeanError = 0.000003

___________________________________________________________________

Elapsed time: 00:19:56.9637877

Both models achieved similar training errors by the end of 50 iterations. (In fact, the GPU accelerated model did slightly better.)

With GPU acceleration, the training time is 2'16'' compared to 19'57'' without GPU acceleration. The performance boost is almost 9X.

We can also see in Net Definition the structure of our neural nets. Here both models are using a neural net of an input layer of 10,000 nodes (the number of nodes in the input layer is always either equal to the number of input variables or plus one node for a bias term), a single hidden layer of 100 nodes and an output layer of one node for this binary classification problem.

In rxNeuralNet the number of nodes in the hidden layer defaults to 100. For a single hidden layer, it's usually recommended to be between the number of nodes in the input layer and number of nodes in the output layer. MicrosoftML also supports user-defined network architectures like convolution neural networks using the NET# language.

Training time vs. mini batch size

Further experiments for the same data give the following relation between training time and miniBatchSize:

traintime

Along with the learning curve for the same combinations of acceleration and miniBatchSize above.

learningcurve

REFERENCES

MicrosoftML manual

Install and Configure CUDA on Windows CUDA Installation Guide for Microsoft Windows

Getting started with GPU acceleration for MicrosoftML's rxNeuralNet

REFERENCES

Additional resources