Running Spark on a GPU enabled cluster with AZTK

image

The ability to run Spark on a GPU enabled cluster demonstrates a unique convergence of big data and high-performance computing (HPC) technologies.

Many institutions now want to provide infrastructure for high performance hardware such as GPUs with big data engines such as Spark, thus allowing academics, data scientists and data engineering students  to enable many scenarios that would otherwise be difficult to achieve and without the power of the cloud.

More importantly with the ability of cloud resources and opex cost models, this can be now done with minimal capital expenditure.

Microsoft has made significant invent in NVIDA latest GPU SKUs and we now support running Spark on a GPU-enabled cluster using the Azure Distributed Data Engineering Toolkit (AZTK).

Azure Distributed Data Engineering Toolkit in a single command, AZTK allows you to provision on demand GPU-enabled Spark clusters on top of Azure Batch's infrastructure, helping you take your high performance implementations that are usually single-node only and distribute it across your Spark cluster.

The Microsoft team has created GPU-enabled Docker images for AZTK, including a python image that comes packaged with Anaconda, Jupyter and PySpark, and a R image that comes packaged with Tidyverse, RStudio-Server and SparklyR.

Getting Started

Additional resources