Recently, CDAP (Cask Data Application Platform) by Cask, was added to the set of applications that are available to be installed on the HDInsight cluster. This blog-post aims to illustrate the scenarios enabled by using CDAP on HDInsight.
HDInsight application platform
Azure HDInsight recently announced an easy way to distribute, discover and install applications that are build for the Apache Hadoop ecosystem. The HDInsight team partnered with Cask to add the CDAP application to its platform.
For more information about the integration of CDAP with HDInsight, please read the blog titled: “Integrating CDAP with Microsoft Azure HDInsight” by Derek Wood, Devops Engineer at Cask.
Challenges in the traditional Hadoop world
Developing applications in the traditional Hadoop world is a not an easy task. Listed below are some key aspects that add to the challenges faced by a Hadoop developer: -
- Over the past few years, the increased interest in the Big Data space has resulted in a technology explosion in the Hadoop ecosystem. It has become progressively difficult to keep track of all the existing technologies as well as new ones come up.
- Simple processes like data ingestion and ETL require a complicated setup which is not generally extensible or reusable.
- Apart from the significant learning curve involved in using each of the different Hadoop technologies, there is a substantial amount of time spent in integrating all of them to form a data processing solution.
- Moving from a proof-of-concept solution to a production-ready one is far from a trivial step involving multiple iterations and can lead to an increased unpredictability in delivery times.
- It is hard to locate data and trace its flow in an application. Collecting metrics and auditing is generally a challenge and often requires building a separate solution.
How does CDAP help?
CDAP (Cask Data Application Platform) is a unified integration platform for big data. The highlight of CDAP is that a user can focus on building applications rather than its underlying infrastructure and integration.
CDAP works using high-level concepts and abstractions which are familiar to developers and empowers them to use their existing skills to build new solutions. These abstractions hide the complexities of internal systems and encourage re-usability of solutions.
An extension called Cask Hydrator is available in CDAP, which provides a rich user interface to develop and manage data pipelines. A data pipeline is composed of various plugins which perform various tasks like data acquisition, transformation, analysis, and post-run operations.
Each CDAP plugin has well-defined interfaces which essentially means that evaluating different technologies would just be a matter of replacing a plugin with another one – there is no need to touch the rest of the application.
CDAP pipelines provide a high-level pictorial flow of the data in your application which enables developers to easily visualize the end-to-end flow of the data and all the steps involved in the processing starting from its ingestion, to the various transformations and analyses performed on the data followed by the eventual writing into an external data store.
Here is an example of a data pipeline which ingests twitter data in real time, filters out some tweets based on some pre-defined criteria, transforms, and projects the data into a more readable format, groups them according to a set of values and writes the results into an HBase store.
The end-to-end pipeline was completely built using the Cask Hydrator UI, utilizing its plugin interface and drag-and-drop functionality to form connections between each stage. It is easy to isolate and modify the functionality of each plugin independent of the rest of the pipeline. Using CDAP, similar pipelines can be built and validated in less than a couple of hours. In the traditional Hadoop world, constructing such solutions could easily take a few days.
Additionally, CDAP provides an extension called Cask Tracker where you can visually trace the data as it flows through the application. Cask tracker adds data governance to the system so that data assets are formally managed throughout the application. You can track its lineage, collect relevant metrics, and audit the data trail throughout the process.
Here is an illustration of how data is flowing in the above pipeline:
How can I install CDAP on HDInsight?
CDAP is an application available to install on Linux clusters of HBase Type with HDI Version 3.4. The application can be configured to install during cluster create or added onto an existing cluster.
Install CDAP on a new HDInsight cluster:
- Navigate to the Azure management portal and choose the option to create a new HDInsight cluster. While configuring the cluster type, set the Cluster Type to be ‘HBase’, Operating System to be ‘Linux’ and HDI Version to be ‘3.4’. Note: The CDAP application requires at least 16 GB of memory and 16 cores so ensure that your cluster meets these requirements.
- On the next step, click on ‘Applications’ to show the list of available applications that can be installed on the cluster. Select 'CDAP' and accept the legal terms.
Install CDAP on an existing cluster:
- Navigate to your existing HBase cluster on the azure portal and click on ‘Applications’ as shown below
- You will see the applications pane open which shows a list of installed applications on the cluster. Click on the ‘Add’ button to show a list of applications which can be installed on this cluster. Select CDAP and accept the legal terms.
Using the CDAP application
- Once the application has been installed successfully using the above steps, click on the ‘Portal’ link that appears next to CDAP in the applications pane to bring up the CDAP portal
- You will be navigated to the CDAP site and prompted to enter credentials. Enter the cluster login username and password that you had configured while creating the cluster. After this you should see the CDAP website where you will be able to use the application.