OozieBot: Automated Oozie Workflow and Coordinator Generation

Article
10/20/2016

Introducing OozieBot - a tool to help customers automate Oozie job creation. Learn how to use OozieBot to generate Apache Oozie coordinators and Workflows for Hive, Spark and Shell actions and run them on a Linux based HDInsight cluster.

Introduction

Apache Oozie is a workflow/coordination system that manages Hadoop jobs. It is integrated with the Hadoop stack, and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, Apache Spark and Apache Sqoop. It can also be used to schedule jobs that are specific to a system, like Java programs or shell scripts

Oozie workflow generation article defines generation of workflows on a HDInsight cluster in detail. However, setting up an Oozie job is a very tedious and time consuming process that involves defining workflows and enclosing coordinators, scraping through several cluster configurations to set up a properties file for the workflows to consume, setting up environment variables and copying all sources to Windows Azure Storage Blobs[WASB]/HDFS. OozieBot intends to automate most of these tasks and provide a configurable way for you to launch the job in a matter of a few minutes.

Prerequisites

Before you begin, you must have the following:

An Azure Subscription: See Get Azure free trail
A HDInsight cluster: See Get Started with HDInsight on Linux
Python lxml: This will be installed using steps in this document
Hue [Optional]: See Install Hue on HDInsight

Where to download OozieBot and prerequisites

OozieBot source needs to be downloaded from Git and OozieBot uses Python LXML package to build workflows. To install both on your cluster, use the following steps

Use SSH to connect to the Linux-based HDInsight cluster: See Use SSH with Linux-based Hadoop on HDInsight from Linux, OS X or Unix
Run these commands on your prompt:

Downloading the Source

The source for OozieBot is hosted in git. To clone the repository, run the below command:

Example Coordinator Scenario

The coordinator that we are implementing here is intended to run a workflow once every 10 minutes for the next one day. The workflow runs a Spark Job that uses Spark SQL to do the following:

Drop a table called “mysampletable” if it already exists
Create an external table called “mysampletable” which contains the device platform and number of devices using that platform. This table is stored in your WASB storage account’s default container wasb://<default_Container>@<storage_account>/sample/data
Extract records from hivesampletable include with HDInsight, group them by device platform and write to mysampletable
Collect all entries from mysampletable and print them.

The spark job explained above is built as part of the spark-scala-example-0.0.1-SNAPSHOT jar included in OozieBot resources.

Note: The tool currently supports Hive, Spark and Shell actions. Each action has an example included as part of the resources directory.

Deploying a Spark Workflow

OozieBot has custom scripts that assist in building individual workflows. In our example, to build a spark workflow, we will be using the deploySpark script.

Generate workflows

Each workflow can be generated using a custom script. For instance to generate spark workflows:

Here:

WASB_PATH: This refers to the WASB directory where Oozie Workflows and source files should be copied to. OozieBot creates this directory in the default container of your WASB storage account.

SPARK_DRIVER_CLASS: The main class name that you want to run from the Spark Job

SPARK_JAR: The jar file that contains your source. This Jar will be copied to WASB_PATH

This script does the following things in the background:

Generates Spark Shell Action workflow as part of workflow.xml. The goal here is to parameterize this workflow to support most cases.
Generates job.properties file by scraping through cluster specific configurations and setting the rest of the configurable parameters to a default value.
Export OOZIE_URL which allows you to submit oozie jobs without explicitly specifying OOZIE path every time.
Creates a directory in WASB and copies all the necessary sources.

Note: Since we setup environment variables like OOZIE_URL from the script, we need to execute the script using “source”.

Running the Oozie Workflow

This command submits the job to Oozie and returns an ID for you to monitor the job:

oozie job -config target/sparkShellSample/job.properties -run

To check the status of the job, use

oozie job -info <JOB_ID>

Deploying a Coordinator

In this example, we will setup a coordinator on top of the Spark workflow we just built to run it once every hour for the next one day. A typical usecase for such a coordinator would be to understand data skew and inflow rate when data keeps flowing continuously into your source.

Generate Coordinator

The deployCoordinator script can be used to develop such coordinators.

Similarly, to run other actions, the commands are listed below:

Hive Action: source deployCoordinator <frequency> hive <WASB_PATH> HIVE_SCRIPT Shell Action: source deployCoordinator <frequency> shell <WASB_PATH> SHELL_SCRIPT

Here:

FREQUENCY: Refers to the time interval in which we want to run the workflows.

HIVE_SCRIPT: Hive Source file [HQL File]

SHELL_SCRIPT: Shell Script to be run by the Oozie job.

The script does the following things in the background:

Calls deploySpark/deployHive/deployShell based on the type of workflow that the cooridinator is encompassing to finish the tasks explained in the previous section.
Generates coordinator.xml which defines the date/time and the action source
Generates coordinator.properties that encompasses both properties generated in job.properties and addiditonal properties required for coordinator action.
Copies coordinator specific sources to WASB_PATH.

Running the Coordinator

The properties file will be saved to the target directory. To run a coordinator, use:
oozie job -config target/coordinator/coordinator.properties -run
To monitor the job, use:
oozie job -info <JOB_ID>

Sample Output

Below is a sample workflow generated by OozieBot:

A sample job.properties corresponding to the above workflow would look like this:

You should receive an output similar to this in your yarn stdout once the job completes:

Customizing Workflows and Coordinators

Most of these workflows and coordinators are built here with default configuration to ensure a smooth on-boarding most scenarios irrespective of the cluster capacity. However, to ensure flexibility to customize, we have parameterized most of the inputs to workflow.xml and coordinator.xml to keep it as generic as possible and have pushed most of the customization to job.properties and cooridinator.properties. The rationale behind this is the need to copy workflow and coordinator files to HDFS/WASB versus the properties file which can be maintained locally.

A sample scenario where customization can come into play is:

If we have a big enough cluster and want to provide more cores and more memory to this Oozie job.
If we want this coordinator to run for the next week instead of next one day.

Doing this would be just modifying a few properties before submitting the job.

Open the properties file using the editor of your choicenano target/coordinator/coordinator.properties
Change “endTime” to today+7 instead of the current today+1.
Change “numExecutors”, “executorCores”, “executorMemory” etc to desired value.
Save and exit the editor.Ctrl-XYEnter
Use
, then
and
to save the file.

Note: If there are a few properties that you want to change across all workflows and coordinators by default on your cluster, you can also change the corresponding job python file located in the src directory to reflect your customized setting.

More to follow…

Chaining multiple workflows to a coordinator
Passing custom arguments to OozieBot

Furthermore, HDI 3.5 provides a GUI based way of building these workflows and coordinators by adding an Oozie view to Ambari. More details on that to follow.

PS: Feel free to drop in your questions and provide any feedback in the comments section.