Run Hue Spark Notebook on Cloudera


When you deploy a CDH cluster using Cloudera Manager, you can use Hue web UI to run, for example, Hive and Impala queries.  But Spark notebook is not configured out of the box.  Turns out installing and configuring Spark notebooks on CDH isn’t as straightforward as is described in their existing documentation.  In this blog, we will provide step by step instructions on how to enable Hue Spark notebook with Livy on CDH.  

These steps have been verified in a default deployment of Cloudera CDH cluster on Azure.  At the time of this writing, the deployed CDH is at version 5.7, HUE 3.9, and Livy 0.2.  The steps should be similar for any CDH cluster deployed with Cloudera Manager.  Note that Livy is not yet supported by Cloudera.   

1. Go to Cloudera Manager, go to Hue and find the host name of the Hue Server.

 

2. In Cloudera Manager, go to Hue->Configurations, search for “safety”, in Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini, add the following configuration, save the changes, and restart Hue:

[desktop]
app_blacklist=

[spark]
server_url=http://<your_hue_server>:8998/

languages='[{"name": "Scala", "type": "scala"},{"name": "Python", "type": "python"},{"name": "Impala SQL", "type": "impala"},{"name": "Hive SQL", "type": "hive"},{"name": "Text", "type": "text"}]'

 

Now if you go to the Hue Web UI, you should be able to see the Spark notebook.  The Spark notebook uses Livy to submit Spark jobs, so without Livy, it’s not functioning yet. 

 

3. In the Hue Web UI, go to Configuration, find hadoop_conf_dir and note it down:

 

 

4. SSH to your Hue server, for simplicity, unless specified we’ll run the following commands in sudo or as root user:

#download Livy
wget http://archive.cloudera.com/beta/livy/livy-server-0.2.0.zip
unzip livy-server-0.2.0.zip -d /<your_livy_dir>

#set environment variables for Livy
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
export HADOOP_CONF_DIR=<your hadoop_conf_dir found in the previous step in Hue configuration>
export HUE_SECRET_KEY=<your Hue superuser password, this is usually the user you use when you log in to Hue Web UI the first time>

#run Livy. You must run Livy as a user who has access to hdfs, for example, the superuser hdfs.
su hdfs
/<your_livy_dir>/livy-server-0.2.0/bin/livy-server

 

5. Now if you go to Hue Web UI again, the warning about Spark in Step 2 should be gone. Go to the root of your Hue Web UI URL, then add “/notebook” in the URL. You should be able to add a notebook to run code:

 

In the following example, we added a Scala snippet and ran it in the notebook:

 

If you keep your ssh console with Livy running open, you will see when you start to run your code in a notebook, a Spark job is being submitted. If there’s any error, you will also see it in the console.

You can also add Impala snippet, and use the various graphing tools:

For more information about Hue Spark notebook, please see gethue.com/spark.

 

Comments (7)

  1. Amit says:

    Hi,
    I followed your solution. It does now show notebook. But I run a pyspark job in notebook. It throws following error:

    “Session ‘-1’ not found.” (error 404)
    may be its related to pyspark or scala.
    I tried to run spark-shell in command line.
    It does start spark shell in command line but with following error:

    :16: error: not found: value sqlContext
    import sqlContext.implicits._
    ^
    :16: error: not found: value sqlContext
    import sqlContext.sql

    …I am using quickstart Cloudera CDH on Virtual Box.

    I would be thankful if you can help me out with this.

    Thank you.

    1. Paige Liu says:

      Sorry I have not tried CDH on virtual box. But if you can’t even use sqlContext from spark-shell, then something is probably wrong with the installation, not related to the notebook. A default sqlContext should be loaded with spark-shell. You might want to ask this question on Cloudera forum about virtual box installation.

  2. Stan Yan says:

    It’s really helpful, thanks for sharing.
    I have a question,
    when I finished running Scala script from Hue Notebook,
    in yarn log, it says that the user who submitted the job is the user who’s running Livy server,
    not the user logon in Hue, which means no matter who runs Scala script from Hue Notebook, it will be the same user on yarn.
    In this way, it’s hard to manage under multi-user situation.
    Do you have any solutions?
    Thank you.

    1. Paige Liu says:

      Interesting. Sorry I don’t have any solutions, but I think this is probably because Livy is still very young, only version 0.2. As it matures, it will have more robust security. Thanks for sharing this observation.

  3. Chris Royles says:

    Minor update, the ‘languages’ parameter does not work as posted. Please use this format instead.

    show_notebooks=true
    [[interpreters]]
    [[[hive]]]
    name=Hive
    interface=hiveserver2
    [[[impala]]]
    name=Impala
    interface=hiveserver2
    [desktop]
    use_new_editor=true
    app_blacklist=security

    1. Paige Liu says:

      What is the version of CDH, Hue, and Livy you are using? This works with the version stated at the beginning of the post.

      1. Chris Royles says:

        Fair comment, it is the latest version of HUE in CDH (3.10/3.11)

        Pulling the source though, I reference this as a cheat sheet of the options available.
        https://github.com/cloudera/hue/blob/master/desktop/conf/pseudo-distributed.ini.tmpl

Skip to main content