Geni on Dataproc

Dataproc Setup

See the following guide to setup dataproc on GCP an to create a dataproc cluster. We will be using Google Cloud SDK with the gcloud CLI commands.

For this example, use the preview image version so that the cluster runs Spark 3. For instance, the following gcloud command creates a small dataproc cluster called geni-cluster:

gcloud dataproc clusters create geni-cluster \
    --region=asia-southeast1 \
    --master-machine-type n1-standard-1 \
    --master-boot-disk-size 30 \
    --num-workers 2 \
    --worker-machine-type n1-standard-1 \
    --worker-boot-disk-size 30 \
    --image-version=preview

This could take a few minutes to run. Then access the primary node using:

gcloud compute ssh ubuntu@geni-cluster-m

Running Geni on Yarn

Java should already be installed on the primary node. Install Leiningen using:

wget https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein && \
    sudo mv lein /usr/bin/ && \
    chmod a+x /usr/bin/lein && \
    lein

Then, create a templated Geni app and step into the app directory::

lein new geni app +dataproc && cd app

To spawn the Spark REPL, run:

lein spark-submit

This is a shortcut to creating an uberjar and running it using spark-submit. By default, the templated main function:

prints the Spark configuration;
runs a Spark ML example;
starts an nREPL server on port 65204; and
steps into a REPL(-y).

Verify that spark.master is set to "yarn". To submit a standalone application, we can simply edit the -main function on core.clj. Remove the launch-repl function to prevent stepping into the REPL.

Cleaning Up

Once finished with the exercise, the easiest way to clean up is to simply delete the GCP project.

Alternatively, delete the cluster using:

gcloud dataproc clusters delete geni-cluster --region=asia-southeast1

There may be dangling storage buckets that have to be deleted separately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataproc.md

dataproc.md

Geni on Dataproc

Dataproc Setup

Running Geni on Yarn

Cleaning Up

Files

dataproc.md

Latest commit

History

dataproc.md

File metadata and controls

Geni on Dataproc

Dataproc Setup

Running Geni on Yarn

Cleaning Up