See the following guide to setup dataproc on GCP an to create a dataproc cluster. We will be using Google Cloud SDK with the gcloud
CLI commands.
For this example, use the preview
image version so that the cluster runs Spark 3. For instance, the following gcloud
command creates a small dataproc cluster called geni-cluster
:
gcloud dataproc clusters create geni-cluster \
--region=asia-southeast1 \
--master-machine-type n1-standard-1 \
--master-boot-disk-size 30 \
--num-workers 2 \
--worker-machine-type n1-standard-1 \
--worker-boot-disk-size 30 \
--image-version=preview
This could take a few minutes to run. Then access the primary node using:
gcloud compute ssh ubuntu@geni-cluster-m
Java should already be installed on the primary node. Install Leiningen using:
wget https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein && \
sudo mv lein /usr/bin/ && \
chmod a+x /usr/bin/lein && \
lein
Then, create a templated Geni app and step into the app directory::
lein new geni app +dataproc && cd app
To spawn the Spark REPL, run:
lein spark-submit
This is a shortcut to creating an uberjar and running it using spark-submit
. By default, the templated main function:
- prints the Spark configuration;
- runs a Spark ML example;
- starts an nREPL server on port 65204; and
- steps into a REPL(-y).
Verify that spark.master
is set to "yarn"
. To submit a standalone application, we can simply edit the -main
function on core.clj
. Remove the launch-repl
function to prevent stepping into the REPL.
Once finished with the exercise, the easiest way to clean up is to simply delete the GCP project.
Alternatively, delete the cluster using:
gcloud dataproc clusters delete geni-cluster --region=asia-southeast1
There may be dangling storage buckets that have to be deleted separately.