Apache Spark (standalone)

Using Apache Spark in standalone mode is quite simple. We need to have one Spark master node and multiple Spark worker nodes. Spark jobs are separate node that related to master and worker nodes (master will submit this application, worker relationships are only there for ordering - we do not want to submit Spark job into partially prepared cluster).

node_templates:

  ${SPARK}_master_firewall:
    type: dice.firewall_rules.spark.Master

  ${SPARK}_master_vm:
    type: dice.hosts.ubuntu.${HOST_SIZE_MASTER}
    relationships:
      - type: dice.relationships.ProtectedBy
        target: ${SPARK}_master_firewall

  ${SPARK}_master:
    type: dice.components.spark.Master
    relationships:
      - type: dice.relationships.ContainedIn
        target: ${SPARK}_master_vm

  ${SPARK}_worker_firewall:
    type: dice.firewall_rules.spark.Worker

  ${SPARK}_worker_vm:
    type: dice.hosts.ubuntu.${HOST_SIZE_WORKER}
    instances:
      deploy: ${SPARK_WORKER_COUNT}
    relationships:
      - type: dice.relationships.ProtectedBy
        target: ${SPARK}_worker_firewall

  ${SPARK}_worker:
    type: dice.components.spark.Worker
    relationships:
      - type: dice.relationships.ContainedIn
        target: ${SPARK}_worker_vm
      - type: dice.relationships.spark.ConnectedToMaster
        target: ${SPARK}_master

  ${SPARK_JOB}:
    type: dice.components.spark.Application
    properties:
      jar: ${SPARK_JOB_JAR_LOCATION}
      class: ${SPARK_JOB_CLASS}
      name: ${SPARK_JOB_NAME}
      args: ${SPARK_JOB_ARGUMENTS}
    relationships:
      - type: dice.relationships.spark.SubmittedBy
        target: ${SPARK}_master
      - type: dice.relationships.Needs
        target: ${SPARK}_worker

Template variables

SPARK
The name of the Spark cluster. This is usually set to spark, which gives us spark_master and spark_worker nodes.
SPARK_WORKER_COUNT
Number of Spark worker instances that should be created when deploying cluster.
SPARK_JOB
The name of the Spark job that we wish to submit.
SPARK_JOB_JAR_LOCATION
Location of the Spark job jar. This can be either URL or relative path, in which case jar needs to be bundled with blueprint.
SPARK_JOB_CLASS
Name of the class that should be executed when submitting Spark job.
SPARK_JOB_NAME
Name that should be used for application when jar is submitted. This name can be seen in Spark UI.
SPARK_JOB_ARGUMENTS
Array of arguments that should be passed to jar when being submitted. If application takes no additional arguments, set this to [].
HOST_SIZE_MASTER, HOST_SIZE_WORKER
Sizes of the master and worker virtual machines. Available values are Small, Medium and Large.