Apache Spark (standalone)¶
Using Apache Spark in standalone mode is quite simple. We need to have one Spark master node and multiple Spark worker nodes. Spark jobs are separate node that related to master and worker nodes (master will submit this application, worker relationships are only there for ordering - we do not want to submit Spark job into partially prepared cluster).
node_templates:
${SPARK}_master_firewall:
type: dice.firewall_rules.spark.Master
${SPARK}_master_vm:
type: dice.hosts.ubuntu.${HOST_SIZE_MASTER}
relationships:
- type: dice.relationships.ProtectedBy
target: ${SPARK}_master_firewall
${SPARK}_master:
type: dice.components.spark.Master
relationships:
- type: dice.relationships.ContainedIn
target: ${SPARK}_master_vm
${SPARK}_worker_firewall:
type: dice.firewall_rules.spark.Worker
${SPARK}_worker_vm:
type: dice.hosts.ubuntu.${HOST_SIZE_WORKER}
instances:
deploy: ${SPARK_WORKER_COUNT}
relationships:
- type: dice.relationships.ProtectedBy
target: ${SPARK}_worker_firewall
${SPARK}_worker:
type: dice.components.spark.Worker
relationships:
- type: dice.relationships.ContainedIn
target: ${SPARK}_worker_vm
- type: dice.relationships.spark.ConnectedToMaster
target: ${SPARK}_master
${SPARK_JOB}:
type: dice.components.spark.Application
properties:
jar: ${SPARK_JOB_JAR_LOCATION}
class: ${SPARK_JOB_CLASS}
name: ${SPARK_JOB_NAME}
args: ${SPARK_JOB_ARGUMENTS}
relationships:
- type: dice.relationships.spark.SubmittedBy
target: ${SPARK}_master
- type: dice.relationships.Needs
target: ${SPARK}_worker
Template variables¶
- SPARK
- The name of the Spark cluster. This is usually set to spark, which gives us spark_master and spark_worker nodes.
- SPARK_WORKER_COUNT
- Number of Spark worker instances that should be created when deploying cluster.
- SPARK_JOB
- The name of the Spark job that we wish to submit.
- SPARK_JOB_JAR_LOCATION
- Location of the Spark job jar. This can be either URL or relative path, in which case jar needs to be bundled with blueprint.
- SPARK_JOB_CLASS
- Name of the class that should be executed when submitting Spark job.
- SPARK_JOB_NAME
- Name that should be used for application when jar is submitted. This name can be seen in Spark UI.
- SPARK_JOB_ARGUMENTS
- Array of arguments that should be passed to jar when being submitted. If application takes no additional arguments, set this to
[]
.- HOST_SIZE_MASTER, HOST_SIZE_WORKER
- Sizes of the master and worker virtual machines. Available values are Small, Medium and Large.