Internal Working of Spark Applications — How a Spark Job is executed?

What is Apache Spark?

Siddharth Ghosh

SelectFrom

· ~4 min read · January 28, 2023 (Updated: August 13, 2023) · Free: Yes

Internal Working of Spark Applications — Internal Working of Spark Applications — How a Spark Job is executed?

Apache Spark is an open-source, multi-language, in-memory, large-scale data processing engine. It provides high-level APIs in Java, Scala, Python & R programming languages. It works on the concept of in-memory computation, making it around 100x faster than Hadoop MapReduce. It also provides tools & libraries like Spark SQL(for structured data processing), MLlib(Machine Learning), Streaming(Stream processing) & GraphX(Graph processing).

Decoding a Spark-Submit Command

Below is a sample Spark-Submit command to run a Spark job. The relevant parameters passed to the job have been described below.

./bin/spark-submit \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key<=<value> \
  --driver-memory <value>g \
  --executor-memory <value>g \
  --executor-cores <number of cores>  \
  --jars  <comma separated dependencies>
  --class <main-class> \
  <application-jar> \
  [application-arguments]

--class Entry to the application. Calls the main method of the Class.
--master The cluster manager(URL) to be set for the application. Set the cluster manager for the application. Types of Cluster Managers:

Standalone: A simple cluster manager that comes out of the box with Spark to run jobs locally.
Mesos: A general-purpose cluster manager that supports Hadoop MapReduce and service applications.
YARN: A resource & node manager that comes with Hadoop 2.0 +.
Kubernetes: An open-source system for automating deployment and management of containerized applications.

--deploy-mode Determines whether to deploy the driver on the client where the job is submitted or on the nodes' cluster. The default is client mode but can be set to cluster for the distributed system.
--conf <key>=<value> Set configuration property in "key=value" format. Multiple configurations can be passed to the job using the --conf keyword.
--jars Path to bundled jar including the application and other dependencies. The URL to the jar should be a hdfs:// or file:// path.
application-arguments Space separated Arguments passed to the calling class if any.

What Happens when you run the Spark-Submit Command?

Architecture Diagram for Internal Working of Spark Application

Spark uses the Master/Slave architecture. Whenever we submit the command, a Driver program is launched on a node depending upon the mode of execution(client/cluster mode) which essentially runs the main method of the class in execution. The Driver program creates a Spark Context/Session that stays until the application's lifecycle. The Driver program is a separate JVM process and has its memory allocated based on the configuration passed onto the Spark-Submit command.

The Spark Context helps create the Operator Graph or DAG(Directed Acyclic Graph) based on the transformations in the running program. This DAG defines the several steps of the program and also consists of the RDD lineages which can be re-used to re-create the RDD in case of job failures.

Once an action is encountered a Job is created. This Job is essentially submitted to DAG Scheduler. The task of the DAG Scheduler is to divide the DAG/Operator Graph into different stages, which are further divided into Tasks. These tasks are then submitted to the Task Scheduler which launches the tasks via the Cluster Manager on the different worker nodes and the executors execute these jobs. For each partition of the data, a task is launched.

The Cluster manager is responsible to launch and allocate the resources to the executors. It can request more resources or decrease the executors based on the workload of data being processed. These executors are responsible to run the tasks.

Each Executor is also a separate JVM process. They have their JVM and memory allocated based on the configurations passed into the Spark-Submit command. Each executor can cache data which can be re-used in further stages.

Once the tasks are completed, the results are shared back to the Driver program. Once the execution of the code is completed, the Driver program is exited and the Spark Context/Session is shut down.

Short Summary of the Spark Application Jargon

Driver The program that launches the main method of the Class in execution.
Executors The process that is responsible to execute a task.
Job A spark application consists of many Jobs. Each Job is an action encountered in the application.
DAG Directed Acyclic Graph. A Job is a combination of several operations which can be represented in form of a DAG. A Job can include multiple RDD transformations, and each RDD has its lineage.
Stages Job is divided into stages. Stages are defined based on the computations or the operators in the DAG. In the stage view, all RDDs belonging to that stage are expanded and can be viewed in detail.
Tasks Each stage is divided into tasks. One task is launched per partition. So a task can be assumed as the smallest unit of execution.

This article is an overview of the execution of a Spark Job. In the next article, I will explain the concepts of the Query Execution Plans which in turn will provide more details on DAG, Stages and tasks in a Spark Job.

PS: I would like to hear back suggestions or feedback from the readers to improve upon my articles. Cheers !!!

#big-data #spark #hadoop #data-science #data-engineering