
As I started learning about Flink after becoming quite skilled with Spark, a key question bothered me: What sets Flink apart from Spark? While many articles listed differences between them, I couldn't find detailed insights into these differences and how their individual designs benefit various use cases.
So, let's dive in as I highlight the significant design differences and explain how they shape the capabilities of these two technologies to serve specific purposes.
Continous Vs Microbatch
If you search flink vs spark in Google most of the articles will mention this.
Spark: Spark Streaming(structured streaming), follows a microbatching approach. Initially designed as an enhancement to Hadoop's MapReduce for batch processing, Spark later added support for streaming. Microbatch processing operates by scheduling jobs to run at fixed intervals determined by the user, often referred to as trigger times or it can run as soon as the previous micro batch completes. The system processes data in small batches, which are collected over these intervals.A crucial aspect to note is that in Spark Streaming, new jobs are generated at each trigger interval.
Flink: Flink adopts continuous processing, emphasizing real-time streaming while still supporting batch processing when needed. Continuous processing involves a perpetual reading loop where data is read from the upstream source, immediately processed, and then the system proceeds to read the next available set of data. This approach eliminates the scheduling overhead seen in microbatching. Data is processed as soon as it becomes available, reducing the processing latency to single-digit milliseconds or less. In this context, jobs are scheduled just once, and data flows continuously through a streamlined operators to be processed.
Note : Flink uses a buffer which has a configured limit to batch the data.As soon as buffer reaches its limit or max interval (in ms) it will be pipelined to the operators for processing.
Implication : The key distinction lies in how data is collected and processed.
- Microbatching entails scheduling jobs at every trigger interval. While it's theoretically possible to reduce the interval to zero to enhance processing speed, this approach introduces overhead that practically results in increased latency. On the contrary, continuous streaming sidesteps this overhead entirely.
- While microbatching may introduce some slowdown in streaming, it offers the advantage of configurable processing intervals. There are scenarios where continuous processing is necessary, but with a less frequent cadence, like every minute. This falls somewhere between traditional batch processing and real-time streaming. In such situations, Spark's microbatching approach proves to be more fitting.
Resource Handling
Spark : Multiple applications(Each app has one spark context with one or more queries/actions) can share a cluster i.e tasks of multiple applications can share a worker node but each app will have dedicated JVMs(Executors) and those are not shared among apps. Spark uses concept of stages for the same task(thread) to handle multiple narrow transformations(no shuffle).
Flink : Multiple jobs(jobs in Flink is similar to apps in spark) can share a cluster same as spark but Flink session cluster also allows sharing the same JVM(task managers) among jobs. Flink also has the concept of task slots which is further logical division of JVMs memory which can chain the non-shuffler operators and process it with same thread(same as narrow transformations).
Implication : 1. Flink session cluster which allows multiple jobs to be run in same JVM it can lead to isolation problems i.e if there is an issue with JVM multiple jobs are impacted and an issue with one job can impact another job. 2. Sharing JVMs can optimize resource utilization, especially in scenarios where one job isn't fully leveraging the JVM's capacity.
Checkpointing and Failure Recovery
Spark : Spark stores checkpoints for each microbatch in a synchronous manner as an integral part of the processing. While this checkpointing does add to the processing time however negligible.
Flink : Chandy-Lampart algorithm is used for checkpointing where snapshots are used for failure recovery. There is also bigger save points concept which is used for application updates etc. Key part of flink checkpointing is it is independent of processing data.
Implication : 1. Flink having a asyncronous checkpointing is much more efficient than spark contributing to smaller latency.
2. Flink has savepoints which can help upgrade the app. Spark does not support this.
Note : Recent spark version 3.x.x has a setting which can make checkpointing asyncronous but this is only for stateless query with kafka sink. Databricks provides it for stateful operators too as part of Databricks runtime.
Watermarking
Spark: In Spark, the distinction lies in local and global watermarks. Local watermarking operates per stream, while global watermarking pertains to the minimum watermark in stream-stream joins.
Flink: Flink offers versatile watermarking options. Watermarks can be custom-designed and can be generated in sources or in operators. For ex : In the case of a Kafka source, watermarks can be generated per partition during reading, with the minimum of these per-partition watermarks being propagated downstream to operators. Or can be generated while data is transformed using operators like aggregations, joins etc
Implication :
- Flink demonstrates greater flexibility by allowing custom watermark generation.
- Flink also supports both operator and source watermark generation. Comparatively, both Spark and Flink employ the concept of the latest seen event to compute watermarks. However, the distinction arises in source watermark generation. In Flink, watermarks are generated per partition and the minimum watermark of individual partitions is taken as the watermark. On the other hand, Spark calculates the watermark based on the latest event seen per stream, encompassing all partitions.
- Flink offers a helpful feature called
.withIdlenesswhich proves particularly advantageous when one of the streams (or partition of a stream) remains idle for an extended period. In such scenarios, processing can proceed based on the behavior defined using.withIdleness. Note: It's important to mention that this approach won't be effective if all streams are idle, which is a major problem in Spark as well.
State
Spark : State can be stored in HDFS backed memory as default. State can also be stored in HDFS backed rocksdb.
Flink : Flink stresses on using HDFS backed rocksdb.
Implication : Both spark and flink has similar options and works similarly when it comes to state storage.
Shuffle
Spark: It might surprise you, but in Spark, every data shuffle takes a detour through the disk. The data meant for shuffling is first written to the disk and then read by other tasks for further processing. It's important to clarify that this process is distinct from shuffle spill to disk due to memory limitations. Additionally, Spark benefits from various optimized external shuffle services, thanks to contributions from different companies.
Flink: On the other hand, Flink operates differently. It bypasses disks for shuffling and instead directly delivers data to the downstream tasks.
Implication : This approach has a significant impact, making Flink much faster due to the absence of disk involvement. While Flink has less(or no?) external shuffle services and the need for extensive external shuffle services is anyway not as pronounced in Flink's streaming-oriented design.
Flexibility
Spark: Spark offers flexibility in various aspects such as sinks, sources, serializations, and Catalyst optimizations. These components are easily pluggable, and features like watermarking are pre-built for convenience.
Flink: Flink provides an extensive level of customization. Nearly every step of a Flink job can be tailored, from how data is deserialized to designing watermark .
Implication: Spark is adept at handling a wide range of use cases thanks to its pre-built frameworks and optimizations. Making configurations for these aspects is straightforward, making implementation easier. On the other hand, Flink's implementations can be more involved, requiring the addition of classes and extensions. However, this greater complexity is balanced by Flink's expansive customization options, making it particularly useful for specialized and niche use cases.
Languages Supported
Spark: Spark supports Java, Scala, Python, and SQL, with Scala being the predominant language choice. Among these, PySpark, SQL, and Scala have the most extensive resources available.
Flink: Similarly, Flink accommodates Java, Scala, SQL, and Python as supported languages.
Implication: In my own experience, I found that SQL followed by Java tends to be the most straightforward in Flink. In Spark, I noticed that all languages had a comparable level of implementation ease.
Note : My perspective on this matter is somewhat subjective, considering my relatively recent introduction to Flink. It's worth noting that Flink, as a technology, is still relatively new compared to Spark.
Summary: These are some of the significant differences that stand out between Spark and Flink, influencing the specific use cases they excel in. In the realm of streaming use cases, Flink's design approach, encompassing continuous processing, shuffling mechanisms, and resource management, lends itself to achieving sub-second latencies with ease compared to Spark. However, this doesn't imply that Spark is unsuitable for streaming. It is still very much on top of the game. Various factors, including the developer ecosystem, integrations, supported languages, and more, play a pivotal role in influencing the choice between Flink and Spark for streaming use cases.
Please feel free to share any additional insights you believe could enhance this article further.
Grateful for the insights provided by the following references & many more:
https://towardsdatascience.com/heres-how-flink-stores-your-state-7b37fbb60e1a
https://www.linkedin.com/pulse/spark-streaming-vs-flink-bo-yang/
https://www.youtube.com/watch?v=sdhwpUAjqaI&ab_channel=Confluent