In 2020, AWS launched Amazon Managed Workflows for Apache Airflow (MWAA). Apache Airflow is an open-source job orchestration platform that was built by Airbnb in 2014. Since then, many companies started using it and adopted it for various use cases. It is a workflow orchestration tool that allows users to run jobs sequentially and logically at a scheduled time or as an ad-hoc execution. Thanks to its architecture that does not rely on embedded storage that users can't control, people can externalize a queueing service and metadata storage outside Airflow so that you can design your Airflow stack to be horizontally scalable.
However, like other open-source projects, it becomes your responsibility to maintain your Airflow resources. It is not a small job. Understanding the needs, AWS launched MWAA to take care of the logistical side. This recent AWS service made me think about AWS Glue — a managed ETL (Extract, Transform, and Load) service and what their differences would be.
MWAA and AWS Glue Comparisons
Apache Airflow and AWS Glue were made with different aims but they share some common ground. Both allow you to create and manage workflows. Due to this similarity, some tasks you can do with Airflow can also be done by Glue and vice versa. This commonality makes us have ambiguity when we want to choose one over the other. To make an informed decision, it is important to understand how they differ.
Server vs. Serverless
Let's first start with an architectural difference that has a major impact on how they work. MWAA is server-based while AWS Glue is serverless. We will not discuss the two different concepts here, but MWAA and Glue inherit the drawbacks and benefits from the architectural disparities and they appear in three main areas.
Cost
Cloud platform providers offer serverless services and people choose them, among other reasons, for cost efficiency. Apache Airflow was designed to be run on servers. This means that even when there is no job to run, your Airflow resources will still stay active, which will incur costs during idle hours. MWAA is still server-based but it gives you a way to save cost with auto-scaling. MWAA monitors resource usage and increases the number of workers if it requires more resources or decreases if it needs less. But, since it is server-based, you cannot avoid running servers during idle time. AWS Glue is serverless and event-driven. It will start allocating server resources when you trigger only. During idle time, it does not use any resources nor incur any cost.
In MWAA, there are four main factors that decide your bill.
- Instance size
- Additional worker instance
- Additional scheduler instance
- Meta database storage
These costs will be multiplied by usage hours.
In AWS Glue, you are charged by DPU or Data Processing Unit multiplied by usage hour. DPU is calculated differently by the type of job you run. There are three types.
- Python shell: you can choose either 0.0625 or 1 DPU.
- Apache Spark: you can use 2 DPUs in minimum up to 100 DPUs in maximum.
- Spark Streaming: you can use 2 DPUs in minimum up to 100 DPUs in maximum.
Job Isolation
In a server environment, workflows share the same resources. When a workflow process consumes large memory or CPU, it can negatively impact other processes. This symptom may occur in MWAA if you don't distribute trigger times evenly and misconfigure your maximum worker count. In contrast, AWS Glue provides an isolated environment for each workflow because each uses separate resources. When you run heavy data-intensive work, it does not impact other workflows running at the same time.
Responsiveness
In MWAA, servers are always running and ready to be used. When a job workflow is triggered, it starts with almost no latency. In AWS Glue, because it's serverless, an ETL workflow begins with allocating necessary resources. This means that there is some delay before an actual job gets started. Moreover, when you have to use external libraries, every time a job starts, it first installs those libraries and then begins the actual job.
Work Orchestration vs. ETL
It is worth talking about what each service was designed for. Let me remind you that Airflow is a job orchestration tool. If we search the definition of orchestration in Wikipedia, it also gives you the definition of an orchestrator.
"An orchestrator is a trained musical professional who assigns instruments to an orchestra…"
The word we want to focus on here is "assigns". An ideal use case for Airflow is delegating jobs to other resources. Of course, we can still run non-intensive jobs within Airflow, but if possible, it is desirable to allocate them. Avoiding running heavy jobs within Airflow also prevents them from eating into shared resources. A good advantage of MWAA is that it is within AWS. This means that you can more aptly access other AWS services such as EMR, Athena, S3, Redshift, and more using IAM and security groups.
AWS Glue is specialized in ETL. When you extract, transform, and load data, it often entails expensive processes. To handle intensive jobs, you can use Apache Spark clusters in Glue. Thanks to the ability to use Spark clusters internally, you can run large-scale data processing without having to worry about resource drainage.
Monitoring and Logging
Both MWAA and AWS Glue provide convenient ways of monitoring. This is a big improvement for Airflow. When you set up and use Airflow on your own server, you will realize that checking Airflow logs is not user-friendly. Airflow writes different types of logs for tasks, web servers, schedulers, workers, and DAGs. By default, it writes these logs inside a server. To read the logs, you will have to SSH into the server and run commands. It becomes more complicated when you want to use distributed servers for scalability. This will require you to create central logging storage and make additional setup to make all servers write logs into that single place.
Because MWAA is managed by AWS, all the logs are written into CloudWatch. That means you can search certain logs using Logs Insights and have a dashboard that displays the usage of server resources like CPU, memory, and network traffic. Also, you can monitor numerous other Airflow-specific metrics. You can also set up alerts and manage notification recipients programmatically.
AWS Glue, since it is a native AWS service, also has good monitoring capabilities. What's different is that, in addition to CloudWatch, you can monitor your Glue resources in GlueStudio within Glue.
I found that it is easier to drill down and track failed jobs in Glue. The image above is the first screen of the GlueStudio monitoring menu. It gives you high-level numbers. You can click each number to go to the relevant jobs. When you select a job there, you can search its CloudWatch logs and view other metrics. When you have many workflows to monitor, this design can give you clearer visibility that allows you to track issues more easily.
CloudFormation
Since Airflow has become part of the managed services in AWS, you can cloudform your Airflow infrastructure. Not only infrastructural environments, however, but also Airflow's native config variables can be cloudformed. When you are already managing your AWS resources by CloudFormation, this will give you consistency in resource management and make it easier to manage Airflow across different environments. AWS Glue, of course, can be managed by CloudFormation. A big difference here is that you can even use it to create and define workflows, which gives you deeper controllability.
AWS SDK and CLI
In addition to CloudFormation, MWAA can be interacted with AWS SDK and CLI like you can for other AWS resources. Using them, you can create, update, and delete MWAA environments and retrieve their environment information that includes logging policies, number of workers, schedulers, and many more. You can also run Airflow's internal commands to control DAGs, but MWAA CLI does not support all Airflow's native commands like backfill (check this AWS document), dags list, dags list-runs, dags next-execution, and more. These limitations are mainly because Airflow was not created as AWS's native service. Having said that, AWS Glue can be controlled more completely by SDK and CLI. You can basically perform all tasks and control entire Glue resources using the libraries.
Wrapping up
MWAA and AWS Glue both are great tools to orchestrate jobs — MWAA for general jobs and Glue for ETL specifically. These architectural and functional differences discussed above will impact your decision on the choice of tools. Of course, knowing these differences, you may want to consider using both of them, which will give you much more flexibility. When you use both, you can control Glue resources from MWAA, for example, since Glue's SDK gives you more control and Airflow has the glue operator. These two services have large ecosystems within and a lot to learn, so It would be good to dip your toes into them before you start full-fledged development.