This content originally appeared on Level Up Coding - Medium and was authored by Karan Pratap Singh
Deployment of Airflow on AWS ECS
Prerequisite
→ Basic information about airflow and DAG
→ Basic idea about containers and deployment of containers(ECS,k8s etc)
→ Docker
Airflow
Airflow is an open source tool used for scheduling and monitoring workflows. Airflow has a webserver, scheduler and worker nodes as components, webserver is the UI to interact and monitor workflows, scheduler schedules workflows and worker executes scheduled tasks. In Airflow workflows are defined as DAG(Directed Acyclic Graph)
Motivation
Earlier I was using a primitive set-up of airflow in which airflow was installed on multiple ubuntu server and DAGs were copied to $AIRFLOW_HOME/dags/ directory from GitHub as part of a cron job
Issues:
→ The system was not scalable
→ No appropriate versioning on DAGs
→ No monitoring of System(in case any worker is down)
→ no appropriate way of maintaining and versioning airflow variables
To resolve these issues we will use docker image to ship airflow DAGS, plugins and configs as part of docker images
Requirements
- A master node with airflow webserver and scheduler running on it and airflow webserver accessible from the public internet
- Multiple workers with accessibility to the master node(running in the same network as master)
- (CI/CD) Copying dags to $AIRFLOW_HOME/dags/ and plugins to $AIRFLOW_HOME/plugin directories of all the workers and master nodes
- Mechanism to maintain airflow variables
Steps
- Packaging DAGs, plugins and configs
- Deploying airflow
- CI/CD of DAGs, plugins and configs
Step 1(Packaging)
Creating docker image with Dags and plugins
Dockerfile
ARG BASE_IMAGE
ARG TAG
FROM $BASE_IMAGE:$TAG
COPY ./run-master.sh /opt/
COPY ./run-worker.sh /opt/
COPY ./dags/ /opt/airflow/dags
COPY ./plugin/ /opt/airflow/plugin/
COPY ./config/ /opt/airflow/config/
run-master.sh and run-worker.sh are scripts to run master scheduler and worker
run-master.sh
airflow initdb
airflow variables --import /opt/airflow/config/*
nohup airflow scheduler >> /opt/airflow/logs/scheduler.logs &
nohup airflow flower >> /opt/airflow/logs/flower.logs &
airflow webserver -p 8080
this script import all the JSON configs(key-value pairs) from /opt/airflow/config/ directory and save them into the airflow variable and runs scheduler, flower and webserver
run-worker.sh
airflow initdb
airflow worker
this will start airflow worker on worker nodes
“apache/airflow:latest” can be used as the base image for airflow
$ docker build --build-arg BASE_IMAGE=apache/airflow --build-arg TAG=latest -t airflow:1.0.0 .
this command can be used to build docker image with the tag as the version which will be package version(1.0.0 in this case)
Step 2(Deploying)
We will be using AWS ECS in this article for deployment
To deploy docker image on AWS ECS a docker registry and a Mysql instance with a database for airflow will be required
For that create an ECR registry from AWS console, then push the docker image to ECR using
$ aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account_id>.ecr.<region>.amazonaws.com
$ docker tag airflow:1.0.0 <ECR_registry_name>:1.0.0
$ docker push <ECR_registry_name>:1.0.0
This will push the image to AWS ECS which will later be used by AWS ECS
The second step will be to create Task definitions for deployment
we will be creating 2 task definition one for master and one for worker
Task definition master specifications
Network mode: awsvpc
Memory: 4GB
CPU: 2vpcu
Container spec
- Image: <image path from ECR>
- Entry point : ["bash","-c"]
- Command: ["/bin/bash -c '/opt/run-master.sh'"]
- PORT: 8080
ENV variable
- AIRFLOW__CORE__SQL_ALCHEMY_CONN: mysql://<user>:<password>@mysql_uri:3306/<database>
- AIRFLOW__CORE__LOAD_EXAMPLES: False
- AIRFLOW__CORE__EXECUTOR: CeleryExecutor
Task definition Worker specifications
Network mode: awsvpc
Memory: 4GB
CPU: 2vpcu
Container spec
- Image: <image path from ECR>
- Entry point : ["bash","-c"]
- Command: ["/bin/bash -c '/opt/run-worker.sh'"]
- PORT: 8080
ENV variable
- AIRFLOW__CORE__SQL_ALCHEMY_CONN: mysql://<user>:<password>@mysql_uri:3306/<database>
- AIRFLOW__CORE__LOAD_EXAMPLES: False
- AIRFLOW__CELERY__FLOWER_HOST: airflow.airflow-master
- AIRFLOW__CORE__EXECUTOR: CeleryExecutor
After task definition, a ECS cluster will be required for deployment
on the cluster, we will be adding 2 services one for master and another one for worker
For master service, an application load balance will be required for the webserver to be available outside private subnet and service discovery with host `airflow.airflow-master` which is already mentioned in ENV variables of container specs for workers to be able to communicate with the master
Common Issues:
→ Containers not being in a private subnet(AWS doesn't allow internet access inside containers if it is in the public subnet)
→ Communication between the master container and ALB
→ Heath check fail of ALB, airflow gives a 302 on “/” route which fails in of ALB, health check path should be “/health” for airflow
Conclusion
- Now versions of deployment are available for easier rollbacks
- Containerized packaging ensure consistency across environments
- Scaling will be as easy as increasing the capacity of worker service
- Container logs will be available for the master and all the workers in case any worker stops a new container will be launched by ECS; Cloudwatch monitoring and alerting can be enabled on this
Containerized Airflow was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding - Medium and was authored by Karan Pratap Singh

Karan Pratap Singh | Sciencx (2021-06-20T16:32:19+00:00) Containerized Airflow. Retrieved from https://www.scien.cx/2021/06/20/containerized-airflow/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.