Implementing the Airflow HA solution
Well, let’s continue building the HA cluster.
The current chapter of the tutorial is installing and configuring application for DAGs synchronization and Airflow itself.
Install and configure csync2:
Pre-shared keys can be generated using the following:
Then specify the file with your keys in the string
This is used for authentication nodes.
Which directory is synchronized should be specified in
So, add a job to crontab:
Then you can create a file or directory and they will appear on all nodes(will be synchronized). Of course, only one key is used here across all nodes.
To autostart the csync2 daemon with necessary params, it’s necessary to modify the systemd unit:
And finally, we got the moment when all preparations have been done and we can install Airflow itself:
Initialize a database:
Create airflow components systemd units:
Don’t forget to enable them.
Create config file by the command in the default directory:
Find and modify the following settings:
in [code] section:
in section [celery]:
broker_url, I specified all three nodes, but it’s redundant because we suppose celery will connect to localhost and if the whole node is down, there is no reason to reconnect to anyone else node. But for reliability, I added three nodes. You can consider specifying only localhost.
Create a user:
Restart the airflow:
Look if some errors occur.
To get access to the web interface I recommend using an an ssh tunnel.
Right now we have the Airflow cluster is up and running. Double-check all settings again to make sure you didn’t make an error.
Try the following:
sometimes you can remove all PostgreSQL data:
But be careful! This can be done only if you have test environment, production servers are not for experiments, you should understand what you are doing.
And restart the services:
Issues with csync can be resolved by removing directories with a database:
Then take a look at statuses of processes by issuing the command:
Last but not least I’d like to say, or even strongly recommend, you should consider the implementation of monitoring of the system. What system to use is up to you. I use Zabbix and special templates for PostgreSQL, templates for tracking if processes are running, and so on. Any comments are welcome!