Airbnb is one of the largest online marketplaces for lodging. Founded in 2007, the platform currently has more than 5.5 million listings across more than 100,000 cities and 220 countries.
Initially, a small team of engineers at Airbnb ran a Ruby on Rails monolith, which they called Monorail. Later on, the company grew and started having dependency issues with Monorail. “We began to experience tight coupling between the modules as our engineering team grew larger,” noted Jessica Tai, Engineering Manager at Airbnb, during a presentation at QCon 2018. “Modules began to assume too many responsibilities and became highly dependent on one another.”
In 2015, Airbnb had over 200 engineers that were adding features to Monorail. With 200 commits deployed to the monolith per day, it would experience an average of 15 hours per week being blocked due to reverts and rollbacks.
In response, Airbnb started to break apart Monorail and shifted to a service-oriented architecture (SOA) in Amazon Elastic Compute Cloud (EC2). While this change resolved issues with tight coupling, the company now faced a new problem with scaling. “We needed to scale continuous delivery horizontally,” explained Melanie Cebula, Software Engineer at Airbnb, during a presentation in KubeCon NA 2018.
According to Melanie, each service needed its own continuous delivery cycle. This would enable Airbnb to scale development for engineers by simply adding new services. To address scaling, the company began migrating to Kubernetes in 2017.
Migrating Monorail to SOA and microservices
In its early days, Monorail’s front end was built with Backbone.js as stated by Jens Vanderhaeghe, Senior Software Engineer at Airbnb, during GitHub Universe 2018. In 2015, the front end was migrated to Redux and React, while the back end utilized Java services. The Monorail codebase grew rapidly and, at some point, it accumulated to 220 changes deployed daily, 30,000 SQL database columns, 155,000 pull requests merged in GitHub, and 1,254 unique contributors. Eventually, Monorail was divided in two major services: Hyperloop and Treehouse.
In order to reduce complexity and abstract away configurations, Airbnb’s engineering team relied on YAML templates rather than file inheritance. This also made it easier to migrate legacy services and retrain engineers, according to Melanie Cebula.
Building on the concept of templates to reduce repetition, engineers at Airbnb began storing configuration files in Git. This way, reviewing, updating, and committing configuration files is streamlined. Additionally, Kubernetes deployment best practices can easily be set as default parameters.
Scaling up with multiclusters
In September 2018, Airbnb’s main production cluster had reached 450 nodes. By December of the same year, this had doubled to 900 nodes. Around this point, concerns around etcd began as explained by Ben Hughes, Software Engineer at Airbnb, during KubeCon North America 2019. “It would be bad if our etcd instances were suddenly getting out of memory,” he added.
Before the end of 2018, etcd in Airbnb’s Kubernetes deployment would get backed up and fall over. This was caused by multiple problems, including low cache hit rate on the API server or a chain of events overwhelming etcd.
Fortunately, when etcd failed, Kubernetes would just stop any deploy and scaling in process. No workloads got taken down. Issues with etcd were resolved by upgrading to the etcd v3 data format.
By March 2019, the company’s production cluster again doubled in size to 1,800 nodes. By April, this increased to around 2,300 nodes. At this point, the engineering team were encountering more issues by increasing the amount of nodes and opted instead to add more clusters.
Airbnb was able to transition to a multicluster environment without much problem. This was largely due to SmartStack, Airbnb’s service mesh, and legacy infrastructure not having any colocation requirements.
Keeping multiple clusters consistent
To ensure that clusters perform equally, the Airbnb team created kube-system, an in-house method for deploying clusters. Components for clusters are written as Helm charts that are templated into a single manifest. Applications are then deployed using kube-gen, Airbnb’s internal framework. Under kube-system, deploys take less than 10 minutes.
To better organize multiple clusters, Airbnb additionally introduced the concept of types. These cluster types serve as classes, whereas clusters serve as instances.
Since the migration to Kubernetes, Airbnb has reached no less than 125,000 production deploys per year. Since 2019, more than 50% of the company’s services have been running on over 7,000 nodes across 36 Kubernetes clusters. This includes over 250 critical services. Moreover, the Airbnb team has added over 22 cluster types, such as production, testing, development, special security groups, etc