The recently concluded Black Friday weekend could have potentially been the most challenging shift for on-call engineers working in the Retail or E-Commerce sector. Since such peak-traffic events push the system to the limits, engineering teams are engulfed in a lot of tension preparing for it.
This is because the holiday season, globally and especially in the US, is a buzzing period of time for shopping enthusiasts. And this excitement brings with it a lot of website traffic. Eager customers wanting to shop are not as active visiting local stores as they are visiting websites these days, in part due to the pandemic.
Online retail sales in the US is about $1.4 billion on a normal day. However on peak traffic days like the Black Friday, sales are more than 5x that amount. On Black Friday 2018, U.S. online sales totaled $6.22 billion and on Cyber Monday 2018, sales surged to $7.9 billion—the biggest online sales day up to that point in the US.
And such increased web traffic means the load will hit the systems hard. Which in turn means pagers buzzing, alert notifications flying, grumpy stakeholders, unhappy customers, and much more. This is the worst scenario for businesses because, when you should be making more money, you are actually losing customers and brand value.
Whether your servers have crashed because of increased transactions/second or because the page load time increased 3x, failed transactions could mean losses in thousands of dollars for every second of downtime. Downtime costs per minute are roughly $220K at Amazon and around $40K at Walmart, making outages scary and expensive.
The role of SRE / Infrastructure teams
Ask any engineer working in E-Commerce or Retail, and they will talk about the ‘capacity planning horror show’ they typically face during such peak seasons with systems firing alerts all over the place. But it doesn’t need to be this way. This blog by Google Cloud talks how teams can prepare early, perform testing, and leverage war rooms to quickly overcome downtime during peak season.
Adopting best practices and converting these learnings into action items will not only help on-call engineers / SREs enjoy a chaos-free holiday break, but it will also help them understand a thing or two about their customers and how systems respond to a periodic increase in footfall.
For example, if your systems were receiving 1,000 qps(queries per second) during peak hours of the Black Friday from previous year, and assuming your business has grown by 20% since last year, then you need to ensure your systems can handle a load of 10%-30% growth in qps this Black Friday.
So what can teams do to make the holiday season less chaotic?
- Learn from past incidents
- Load testing and Performance testing
- Observability - Monitoring, Logging, Tracing, etc.
- SLO based alerting- Revisit SLOs and plan releases keeping in mind peak season traffic
- Other best practices- SRE Automation, Oncall, Alerting & Monitoring
Learn from past Incidents
Analyzing postmortems of past incidents will give you a fair idea of the limitations of your infrastructure. You will understand the breaking point of various services and how to tackle outages if they were to happen again.
The ideal way to take away learnings from past incidents is to make a checklist of action items from previous outages, and ensure they are addressed this time around.
Load testing and Performance testing to understand system thresholds
A good Site Reliability Engineering practice is routinely performing:
- Load tests on the systems to understand the stress levels that it can hold and
- Performance tests to understand how the system behaves in normal load conditions.
By definition, Load testing is the process of determining the behavior of a system when multiple users access it at the same time. The behavior of systems under extreme load can help us determine the threshold of break point. Various questions like the sustainability of the system under a particular load, and operating capacity of the system can be answered.
On the other hand, Performance testing measures system attributes such as Speed, Scalability, Reliability, Stability and how the system adapts to change during normal load conditions. The idea here is to validate if the system is performing efficiently when the limit of load is both above and below the threshold of break.
Ideally these should be ongoing practices, and not something that is done a few days prior to a major event as it will not give you a complete picture of system behaviour when under stress. Routine load tests will help SREs be more prepared and assist them in scaling up or scaling out accordingly.
Make Observability truly actionable
‘You can’t fix what you can’t see.’ This is a very famous saying which is applicable to fixing production issues. One of the most important aspects of running systems effectively in production is making your system more observable and taking proactive measures when a red signal is flagged.
Having a clear view of your system makes early recognition and preemptive solving of problems possible. Getting the right data at the right time with associated context is a game changer for those who want better system stability. For example, if an outage occurs, and an on-call engineer gets notified, it is ideal to give him more context into why the outage occurred. By referring to a dashboard with key system metrics recorded, he can debug the issue faster, reduce the duration of outage and bring down the overall MTTR.
There are different observability tools to monitor different system metrics like log aggregation, APM, time series databases, distributed tracing and metrics collection tools. The below table will give you a better understanding of the different tools and how/when SREs can use them. (read more here: https://www.squadcast.com/blog/top-observability-tools-for-devops-engineers-and-sres)