Preface
In 2013, a team at Twitter, struggling with the complexities of managing a growing network of hundreds of interconnected services across its datacenters, published a blog post titled "Observability at Twitter." As their infrastructure scaled, traditional monitoring approaches proved insufficient, a fact that became painfully clear during incidents. Engineers found themselves unable to answer critical questions about system behavior, prompting the development of a unified observability stack to provide deep visibility into system performance and service interactions.
At that time, Twitter's observability stack gathered over 170 million metrics per minute. It used Twitter-Server, Finagle, and a host agent for system-level data, retrieving metrics from tens of thousands of endpoints via HTTP. A custom time-series database (TSDB) backed by Cassandra ingested 500 million writes per minute, aggregating and indexing data in real time. Scribe and HDFS handled batch metric processing. A custom query language powered 400,000 queries per minute for dashboards and alerts. During critical events, engineers could enable 1-second resolution metrics. The alert system evaluated over 10,000 queries per minute. This made Twitter’s monitoring system one of the largest-scale time-series metric implementations at that time.
This system provided a window into the health and performance of Twitter's massive infrastructure. It enabled rapid responses to incidents and data-driven optimization. The volume of data processed highlighted the challenges of maintaining observability at this scale and required innovations in data collection, storage, querying, and alerting. Faced with the limitations of traditional monitoring systems that focused primarily on predefined alerts, Twitter's engineers articulated a new paradigm: the ability to ask arbitrary questions about a system's internal state-questions that might not have been anticipated during the initial instrumentation phase. This shift in perspective was profound, moving monitoring from a reactive, alert-driven approach to a proactive, investigative one.
The blog post helped describe how Twitter's stack was built and operated, helped popularize the concept of "Observability" in the broader tech community, and contributed to a shift in how complex systems are understood and managed.
Observability rests on three fundamental cornerstones: metrics, logs, and traces. Each contributes uniquely to a comprehensive understanding of a system's health, performance, and behavior. Logs provide detailed textual records of events, and they offer granular insight into specific operations. Traces follow requests as they propagate through various services, and they highlight performance bottlenecks and dependencies. Of these three, metrics have become particularly prominent in the context of observability, largely due to their efficiency, scalability, and ability to provide high-level, at-a-glance insights. They offer a crucial summary of system behavior and enable engineers to quickly identify trends and anomalies before investigating the more detailed information provided by logs and traces.
Observability with Prometheus and Grafana
A Complete Hands-On Guide to Operational Clarity in Cloud-Native SystemsEnroll now to unlock all content and receive all future updates for free.
