The article discusses how GitLab improved the quality of life for their on-call SREs by reducing the number of pages they receive during service-wide degradation.
They achieved this by grouping alerts by service and introducing service dependencies for alerting/paging. The on-call now receives only one page per service with a list of affected SLIs, rather than receiving multiple pages for each SLI.
The service dependencies prevent alerts on downstream services if alerts are already firing for upstream services. This has resulted in an overall downward trend in pages for the on-call.
















