Join us
@squadcast ă» Dec 08,2024 ă» 8 min read ă» Originally posted on www.squadcast.com
Effective communication is critical during incident management to maintain trust and minimize the impact of outages. This blog emphasizes the importance of clear, timely, and honest communication between technical teams, business teams, and customers. By addressing common pitfalls in communication and outlining best practices such as direct communication, status updates, and post-mortems, organizations can foster teamwork and customer confidence. The blog also highlights how Squadcast simplifies incident communication with tools like Incident Notes, StatusPage, and unified incident response workflows, enabling businesses to handle outages transparently and efficiently.
Communication is key. This is true in all aspects of life, and it especially applies to managing critical incidents in your company. Managing communication effectively with your customers can ensure good-will is maintained and they continue to use your product; alternately, failing to keep your customers informed can result in a loss of business and angry customers. Building and maintaining good communication channels within your company and with your customers is key to ensuring your product continues to be patronised. Done properly, when (inevitable) outages occur, the impact (both technologically and emotionally) is limited.
In a technical sense, proper communication channels help technical teams who may be unaware of how each other operate can work together efficiently to action and resolve problems quickly. In an emotional sense, business teams, management, and customers will be happy to know that their data and technology is in the hands of people who know what theyâre doing. Levels of comfort among these stakeholders are enhanced by being included in the process, having their concerns proactively acknowledged, and, being treated as equals by technical teams.
Many âtechiesâ enjoy communicating using complex, domain-specific language when managing their services, and this usually never poses a problem in the day-to-day of their jobs. However, when a service-impacting incident is underway, itâs not just you and your team members who are fixing it: everyone is. In the same way spectators at a football game cheer on their home team, your managers and customers are there to provide you with the support you need to resolve issues. But when you speak to them in complex language that is difficult to understand, you unintentionally gate-keep which prevents those same people from assisting you because they just donât understand what youâre saying. In the same way, product managers and marketing have a tendency to âsanitiseâ public communication to customers; by the time information makes its way into their inboxes itâs functionally useless.
Consider the following: âour internet service provider has published incorrect routing information, which means that everything on our internal network does not know how to reach the internet. We can fix it by temporarily overriding the incorrect routing information, but our ISP will need to correct their configurationâ. This explanation uses very clear, common phrases to explain the issue: internet service provider, internal network, internet, route. It contains no overtly technical information, and it also provides methods for resolution. Ironically, I have very rarely seen an example of clear communication from most technical staff. Usually, a manager comes along with a question like âwhy is the network down?â, and the following answers are given:
The first response lacks any clarifying information and doesnât provide any additional context to the question. The last two cases are so technical that without domain-specific knowledge, anyone who is not on your team including those people in other technical teams will not understand it. This example was adapted from the service outage review conducted by Cloudflare for their outage on the 17th of July 2020. Iâm a customer with Cloudflare and the way they clearly communicated their understanding of the issue, steps they took to resolve, and post-mortem of the issue gave me the confidence to continue being their customer. If they had responded with âthe service is out, weâre looking into itâ, I would have moved to a better provider. This is usually what happens when these simplistic, pointless updates are given because people lose trust in a service provider to actually do their job.
Hot tip: customers already know the service is out, they donât need reassurance that the outage is occurring, they need assurance the service is being fixed.
These pointless updates are usually caused by technical teams who provide little-to-no-context to product managers and external communications. In turn, these teams do the same for customers. Conversely, it is possible to be âtoo communicativeâ, whereby you notify customers of outages to individual infrastructure that is redundant or will not impact the customer in a material way.
There are a few keys to communicating with your customers that will ensure that they remain customers:
These four behaviours ultimately result in a better experience overall, it develops and enhances the relationship you have with your customers. It changes the dynamic from âus vs. themâ to âwe are in this togetherâ.
Now that we understand what we need to communicate, we need to know how we can use that  to convey that information to those who need it. During an outage, there are four methods of communication, each building on the last, to provide each audience with the appropriate information they need to do their jobs.
Whether this is in-person, via chat, or conference, this communication happens directly between the people âon the groundâ fixing the issues. This should be highly technical to give technical staff the information they need to fix technical problems.
A War Room is a place where technical and non-technical staff come together to provide updates and discuss a critical outage. Generally, updates should only be provided when the status of an incident changes (e.g. the cause is discovered or service restoration is beginning). An Incident Commander (IC) should also provide non-technical updates in the incident notes to ensure that appropriate communications can be drafted for all stakeholders.
A public Status Page should be made available to all customers and potential customers of your service. This is vital as it ensures that your service is fully transparent, and it also provides a central place for your customers to find information in the case of an outage. The information on here should be non-technical in nature and provide customers with the information they need to make critical decisions regarding their own services.
Unified Incident Response PlatformTry for free Seamlessly integrate On-Call Management, Incident Response and SRE Workflows for efficient operations. Automate Incident Response, minimize downtime and enhance your tech teams' productivity with our Unified Platform. Manage incidents anytime, anywhere with our native iOS and Android mobile apps.
Within 48 hours of an incident being resolved, you should provide a full postmortem of the incident on your blog and/or status page. This provides your customers with a full understanding of why an incident has occurred, how it was resolved, and actions that can be taken to limit similar outages in the future. This is a blend of non-technical and technical, with a business summary at the start followed by a technical analysis. Customers should also be invited to ask questions about the incident on social media channels to ensure that any concerns they have, are addressed.
Squadcast has many features that can enable you to keep your customers engaged and informed during an outage. Implementing these into your incident management process is very simple, and integrate into your existing Squadcast usage.
Incident Notes (previously War Room) is an excellent tool for keeping everyone up-to-date. Use this effectively to drive inter-team communication by having an Incident Commander (IC) who can translate between technical and non-technical staff. Ensure that all communications are addressed to teams or individuals so that nothing is missed. Finally, be sure to send your account managers, support staff, and management to War Rooms for updates; your IC should be providing non-technical updates as the status of your incident changes.
StatusPage is Squadcastâs tool for providing public updates to your customers. StatusPage allows you to provide updates from within Squadcastâs Incident Page, reducing the need for your team to jump between tools to provide customer updates. Users can simply select the option to Update the StatusPage, provide a status and message for the incident and publish it to customers. Having such an easily accessible solution for support staff means that communication processes can be augmented without adding burden or extra work. Itâs all conveniently located in one central place.
The Incidents Page should be your one-stop-shop for all information pertaining to an incident. Your post-mortem should derive all of its information from this page, and staff should be encouraged to ensure that technical and non-technical updates are adequately managed within an incident. By doing this, technical staff can be easily removed from the external communications process (which they probably find boring) and communications staff know they can rely on the information they can obtain via the incident.
Making your organisation more transparent is not always an easy process, but using some of the tips and tools weâve provided in this article will give you an idea on how to begin. The core message is that you need to make communication a cultural pillar for your organisation. Donât just write a procedure that says that âstaff should communicate with each otherâ, encourage communication in every part of your organisation. When outages occur, get everyone on a phone call. Have your communications teams sit with technical staff to understand how the business runs. Encourage customers to follow up with your team for information following an outage. There are many things you can do to get started, but the most important thing is that you do something!
â
Squadcast is an Incident Management tool thatâs purpose-built for SRE. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.
Join other developers and claim your FAUN account now!
Influence
Total Hits
Posts
Only registered users can post comments. Please, login or signup.