Table of Contents
Effective post-incident analysis is crucial for Site Reliability Engineers (SREs) to improve system resilience and prevent future outages. Using practical prompts can guide teams through thorough investigations and foster continuous improvement. Here are some example prompts to enhance your post-incident reviews.
Prompt 1: Incident Timeline Reconstruction
Describe the sequence of events leading up to the incident. Include timestamps, system states, and user reports. What was the first sign of the issue, and how did it escalate?
Prompt 2: Root Cause Identification
What was the underlying cause of the incident? Consider both technical failures and process gaps. Was there a specific code change, configuration error, or external factor?
Prompt 3: Impact Assessment
Assess the scope and severity of the impact. Which users, services, or regions were affected? How long did the outage last, and what was the business impact?
Prompt 4: Detection and Response Evaluation
Evaluate the effectiveness of detection mechanisms and response actions. Were alerts timely and accurate? Did the team follow established incident response procedures?
Prompt 5: Lessons Learned and Preventative Measures
Identify key lessons from the incident. What changes can be made to monitoring, alerting, or infrastructure? How will these improvements reduce the risk of recurrence?
Prompt 6: Communication and Stakeholder Engagement
Review how communication was handled during the incident. Were stakeholders kept informed? What can be improved in future communications?
Prompt 7: Documentation and Follow-up
Ensure all findings are documented clearly. Schedule follow-up actions and assign responsibilities. How will the team track progress on improvements?
Conclusion
Utilizing these prompts during post-incident reviews can lead to deeper insights and stronger system reliability. Encourage open discussion, thorough analysis, and continuous learning to build resilient infrastructure.