Security Incident Response
When things go wrong. Learn how to detect, contain, and recover from security incidents.
🎯 What You'll Learn
- Understand the incident response lifecycle
- Learn detection and containment strategies
- Know how to investigate incidents
- Recover and learn from incidents
- Build an incident response plan
When Things Go Wrong
Security incidents happen — breaches, malware, data leaks. What matters is how you respond.
Fast, effective response minimizes damage. Panic and improvisation make it worse.
The Incident Response Lifecycle
Preparation → Detection → Containment → Eradication → Recovery → Lessons
```bash
---
## Phase 1: Preparation
Before incidents happen:
### Build Your Team
| Role | Responsibility |
|------|----------------|
| Incident Commander | Decisions, communication |
| Technical Lead | Investigation, containment |
| Communications | Internal/external messaging |
| Legal | Compliance, notification |
### Create Runbooks
Pre-written playbooks for common scenarios:
- Malware infection
- Data breach
- DDoS attack
- Account compromise
### Prepare Tools
- Log aggregation (ELK, Splunk, or simpler alternatives)
- Forensic tools
- Communication channels (out-of-band — if email is compromised, you need a backup)
- Contact lists
---
## Phase 2: Detection
Recognize that an incident is occurring.
### Detection Sources
| Source | Example |
|--------|---------|
| Monitoring alerts | Unusual login patterns |
| User reports | "My computer is acting weird" |
| External notification | Researcher, customer, attacker |
| Log analysis | Failed auth spikes |
### Triage Questions
1. What is happening?
2. When did it start?
3. What systems are affected?
4. Is it ongoing?
5. What's the potential impact?
---
## Phase 3: Containment
Stop the bleeding.
### Short-Term Containment
| Action | Purpose |
|--------|---------|
| Isolate system | Prevent lateral movement |
| Block IPs | Stop ongoing attack |
| Disable accounts | Prevent access |
| Preserve evidence | Don't destroy logs |
```bash
# Example: Isolate network (emergency use only — document first)
iptables -A INPUT -j DROP
iptables -A OUTPUT -j DROP
# Or: Move to quarantine VLAN via network team
```diff
## Long-Term Containment
Keep business running while you investigate:
- Temporary workarounds
- Rebuild affected systems in parallel
- Monitor for re-infection
---
## Phase 4: Eradication
Remove the threat completely.
### Find Root Cause
- How did they get in?
- What did they access?
- What did they leave behind (persistence mechanisms)?
### Clean Up
- Remove malware
- Close vulnerabilities
- Reset compromised credentials
- Patch exploited systems
---
## Phase 5: Recovery
Return to normal operations.
### Restore Safely
```python
1. Verify system is clean
2. Restore from known-good backup
3. Monitor closely after restoration
4. Gradual return to production
```diff
### Validation
- Systems functioning correctly
- Security controls in place
- No signs of persistent access
---
## Phase 6: Lessons Learned
Every incident is a learning opportunity.
### Post-Mortem Meeting
Within 1-2 weeks:
- What happened (timeline)
- What went well
- What could improve
- Action items
### Document Everything
```markdown
# Incident Report: [Title]
## Summary
Brief description of what happened.
## Timeline
- HH:MM - Detection
- HH:MM - Containment began
- HH:MM - Root cause identified
- HH:MM - Systems restored
## Impact
- Systems affected
- Data exposed
- Duration
- Cost
## Root Cause
How the incident occurred.
## Response Actions
What was done to contain and eradicate.
## Lessons Learned
What to improve.
## Action Items
- [ ] Item 1 (Owner, Due date)
- [ ] Item 2 (Owner, Due date)
Practice Exercises
Exercise 1: Triage (Beginner)
An employee reports: “I can’t access my email and there are weird files on my desktop.”
What are your first 3 questions?
Answer
- When did you first notice this?
- Did you click any links or open attachments recently?
- Are your coworkers experiencing the same issue?
Exercise 2: Containment (Intermediate)
You’ve confirmed malware on a developer’s laptop that has SSH access to production servers.
What containment actions do you take?
Answer
- Isolate the laptop from network
- Revoke developer’s SSH keys
- Check production access logs for suspicious activity during the infection window
- Force password reset on affected accounts
- Monitor for unusual production activity
Exercise 3: Post-Mortem (Advanced)
Write a brief post-mortem for this scenario:
- Attacker gained access via phishing
- Had access for 3 days before detection
- Exfiltrated customer database
Knowledge Check
-
What are the six phases of incident response?
-
Why preserve evidence during containment?
-
What’s the difference between short-term and long-term containment?
-
Why do a post-mortem?
-
What’s the first thing you should do when you detect an incident?
Answers
-
Preparation, Detection, Containment, Eradication, Recovery, Lessons Learned.
-
For investigation and potential legal action. Destroying evidence makes it impossible to understand what happened and may complicate legal proceedings.
-
Short-term = immediate isolation. Long-term = temporary workarounds while you investigate and clean up properly.
-
Learn and improve. Understand what happened, what worked, what didn’t, so you’re better prepared next time.
-
Assess scope and impact before taking containment actions. Don’t panic and don’t act without documenting what you’re doing.
Summary
| Phase | Goal |
|---|---|
| Preparation | Be ready before incidents |
| Detection | Recognize incidents quickly |
| Containment | Stop the damage |
| Eradication | Remove the threat |
| Recovery | Return to normal |
| Lessons | Improve for next time |
What’s Next?
- Security Logging - Detection foundation
- Zero Trust - Reduce blast radius
Want to go deeper?
Weekly infrastructure insights for engineers who build trading systems.
Free forever. Unsubscribe anytime.
You're in. Check your inbox.
Questions about this lesson? Working on related infrastructure?
Let's discuss