Post-mortem template
This runbook is published verbatim from fremverk’s internal operator documentation; some internal identifiers (cluster IDs, credential paths, internal contact handles) have been redacted. The timing commitments (RPO, RTO, paging cadence) are unchanged from the operator-side procedure.
This is the canonical post-mortem template for SEV-1 + SEV-2 incidents. Per SLA §11.4, a post-mortem MUST be published on the status page within 14 days of any SEV-1; this template is the source format that gets rendered to a status-page entry by the operator.
Filed copies live in runbooks/post-mortem-archive/<YYYY-MM-DD>-<short-slug>.md, date-prefixed so chronological listing is the natural sort order.
Publishing procedure
Customer-facing version (status.frem.sh) is a markdown render of the same file with the “Operator-only” sections stripped. Operator copy stays in repo; customer-facing copy gets posted to the status page via the Kuma admin UI Incidents feature, manually. SLA §11.4 commits publication within 14 days of any SEV-1, the steps below are the rehearsed manual procedure.
One-shot publication (per incident)
- Strip operator-only sections from the archived post-mortem. Anything in a section flagged
<!-- operator-only -->…<!-- /operator-only -->is removed. - Reach Kuma admin (port-forward via the operator’s kubectl access). Navigate to the Kuma dashboard and sign in.
- Open the status page, click Edit Status Page (top-right), then click Create Incident in the editor.
- Fill in:
- Title, match the post-mortem front-matter
title - Style,
Majorfor SEV-1,Minorfor SEV-2,Maintenancefor planned events (post-hoc maintenance doesn’t fit the post-mortem flow; use only for SEV). - Content, paste the stripped markdown. Kuma renders markdown natively; verify the preview before publish.
- Title, match the post-mortem front-matter
- Click Save then Click Publish. The incident lands at
https://status.frem.sh/immediately (Kuma caches ~30s; Bunny CDN cache TTL on the status page is 1 min). Verify by fetching the public URL, the new entry should appear in the “Past Incidents” list. - Update the post-mortem front-matter, set
status: publishedandpublic-url:to the new status-page URL.
Why no auto-publisher
A status-page automation was scoped at one point but deferred: the operator’s review pass before publishing (stripping operator-only sections, verifying the markdown renders cleanly, choosing severity) is a small fraction of the total post-mortem time. The 14-day SLA window is comfortably wider than any automation gain. Revisit if the operator team grows past ~3 people or if incident volume exceeds ~1/month.
Evidence capture
Before publishing, archive screenshots of any operator dashboards / observability panels referenced in the post-mortem to a date-stamped evidence folder. These DON’T appear on the customer-facing page, but they back the customer-facing claims if a customer asks for detail and provide context for any future similar incident.
Template body
---
title: <one-line incident title>
date: <YYYY-MM-DD of incident-start>
severity: SEV-1 | SEV-2
duration: <human readable, e.g. "47 minutes">
status: published | draft
public-url: https://status.frem.sh/incidents/<slug>
owner: <incident commander>
---
# <Same title as front-matter>
## TL;DR
Two-sentence summary, written for a customer reading on the status page.
What happened, what was the customer impact, what was the resolution.
No internal jargon, no hostnames a customer doesn't recognise. If this
section reads to a non-engineer like the rest of the post-mortem could
be skimmed safely, you've nailed it.
## Customer impact
| Surface | Window (UTC) | Severity | Behaviour |
|---|---|---|---|
| <e.g. Forgejo web> | 2026-MM-DD HH:MM – HH:MM | total outage / partial | what users saw |
| <e.g. Git HTTPS> | … | … | … |
Affected tenants: <count> (or "all", or "tenants on plan X only").
SLA credits: per [SLA §6.2](https://www.frem.sh/legal/sla/), customers whose
monthly availability dipped below the 99.5% commitment are entitled to a
service credit. The billing cron applied credits automatically on
`<YYYY-MM-DD>` at the next invoice cycle.
## Timeline (UTC)
| Time | Actor | Event |
|---|---|---|
| HH:MM | system | First failure detected by `<alarm name>`. |
| HH:MM | <on-call> | Acknowledged page; began investigation. |
| HH:MM | <on-call> | Identified root cause (… or "still investigating"). |
| HH:MM | <on-call> | Applied fix `<commit / kubectl / tofu apply>`. |
| HH:MM | system | Health checks recovered; alarm cleared. |
| HH:MM | <on-call> | Posted public notice on status.frem.sh. |
Use UTC consistently — Datatilsynet correspondence requires it. Show
local Europe/Copenhagen in parentheses for the operator's own clarity.
## Root cause
Plain-English explanation of what was broken. Include:
- **What changed?** A deploy, a config flip, an external vendor incident,
a slow-rolling resource exhaustion?
- **Why didn't tests catch it?** If a test should have, that's a follow-up.
- **What signal did we miss?** If a metric or alarm should have fired
earlier, that's a follow-up.
Two-paragraph max — if it's longer, split into "what" + "why" sub-headers.
## Detection
How was the incident first noticed?
- [ ] LTS keyword alarm (which one?)
- [ ] CES metric alarm (which one?)
- [ ] Customer report (channel — email, ticket, social?)
- [ ] Operator caught it during another task
- [ ] Routine probe / synthetic check
If detection was from a customer rather than our own monitoring, that's a
follow-up: "what alarm should have fired before the customer noticed?"
## Resolution
What was actually done to bring the system back. Include:
- The exact `kubectl` / `tofu apply` / restart command, OR a commit SHA + PR.
- Whether a workaround was applied first (cited as a "stop the bleeding"
step) before a proper fix.
- Whether the fix was reverted later (some hotfixes are rollback-on-stable).
## What went well
Even on a bad incident, something usually went right. Three bullets:
- The on-call ack was in 12 minutes (vs SLA target 30).
- The runbook for `<symptom>` was current and saved 20 minutes of debug.
- The `<X>` alarm fired before customer reports landed — meaning monitoring
is in the right place.
## What went poorly
Be specific. "Communication was poor" is too vague. Better:
- "We didn't post a status update for the first 38 minutes; customers
who hit the issue had no signal."
- "The runbook said kubectl exec into `<pod>`, but the pod no longer
exists in the new cluster topology — runbook drift."
- "The alarm threshold (`<X>`) was tuned for the old volume of `<Y>` and
is now firing 4× per week with false positives."
## Action items
Each one is a real PR or runbook ticket — not "do better next time".
| ID | What | Owner | Severity | Target | Status |
|---|---|---|---|---|---|
| AI-1 | Add CES alarm on `<metric>` so a similar incident pages within 5min instead of 38min | <on-call> | P1 | 2026-MM-DD | open |
| AI-2 | Update `runbook-X.md` to reflect cluster topology changes | <on-call> | P2 | next ops weekly | open |
| AI-3 | Backfill SLA credits for the 3 affected tenants | <on-call> | P0 | next billing cycle | open |
## References
- Status-page entry: <URL>
- Related Datatilsynet notification (if any)
- Vendor incident reports (if any)
- PR / commit linksWorkflow
- Within 48h of incident close: file a draft with
status: draft. Use this template; fill what you know. - Within 7 days: complete the document. Brief review with the team (asynchronous OK at this scale).
- Within 14 days: publish the customer-facing version (template body minus “Operator-only” section) on status.frem.sh. Mark
status: published+ addpublic-url. - Within 21 days: action items materialise as PRs / runbook updates / tofu changes. Update the table to mark each shipped.
Special cases
- Datatilsynet-notifiable Class A incidents (e.g. credential compromise, audit-chain tampering, cross-tenant data leak): the post-mortem template is filed as above AND a parallel Datatilsynet notification follows the controller-incident-response procedure. The two documents share evidence, keep them synchronised.
- Vendor-caused incidents (Bunny outage, payments-vendor outage, T Cloud regional event): file the post-mortem locally even though we didn’t cause it; the action items are usually about our DETECTION + COMMS rather than the vendor’s fix. Cross-reference vendor’s RCA when published.
- Near-misses (alarm fired but no customer impact): file an OPTIONAL “near-miss” post-mortem, same template, severity tier marked
near-miss. Useful when an alarm threshold prevented a real incident; the action items are usually about validating the threshold’s continued correctness.
Related
- SLA §11.4, 14-day publication commitment
- On-call rotation (operator runbook, available under NDA via
compliance@frem.sh), who acks the page - Incident comms template, what goes on status.frem.sh during the incident