Post-mortem template

This runbook is published verbatim from fremverk’s internal operator documentation; some internal identifiers (cluster IDs, credential paths, internal contact handles) have been redacted. The timing commitments (RPO, RTO, paging cadence) are unchanged from the operator-side procedure.

This is the canonical post-mortem template for SEV-1 + SEV-2 incidents. Per SLA §11.4, a post-mortem MUST be published on the status page within 14 days of any SEV-1; this template is the source format that gets rendered to a status-page entry by the operator.

Filed copies live in runbooks/post-mortem-archive/<YYYY-MM-DD>-<short-slug>.md, date-prefixed so chronological listing is the natural sort order.

Publishing procedure

Customer-facing version (status.frem.sh) is a markdown render of the same file with the “Operator-only” sections stripped. Operator copy stays in repo; customer-facing copy gets posted to the status page via the Kuma admin UI Incidents feature, manually. SLA §11.4 commits publication within 14 days of any SEV-1, the steps below are the rehearsed manual procedure.

One-shot publication (per incident)

Strip operator-only sections from the archived post-mortem. Anything in a section flagged  …  is removed.
Reach Kuma admin (port-forward via the operator’s kubectl access). Navigate to the Kuma dashboard and sign in.
Open the status page, click Edit Status Page (top-right), then click Create Incident in the editor.
Fill in:
- Title, match the post-mortem front-matter title
- Style, Major for SEV-1, Minor for SEV-2, Maintenance for planned events (post-hoc maintenance doesn’t fit the post-mortem flow; use only for SEV).
- Content, paste the stripped markdown. Kuma renders markdown natively; verify the preview before publish.
Click Save then Click Publish. The incident lands at https://status.frem.sh/ immediately (Kuma caches ~30s; Bunny CDN cache TTL on the status page is 1 min). Verify by fetching the public URL, the new entry should appear in the “Past Incidents” list.
Update the post-mortem front-matter, set status: published and public-url: to the new status-page URL.

Why no auto-publisher

A status-page automation was scoped at one point but deferred: the operator’s review pass before publishing (stripping operator-only sections, verifying the markdown renders cleanly, choosing severity) is a small fraction of the total post-mortem time. The 14-day SLA window is comfortably wider than any automation gain. Revisit if the operator team grows past ~3 people or if incident volume exceeds ~1/month.

Evidence capture

Before publishing, archive screenshots of any operator dashboards / observability panels referenced in the post-mortem to a date-stamped evidence folder. These DON’T appear on the customer-facing page, but they back the customer-facing claims if a customer asks for detail and provide context for any future similar incident.

Template body

---
title: <one-line incident title>
date: <YYYY-MM-DD of incident-start>
severity: SEV-1 | SEV-2
duration: <human readable, e.g. "47 minutes">
status: published | draft
public-url: https://status.frem.sh/incidents/<slug>
owner: <incident commander>
---

# <Same title as front-matter>

## TL;DR

Two-sentence summary, written for a customer reading on the status page.
What happened, what was the customer impact, what was the resolution.
No internal jargon, no hostnames a customer doesn't recognise. If this
section reads to a non-engineer like the rest of the post-mortem could
be skimmed safely, you've nailed it.

## Customer impact

| Surface | Window (UTC) | Severity | Behaviour |
|---|---|---|---|
| <e.g. Forgejo web> | 2026-MM-DD HH:MM – HH:MM | total outage / partial | what users saw |
| <e.g. Git HTTPS> | … | … | … |

Affected tenants: <count> (or "all", or "tenants on plan X only").

SLA credits: per [SLA §6.2](https://www.frem.sh/legal/sla/), customers whose
monthly availability dipped below the 99.5% commitment are entitled to a
service credit. The billing cron applied credits automatically on
`<YYYY-MM-DD>` at the next invoice cycle.

## Timeline (UTC)

| Time | Actor | Event |
|---|---|---|
| HH:MM | system | First failure detected by `<alarm name>`. |
| HH:MM | <on-call> | Acknowledged page; began investigation. |
| HH:MM | <on-call> | Identified root cause (… or "still investigating"). |
| HH:MM | <on-call> | Applied fix `<commit / kubectl / tofu apply>`. |
| HH:MM | system | Health checks recovered; alarm cleared. |
| HH:MM | <on-call> | Posted public notice on status.frem.sh. |

Use UTC consistently — Datatilsynet correspondence requires it. Show
local Europe/Copenhagen in parentheses for the operator's own clarity.

## Root cause

Plain-English explanation of what was broken. Include:

- **What changed?** A deploy, a config flip, an external vendor incident,
  a slow-rolling resource exhaustion?
- **Why didn't tests catch it?** If a test should have, that's a follow-up.
- **What signal did we miss?** If a metric or alarm should have fired
  earlier, that's a follow-up.

Two-paragraph max — if it's longer, split into "what" + "why" sub-headers.

## Detection

How was the incident first noticed?

- [ ] LTS keyword alarm (which one?)
- [ ] CES metric alarm (which one?)
- [ ] Customer report (channel — email, ticket, social?)
- [ ] Operator caught it during another task
- [ ] Routine probe / synthetic check

If detection was from a customer rather than our own monitoring, that's a
follow-up: "what alarm should have fired before the customer noticed?"

## Resolution

What was actually done to bring the system back. Include:

- The exact `kubectl` / `tofu apply` / restart command, OR a commit SHA + PR.
- Whether a workaround was applied first (cited as a "stop the bleeding"
  step) before a proper fix.
- Whether the fix was reverted later (some hotfixes are rollback-on-stable).

## What went well

Even on a bad incident, something usually went right. Three bullets:

- The on-call ack was in 12 minutes (vs SLA target 30).
- The runbook for `<symptom>` was current and saved 20 minutes of debug.
- The `<X>` alarm fired before customer reports landed — meaning monitoring
  is in the right place.

## What went poorly

Be specific. "Communication was poor" is too vague. Better:

- "We didn't post a status update for the first 38 minutes; customers
  who hit the issue had no signal."
- "The runbook said kubectl exec into `<pod>`, but the pod no longer
  exists in the new cluster topology — runbook drift."
- "The alarm threshold (`<X>`) was tuned for the old volume of `<Y>` and
  is now firing 4× per week with false positives."

## Action items

Each one is a real PR or runbook ticket — not "do better next time".

| ID | What | Owner | Severity | Target | Status |
|---|---|---|---|---|---|
| AI-1 | Add CES alarm on `<metric>` so a similar incident pages within 5min instead of 38min | <on-call> | P1 | 2026-MM-DD | open |
| AI-2 | Update `runbook-X.md` to reflect cluster topology changes | <on-call> | P2 | next ops weekly | open |
| AI-3 | Backfill SLA credits for the 3 affected tenants | <on-call> | P0 | next billing cycle | open |

## References

- Status-page entry: <URL>
- Related Datatilsynet notification (if any)
- Vendor incident reports (if any)
- PR / commit links

Workflow

Within 48h of incident close: file a draft with status: draft. Use this template; fill what you know.
Within 7 days: complete the document. Brief review with the team (asynchronous OK at this scale).
Within 14 days: publish the customer-facing version (template body minus “Operator-only” section) on status.frem.sh. Mark status: published + add public-url.
Within 21 days: action items materialise as PRs / runbook updates / tofu changes. Update the table to mark each shipped.

Special cases

Datatilsynet-notifiable Class A incidents (e.g. credential compromise, audit-chain tampering, cross-tenant data leak): the post-mortem template is filed as above AND a parallel Datatilsynet notification follows the controller-incident-response procedure. The two documents share evidence, keep them synchronised.
Vendor-caused incidents (Bunny outage, payments-vendor outage, T Cloud regional event): file the post-mortem locally even though we didn’t cause it; the action items are usually about our DETECTION + COMMS rather than the vendor’s fix. Cross-reference vendor’s RCA when published.
Near-misses (alarm fired but no customer impact): file an OPTIONAL “near-miss” post-mortem, same template, severity tier marked near-miss. Useful when an alarm threshold prevented a real incident; the action items are usually about validating the threshold’s continued correctness.

SLA §11.4, 14-day publication commitment
On-call rotation (operator runbook, available under NDA via compliance@frem.sh), who acks the page
Incident comms template, what goes on status.frem.sh during the incident

Incident comms template