Incident comms template
This runbook is published verbatim from fremverk’s internal operator documentation; some internal identifiers (cluster IDs, credential paths, internal contact handles) have been redacted. The timing commitments (RPO, RTO, paging cadence) are unchanged from the operator-side procedure.
Use this template when a customer-facing incident is in progress. Goal: keep the audience informed at the cadence the SLA promises without leaking unconfirmed detail.
Channels (priority order)
- status.frem.sh, primary. Updates every 30 min during a SEV-1, every 2h during a SEV-2, on resolution always. Run by the on-call rota; rotate updates between primary + secondary if the incident lasts > 4h.
- Customer comms-list, fires automatically on a status-page state change to “Major outage” or “Partial outage”. Manual trigger for SEV-2 if the impact spans >2 customers.
- Per-tenant audit-log entry, every customer-facing message also lands in the affected tenant’s audit log under
incident.updateso the customer’s procurement/compliance team has an exportable record.
Do not post to public Twitter/Mastodon during the incident. The trust page incident timeline at /trust/incidents/<slug>/ is the canonical record after the fact; status-page updates auto-archive there on resolution.
Status-page update, initial detection
Status: Investigating
Affected components: <comma-separated list>
Posted: <UTC timestamp>
We are investigating reports of [user-visible symptom — e.g. "elevated error
rates on the api"] starting at [UTC timestamp]. Initial scope appears to be
[X% of tenants OR specific tenant subset OR full-service]. Engineering is
on the incident.
Next update: [UTC timestamp; 15-30 min from now]Don’t speculate on cause. Don’t promise a fix-by time on the initial detection. The customer wants three things from this update: (1) we know about it, (2) we know how big it is roughly, (3) when to check back.
Status-page update, investigation in progress
Status: Identified | Monitoring
Affected components: <list>
Posted: <UTC timestamp>
We have [identified | confirmed mitigation of] [one-sentence cause if
public-safe; "an internal infrastructure issue" if not]. [Mitigation
posture: what's degrading, what's working, what manual actions customers
can take if any.]
Next update: <UTC timestamp; 30-60 min from now if SEV-1, 2h if SEV-2>Cause-public-safe = the cause name doesn’t disclose a security weakness or pin blame on a single sub-processor with their PR team unprepared. “Database connectivity” is fine; “[vendor X] has a regional outage” is fine if the vendor has a public statuspage match; “exploiting CVE-2026-XXXXX” is not until patched.
Status-page update, resolution
Status: Resolved
Affected components: <list>
Posted: <UTC timestamp>
Started: <UTC timestamp>
Resolved: <UTC timestamp>
The incident is resolved. [One-sentence cause + one-sentence fix.]
[Customer-action ask if any — e.g. "Customers who saw failed CI runs
during the window can re-trigger from the runner-jobs page."]
We will publish a post-mortem at [trust page link] within 5 business days.The post-mortem is the long-form. The resolution post is short. Don’t combine them.
Customer email
Wired to fire on status-page state transition to “Major outage” / “Partial outage” via the existing webhook. Manual trigger via the operator console (Phase 1.5; today operator-driven via the email-provider dashboard).
Subject: [fremforge incident] <one-line summary>
Body: short version of the status-page update, plus a CTA to the status page for live updates. Sender: info@fremverk.com. Reply-to: compliance@frem.sh (so a customer reply lands with the operator who can act, not in marketing).
Post-mortem
After resolution, file a post-mortem using the post-mortem template. The customer-facing version is published on status.frem.sh/incidents/<slug>/ within 14 days of any SEV-1 per SLA §11.4.
Related
- SLA §5, incident response and SLA credits
- On-call rotation (operator runbook, available under NDA via
compliance@frem.sh), rota composition + escalation paths - Post-mortem template, fill out within 5 business days of resolution per SLA §5.4
- T Cloud region outage (operator runbook, available under NDA via
compliance@frem.sh), regional-outage decision tree