Runner model

Hosted runner security model

fremforge hosted runners execute customer Forgejo Actions workflows on single-use T Cloud Public ECS virtual machines in the EU-DE region. Every queued job provisions a fresh VM with embedded Docker; the runner registers ephemerally with Forgejo, picks up exactly one task, runs it, and the VM is destroyed. There is no persistent runner pool that a previous customer’s code could leave residue in.

This page documents the isolation guarantees fremforge makes, the per-tenant limits that apply, and the threat model: what customer code can do, what it can’t, and the design choices a workflow author should make to keep their secrets safe.

Where workflows run

Component	Implementation
Compute	T Cloud Public ECS (Elastic Cloud Server) virtual machines, region eu-de, three availability zones
Per-job lifecycle	Single-use VM per job — the platform spawns a fresh ECS VM when a job is queued, registers `forgejo-runner` on it with a one-job limit, and destroys the VM when the runner exits. A small warm-VM buffer hides the ~25 s ECS boot. No two jobs share a kernel — same model as GitHub-hosted and GitLab-hosted runners. Customer workflows never share a VM with platform workloads OR with other jobs (same tenant or otherwise).
Network placement	Each VM gets its own ENI in a dedicated runner subnet, with a per-VM Security Group enforcing egress. VMs sit in a dedicated VPC subnet split across three availability zones
VM runtime	One VM per queued job; the platform provisions the VM when the job is queued, then destroys it when the job completes. Per-job wall-clock cap (60 min default, 360 min hard ceiling) is enforced on the runner
Reaper backstop	A separate, out-of-band reaper deletes orphan VMs — runner never registered, runner exited but the delete was missed, VM age beyond its reap deadline, or an absolute 6h10m backstop.
Runner binary	Upstream `forgejo-runner` (`act_runner`) v12.10.2, no fork. Registers ephemerally with `--ephemeral`, picks one task, then the runner process exits and the platform destroys the VM
VM image	The hosted runner image — debian 13 (trixie) base, baked with toolchains from EU-cached binaries (no direct upstream pulls during boot). Image is the VM’s boot disk; Docker daemon is pre-installed and runs natively on the VM (no separate sidecar). Cold start is ~25 s, ~5 s when a warm-pool VM is already provisioned

Isolation guarantees

Network isolation (per VM)

Every runner VM gets its own ENI bound to a per-VM Security Group. The SG denies, by default, all traffic except:

DNS (53/UDP, 53/TCP) to the platform’s internal DNS resolver
Egress through the runner egress proxy — the only outbound HTTP path

Customer code cannot reach:

The fremforge api or its private database
Forgejo’s internal endpoint (runner VMs talk to Forgejo through the public Bunny-fronted apex with a runner-specific scope; the internal endpoint is unreachable)
T Cloud Public platform services (RDS, DCS, OBS, SFS-Turbo, IAM, KMS)
Other tenants’ runner VMs
The cluster’s metadata IP (169.254.169.254) — explicitly blocked at the SSRF guard
RFC1918 private ranges or loopback

The Security Group is applied at VM provisioning and cannot be modified from inside the VM — the VM has no IAM agency attached, so the network-config APIs are unreachable through the metadata-derived identity.

Outbound egress

Customer workflow steps that need the public internet (cloning a public repo, downloading a tool, calling a SaaS API) route through the runner egress proxy via HTTPS_PROXY injected into the VM. This proxy:

Accepts any public unicast IP including Carrier-Grade NAT (matches GitHub-hosted runner behaviour — needed for T Cloud Public’s own services like SWR which sit in CGNAT)
Blocks RFC1918 private, link-local, loopback, multicast, reserved, and explicit cloud-metadata IPs (169.254.169.254, T Cloud Public’s 100.100.100.200)
Logs every CONNECT tunnel attempt with hostname + resolved IP for audit retention

The proxy is deliberately permissive in the GitHub-hosted-runner sense: customer workflows can reach any public host they need. The boundary is at SSRF + internal-platform, not at vendor allowlists.

Package caches (sovereign, EU-resident)

Before a workflow reaches the public internet, the common package-manager fetches are routed through in-VPC pull-through caches — EU-resident, fast, and they keep dependency traffic off US-hosted public indexes. Each runner has these injected as the default index/registry:

Ecosystem	Cache	Injected as
npm / pnpm / yarn / bun	Verdaccio	`NPM_CONFIG_REGISTRY`, `COREPACK_NPM_REGISTRY`, …
Python (pip / uv)	proxpi	`PIP_INDEX_URL` (+ `PIP_TRUSTED_HOST`), `UV_DEFAULT_INDEX`
Go modules	Athens	`GOPROXY` (Go module and checksum-DB proxy — `GOSUMDB` validation is preserved through the cache)
Docker images (Docker Hub)	distribution	dockerd `registry-mirrors`
Trivy vulnerability DB	cache-ghcr (in-VPC mirror of `ghcr.io/aquasecurity`)	`TRIVY_DB_REPOSITORY`, `TRIVY_JAVA_DB_REPOSITORY` — so a `trivy` scan in your workflow pulls the vuln-DB from the EU cache instead of ghcr.io directly

For the common case (public dependencies) this is transparent — the caches proxy the public index, so nothing in your workflow changes.

If you use a private index or registry, note the injected value is the default, so it takes precedence over a default configured in a committed config file (pip.conf, .npmrc, go env). Point your private source explicitly so it wins:

pip: pip install --index-url <your-private> ... (a CLI --index-url overrides the injected PIP_INDEX_URL), or set it in the workflow step’s env:. To add a private index alongside the cache, use --extra-index-url.
npm: scope your private registry in .npmrc with a @scope:registry= line (scoped registries are honoured alongside the default), or set NPM_CONFIG_REGISTRY in the step env:.
Go: set GOPRIVATE (e.g. GOPRIVATE=git.example.com/*) so the toolchain bypasses the proxy + checksum DB for your private module paths.

VM-level isolation

Surface	Enforcement
Process	Workflow steps run as the VM’s `runner` user by default; `sudo` is available so workflows can `apt install`, run docker, configure system tools, etc. (matches GitHub-hosted-runner expectations). The boundary is the hypervisor below the VM, not user namespaces
Workspace	VM-local scratch at `/workspace` (and `/tmp`) — destroyed with the VM when the job exits. No persistence between jobs
Identity	The VM has no IAM agency attached. Metadata-service queries return only the network identity needed to boot; no platform-API access
Resource bounds	The VM’s ECS flavor sets a hard CPU + memory ceiling. A runaway customer can’t exhaust capacity beyond the VM’s own bounds. Global runaway is bounded by the per-tenant concurrent cap + the platform’s compute capacity
Container runtime	Docker (native) — the daemon runs on the VM host, not as a sidecar. `docker build`, `docker compose`, `kind`, `testcontainers` work directly with no DinD complexity. For workflows that need extra container-runtime capabilities, the `dind-privileged` label routes to a flavor with KVM nesting enabled
Hypervisor	T Cloud Public’s KVM-based ECS hypervisor — the same multi-tenant VM boundary used for every commercial T Cloud Public customer running ECS in the region

Cross-tenant + intra-tenant kernel isolation

This is the load-bearing claim of the hosted-runner model. Every queued job gets its own ECS VM — a guest Linux kernel running on T Cloud Public’s hypervisor. The runner registers, runs the job, exits, and the platform destroys the VM. The next job (same tenant or another) gets a fresh VM with a fresh kernel.

Mechanism	How
Per-job VM	The platform spawns 1 VM per pending job. The VM literally only exists while one runner is running on it. A kernel-level escape (e.g. CVE-2024-1086 `nf_tables`) compromises only THAT throwaway VM, not co-located tenants
Fast-down on completion	The runner exits → the platform deletes the VM immediately. No idle drain window. A separate out-of-band reaper catches orphans (runner-never-registered, missed-delete) within a 6h10m absolute backstop
Per-VM ENI + per-VM SG	Network isolation independent of the kernel boundary. Tenant A’s VM cannot reach Tenant B’s VM over the network — separate ENIs, separate Security Groups
Single-task lifecycle	A VM provisions, the runner registers + claims exactly one task, runs it, exits. No reuse, no cached secrets from a previous tenant. Runner exit triggers VM destroy
OIDC tokens	Each VM gets a fresh OIDC token signed by the runner-OIDC-issuer with `aud=fremforge` and `sub=repo:<owner>/<repo>:ref:<refname>`. Customer cloud trust policies key off this audience + sub for `assume_role_with_web_identity`

Per-tenant limits

Limit	Default	Override
Concurrent jobs per tenant	30 (matches GitHub Pro tier; GitHub Free: 20, GitHub Team: 60)	Paid plans can lift via per-tenant override
Global concurrent jobs across all tenants	Bound by the platform’s compute capacity in eu-de	—
Job rate cap	100 jobs per 5-minute rolling window per tenant	Per-tenant override
Wall-clock per job	3600 s (60 min)	Configurable per workflow step but capped at 6 h ceiling
Runner-minute pool	`seat_cap × 1000 minutes/month` (or an explicit per-tenant cap)	Operator config; matches the public pricing
VM flavor	`c9.large.2` (2 vCPU / 4 GiB RAM) default; `c9.xlarge.4` (4 vCPU / 16 GiB) for `runs-on: [fremforge, large]`	Workflow step `runs-on:` label routes to alternate flavors

When a tenant exceeds the runner-minute pool and overage isn’t enabled (or is frozen), the platform refuses dispatch: the workflow’s queued job transitions to Cancelled in Forgejo. Tenants with overage enabled keep running and are invoiced for the excess.

When a tenant hits the concurrent cap, additional queued jobs wait and are dispatched as soon as a slot frees.

What customer code can NOT do

Reach the fremforge api, Forgejo’s internal endpoint, or any other platform service (the api is only available through the public Bunny apex with proper authentication, just like for any external caller)
Read another tenant’s runner VM (separate ENI; cross-VM traffic explicitly denied at the Security Group)
Read another tenant’s secrets (workflow secrets are passed only to the runner that’s running the workflow that owns them; VM destruction wipes the in-memory credentials)
Persist state across job invocations (single-task VM, no shared disk)
Reach the OTC API directly with privileged scope (no IAM agency attached; metadata-derived identity has no roles)
Reach cloud metadata endpoints (169.254.169.254, 100.100.100.200 — both explicitly denied at the SSRF guard)
Exhaust the runner pool (per-tenant concurrent cap + the platform’s global compute ceiling)
Cause queue starvation across tenants (per-tenant concurrent cap)
Escape the hypervisor (the VM is the boundary; the host below it is T Cloud Public’s, isolated by the same VMM that separates every commercial ECS tenant in the region)

Recommended workflow shapes

Split `validate` and `apply` jobs

Run untrusted code (PR builds, dependency scans, contributor-supplied test fixtures) in a validate job with no production credentials, and gate the apply job behind an environment approval. A compromised validate job cannot reach apply-job secrets.

Pin `uses:` references by SHA when supply-chain matters

uses: actions/checkout@v4 resolves to frem.sh/actions/checkout (fremforge’s Forgejo native pull-mirror of github.com/actions/checkout, 8-hour sync). The mirror tracks upstream tags including force-pushes, so pinning by SHA is the only way to guarantee bytewise stability:

# Bytewise-stable across maintainer force-pushes
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4.2.0

# Picks up upstream's current HEAD at last mirror sync — including
# any maintainer-credential-compromise force-push
- uses: actions/checkout@v4

Use OIDC for cloud auth, not long-lived secrets

Every runner VM gets a fresh OIDC token with aud=fremforge. Configure your cloud provider’s trust policy to accept this token and exchange it for short-lived credentials. Avoid storing long-lived AWS access keys / GCP service-account JSONs in workflow secrets where they end up in process env.

Cap workflow step deadlines

Set timeout-minutes: on individual steps to bound how long a hung step can run. The VM-level wall-clock cap (activeDeadlineSeconds=3600) is the platform floor; step-level timeouts give faster feedback.

Threat-model summary

Threat	Mitigation
Customer workflow contains a credential	Push protection (Gitleaks) blocks the commit before it lands in history
Customer workflow tries to read fremforge platform	Per-VM ENI + SG denies; no IAM agency attached to the VM; api only reachable via public-auth path
One tenant’s job tries to read another’s VM	Per-VM ENI on separate IPs; SG default-deny across the runner subnet; and they’re on different VMs — different kernels
Kernel-level escape (root inside the VM → host kernel via CVE)	Compromises only the throwaway VM running that one job. Co-located tenants ARE NOT — they’re on their own VM. VM destroys within seconds of job completion regardless
Customer workflow tries to exfiltrate via DNS	DNS is allowed (workflows legitimately need it) but the egress proxy logs every CONNECT with full hostname for audit
Compromised `act_runner` upstream	Upstream-tested binary (we don’t fork). Mirrored through the EU cache with Trivy gates on the runner image. CVE fixes flow via weekly rebuild
Customer workflow tries to reach metadata IPs	Explicit deny at SSRF guard: `169.254.169.254` (AWS-style) and `100.100.100.200` (T Cloud Public)
Customer monopolises capacity	Per-tenant cap (30 concurrent), job rate cap (100 / 5 min), pool-minute cap
Stale runner VM survives past task	Per-job wall-clock cap hard-kills at the deadline; the platform deletes the VM on runner exit; a separate out-of-band reaper catches orphans with a 6h10m absolute backstop

Auditability

Every dispatch is logged in three places:

runner_jobs table — tenant_id, repo, forgejo_task_id, started_at, finished_at, duration_seconds, vm_id. Source for billing aggregation and the operator console’s Infrastructure tab
metering_events table — code=runner_minute rows derived from runner_jobs.duration_seconds. Drives invoice line items and the tenant’s “Runner minutes used” panel
Operator log stream — structured spawn logs for the operator, with continuous alerting on dispatch refusals, stuck VMs, overrun deadlines, and capacity exhaustion

Tenants see their own usage at Org admin → CI runners → Usage, including a per-month minute count and a per-job durations table.

Deploy secrets (OIDC)Agent auth