Skip to main content
Private preview. fremforge is in private preview — invited customers only. Content is still subject to change. Request access →
Hosted runner security model

Hosted runner security model

fremforge hosted runners execute customer Forgejo Actions workflows in T Cloud CCE Turbo — a dedicated Kubernetes cluster running in the EU-DE region. Every queued job spawns a fresh, single-task pod; the pod registers ephemerally with Forgejo, picks up exactly one task, runs it, and exits. There is no persistent runner pool that a previous customer’s code could leave residue in.

This page documents the isolation guarantees fremforge makes, the per-tenant limits that apply, and the threat model: what customer code can do, what it can’t, and the design choices a workflow author should make to keep their secrets safe.

Where workflows run

ComponentImplementation
ClusterT Cloud CCE Turbo (Cloud Native Network 2.0), region eu-de, three availability zones
Node poolEphemeral node-per-job — the controller spawns a fresh CCE Turbo VM node when a job is queued, places exactly one runner pod on it (podAntiAffinity + dedicated taint), and deletes the node when the pod completes. A small warm-node buffer hides the ~20s VM boot + image-pull. No two jobs share a kernel — same model as GitHub-hosted and GitLab-hosted runners. Customer workflows never share a node with platform workloads OR with other jobs (same tenant or otherwise).
Node placementTaint fremforge.sh/dedicated=runners:NoSchedule + label fremforge.sh/role=runner. podAntiAffinity on app.kubernetes.io/name=fremforge-runner with topologyKey=kubernetes.io/hostname enforces the one-pod-per-node invariant at the scheduler layer
Pod runtimeOne pod per queued job; spawner creates the pod when the job is queued, then deletes BOTH the pod AND its node when the job completes. The fast-down uses the controller’s scoped CCE node:delete IAM. activeDeadlineSeconds capped per job (60min default, 360min hard ceiling)
Reaper backstopOut-of-cluster FunctionGraph that deletes orphan nodes — pod never started, pod exited but controller missed the delete, node age beyond reap-after tag, or absolute 6h10m backstop. See reaper service account scope for the IAM.
Runner binaryUpstream forgejo-runner (act_runner) v6.2.1, no fork. Registers via ephemeral admin token, picks one task, then the pod exits
Container imagefremforge-prd/runner-base — Alpine 3.22 base mirrored in SWR, layered with toolchains baked from SWR-cached binaries (no direct upstream pulls). Image pre-baked into the warm-node disk so cold start is ~20s, not 47s

Isolation guarantees

Network isolation (per pod)

Every runner pod gets its own VPC ENI (Yangtse Cloud Native Network 2.0) bound to a per-pod Security Group fremforge-prd-runner-pod. The SG denies, by default, all traffic except:

  • DNS (53/UDP, 53/TCP) to the cluster’s CoreDNS service
  • Egress through outbound-proxy-runners (port 80) — the only outbound HTTP path

Customer code cannot reach:

  • The fremforge api or its private database
  • Forgejo’s internal ClusterIP (runner pods talk to Forgejo through the public Bunny-fronted apex with a runner-specific scope; the internal service is unreachable)
  • T Cloud platform services (RDS, DCS, OBS, SFS-Turbo, IAM, KMS)
  • Other tenants’ runner pods
  • The cluster’s metadata IP (169.254.169.254) — explicitly blocked at the SSRF guard
  • RFC1918 private ranges or loopback

A Kubernetes NetworkPolicy enforces the same rules at the CNI layer (default-deny on fremforge-prd-runners namespace), so even an SG misconfiguration cannot open a path.

Outbound egress: outbound-proxy-runners

Customer workflow steps that need the public internet (cloning a public repo, downloading a tool, calling a SaaS API) route through outbound-proxy-runners via HTTPS_PROXY injected into the pod. This proxy:

  • Accepts any public unicast IP including Carrier-Grade NAT (matches GitHub-hosted runner behaviour — needed for T Cloud’s own services like SWR which sit in CGNAT)
  • Blocks RFC1918 private, link-local, loopback, multicast, reserved, cluster Service-IP CIDR, and explicit cloud-metadata IPs (169.254.169.254, T Cloud’s 100.100.100.200)
  • Logs every CONNECT tunnel attempt with hostname + resolved IP for audit retention

The proxy is deliberately permissive in the GitHub-hosted-runner sense: customer workflows can reach any public host they need. The boundary is at SSRF + cluster-internal, not at vendor allowlists.

Pod-level isolation

SurfaceEnforcement
ProcessPod runs as root inside the container — deliberate, so workflows can apt install, run rootless buildkit, etc. (matches GitHub-hosted-runner expectations). The boundary is the VM below the container, not the container’s user namespace
seccompseccompProfile: RuntimeDefault on every pod — shrinks the syscall surface available to the in-container root without breaking CI ergonomics
WorkspacePer-pod emptyDir mount at /workspace — destroyed when the pod (and the node beneath it) exits. No persistence between jobs
Service-account tokenautomountServiceAccountToken: false — even the pod’s SA token isn’t mounted. Workflows can’t reach the Kubernetes API
Resource quotaPer-namespace ResourceQuota caps total CPU + memory + pod count across the runner namespace. A runaway customer can’t exhaust cluster capacity
LimitRangePer-pod default request 500m CPU / 1 GiB RAM, max 4 CPU / 8 GiB. Customer can request more in workflow resources: block (capped at LimitRange ceiling)
Container runtimecontainerd. No Docker socket fanout. Docker-in-Docker steps work via rootless buildkit when needed (workflow-author opt-in)
Pod Security Standardsbaseline enforce, restricted audit on the runner namespace

Cross-tenant + intra-tenant kernel isolation

This is the load-bearing claim of the hosted-runner model. Every queued job gets its own CCE Turbo VM node — a guest Linux kernel running on T Cloud’s hypervisor. The pod runs on that node, the job runs in the pod, the job completes, the controller deletes the pod, and the controller deletes the node. The next job (same tenant or another) gets a fresh node with a fresh kernel.

MechanismHow
Per-job VM nodepodAntiAffinity ensures the scheduler will not place two fremforge-runner pods on the same node. Combined with the autoscaler growing the runner pool one node per pending pod, the result is 1 job = 1 node = 1 kernel. A kernel-level escape (e.g. CVE-2024-1086 nf_tables) compromises only THAT throwaway VM, not co-located tenants
Fast-down on completionThe controller watches the pod → pod terminates → controller calls CCE node:delete immediately. No idle 10-min autoscaler drain window. Out-of-cluster reaper FunctionGraph catches orphans (pod-never-started, controller-missed-delete) within 6h10m absolute backstop
Per-pod ENI + per-pod SGNetwork isolation, independent of the kernel boundary. Tenant A’s pod cannot reach Tenant B’s pod over the network — separate VPC NICs, separate Security Groups
Dedicated runner namespace + node taintfremforge-prd-runners namespace, tainted nodes — no platform workload ever co-schedules with a runner pod
Single-task pod lifecycleA pod registers, runs one task, exits. No reuse, no cached secrets from a previous tenant. Pod exit triggers node delete
OIDC tokensEach pod gets a fresh OIDC token signed by the runner-OIDC-issuer with aud=fremforge and sub=repo:<owner>/<repo>:ref:<refname>. Customer cloud trust policies key off this audience + sub for assume_role_with_web_identity

Per-tenant limits

LimitDefaultOverride
Concurrent jobs per tenant30 (matches GitHub Pro tier; GitHub Free: 20, GitHub Team: 60)Per-deploy via RUNNER_PER_TENANT_CONCURRENCY_CAP. Paid plans can lift via per-tenant override
Global concurrent jobs across all tenantsBound by the runner node-pool autoscaler’s max_node_count (one node per concurrent job). Configurable at platform level
Job rate cap100 jobs per 5-minute rolling window per tenanttenants.max_jobs_per_5min column
Wall-clock per job3600s (60 min) — activeDeadlineSeconds on the podConfigurable per workflow step but capped at 6h ceiling
Runner-minute poolseat_cap × 1000 minutes/month (or explicit runner_minutes_cap on the tenant)Operator config; matches the public pricing
CPU per pod500m default, 4 CPU max (LimitRange)Workflow step resources: request, bounded by LimitRange
Memory per pod1 GiB default, 8 GiB max (LimitRange)Same

When a tenant exceeds the runner-minute pool and overage isn’t enabled (or is frozen), the spawner refuses dispatch: the workflow’s queued job transitions to Cancelled in Forgejo with reason runner_minute_cap. Tenants with runner_overage_enabled=true keep running and are invoiced for the excess.

When a tenant hits the concurrent cap, additional queued jobs wait — the spawner polls every 2s and dispatches as soon as a slot frees.

What customer code can NOT do

  • Reach the fremforge api, Forgejo’s internal ClusterIP, or any other platform service (the api is only available through the public Bunny apex with proper authentication, just like for any external caller)
  • Read another tenant’s runner pod (separate VPC ENI; cross-pod traffic explicitly denied)
  • Read another tenant’s secrets (workflow secrets are mounted only into the pod that’s running the workflow that owns them; pod exit destroys the mount)
  • Persist state across job invocations (single-task pod, no shared volume)
  • Reach the Kubernetes API server (no SA token mounted)
  • Reach cloud metadata endpoints (169.254.169.254, 100.100.100.200 — both explicitly denied at the SSRF guard)
  • Exhaust cluster capacity (ResourceQuota + global cap)
  • Cause queue starvation across tenants (per-tenant concurrent cap)
  • Run privileged or root containers (PSA baseline enforced on the namespace)
  • Use raw network sockets, capabilities, host networking, or hostPID/hostIPC

Recommended workflow shapes

Split validate and apply jobs

Run untrusted code (PR builds, dependency scans, contributor-supplied test fixtures) in a validate job with no production credentials, and gate the apply job behind an environment approval. A compromised validate job cannot reach apply-job secrets.

Pin uses: references by SHA when supply-chain matters

uses: actions/checkout@v4 resolves to frem.sh/actions/checkout (fremforge’s Forgejo native pull-mirror of github.com/actions/checkout, 8-hour sync). The mirror tracks upstream tags including force-pushes, so pinning by SHA is the only way to guarantee bytewise stability:

# Bytewise-stable across maintainer force-pushes
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4.2.0

# Picks up upstream's current HEAD at last mirror sync — including
# any maintainer-credential-compromise force-push
- uses: actions/checkout@v4

Use OIDC for cloud auth, not long-lived secrets

Every runner pod gets a fresh OIDC token with aud=fremforge. Configure your cloud provider’s trust policy to accept this token and exchange it for short-lived credentials. Avoid storing long-lived AWS access keys / GCP service-account JSONs in workflow secrets where they end up in pod env vars.

Cap workflow step deadlines

Set timeout-minutes: on individual steps to bound how long a hung step can run. The pod-level activeDeadlineSeconds=3600 is the platform floor; step-level timeouts give faster feedback.

Threat-model summary

ThreatMitigation
Customer workflow contains a credentialPush protection (Gitleaks) blocks the commit before it lands in history
Customer workflow tries to read fremforge platformPer-pod ENI + SG denies; no kube SA token mounted; api only reachable via public-auth path
One tenant’s job tries to read another’s podPer-pod VPC ENI on separate IPs; NetworkPolicy default-deny across the runner namespace; and they’re on different VM nodes — different kernels
Kernel-level escape (root-in-container → host kernel via CVE)Compromises only the throwaway VM running that one job. Co-located tenants ARE NOT — they’re on their own VM. Node deletes within seconds of job completion regardless
Customer workflow tries to exfiltrate via DNSDNS is allowed (workflows legitimately need it) but the egress proxy logs every CONNECT with full hostname for audit
Compromised act_runner upstreamUpstream-tested binary (we don’t fork). Mirrored in SWR with Trivy gates on the runner image. CVE fixes flow via weekly rebuild
Customer workflow tries to reach metadata IPsExplicit deny at SSRF guard: 169.254.169.254 (AWS-style) and 100.100.100.200 (T Cloud)
Customer monopolises capacityPer-tenant cap (30 concurrent), job rate cap (100 / 5 min), pool-minute cap
Stale runner pod survives past taskactiveDeadlineSeconds hard kill at the per-job deadline; controller fast-down deletes the node on pod completion; out-of-cluster reaper FunctionGraph catches orphans via reap-after ECS tag with a 6h10m absolute backstop

Auditability

Every dispatch is logged in three places:

  1. runner_jobs table — tenant_id, repo, forgejo_task_id, started_at, finished_at, duration_seconds, pod_name. Source for billing aggregation and the operator console’s Infrastructure tab
  2. metering_events tablecode=runner_minute rows derived from runner_jobs.duration_seconds. Drives invoice line items and the tenant’s “Runner minutes used” panel
  3. LTS stream fremforge-prd-runner-spawner — structured pod-spawn logs for the operator. Keyword alarms on runner_dispatch_refused, runner_pod_stuck_pending, runner_job_overran_deadline, runner_quota_exhausted, runner_pool_at_max

Tenants see their own usage at Org admin → CI runners → Usage, including a per-month minute count and a per-job durations table.