Every resource declared. Every secret rotated. Every deployment traceable from commit to pod.
Right-sized clusters — not a fleet you can’t afford to operate.
Reproducible from a clean account, end-to-end.
Trunk-based, gated, observable end-to-end.
Three signals, one query language, no $1M Datadog bills.
Built so the next team can read it, run it, and improve it without rewriting it.
Modular, versioned, OPA-gated. Apply from CI only — no laptop applies. Drift detected nightly and ticketed.
Argo CD or Flux managing all clusters. Promotion via PR. Manual kubectl removed from break-glass-only.
Every user-facing service has a named SLO, an error budget, and a paged alert. Burn-rate alerts, not threshold alerts.
Severity rubric, paging policy, incident commander rotation, postmortem template, action-item burndown. All in writing.
Monthly cost review. Per-team chargeback if useful. Targeted optimization PRs — not “please use fewer resources.”
Documented RPO/RTO per service. Restore drill quarterly, results published. No DR plan that hasn’t been tested.
From “help us run prod” to “take the pager.” Same engineers, different scope of responsibility.
Boring infrastructure is the goal. The first eight weeks are the unglamorous reshaping that gets you there.
Written audit of current state: IaC coverage, deploy path, secret hygiene, alert quality, RPO/RTO. Plan with named owners and timelines.
Bring 80% of resources under Terraform. Set up Atlantis or Terraform Cloud. Remove clickops paths. Drift detection in place.
Argo or Flux owns deploys. Every user-facing service gets a written SLO and a burn-rate alert. Runbooks for each.
We join the rotation, then take it. Postmortems every incident. Monthly written ops review with engineering leadership.
If something isn’t answered here, ask in your intro email — we keep this list short on purpose.
No. Most of our work is AWS or GCP; we know both deeply, plus Cloudflare and Fly. We’ll say honestly when a multi-cloud constraint is paying its way and when it’s a tax.
For Kubernetes on bare metal: yes, with Talos/k0s. Pure on-prem with no cloud at all: usually not — let’s talk if it’s a real constraint.
Yes, and we’ll first ask if you should. Many teams are paying a managed-platform premium for genuinely good leverage. Migration only makes sense at specific scale or pricing inflections.
Skeptical by default. Istio/Linkerd earns its place when you have cross-team mTLS, cross-cluster traffic, or genuine observability needs. For most teams it adds operational cost without ROI.
Send a paragraph about the problem. We’ll come back inside 48 hours with a written take — team shape, cost envelope, riskiest assumptions.