Service 04 / 10 / Run From $48k/mo × 24×7 on-call optional

Cloud & DevOps
without fragile prod.

Kubernetes you can reason about. Terraform you don’t fear. GitOps from day one. SLOs as the contract between engineering and the business — not a wall of green dashboards no one reads.

$48k
From / month
99.99%
Achieved SLO
< 30m
P1 MTTR
8m
Avg. deploy time

Infrastructure as code,
not as a screenshot.

Every resource declared. Every secret rotated. Every deployment traceable from commit to pod.

A / Compute

Kubernetes & Nomad

Right-sized clusters — not a fleet you can’t afford to operate.

  • EKS / GKE 1.30+
  • Karpenter / autoscale cost-aware
  • GitOps (Argo/Flux) mandatory
  • Spot mix 60–80%
  • Service mesh only if needed
B / IaC

Terraform & Pulumi

Reproducible from a clean account, end-to-end.

  • Per-env workspaces isolated
  • Module registry versioned
  • Drift detection nightly
  • OPA policy pre-apply
  • Atlantis / TFC PR-gated
C / CI/CD

Continuous delivery

Trunk-based, gated, observable end-to-end.

  • GitHub / Buildkite parallel
  • Mean lead time < 1 hr
  • Canary rollouts Argo Rollouts
  • Rollback < 90s
  • Change failure rate < 5%
D / Observability

Logs / metrics / traces

Three signals, one query language, no $1M Datadog bills.

  • OpenTelemetry native
  • SLO budgets paged
  • Grafana / Tempo if cost matters
  • Honeycomb / DD if it earns it
  • Runbook per alert enforced

Infrastructure that survives
your tenure.

Built so the next team can read it, run it, and improve it without rewriting it.

01

Terraform monorepo

Modular, versioned, OPA-gated. Apply from CI only — no laptop applies. Drift detected nightly and ticketed.

02

GitOps delivery

Argo CD or Flux managing all clusters. Promotion via PR. Manual kubectl removed from break-glass-only.

03

SLO catalog

Every user-facing service has a named SLO, an error budget, and a paged alert. Burn-rate alerts, not threshold alerts.

04

Incident process

Severity rubric, paging policy, incident commander rotation, postmortem template, action-item burndown. All in writing.

05

FinOps cycle

Monthly cost review. Per-team chargeback if useful. Targeted optimization PRs — not “please use fewer resources.”

06

Disaster recovery

Documented RPO/RTO per service. Restore drill quarterly, results published. No DR plan that hasn’t been tested.

Three shapes
of operate.

From “help us run prod” to “take the pager.” Same engineers, different scope of responsibility.

Hands-on coaching

Embed light

From $48k/mo · 2 engineers
  • Embedded with your team
  • IaC + CI/CD foundation
  • SLO catalog + alert hygiene
  • You keep the pager
Most common

Embed standard

From $84k/mo · 3 engineers
  • All of light, plus:
  • Shared on-call rotation
  • Monthly capacity + cost review
  • Quarterly DR drill
Full operate

Run for you

From $128k/mo · 4–5 engineers
  • 24×7 on-call we own
  • Named SLO contract
  • P1 MTTR < 30 min
  • Monthly written ops review

Eight weeks
to boring.

Boring infrastructure is the goal. The first eight weeks are the unglamorous reshaping that gets you there.

01 / Week 1

Audit & plan

Written audit of current state: IaC coverage, deploy path, secret hygiene, alert quality, RPO/RTO. Plan with named owners and timelines.

02 / Week 2–4

IaC baseline

Bring 80% of resources under Terraform. Set up Atlantis or Terraform Cloud. Remove clickops paths. Drift detection in place.

03 / Week 5–6

GitOps + SLOs

Argo or Flux owns deploys. Every user-facing service gets a written SLO and a burn-rate alert. Runbooks for each.

04 / Week 7+

Take the pager

We join the rotation, then take it. Postmortems every incident. Monthly written ops review with engineering leadership.

Things buyers ask
on the first call.

If something isn’t answered here, ask in your intro email — we keep this list short on purpose.

We’re multi-cloud — is that a problem?+

No. Most of our work is AWS or GCP; we know both deeply, plus Cloudflare and Fly. We’ll say honestly when a multi-cloud constraint is paying its way and when it’s a tax.

Do you do bare metal or on-prem?+

For Kubernetes on bare metal: yes, with Talos/k0s. Pure on-prem with no cloud at all: usually not — let’s talk if it’s a real constraint.

Can you migrate us off Heroku / Render / Vercel?+

Yes, and we’ll first ask if you should. Many teams are paying a managed-platform premium for genuinely good leverage. Migration only makes sense at specific scale or pricing inflections.

What’s your stance on service mesh?+

Skeptical by default. Istio/Linkerd earns its place when you have cross-team mTLS, cross-cluster traffic, or genuine observability needs. For most teams it adds operational cost without ROI.

Got something hard
that needs to be real?

Send a paragraph about the problem. We’ll come back inside 48 hours with a written take — team shape, cost envelope, riskiest assumptions.

hello@kvb.dev Browse services