Running a fully declarative homelab, five years in

Every service in my house rebuilds from a git repo. One laptop, one command, twenty minutes later it’s all back. Here’s how, why, and what I’d still do differently.

I’ve had some version of a homelab since 2021, and a few rewrites later it’s finally something I can recommend without asterisks. The current shape is a dozen-ish self-hosted services running on one laptop, all configured in NixOS, scheduled by k3s, with secrets in SOPS and the entire world described by one git repo I push to from that same laptop.

What follows is the architecture, the pieces I think are worth stealing, and the bits that I would not recommend even to people I like.

Why declarative

The shortest version: I want the repo to be the source of truth. The longer version is that running imperative servers at home is mostly fine until the day you need to remember why a thing works, and the only living memory of that decision is in your fingers.

Declarative configuration moves that memory into the repo. The cost is up-front pain. The payoff is that future-me can wipe the disk, pull the repo, run one command, and have it all back - without remembering what got configured by hand at 2am six months ago.

// rule of thumb If a config change isn’t in git, it didn’t happen. I run a hook that screams when /etc drifts.

The shape of it

One node. Laptop, not a rack: ~12 vCPU, 64 GB RAM, ~1 TB NVMe, Intel iGPU for the occasional bit of QuickSync transcoding, no discrete anything. NixOS at the base, k3s in single-node server mode on top, flake at the root of the repo. Every workload is a NixOS module, a Helm release, or a Kubernetes manifest pinned in that flake.

This is intentional. It’s a homelab, not an aspirational datacenter. A second node (sentry-level-01) is roughed in and commented out in flake.nix for the day I actually want HA; until then, one laptop is plenty, and “the cluster” is mostly a vocabulary choice.

// flake.nix (abridged)
{
  outputs = inputs: {
    nixosConfigurations = {
      engineer = mkHost "engineer" { role = "server"; };
      # sentry-level-01 = mkHost "sentry-level-01" { role = "agent"; };
    };
    apps.deploy = ./scripts/deploy.nix;
  };
}

I deploy with nix run .#deploy. It evaluates the flake, builds the new system closure, and activates it. If the switch fails, the previous generation is still bootable. This is the single feature that turned NixOS from “fun” into “trustworthy.”

Secrets without panic

Anything sensitive lives in SOPS-encrypted files inside the repo, using age keys. Each host has its own key, the laptops have their own keys, and the master key lives in Bitwarden where past-me has already trusted it.

The thing I want to talk people into is the recovery path. If my laptop goes in the river:

Buy a new laptop.
Pull the master key from Bitwarden.
Generate a new age key, add it to .sops.yaml, push.
Run the deploy. Done.

No re-encrypting from memory, no “wait, was that DB password the one with the dollar sign or the underscore.”

k3s, but only just

A homelab does not need Kubernetes. I will say that again because I mean it: a homelab does not need Kubernetes. I run k3s because it’s the same primitives I use professionally, and dogfooding is cheaper than reading docs. If your homelab is for watching films, run Docker Compose and go outside.

With that out of the way: k3s is great. Single-binary install, built-in storage, sane defaults. The bundled traefik is k3s’s default ingress and stays - I customize it via a HelmChartConfig (extra entrypoints, Prometheus metrics scraping, a couple of per-app header middlewares) rather than ripping it out for nginx-ingress or something fancier. cert-manager is in the repo too, but its certs are used inside the cluster only - the public TLS edge lives on a Hetzner VPS running Pangolin/gerbil, so the cluster never sees Let’s Encrypt directly. The rest is just Helm releases pinned in the flake.

Things that broke (so you don’t have to break them)

SOPS + git hooks + impatience. If you forget to re-encrypt after rotating a key, your deploy will succeed and your services will fail to start with a wonderful, unhelpful error. Add a pre-push hook.
Flake input churn. nix flake update on a Friday is a special kind of optimism. Pin the inputs you care about, update on purpose, and read the release notes of nixpkgs and any CRD-shipping chart before you switch.

What I’d do again

Flake-only. configuration.nix is a sweet trap.
One repo for all hosts. Multi-repo seemed clean for a week.
Pangolin tunnels for any service I want from outside - no exposed ports, no VPN to onboard guests onto.
A real observability stack, even at one-node scale. The current setup is kube-prometheus-stack 84.5, Loki for logs, an Alloy DaemonSet shipping everything, plus a few oddballs - an Intel iGPU exporter, a local-path du-exporter, version-checker and nova for upgrade nags. Massive overkill on paper; I look at it more than I’d ever admit.

What I’d do differently

Skip k3s for the first version. NixOS modules + systemd would have got me to v1 a month sooner.
Write the disaster-recovery doc on day one, not after the first near-disaster.

If you want the actual repo, it’s at github.com/bovf/homelab-overkill. Forks and questions both welcome - preferably in that order.

filed under #nix #homelab #gitops

reply by email all posts →