Why Infrastructure Platforms Must Be Reproducible

Two engineers at different organisations, both running what their team calls a “Kubernetes platform,” can have environments that behave so differently they barely resemble the same technology. Same control plane version, different CNI. Same node OS, different systemd configurations. Same Helm chart, different values files that have accumulated two years of undocumented changes. Both environments technically run workloads. Neither environment can be reliably rebuilt by someone who was not present when it was originally constructed.

This is not an edge case. It is the normal state of infrastructure that has evolved organically rather than been built to a reproducible standard. And it is a risk posture that only becomes visible at the worst moment, when recovery is urgent and the process for rebuilding the environment turns out not to exist in a usable form.

What reproducibility actually means

Reproducibility in infrastructure means that the same inputs produce the same environment, every time, regardless of who runs the process or what state they start from. It does not mean identical, environments can legitimately differ in size, credential scope, or regional target. But the structure, the configuration, and the operational behaviour should be deterministic given the same module and profile.

This is a stronger claim than “we use infrastructure-as-code.” Having Terraform does not make your infrastructure reproducible. Having Terraform with hand-edited state files, provider versions that are not pinned, and modules that depend on external resources that may or may not exist at plan time is not reproducible. It is code that sometimes works, in the specific environment where it was last successfully run.

Genuine reproducibility requires the entire execution path to be deterministic: pinned tool versions, defined input schemas, explicit pre-condition checks, validated output states. Each of those is a design decision that has to be made and enforced. None of them happen automatically.

Why it matters more than teams realise

The value of reproducibility is not obvious until you need it, and then it becomes urgent.

The most common moment is disaster recovery. You have a primary environment that has failed, or is at risk of failing, and you need to stand up a working environment in a different location. If your infrastructure is reproducible, this is a defined operation, run the module against the DR target, verify the output, cut traffic. If it’s not reproducible, you are improvising under pressure. Experienced engineers can spend four hours rebuilding an environment from memory that should have taken thirty minutes, because the process had never been written down, only done.

DR tests that are effectively scripted demos represent a recognisable failure mode. The test environment was built once, manually, and then kept running as the “DR environment” rather than being rebuilt from scratch as part of the test. That is not testing recovery. It is testing connectivity to a static environment that happens to exist.

A genuine reproducibility test is running the build process against a clean target and verifying that the result matches the production definition. If you cannot do that in a test environment, you cannot do it reliably in an actual disaster.

Where reproducibility breaks down

The typical failure modes are familiar.

Tool version drift. Minor version changes in provisioning tools can produce subtle differences in how provider operations are handled, or turn deprecated options from warnings into errors. When tool versions are not pinned and enforced, different engineers running the same module get different results. The module appears to work, but the environment it produces differs depending on who ran it and when.

State drift. Manual changes made outside the provisioning process accumulate. The Terraform state reflects what was last applied, not the current live state. The gap between them grows until someone runs a plan that wants to undo a change that was made by hand three months ago and is now load-bearing. The plan is technically correct. Applying it would break the environment.

Implicit dependencies. Modules that pull from external sources at plan time, dynamic lookups, shared data sources, remote state references, introduce variability that is invisible until it fails. A remote state reference that works in the lab might return different data in the DR environment because the remote workspace has a different configuration. The dependency existed; it just was not declared.

Undocumented environment assumptions. The module works in the lab because the lab was built with certain baseline services already in place. That baseline is not declared anywhere. Run the module against a clean environment and it fails in the third step, for reasons that are not obvious from the error output.

# Module pre-conditions make assumptions explicit
preconditions:
  - name: netbox-reachable
    probe: http
    target: "{{ netbox_url }}/api/"
    expect: status_200

  - name: terraform-version-check
    probe: version
    binary: terraform
    constraint: "~> 1.7"

  - name: target-workspace-empty
    probe: tf-state
    expect: resource_count_zero

When pre-conditions are declared and checked automatically, implicit assumptions become visible failures. The engineer learns exactly what the module requires before anything executes, rather than discovering dependencies mid-run in the environment they least want to be debugging.

The lab as proof

A reproducible platform can rebuild its own lab environments on demand. That is the practical test. If you can run hyops run authoritative-foundation --env lab against a clean Proxmox cluster and get a working environment, SDN configured, IPAM populated, control nodes deployed, then you have a foundation that is genuinely reproducible. The process exists, it works, and it is documented in a form that a run record can verify.

If doing that requires manual steps, specific pre-existing state, or operator knowledge that is not encoded anywhere, then you have an environment that was built once and has been maintained since. That is a different thing with a different risk profile. It can appear stable for a long time. It does not survive recovery scenarios.

HybridOps treats lab rebuilds as a routine operation. The same modules used in production build the lab, against the same profiles with appropriately scoped credentials. The run records from a lab build are comparable to production records. Differences surface as discrepancies, not surprises encountered during a recovery event.

The investment

Building for reproducibility takes longer upfront. Declaring pre-conditions, pinning versions, defining input schemas, writing output probes, these are not optional additions to a module. They are what makes it a module rather than a collection of commands that worked once in a specific environment.

The return comes in every subsequent use. Onboarding a new engineer into a reproducible environment is a different experience from onboarding into one that requires institutional knowledge to operate safely. Recovering from a failure is a different experience when the recovery path is a defined, tested module rather than a sequence of steps someone will have to reconstruct under pressure while the business waits.

Reproducibility is not a property you add to a platform after it is built. It is a design decision made at the beginning, expressed in the choices about how modules are structured and executed, and it shapes everything that follows. The organisations that treat it as foundational are the ones that can recover confidently. The ones that treat it as optional discover the cost of that decision at the moment they can least afford it.