← Blog

Disaster Recovery as a Governance System

The technical side of DR is largely solved. The harder problem is governance: who decides to use it, based on what information, and what evidence does the decision produce.

Disaster recovery is usually treated as a technical problem. You set up replication, you configure a standby, you test the failover, and you document the procedure. If the primary goes down, the standby comes up. That is the model most organisations default to.

The technical side of that model is largely solved. Patroni handles PostgreSQL HA. Cloud SQL supports external replicas. VPN failover can be automated. The tooling exists and is mature. The harder problem is the one that comes after the technical setup works: who decides when to actually use it, based on what information, and what are the consequences of that decision?

That is a governance problem, not a technical one. And treating it as a technical problem is how organisations end up with recovery systems that are technically correct but operationally unreliable.


Why automatic failover is not always the right answer

Automatic failover is appealing because it removes the need for human judgment in the moment. The monitoring detects a failure, the decision is made, the standby is promoted. Fast, clean, no 3am phone call required.

But automatic failover makes a specific assumption: that the detected condition always warrants immediate recovery action. In practice, that assumption breaks down in ways that are entirely foreseeable.

A transient network partition can look identical to a primary failure. Automatic failover can fire on exactly this condition, promoting a replica that is still replicating to a live primary and creating a split-brain scenario that is significantly worse than the original blip. Recovery from split-brain requires deliberate, manual intervention, the exact thing automatic failover was supposed to eliminate.

A brief spike in replication lag might trigger failover logic that moves database traffic to a cloud replica, incurring egress costs and latency penalties, for a condition that resolves itself in four minutes. The automated system was not providing resilience. It was making expensive decisions based on incomplete information.

An application bug that crashes one service might trigger cascading alerts that make the monitoring surface look like a site-wide outage when it is not.

In each of these cases, the problem is not the automation itself, it is automation applied without access to the business context that would change the decision.


Recovery as a decision

A more useful model is to treat recovery as a governed decision rather than an automatic trigger.

The signals that inform that decision, health checks, replication lag, cost posture, service availability, cross-region connectivity, are still captured automatically and continuously. The difference is that they feed a decision surface rather than directly triggering actions.

recovery_triggers:
  evaluate_when:
    - probe: primary-health
      status: failing
      duration_seconds: 120
    - probe: replica-lag
      threshold_seconds: 30
      status: exceeded

  cost_gate:
    max_monthly_egress_usd: 400
    action_if_exceeded: alert_and_hold

  decision_mode: governed   # not: automatic
  approval_required: true
  notify: ["oncall-lead", "platform-team"]

Under this model, the platform evaluates the signals, determines whether recovery conditions are met, and then routes the decision through a governance layer before executing. An operator confirms the action. The approval, the signal state at the time, and the recovery action are all written into a structured record.

That record is what makes the recovery auditable. Not just “we failed over at 02:14” but: here are the signals that triggered evaluation, here is what the environment looked like at the point of decision, here is who approved the action, here is the outcome and the post-recovery probe results. That is the difference between a recovery that happened and a recovery that can be understood, reviewed, and improved.


Cost posture as a recovery signal

One of the more underappreciated aspects of cloud-based DR is the cost model. Running active DR into a cloud target, Cloud SQL external replica, GCP compute, cross-region networking, has ongoing costs. Failing over into that target for an extended period has larger costs. Failing over unnecessarily, for a condition that resolved itself, has costs with no corresponding benefit.

Most DR designs acknowledge this at the architecture stage and then ignore it at runtime. The failover happens, traffic moves to the cloud target, and nobody thinks about cost posture until the monthly bill arrives.

Including cost posture as a first-class signal in the recovery decision changes that. The platform knows the current cost position. If a failover would push egress past a defined threshold, it flags that condition before executing, not as a hard blocker, but as information that the operator should have before making the call.

This is not about being cheap with infrastructure. It is about making recovery decisions with complete information. An operator who knows that a failover will cost an additional £800 this month is in a better position to weigh the trade-off than one who is purely reacting to a health check alert with no financial context.


What governed recovery actually looks like

The practical components of a governed recovery model are not complex. They are mostly decisions that need to be made explicitly and encoded somewhere. The technology to implement them is not the constraint.

What signals trigger evaluation? Define the health checks, replication lag thresholds, and connectivity probes that indicate a recovery condition might exist. These should be specific and measurable, not “the primary looks unhealthy” but “the primary health probe has returned a failing status for 120 consecutive seconds.”

What is the decision mode? Some conditions warrant immediate automatic action, a primary that has been genuinely unreachable for twenty minutes is a failover scenario where speed matters more than deliberation. Others warrant evaluation and a human call. A well-designed recovery model distinguishes between them explicitly rather than applying the same mode to all conditions.

What does the evidence record contain? Every recovery action, whether automatic or governed, should produce a structured record: the signal state, the decision path, the approver, the recovery steps executed, and the post-recovery verification results. That record is the difference between a recovery that happened and one that can be reviewed.

What is the failback path? Recovery into a cloud target is temporary in most DR architectures. The failback path, restoring data to the primary, reestablishing replication, cutting traffic back, needs to be as well-defined as the failover path. Failback is where many organisations discover that their DR plan only covered the first half of the problem.

HybridOps structures DR orchestration around these questions. The decision service evaluates signals against a defined contract. Governed modes route through an approval layer before executing. Every action produces a structured record that documents the full decision chain.


The argument for deliberate recovery

Automatic systems are valuable. They respond faster than humans, they do not panic, and they do not miss alerts at 3am. Removing them from DR entirely would be the wrong conclusion from this argument.

But infrastructure recovery is not purely a speed problem. A recovery action that is fast but wrong, that fires on a transient signal, incurs unnecessary cost, or creates a worse condition than the one it was responding to, is not a success. It is a different kind of failure, and one that is harder to diagnose because the system behaved as designed.

Deliberate recovery is slower in the moment and more reliable over time. It produces evidence. It incorporates business context that no purely automated system has access to. It distributes decision-making appropriately rather than encoding all judgment into automation that was designed before the incident scenario was known.

The goal is not to remove automation from DR. It is to put it in the right places, and to build the governance layer that makes recovery decisions, including automated ones, traceable and auditable.