Database High Availability: Clustering, Replication and Failover

February 26, 2026 Editorial Team 7 min read

When the database goes down, everything goes down. Whether it is an ERP system, a web application, or a point-of-sale platform, database availability underpins every transaction. This article explores the major approaches to database high availability — active-passive clustering, active-active replication, and automated failover — across SQL Server, PostgreSQL, and MySQL, with practical guidance on choosing the right architecture for your clients.

Why Database High Availability Matters

Databases sit at the heart of virtually every business-critical application. A few minutes of database downtime can halt order processing, prevent customer logins, freeze manufacturing lines, and stop payroll runs. The cost of downtime varies by industry, but even for a small Australian business running a cloud-hosted ERP, an hour of database unavailability can mean tens of thousands of dollars in lost revenue and productivity. High availability (HA) architectures are designed to eliminate or minimise this risk by ensuring that if one database instance fails, another takes over with minimal interruption and data loss.

Key Concepts: RPO, RTO, and Failover

Before diving into specific technologies, it is essential to understand the two metrics that define any HA solution. Recovery Point Objective (RPO) is the maximum acceptable amount of data loss, measured in time. An RPO of zero means no transactions can be lost, which requires synchronous replication. An RPO of five minutes means the business can tolerate losing up to five minutes of committed transactions, which can be achieved with asynchronous replication. Recovery Time Objective (RTO) is the maximum acceptable downtime before the database is operational again. Automated failover can achieve RTOs measured in seconds; manual failover may take minutes to hours depending on staff availability and runbook quality.

Active-Passive vs Active-Active Architectures

Active-Passive vs Active-Active

Feature Active-Passive Active-Active
Write handling All writes go to the primary node Writes accepted on multiple nodes
Read scaling Reads can be offloaded to standby Reads distributed across all nodes
Failover complexity Moderate — promote standby to primary Low — other nodes continue serving
Conflict resolution Not applicable — single writer Required — conflicts possible on concurrent writes
Data consistency Strong — single source of truth Eventual or conflict-resolved — depends on implementation
Typical use case OLTP workloads, ERP, accounting Geographically distributed apps, read-heavy workloads

In an active-passive configuration, one database instance handles all read and write traffic while one or more standby instances receive a continuous stream of replicated changes. If the primary fails, a standby is promoted to take over. This is the most common HA topology for transactional workloads because it avoids write conflicts entirely. The trade-off is that the standby hardware sits idle under normal conditions, though many systems allow read-only queries against the standby to improve utilisation.

In an active-active configuration, multiple database instances accept writes simultaneously. This delivers higher aggregate throughput and better geographic distribution but introduces the challenge of write conflicts — two nodes may attempt to modify the same row at the same time. Conflict resolution strategies range from last-write-wins timestamps to application-level conflict handlers. Active-active is best suited to workloads that are naturally partitioned (e.g., each region writes to its own subset of data) or where eventual consistency is acceptable.

SQL Server Always On Availability Groups

Microsoft SQL Server's Always On Availability Groups (AG) is the premier HA feature for SQL Server Enterprise and Standard editions. An Availability Group consists of a primary replica that handles read-write traffic and one or more secondary replicas that receive transaction log records via synchronous or asynchronous replication. Synchronous commit ensures zero data loss (RPO = 0) by requiring the secondary to harden the log record before the primary acknowledges the commit to the application. Asynchronous commit allows the primary to commit without waiting, improving write performance at the cost of potential data loss during failover.

Always On AGs require a Windows Server Failover Cluster (WSFC) or, from SQL Server 2017 onwards, a Pacemaker cluster on Linux. The cluster provides the health-monitoring and automatic-failover mechanism. A Listener — a virtual network name and IP address — abstracts the client connection so that applications do not need to know which replica is currently primary. When failover occurs, the Listener DNS record is updated and client connections are redirected within seconds. Secondary replicas can be configured for read-only access, allowing reporting and analytics workloads to be offloaded from the primary.

PostgreSQL Streaming Replication

PostgreSQL provides built-in streaming replication that ships Write-Ahead Log (WAL) records from a primary server to one or more standby servers in near real-time. In synchronous mode, the primary waits for the standby to confirm that the WAL record has been written to disk before acknowledging the commit, achieving RPO = 0. In asynchronous mode, the primary streams WAL records without waiting, which reduces latency but introduces a small replication lag. Standby servers can serve read-only queries via hot standby mode, making them useful for reporting and read scaling.

PostgreSQL does not include a built-in automatic failover manager, so the community and vendors have developed several external tools for this purpose. Patroni, backed by a distributed consensus store such as etcd, is the most popular choice for automated failover in production. Patroni monitors the primary, detects failures, orchestrates the promotion of a standby, and updates connection routing — all without human intervention. Other options include repmgr, pg_auto_failover from Citus, and managed services like Amazon RDS and Azure Database for PostgreSQL that handle failover transparently.

MySQL Group Replication and InnoDB Cluster

MySQL Group Replication is a plugin that enables a group of MySQL servers to coordinate writes using a distributed consensus protocol based on Paxos. In single-primary mode, one member accepts writes while the others are read-only replicas; if the primary fails, the group automatically elects a new primary. In multi-primary mode, all members accept writes, and the group replication protocol detects and rolls back conflicting transactions. Single-primary mode is recommended for most workloads because it avoids the complexity of conflict detection and certification.

InnoDB Cluster wraps Group Replication with MySQL Shell for cluster administration and MySQL Router for transparent client routing. MySQL Router acts as a lightweight proxy that directs read-write traffic to the current primary and distributes read-only traffic across secondaries. This three-component stack — Group Replication, MySQL Shell, and MySQL Router — provides an integrated HA solution that is straightforward to deploy and manage, particularly in environments where MySQL is already the standard database engine.

RPO Implications: Synchronous vs Asynchronous Replication

The choice between synchronous and asynchronous replication has a direct impact on RPO and application performance. Synchronous replication guarantees that every committed transaction exists on at least two nodes, delivering RPO = 0. However, every write incurs the latency of the network round trip to the standby, which can add one to several milliseconds depending on distance. For replicas in the same data centre or availability zone, this overhead is usually negligible. For replicas across cities — say Sydney to Melbourne — the added latency is around 10-15 ms per commit, which can noticeably affect high-throughput OLTP workloads.

Asynchronous replication decouples the commit acknowledgement from the replication process. The primary commits immediately and streams changes to the standby in the background. This delivers near-zero write overhead but means the standby is always slightly behind. In the event of a primary failure, any transactions that were committed on the primary but not yet replicated to the standby will be lost. The actual data loss depends on the replication lag at the moment of failure, which is typically sub-second under normal conditions but can grow during periods of heavy write activity.

Pros

  • Dramatically reduces unplanned downtime for business-critical applications
  • Automated failover eliminates dependence on on-call DBAs for recovery
  • Read replicas improve query performance for reporting and analytics
  • Synchronous replication can achieve zero data loss (RPO = 0)
  • Modern solutions like Patroni and InnoDB Cluster simplify deployment

Cons

  • Synchronous replication adds write latency, especially over long distances
  • Licensing costs can be significant (e.g., SQL Server Enterprise for advanced AG features)
  • Split-brain scenarios require careful quorum and fencing configuration
  • Active-active architectures introduce write conflict complexity
  • Monitoring and testing failover adds ongoing operational overhead

Choosing the Right HA Architecture for Your Client

For most SMB clients running a single-site deployment with SQL Server, Always On Availability Groups in synchronous mode with two replicas and a file-share witness is the sweet spot. It provides automatic failover, zero data loss, and read offloading. For clients on PostgreSQL, a three-node Patroni cluster with etcd provides equivalent capability at zero licensing cost. MySQL shops should look at InnoDB Cluster in single-primary mode for a self-contained HA stack. In all cases, test failover regularly — at least quarterly — and ensure that connection strings use the virtual listener or router endpoint rather than pointing directly at a specific server IP address.

Share:
Back to Blog

Related Posts

Ubiquiti U7 Pro XG Review: WiFi 7 With a 10 GbE Uplink
Jun 01, 2026
Ubiquiti U7 Pro XG Review: WiFi 7 With a 10 GbE Uplink

The U7 Pro XG brings WiFi 7, a 10 GbE PoE+ uplink and a silent metal-heatsink design to UniFi’s flagship …

Feb 26, 2026
Building a Home Lab for IT Professionals: Hardware and Software Guide

A home lab is one of the best investments an IT professional can make. It provides a safe environment to …

Feb 26, 2026
Cyber Insurance: What Australian Businesses Need to Qualify

Cyber insurance has shifted from a nice-to-have to a boardroom priority, but getting coverage is no longer simple. Australian insurers …