High Availability Cluster Setup
Designing hyper-resilient clustering models to automate failovers during critical disasters.
For critical infrastructure incapable of sustaining downtime, staking operational uptime on solitary instances is dangerous. This guide delineates complex configuration architectures for inter-region / inter-datacenter high-availability spanning. It covers mechanism designs defining peer sync topologies, automated node election triggers, and seamless business-logic failover strategies capable of sustaining communication channels during sudden rack losses or cascading single-point interruptions.
Core Disaster Recovery (DR) Topology
The CXMind Active-Active High Availability model relies on a decoupled, stateless ingestion framework working in tandem with highly synchronized persistent data stores. To deploy a true HA environment, your architecture must implement the following multi-region structure:
- Stateless Ingestion Scaling: Ingestion nodes themselves store zero state. You must run at least two Ingestion Engine instances behind a UDP load balancer (e.g., F5 BIG-IP, HAProxy, or cloud-native Network Load Balancers) using consistent hashing based on SIP Call-ID.
- Cross-Region Redis Pub/Sub: The heartbeat of the system. Deploy Redis Enterprise or AWS ElastiCache utilizing Multi-AZ replication and Redis Sentinel. If the primary region's master node fails, Sentinel automatically elects a replica, and the stateless Ingestion nodes reconnect seamlessly with exponential backoff.
- ClickHouse ReplicatedMergeTree: Your persistent CDR storage must be configured with at least two shards and two replicas per shard. ClickHouse Keeper coordinates distributed DDL operations and ensures data consistency even during network partitions (split-brain scenarios).
Application Node Failover Configuration
The App Node serves the Admin UI and orchestrates LLM dispatching. These are traditionally fronted by an Application Load Balancer.
# Example NGINX Load Balancing Block for App Nodes
upstream backend_app_nodes {
least_conn; # Distribute based on fewest active connections
server 10.0.1.50:5173 max_fails=3 fail_timeout=10s;
server 10.0.2.50:5173 max_fails=3 fail_timeout=10s;
server 10.0.3.50:5173 backup; # Standby disaster recovery node
}Need more help or have a specific architecture question?
Contact Engineering Support