DNS Failover Strategy
DNS failover reduces downtime by routing traffic away from a failing origin to a standby origin. It only works if the standby can serve correct content (and for dynamic sites: correct state).
What Failover Does (and Doesn't)
Failover is routing. It does not magically replicate your database, uploads, sessions, or cache state.
If the backup does not have current content/state, failover trades downtime for incorrect behavior (stale content, broken logins, carts, or admin tasks).
Choose a Model
| Model | How it behaves | Who it fits |
|---|---|---|
| Active-passive | one primary, one standby | most sites that want better uptime without full HA complexity |
| Active-active | both origins serve traffic | high-traffic sites that can keep state shared consistently |
Option A: Cloudflare Load Balancing (If Available)
Cloudflare Load Balancing (LB) gives you health checks and automatic pool failover at the edge.
Recommended settings:
| Setting | Recommendation | Why |
|---|---|---|
| Monitor type | HTTPS | tests the real request path |
| Monitor path | custom health endpoint | avoids cached-homepage false positives |
| Timeout | 3-5s | fail fast during real outages |
| Retries | 1-2 | reduce flapping |
| Steering | failover | only use backup when needed |
| Session affinity | enable for logged-in flows | reduces cross-origin session weirdness |
Avoid using / as the health check path on cached sites. A cached homepage can return 200 even when PHP or the database is down.
Health Check Endpoint (WordPress)
This MU plugin provides a lightweight endpoint that can return 200 or 503 based on critical checks.
<?php
/**
* Load balancer health check.
* Request: https://example.com/?health_check=1
*/
add_action('init', function () {
if (!isset($_GET['health_check'])) {
return;
}
header('Content-Type: application/json');
$checks = [];
// Database
global $wpdb;
$checks['database'] = ($wpdb->get_var('SELECT 1') == 1);
// Object cache (optional)
if (function_exists('wp_cache_get')) {
wp_cache_set('health_test', 'ok', '', 60);
$checks['object_cache'] = (wp_cache_get('health_test') === 'ok');
}
$healthy = !in_array(false, $checks, true);
http_response_code($healthy ? 200 : 503);
echo json_encode(['healthy' => $healthy, 'checks' => $checks]);
exit;
});
If you want a health check that stays valid even when PHP is down, implement a simple static endpoint at the web server instead (but then you are only testing "web server alive", not "WordPress alive").
Keep the Backup Ready
At minimum, the backup origin needs:
- the same web/PHP configuration as primary (or automation to recreate it)
- current WordPress code and configuration
- current uploads/media (prefer object storage for HA)
- a plan for database continuity (managed DB, replication, or clear RPO/RTO)
Common sync patterns (examples)
# WordPress files (every 5 minutes)
*/5 * * * * rsync -az --delete /var/www/html/ backup:/var/www/html/
# Uploads (every minute)
* * * * * rsync -az /var/www/html/wp-content/uploads/ backup:/var/www/html/wp-content/uploads/
Rsync-based failover is usually fine for mostly-static marketing sites, but it is not enough for transactional sites (WooCommerce, memberships) unless your database/session strategy is solid.
Test Failover and Failback
Run drills on purpose (quarterly is a good baseline):
- Verify both origins serve the same build/config.
- Trigger an outage on primary (stop the web server or firewall off 443).
- Confirm edge routing moves traffic to backup.
- Test critical user flows (login, checkout, form submissions).
- Restore primary and confirm failback behavior.
curl -I https://example.com/ | grep -iE 'server|cf-ray|x-litespeed-cache'
Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| health check hits cached content | LB says "healthy" during a real outage | use a real health endpoint (dynamic or purpose-built) |
| backup is stale | users see old content or broken assets | automate sync/deploy; prefer shared storage for uploads |
| no database plan | logins/carts/admin fail after failover | decide replication/managed DB strategy before calling it HA |
| never testing failback | "recovered" origin causes surprises | drill both failover and failback |