Skip to main content

DNS Failover Strategy

DNS failover reduces downtime by routing traffic away from a failing origin to a standby origin. It only works if the standby can serve correct content (and for dynamic sites: correct state).

What Failover Does (and Doesn't)

Failover is routing. It does not magically replicate your database, uploads, sessions, or cache state.

caution

If the backup does not have current content/state, failover trades downtime for incorrect behavior (stale content, broken logins, carts, or admin tasks).

Choose a Model

ModelHow it behavesWho it fits
Active-passiveone primary, one standbymost sites that want better uptime without full HA complexity
Active-activeboth origins serve traffichigh-traffic sites that can keep state shared consistently

Option A: Cloudflare Load Balancing (If Available)

Cloudflare Load Balancing (LB) gives you health checks and automatic pool failover at the edge.

Recommended settings:

SettingRecommendationWhy
Monitor typeHTTPStests the real request path
Monitor pathcustom health endpointavoids cached-homepage false positives
Timeout3-5sfail fast during real outages
Retries1-2reduce flapping
Steeringfailoveronly use backup when needed
Session affinityenable for logged-in flowsreduces cross-origin session weirdness
tip

Avoid using / as the health check path on cached sites. A cached homepage can return 200 even when PHP or the database is down.

Health Check Endpoint (WordPress)

This MU plugin provides a lightweight endpoint that can return 200 or 503 based on critical checks.

wp-content/mu-plugins/health-check.php
<?php
/**
* Load balancer health check.
* Request: https://example.com/?health_check=1
*/
add_action('init', function () {
if (!isset($_GET['health_check'])) {
return;
}

header('Content-Type: application/json');

$checks = [];

// Database
global $wpdb;
$checks['database'] = ($wpdb->get_var('SELECT 1') == 1);

// Object cache (optional)
if (function_exists('wp_cache_get')) {
wp_cache_set('health_test', 'ok', '', 60);
$checks['object_cache'] = (wp_cache_get('health_test') === 'ok');
}

$healthy = !in_array(false, $checks, true);
http_response_code($healthy ? 200 : 503);
echo json_encode(['healthy' => $healthy, 'checks' => $checks]);
exit;
});
note

If you want a health check that stays valid even when PHP is down, implement a simple static endpoint at the web server instead (but then you are only testing "web server alive", not "WordPress alive").

Keep the Backup Ready

At minimum, the backup origin needs:

  • the same web/PHP configuration as primary (or automation to recreate it)
  • current WordPress code and configuration
  • current uploads/media (prefer object storage for HA)
  • a plan for database continuity (managed DB, replication, or clear RPO/RTO)
Common sync patterns (examples)
Rsync cron examples (files and uploads)
# WordPress files (every 5 minutes)
*/5 * * * * rsync -az --delete /var/www/html/ backup:/var/www/html/

# Uploads (every minute)
* * * * * rsync -az /var/www/html/wp-content/uploads/ backup:/var/www/html/wp-content/uploads/
caution

Rsync-based failover is usually fine for mostly-static marketing sites, but it is not enough for transactional sites (WooCommerce, memberships) unless your database/session strategy is solid.

Test Failover and Failback

Run drills on purpose (quarterly is a good baseline):

  1. Verify both origins serve the same build/config.
  2. Trigger an outage on primary (stop the web server or firewall off 443).
  3. Confirm edge routing moves traffic to backup.
  4. Test critical user flows (login, checkout, form submissions).
  5. Restore primary and confirm failback behavior.
Check which origin responded (headers)
curl -I https://example.com/ | grep -iE 'server|cf-ray|x-litespeed-cache'

Common Pitfalls

PitfallSymptomFix
health check hits cached contentLB says "healthy" during a real outageuse a real health endpoint (dynamic or purpose-built)
backup is staleusers see old content or broken assetsautomate sync/deploy; prefer shared storage for uploads
no database planlogins/carts/admin fail after failoverdecide replication/managed DB strategy before calling it HA
never testing failback"recovered" origin causes surprisesdrill both failover and failback

What's Next