A real DR program is not a binder. It is a rhythm of detection, containment, recovery, communication, and learning that runs whether or not anything is going wrong. This is ours.
A reliability program is a closed loop. Skipping a phase is how outages get longer.
Synthetic Health Checks from three U.S. regions hit every Worker every 60 seconds. An anomaly detector watches the audit log hourly. Either signal can open an incident automatically.
Per-tenant feature flags let us disable a misbehaving feature for one customer without redeploying. Per-provider circuit breakers route around failing dependencies in seconds.
D1 Time Travel restores any tenant database to any point in the last 30 days. A break-glass admin tool restores a single tenant in under five minutes without operator scripting.
Incidents post updates to the public status page in real time. RSS, JSON, and email subscriptions are available. Customers can subscribe per-component.
Every incident with customer-visible impact of fifteen minutes or more receives a public post-mortem within five business days. Remediation items are tracked and linked.
Authentication recovers before everything else, because nothing works without it. The audit log recovers second, because forensic continuity must not be lost.
| Data class | RPO target | RTO target | Notes |
|---|---|---|---|
| Authentication | ≤ 30s | ≤ 2m | Sessions are stored in KV with global replication. The auth database is the smallest of the five and recovers quickly. |
| Tenant operations | ≤ 60s | ≤ 5m | Companies, users, plans, and billing settings. Restore order: this is the second database brought online after auth. |
| Business records | ≤ 60s | ≤ 5m | Accounting, CRM, HR, e-commerce, and POS data. The largest of the five databases; carries the bulk of customer state. |
| Communications | ≤ 60s | ≤ 10m | Call history, voicemails, transcripts. Real-time delivery resumes within 60s of restore. |
| Audit log | ≤ 5s | ≤ 5m | Append-only, replicated continuously. Restored before any other database to preserve forensic continuity. |
| Customer files (R2) | ≤ 60s | Continuous | Object storage with cross-region replication and 11 nines of durability. Object-lock retention is enforced at the storage layer. |
Any tenant database, any moment in the last 30 days, with one click in the break-glass admin tool.
Dead-letter queues capture failed messages with full payloads. One click replays a checkpoint or a single message.
Versioned deploys let us pin traffic back to the previous working version of any Worker in seconds.
A super-admin can freeze a tenant during an investigation, preventing new writes while the audit log is preserved.
Disable a misbehaving feature for one customer or globally without redeploying. The flag table lives in the auth database, the most resilient of the five.
Open, update, and resolve incidents from the admin console. The public status page reflects changes in real time.
The same source of truth that drives the marketing site, the runbooks, and the synthetic alerts.
| Metric | Target | What it means |
|---|---|---|
| Uptime | 99.99% | Target uptime across the platform. Roughly 4.4 minutes of monthly downtime budget. |
| Read latency (p95, US) | < 50 ms | Read p95 measured coast-to-coast within the United States, served from the nearest edge replica. |
| Write latency (p95, US) | < 150 ms | Write p95 measured against the regional primary database, including queue acknowledgements. |
| Recovery Point Objective (RPO) | ≤ 60 seconds | Maximum window of data potentially lost in a worst-case region failure for critical tables. |
| Recovery Time Objective (RTO) | ≤ 5 minutes | Maximum time to restore the service for a regional incident affecting a critical workload. |
| Hot backup retention | 30 days | Time-Travel-style point-in-time restore window covering every tenant database. |
| Cold archive retention | 7 years | Object-locked archive in geographically redundant storage with retention enforced at the storage layer. |
| Backup durability | 11 nines | Backups are stored on R2 with eleven nines of annual durability, replicated cross-region. |
| US edge presence | 30+ POPs | Compute and cache run on Cloudflare’s anycast network with more than thirty US points of presence. |
| DDoS / WAF | Always on | Layer 3, 4, and 7 protection plus a managed WAF rule pack are enabled for every customer. |
| Encryption | AES-256 + TLS 1.3 | Customer data is encrypted at rest with AES-256 and in transit with TLS 1.3 on every connection. |
| Admin MFA | Mandatory | Administrators must enroll TOTP or WebAuthn before performing privileged operations. |
| DR drills | Quarterly, public | A scripted disaster-recovery game day is run every quarter and the results are posted publicly. |
| Public post-mortems | ≥ 15 minutes impact | Any incident with customer-visible impact of fifteen minutes or more is documented in a public post-mortem. |
In a worst-case region failure, the maximum amount of data potentially unrecoverable is the work performed in the last 60 seconds. For most data classes, the observed window is far smaller because replication runs continuously.
Customers on enterprise plans can opt into hands-on participation in a quarterly disaster-recovery drill, run jointly with CloudIP engineering.