Lessons from Running WordPress at Scale
Jerico Aragon · Cloud Engineer, Human Made (Altis Cloud)
Who am I and why listen
Jerico Aragon
Cloud Engineer, Altis Cloud (Human Made)
Manages systems serving 100M+ requests/day across multiple regions.
Key engineer for PlayStation's high-traffic, enterprise-scale platform.
Works with Standard Chartered on secure, high-stakes financial applications.
Agenda
- Scaling realities: traffic patterns, blast radius, trade-offs
- Reliability habits: caching layers, statelessness, and graceful failure
- Deployments that stick: CI/CD, flags, and safe DB changes
- Observability at sane cost: logs, metrics, traces, and SLOs
- Disaster recovery that actually works
- Checklist you can use next week
1) What changes when WordPress scales
Reality Shifts
- Traffic spikes & tail latency become the norm.
- Editors and contributors are everywhere, all the time.
- Plugins collide and the `wp_options` table bloats.
Mindset
Deployments must be repeatable and automated. Caching is no longer a feature, it's a core part of your architecture. Failures are not exceptional events; they are normal, so recovery must be designed in from the start.
Goal: Understand why new habits are needed for scaled environments.
From WP monolith to distributed flow
Gradually move things off the app: edge cache, media offload, shared caches.
The Monolith
One server does everything. Simple, but a single point of failure.
Distributed Flow
Services are decoupled for resilience and performance.
Result: less origin load, faster responses, safer deploys.
Gradual optimization ladder
Full-page CDN Caching
Cache anonymous traffic at the edge and normalize cookies to maximize hit rates.
Media Offload
Move media assets to a dedicated service like S3 and serve them through a CDN.
Persistent Object Cache
Use a shared, persistent object cache like Redis to store queries and options.
Background Jobs
Offload heavy tasks to background jobs and use a real cron runner.
Canary Releases
Use feature flags and canary releases for safer deployments.
Each step reduces origin pressure and improves resilience without big rewrites.
2) Disaster recovery basics anyone can start
- Offsite backups + weekly restore drills: Test your backups regularly, don't just trust them.
- Health checks: Monitor your app, database, cache, search, and queues.
- Runbooks with owners: Document recovery procedures and assign clear ownership.
- Learn from outages: Analyze what broke and what changed to prevent future incidents.
Takeaway: Avoid being caught off guard by failure.
3) Caching as part of architecture
Layers
CDN for anonymous HTML, object cache for queries, tuned database, and offloaded media.
Hit Targets
Aim for >85% anonymous HTML hit rate and >95% for static assets. Measure and tune.
Anti-patterns
Cookies killing CDN cache, private headers, unbounded search queries, and logged-in cache misses.
Headers
Set `cache-control` per asset type, purge by tag/surrogate key, and test hit ratios in staging.
Request Flow:
4) Deployments without fear
- Separate deploy from release: Use feature flags and canary releases.
- CI gates: Enforce coding standards, run tests, and perform visual diffs.
- Infrastructure as Code: Use tools like Terraform to manage infrastructure.
- Safe database changes: Make changes backward compatible and use background data moves.
- Rollback plan: Have a rollback switch and a preflight checklist for each release.
Goal: Improve uptime immediately with better deploy habits.
5) Observability unlocks confidence
Metrics
Define SLOs for key user journeys, like p95 latency and error rates. Alert on user pain, not just server health.
Logs
Centralize logs with request IDs, use structured JSON, and redact PII. Ship logs to a cost-effective service.
Traces
Trace slow templates and queries. Instrument key WordPress hooks to identify performance bottlenecks.
Detect Partial Failures
Monitor slow requests and specific user flows, not just uptime. Conduct weekly postmortems with actionable fixes.
Takeaway: Know what to measure and start today.
Case Slices (Quick)
- Launch-day spike: CDN cache rules + purge by tag → p95 drop, DB CPU normalized.
- Slow admin from `wp_options` autoload bloat: Indexed, trimmed, and moved to object cache → load time recovery.
Checklist to take home
- >100k users or lumpy traffic? Add layered caching and measure hit rates.
- Deploys cause downtime? Add flags/canary and a rollback/past-build switch.
- Errors mysterious? Centralize logs + set one latency and one error SLO.
- Outages linger? Weekly restore drill and owned runbooks with roles.
- Growing fast? Treat it like a system: health checks, CI, and cache headers baked in.
Q&A
Ping me: jerico@humanmade.com · WP.org/jericoaragon
Let’s make your WordPress behave like an enterprise platform, whatever the scale.