WordPress Manila Meetup

Lessons from Running WordPress at Scale

Jerico Aragon · Cloud Engineer, Human Made (Altis Cloud)

Who am I and why listen

Jerico Aragon

Cloud Engineer, Altis Cloud (Human Made)

Manages systems serving 100M+ requests/day across multiple regions.

Key engineer for PlayStation's high-traffic, enterprise-scale platform.

Works with Standard Chartered on secure, high-stakes financial applications.

Agenda

Scaling realities: traffic patterns, blast radius, trade-offs
Reliability habits: caching layers, statelessness, and graceful failure
Deployments that stick: CI/CD, flags, and safe DB changes
Observability at sane cost: logs, metrics, traces, and SLOs
Disaster recovery that actually works
Checklist you can use next week

1) What changes when WordPress scales

Reality Shifts

Traffic spikes & tail latency become the norm.
Editors and contributors are everywhere, all the time.
Plugins collide and the `wp_options` table bloats.

Mindset

Deployments must be repeatable and automated. Caching is no longer a feature, it's a core part of your architecture. Failures are not exceptional events; they are normal, so recovery must be designed in from the start.

Goal: Understand why new habits are needed for scaled environments.

From WP monolith to distributed flow

Gradually move things off the app: edge cache, media offload, shared caches.

The Monolith

One server does everything. Simple, but a single point of failure.

Single Server

WordPress (PHP)

Database (MySQL)

Media Files

Object Cache

→

Distributed Flow

Services are decoupled for resilience and performance.

CDN (Edge Cache)

WordPress (PHP App)

Managed Database

Shared Object Cache

Media Offload (S3)

External Search

Result: less origin load, faster responses, safer deploys.

Gradual optimization ladder

1

Full-page CDN Caching

Cache anonymous traffic at the edge and normalize cookies to maximize hit rates.

2

Media Offload

Move media assets to a dedicated service like S3 and serve them through a CDN.

3

Persistent Object Cache

Use a shared, persistent object cache like Redis to store queries and options.

4

Background Jobs

Offload heavy tasks to background jobs and use a real cron runner.

5

Canary Releases

Use feature flags and canary releases for safer deployments.

Each step reduces origin pressure and improves resilience without big rewrites.

2) Disaster recovery basics anyone can start

Offsite backups + weekly restore drills: Test your backups regularly, don't just trust them.
Health checks: Monitor your app, database, cache, search, and queues.
Runbooks with owners: Document recovery procedures and assign clear ownership.
Learn from outages: Analyze what broke and what changed to prevent future incidents.

Takeaway: Avoid being caught off guard by failure.

3) Caching as part of architecture

Layers

CDN for anonymous HTML, object cache for queries, tuned database, and offloaded media.

Hit Targets

Aim for >85% anonymous HTML hit rate and >95% for static assets. Measure and tune.

Anti-patterns

Cookies killing CDN cache, private headers, unbounded search queries, and logged-in cache misses.

Headers

Set `cache-control` per asset type, purge by tag/surrogate key, and test hit ratios in staging.

Request Flow:

Request CDN WordPress DB

4) Deployments without fear

Separate deploy from release: Use feature flags and canary releases.
CI gates: Enforce coding standards, run tests, and perform visual diffs.
Infrastructure as Code: Use tools like Terraform to manage infrastructure.
Safe database changes: Make changes backward compatible and use background data moves.
Rollback plan: Have a rollback switch and a preflight checklist for each release.

Goal: Improve uptime immediately with better deploy habits.

5) Observability unlocks confidence

Metrics

Define SLOs for key user journeys, like p95 latency and error rates. Alert on user pain, not just server health.

Logs

Centralize logs with request IDs, use structured JSON, and redact PII. Ship logs to a cost-effective service.

Traces

Trace slow templates and queries. Instrument key WordPress hooks to identify performance bottlenecks.

Detect Partial Failures

Monitor slow requests and specific user flows, not just uptime. Conduct weekly postmortems with actionable fixes.

Takeaway: Know what to measure and start today.

Case Slices (Quick)

Launch-day spike: CDN cache rules + purge by tag → p95 drop, DB CPU normalized.
Slow admin from `wp_options` autoload bloat: Indexed, trimmed, and moved to object cache → load time recovery.

Checklist to take home

>100k users or lumpy traffic? Add layered caching and measure hit rates.
Deploys cause downtime? Add flags/canary and a rollback/past-build switch.
Errors mysterious? Centralize logs + set one latency and one error SLO.
Outages linger? Weekly restore drill and owned runbooks with roles.
Growing fast? Treat it like a system: health checks, CI, and cache headers baked in.

Q&A

Ping me: jerico@humanmade.com · WP.org/jericoaragon

Let’s make your WordPress behave like an enterprise platform, whatever the scale.