Q1 Platform Reliability Review

Q1 Platform Reliability Review#

A summary of incidents, trends, and action items for the first quarter.

Executive summary#

We operated at 99.94% availability across the primary services, slightly above our 99.9% SLA target. Three incidents accounted for the majority of downtime, all traced back to the storage layer.

Takeaway: storage failover is our weakest link. The Q2 plan prioritizes it.

Incident table#

Date	Service	Duration	Root cause
2026-01-14	`reports`	22 min	Disk pressure on node-3
2026-02-08	`ingest`	9 min	Redis OOM
2026-03-21	`auth`	41 min	Stale DNS TTL

Code example#

Here's the health check we added after February's incident:

async def check_redis():
    used = await redis.info("memory")
    pct = used["used_memory"] / used["total_system_memory"]
    if pct > 0.85:
        raise HealthCheckFailed(f"Redis at {pct:.0%}")

Next steps#

Migrate reports storage to the new tier
Add automated failover drills (monthly)
Revisit DNS TTLs across services

Owner: Platform team
Review date: 2026-05-01