Q1 Platform Reliability Review
Q1 platform achieved 99.94% availability, exceeding the 99.9% SLA target. Three storage-related incidents drove prioritization of failover improvements in Q2.
Q1 Platform Reliability Review#
A summary of incidents, trends, and action items for the first quarter.
Executive summary#
We operated at 99.94% availability across the primary services, slightly above our 99.9% SLA target. Three incidents accounted for the majority of downtime, all traced back to the storage layer.
Takeaway: storage failover is our weakest link. The Q2 plan prioritizes it.
Incident table#
| Date | Service | Duration | Root cause |
|---|---|---|---|
| 2026-01-14 | reports |
22 min | Disk pressure on node-3 |
| 2026-02-08 | ingest |
9 min | Redis OOM |
| 2026-03-21 | auth |
41 min | Stale DNS TTL |
Code example#
Here's the health check we added after February's incident:
async def check_redis():
used = await redis.info("memory")
pct = used["used_memory"] / used["total_system_memory"]
if pct > 0.85:
raise HealthCheckFailed(f"Redis at {pct:.0%}")
Next steps#
- Migrate
reportsstorage to the new tier - Add automated failover drills (monthly)
- Revisit DNS TTLs across services
- Owner: Platform team
- Review date: 2026-05-01