Kosha
Reports q1-platform-reliability-review-2698f0

Q1 Platform Reliability Review

Q1 platform achieved 99.94% availability, exceeding the 99.9% SLA target. Three storage-related incidents drove prioritization of failover improvements in Q2.

md 4h ago 1.3 KB sample.md
View source Download

Q1 Platform Reliability Review#

A summary of incidents, trends, and action items for the first quarter.

Executive summary#

We operated at 99.94% availability across the primary services, slightly above our 99.9% SLA target. Three incidents accounted for the majority of downtime, all traced back to the storage layer.

Takeaway: storage failover is our weakest link. The Q2 plan prioritizes it.

Incident table#

Date Service Duration Root cause
2026-01-14 reports 22 min Disk pressure on node-3
2026-02-08 ingest 9 min Redis OOM
2026-03-21 auth 41 min Stale DNS TTL

Code example#

Here's the health check we added after February's incident:

async def check_redis():
    used = await redis.info("memory")
    pct = used["used_memory"] / used["total_system_memory"]
    if pct > 0.85:
        raise HealthCheckFailed(f"Redis at {pct:.0%}")

Next steps#

  1. Migrate reports storage to the new tier
  2. Add automated failover drills (monthly)
  3. Revisit DNS TTLs across services
  • Owner: Platform team
  • Review date: 2026-05-01