--- title: Q1 Platform Reliability Review author: Sumanth --- # Q1 Platform Reliability Review A summary of incidents, trends, and action items for the first quarter. ## Executive summary We operated at **99.94%** availability across the primary services, slightly above our 99.9% SLA target. Three incidents accounted for the majority of downtime, all traced back to the storage layer. > Takeaway: storage failover is our weakest link. The Q2 plan prioritizes it. ## Incident table | Date | Service | Duration | Root cause | |------------|-----------|----------|-------------------------| | 2026-01-14 | `reports` | 22 min | Disk pressure on node-3 | | 2026-02-08 | `ingest` | 9 min | Redis OOM | | 2026-03-21 | `auth` | 41 min | Stale DNS TTL | ## Code example Here's the health check we added after February's incident: ```python async def check_redis(): used = await redis.info("memory") pct = used["used_memory"] / used["total_system_memory"] if pct > 0.85: raise HealthCheckFailed(f"Redis at {pct:.0%}") ``` ## Next steps 1. Migrate `reports` storage to the new tier 2. Add automated failover drills (monthly) 3. Revisit DNS TTLs across services - Owner: Platform team - Review date: 2026-05-01