---
title: Q1 Platform Reliability Review
author: Sumanth
---

# Q1 Platform Reliability Review

A summary of incidents, trends, and action items for the first quarter.

## Executive summary

We operated at **99.94%** availability across the primary services, slightly above our 99.9% SLA target. Three incidents accounted for the majority of downtime, all traced back to the storage layer.

> Takeaway: storage failover is our weakest link. The Q2 plan prioritizes it.

## Incident table

| Date       | Service   | Duration | Root cause              |
|------------|-----------|----------|-------------------------|
| 2026-01-14 | `reports` | 22 min   | Disk pressure on node-3 |
| 2026-02-08 | `ingest`  | 9 min    | Redis OOM               |
| 2026-03-21 | `auth`    | 41 min   | Stale DNS TTL           |

## Code example

Here's the health check we added after February's incident:

```python
async def check_redis():
    used = await redis.info("memory")
    pct = used["used_memory"] / used["total_system_memory"]
    if pct > 0.85:
        raise HealthCheckFailed(f"Redis at {pct:.0%}")
```

## Next steps

1. Migrate `reports` storage to the new tier
2. Add automated failover drills (monthly)
3. Revisit DNS TTLs across services

- Owner: Platform team
- Review date: 2026-05-01
