Monitoring Setup Checklist
Build comprehensive observability with metrics, logs, traces, and alerts. Know what is happening in production before your users tell you.
Checklist: Monitoring Setup (engineering)
Production monitoring is the foundation of operational reliability. Without proper observability, you are flying blind — incidents go undetected, performance degrades silently, and debugging takes hours instead of minutes. This checklist covers the three pillars of observability (metrics, logs, traces) plus alerting and dashboarding.
Checklist Items
- Define key service-level indicators (SLIs) [critical]: Identify the metrics that matter most: latency, error rate, throughput, and saturation for each service.
- Set up infrastructure metrics collection [critical]: Collect CPU, memory, disk, and network metrics from all servers and containers using Prometheus, Datadog, or CloudWatch.
- Implement structured application logging [critical]: Log in JSON format with consistent fields: timestamp, level, request ID, service name, and context.
- Configure centralized log aggregation [important]: Ship all logs to a central system (ELK, Datadog, Loki) with retention policies and search capability.
- Set up distributed tracing [important]: Implement OpenTelemetry or Jaeger tracing across services to track requests from ingress to database and back.
- Create alerting rules for critical conditions [important]: Define alerts for error rate spikes, latency degradation, service downtime, and resource exhaustion with appropriate thresholds.
- Configure alert routing and on-call schedules [important]: Route alerts to the right team via PagerDuty or Opsgenie with escalation policies and on-call rotations.
- Build service health dashboards [recommended]: Create dashboards showing the RED metrics (Rate, Errors, Duration) for each service at a glance.
- Implement synthetic monitoring [recommended]: Run automated health checks from external locations to detect outages and performance issues before users report them.
- Set up SLO tracking and error budget dashboards [recommended]: Define service-level objectives and track error budgets to balance reliability with feature velocity.
Common Mistakes
- Alert fatigue from noisy alerts: Start with few, high-signal alerts on symptoms (error rate, latency) not causes (CPU usage). Add more only when needed.
- No correlation between metrics, logs, and traces: Include trace IDs in log entries and link metrics to traces. Correlated observability data cuts debugging time dramatically.
- Monitoring only the happy path: Monitor error paths, timeout rates, retry rates, and queue depths. Problems hide in the flows you do not watch.
Frequently Asked Questions
- What monitoring stack should I use?
- For most teams: Prometheus + Grafana for metrics, Loki or ELK for logs, and Jaeger or Tempo for traces. Datadog is an excellent all-in-one alternative if budget allows.
- How many alerts should I start with?
- Start with 5-10 critical alerts: service down, error rate above threshold, latency P99 above SLO, disk space low, and certificate expiry. Add more gradually based on incident learnings.
- Do I need distributed tracing from day one?
- If you have more than two services, yes. Tracing becomes essential for debugging cross-service issues. Instrument with OpenTelemetry early — retrofitting is much harder.