Monitoring
Metrics, dashboards, and alerts for the NESO data portal.
Monitoring
Purpose
Gives operators visibility into the health and behaviour of the running portal — request rates, latencies, error rates, infrastructure load, and outages — so problems are caught before users report them.
Tech stack
| Layer | Tech |
|---|---|
| Metrics collection | Prometheus |
| Dashboards | Grafana |
| Alerting | Email/Chat |
| Log aggregation | TODO — confirm log stack |
Dashboards
Three consolidated Grafana dashboards cover day-to-day operations. Each supports filtering by environment (Dev / Staging / Prod) and by time range as a baseline; per-dashboard filters are listed below.
1. Traffic monitoring
End-to-end view of inbound traffic and API behaviour.
Metrics
- Request volume — total requests per second / minute / hour.
- Requests per IP — traffic broken down by source IP.
- Requests per endpoint — traffic broken down by API path.
- Latency — average, p50, p95, p99.
- Error rates — 4xx and 5xx, with breakdown by error type.
- Status code distribution — success vs error.
- Geographic distribution — requests by country, where available.
Filters
- Specific IP.
- Specific endpoint.
Alerts
- Sudden traffic spikes.
- Email notification when traffic exceeds a configured threshold, so the source can be identified and blocked.
2. Uptime monitoring
Service availability and health, current and historical.
Metrics
- Current service status — up / down / degraded.
- Uptime percentage — real-time and historical (daily / weekly / monthly).
- Downtime incidents — timeline of outages.
- Per-service health — CKAN, PortalJS Admin, Data API, etc.
Filters
- Specific service or dependency.
Alerts
- Immediate notification when a service goes down.
- Health-check failures.
- Certificate expiration warnings — flagged after a prior incident, treat as a hard requirement.
3. Resource usage monitoring
Infrastructure and resource consumption.
Metrics
- CPU — utilisation percentage and load averages.
- Memory — used vs available, broken down by process.
- Disk — utilisation percentage and growth trends.
- Network — inbound / outbound throughput.
- Container health — status, restart count, resource limits.
- Database — connection pool, query performance, storage size. (Not yet confirmed which of these existed on the old instance — to verify before sign-off.)
Filters
- Namespace.
- Container or service.
- Resource type.
Alerts
- CPU usage > 80%.
- Memory usage > 85%.
- Disk usage > 90%.
- Container restarts / crashes — desired; pending confirmation that the metric is exposed by the runtime.
Future dashboards
Planned for a later phase, requiring separate implementation:
- API usage — breakdown by
api_action, dataset, resource, and organisation. - Data downloads — counts and volumes per resource / dataset.
- DXP frontend usage — pulled from Google Analytics (per prior discussion); not a Prometheus source.
What it does NOT do
- Application logging. Structured logs are emitted by each service to the log stack; Prometheus is for metrics, not log search.
- User-facing analytics. Page views, downloads, and similar product metrics live in the future dashboards above, not in the operational stack.
- Tracing. Distributed traces, if used, are out of scope here.
Dependencies
- Every service exposes a
/metricsendpoint scraped by Prometheus. - Grafana reads from Prometheus.
- Alertmanager (if configured) routes alerts onward.
- Google Analytics (future DXP usage dashboard only).
See also
Last reviewed: 2026-05-04