Monitoring

Purpose

Gives operators visibility into the health and behaviour of the running portal — request rates, latencies, error rates, infrastructure load, and outages — so problems are caught before users report them.

Tech stack

Layer	Tech
Metrics collection	Prometheus
Dashboards	Grafana
Alerting	Email/Chat
Log aggregation	TODO — confirm log stack

Dashboards

Three consolidated Grafana dashboards cover day-to-day operations. Each supports filtering by environment (Dev / Staging / Prod) and by time range as a baseline; per-dashboard filters are listed below.

1. Traffic monitoring

End-to-end view of inbound traffic and API behaviour.

Metrics

Request volume — total requests per second / minute / hour.
Requests per IP — traffic broken down by source IP.
Requests per endpoint — traffic broken down by API path.
Latency — average, p50, p95, p99.
Error rates — 4xx and 5xx, with breakdown by error type.
Status code distribution — success vs error.
Geographic distribution — requests by country, where available.

Filters

Specific IP.
Specific endpoint.

Alerts

Sudden traffic spikes.
Email notification when traffic exceeds a configured threshold, so the source can be identified and blocked.

2. Uptime monitoring

Service availability and health, current and historical.

Metrics

Current service status — up / down / degraded.
Uptime percentage — real-time and historical (daily / weekly / monthly).
Downtime incidents — timeline of outages.
Per-service health — CKAN, PortalJS Admin, Data API, etc.

Filters

Specific service or dependency.

Alerts

Immediate notification when a service goes down.
Health-check failures.
Certificate expiration warnings — flagged after a prior incident, treat as a hard requirement.

3. Resource usage monitoring

Infrastructure and resource consumption.

Metrics

CPU — utilisation percentage and load averages.
Memory — used vs available, broken down by process.
Disk — utilisation percentage and growth trends.
Network — inbound / outbound throughput.
Container health — status, restart count, resource limits.
Database — connection pool, query performance, storage size. (Not yet confirmed which of these existed on the old instance — to verify before sign-off.)

Filters

Namespace.
Container or service.
Resource type.

Alerts

CPU usage > 80%.
Memory usage > 85%.
Disk usage > 90%.
Container restarts / crashes — desired; pending confirmation that the metric is exposed by the runtime.

Future dashboards

Planned for a later phase, requiring separate implementation:

API usage — breakdown by api_action, dataset, resource, and organisation.
Data downloads — counts and volumes per resource / dataset.
DXP frontend usage — pulled from Google Analytics (per prior discussion); not a Prometheus source.

What it does NOT do

Application logging. Structured logs are emitted by each service to the log stack; Prometheus is for metrics, not log search.
User-facing analytics. Page views, downloads, and similar product metrics live in the future dashboards above, not in the operational stack.
Tracing. Distributed traces, if used, are out of scope here.

Dependencies

Every service exposes a /metrics endpoint scraped by Prometheus.
Grafana reads from Prometheus.
Alertmanager (if configured) routes alerts onward.
Google Analytics (future DXP usage dashboard only).

Monitoring

Purpose

Tech stack

Dashboards

1. Traffic monitoring

2. Uptime monitoring

3. Resource usage monitoring

Future dashboards

What it does NOT do

Dependencies

See also