ArchitectureComponentsMonitoring

Monitoring

Purpose

Gives operators visibility into the health and behaviour of the running portal — request rates, latencies, error rates, infrastructure load, and outages — so problems are caught before users report them.

Tech stack

LayerTech
Metrics collectionPrometheus
DashboardsGrafana
AlertingEmail/Chat
Log aggregationTODO — confirm log stack

Dashboards

Three consolidated Grafana dashboards cover day-to-day operations. Each supports filtering by environment (Dev / Staging / Prod) and by time range as a baseline; per-dashboard filters are listed below.

1. Traffic monitoring

End-to-end view of inbound traffic and API behaviour.

Metrics

  • Request volume — total requests per second / minute / hour.
  • Requests per IP — traffic broken down by source IP.
  • Requests per endpoint — traffic broken down by API path.
  • Latency — average, p50, p95, p99.
  • Error rates — 4xx and 5xx, with breakdown by error type.
  • Status code distribution — success vs error.
  • Geographic distribution — requests by country, where available.

Filters

  • Specific IP.
  • Specific endpoint.

Alerts

  • Sudden traffic spikes.
  • Email notification when traffic exceeds a configured threshold, so the source can be identified and blocked.

2. Uptime monitoring

Service availability and health, current and historical.

Metrics

  • Current service status — up / down / degraded.
  • Uptime percentage — real-time and historical (daily / weekly / monthly).
  • Downtime incidents — timeline of outages.
  • Per-service health — CKAN, PortalJS Admin, Data API, etc.

Filters

  • Specific service or dependency.

Alerts

  • Immediate notification when a service goes down.
  • Health-check failures.
  • Certificate expiration warnings — flagged after a prior incident, treat as a hard requirement.

3. Resource usage monitoring

Infrastructure and resource consumption.

Metrics

  • CPU — utilisation percentage and load averages.
  • Memory — used vs available, broken down by process.
  • Disk — utilisation percentage and growth trends.
  • Network — inbound / outbound throughput.
  • Container health — status, restart count, resource limits.
  • Database — connection pool, query performance, storage size. (Not yet confirmed which of these existed on the old instance — to verify before sign-off.)

Filters

  • Namespace.
  • Container or service.
  • Resource type.

Alerts

  • CPU usage > 80%.
  • Memory usage > 85%.
  • Disk usage > 90%.
  • Container restarts / crashes — desired; pending confirmation that the metric is exposed by the runtime.

Future dashboards

Planned for a later phase, requiring separate implementation:

  • API usage — breakdown by api_action, dataset, resource, and organisation.
  • Data downloads — counts and volumes per resource / dataset.
  • DXP frontend usage — pulled from Google Analytics (per prior discussion); not a Prometheus source.

What it does NOT do

  • Application logging. Structured logs are emitted by each service to the log stack; Prometheus is for metrics, not log search.
  • User-facing analytics. Page views, downloads, and similar product metrics live in the future dashboards above, not in the operational stack.
  • Tracing. Distributed traces, if used, are out of scope here.

Dependencies

  • Every service exposes a /metrics endpoint scraped by Prometheus.
  • Grafana reads from Prometheus.
  • Alertmanager (if configured) routes alerts onward.
  • Google Analytics (future DXP usage dashboard only).

See also


Last reviewed: 2026-05-04

Built with LogoFlowershow