Observability

XTrinode exposes observability through Kubernetes events, Prometheus metrics, gateway metrics, and structured logs.

Kubernetes Events

The operator writes events for lifecycle, reconciliation, scaling, KEDA, catalog, node-pool, and cleanup actions.

kubectl get events --field-selector involvedObject.name=analytics -n team-a
kubectl get events --field-selector involvedObject.name=analytics,type=Warning -n team-a
kubectl get events --watch --field-selector involvedObject.name=analytics -n team-a

Events are the fastest way to understand why a runtime is stuck in Reconciling, Suspended, or Error.

Operator Metrics

The operator metrics surface includes reconciliation, lifecycle, scaling, state, catalog, auto-suspend, wake TTL, KEDA, and node-pool signals.

Common examples:

xtrinode_reconcile_total{namespace="team-a",name="analytics"}
xtrinode_reconcile_errors_total{namespace="team-a",name="analytics"}
xtrinode_workers_current{namespace="team-a",name="analytics"}
xtrinode_workers_desired{namespace="team-a",name="analytics"}
xtrinode_state{namespace="team-a",name="analytics"}
xtrinode_catalog_sync_errors_total{namespace="team-a",name="analytics"}
xtrinode_nodepool_provision_failed_total{namespace="team-a",name="analytics"}

Useful dashboard panels:

Panel	Query shape
Runtime count by phase	`count by (phase) (xtrinode_state)`
Reconciliation error rate	`rate(xtrinode_reconcile_errors_total[5m])`
Worker count by runtime	`sum by (namespace,name) (xtrinode_workers_current)`
Slow reconciliation steps	`histogram_quantile(0.95, xtrinode_reconcile_step_duration_seconds_bucket)`

Gateway Metrics

The gateway owns the Trino-facing request path, so it is also the best place to observe routing and first-query demand.

xtrinode_gateway_inflight_queries{namespace="team-a",xtrinode="analytics"}
gateway_requests_total{routing_group="team-a--analytics"}
gateway_503_total{routing_group="team-a--analytics"}
gateway_auto_resume_total{routing_group="team-a--analytics"}
gateway_request_duration_seconds_bucket{routing_group="team-a--analytics"}

The xtrinode_gateway_inflight_queries metric is especially important for KEDA query scaling because it can exist before runtime workers are available. Depending on Prometheus scrape labeling, the runtime KEDA default checks both exported_namespace and namespace label forms.

ServiceMonitor

When Prometheus Operator is installed, enable ServiceMonitor resources for the components you want scraped. Operator and gateway charts use serviceMonitor; the API server chart nests this under metrics.serviceMonitor.

# operator and gateway
serviceMonitor:
  enabled: true

# api server
metrics:
  serviceMonitor:
    enabled: true

Prometheus must be able to reach the operator, gateway, and any runtime metrics endpoints used by KEDA triggers.

Alerting Starting Points

Start with alerts for reconciliation errors, error-state runtimes, pipeline step failures, auto-suspend drift, and catalog sync errors.

- alert: XTrinodeHighReconcileErrorRate
  expr: rate(xtrinode_reconcile_errors_total[5m]) > 0.1
  for: 5m
  labels:
    severity: critical

- alert: XTrinodeStuckInErrorState
  expr: xtrinode_state{phase="Error"} == 4
  for: 15m
  labels:
    severity: critical

- alert: CatalogSyncErrors
  expr: rate(xtrinode_catalog_sync_errors_total[5m]) > 0
  for: 10m
  labels:
    severity: warning

For dashboards, keep separate views for runtime overview, reconciliation pipeline health, and auto-suspend or worker-scaling behavior. Those three views map directly to the main incident paths: platform health, operator progress, and runtime capacity.

Logs

The observability chart includes Vector for structured log collection. The default local path can emit JSON logs to console and expose Vector internal metrics. Production deployments can route logs to systems such as Loki, object storage, or Elasticsearch.

Example Loki queries:

{namespace="xtrinode-system", log_source="xtrinode-operator"}
{namespace="xtrinode-system", level="error"}
{namespace="xtrinode-gateway"} | json | routing_group="team-a--analytics"

Runtime Health Checklist

kubectl get deployment -n xtrinode-system
kubectl get deployment -n xtrinode-gateway
kubectl get xtrinode analytics -n team-a -o yaml
kubectl get pods -n team-a
kubectl logs -n xtrinode-system -l app.kubernetes.io/name=xtrinode-operator
kubectl logs -n xtrinode-gateway -l app.kubernetes.io/name=xtrinode-gateway

For an incident, collect status YAML, recent events, operator logs, gateway logs, KEDA status, and any node-pool provider status before changing the runtime.