Skip to content

Observability

XTrinode exposes observability through Kubernetes events, Prometheus metrics, gateway metrics, and structured logs.

The operator writes events for lifecycle, reconciliation, scaling, KEDA, catalog, node-pool, and cleanup actions.

Terminal window
kubectl get events --field-selector involvedObject.name=analytics -n team-a
kubectl get events --field-selector involvedObject.name=analytics,type=Warning -n team-a
kubectl get events --watch --field-selector involvedObject.name=analytics -n team-a

Events are the fastest way to understand why a runtime is stuck in Reconciling, Suspended, or Error.

The operator metrics surface includes reconciliation, lifecycle, scaling, state, catalog, auto-suspend, wake TTL, KEDA, and node-pool signals.

Common examples:

xtrinode_reconcile_total{namespace="team-a",name="analytics"}
xtrinode_reconcile_errors_total{namespace="team-a",name="analytics"}
xtrinode_workers_current{namespace="team-a",name="analytics"}
xtrinode_workers_desired{namespace="team-a",name="analytics"}
xtrinode_state{namespace="team-a",name="analytics"}
xtrinode_catalog_sync_errors_total{namespace="team-a",name="analytics"}
xtrinode_nodepool_provision_failed_total{namespace="team-a",name="analytics"}

Useful dashboard panels:

PanelQuery shape
Runtime count by phasecount by (phase) (xtrinode_state)
Reconciliation error raterate(xtrinode_reconcile_errors_total[5m])
Worker count by runtimesum by (namespace,name) (xtrinode_workers_current)
Slow reconciliation stepshistogram_quantile(0.95, xtrinode_reconcile_step_duration_seconds_bucket)

The gateway owns the Trino-facing request path, so it is also the best place to observe routing and first-query demand.

xtrinode_gateway_inflight_queries{namespace="team-a",xtrinode="analytics"}
gateway_requests_total{routing_group="team-a--analytics"}
gateway_503_total{routing_group="team-a--analytics"}
gateway_auto_resume_total{routing_group="team-a--analytics"}
gateway_request_duration_seconds_bucket{routing_group="team-a--analytics"}

The xtrinode_gateway_inflight_queries metric is especially important for KEDA query scaling because it can exist before runtime workers are available. Depending on Prometheus scrape labeling, the runtime KEDA default checks both exported_namespace and namespace label forms.

When Prometheus Operator is installed, enable ServiceMonitor resources for the components you want scraped. Operator and gateway charts use serviceMonitor; the API server chart nests this under metrics.serviceMonitor.

# operator and gateway
serviceMonitor:
enabled: true
# api server
metrics:
serviceMonitor:
enabled: true

Prometheus must be able to reach the operator, gateway, and any runtime metrics endpoints used by KEDA triggers.

Start with alerts for reconciliation errors, error-state runtimes, pipeline step failures, auto-suspend drift, and catalog sync errors.

- alert: XTrinodeHighReconcileErrorRate
expr: rate(xtrinode_reconcile_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
- alert: XTrinodeStuckInErrorState
expr: xtrinode_state{phase="Error"} == 4
for: 15m
labels:
severity: critical
- alert: CatalogSyncErrors
expr: rate(xtrinode_catalog_sync_errors_total[5m]) > 0
for: 10m
labels:
severity: warning

For dashboards, keep separate views for runtime overview, reconciliation pipeline health, and auto-suspend or worker-scaling behavior. Those three views map directly to the main incident paths: platform health, operator progress, and runtime capacity.

The observability chart includes Vector for structured log collection. The default local path can emit JSON logs to console and expose Vector internal metrics. Production deployments can route logs to systems such as Loki, object storage, or Elasticsearch.

Example Loki queries:

{namespace="xtrinode-system", log_source="xtrinode-operator"}
{namespace="xtrinode-system", level="error"}
{namespace="xtrinode-gateway"} | json | routing_group="team-a--analytics"
Terminal window
kubectl get deployment -n xtrinode-system
kubectl get deployment -n xtrinode-gateway
kubectl get xtrinode analytics -n team-a -o yaml
kubectl get pods -n team-a
kubectl logs -n xtrinode-system -l app.kubernetes.io/name=xtrinode-operator
kubectl logs -n xtrinode-gateway -l app.kubernetes.io/name=xtrinode-gateway

For an incident, collect status YAML, recent events, operator logs, gateway logs, KEDA status, and any node-pool provider status before changing the runtime.