Observability
XTrinode exposes observability through Kubernetes events, Prometheus metrics, gateway metrics, and structured logs.
Kubernetes Events
Section titled “Kubernetes Events”The operator writes events for lifecycle, reconciliation, scaling, KEDA, catalog, node-pool, and cleanup actions.
kubectl get events --field-selector involvedObject.name=analytics -n team-akubectl get events --field-selector involvedObject.name=analytics,type=Warning -n team-akubectl get events --watch --field-selector involvedObject.name=analytics -n team-aEvents are the fastest way to understand why a runtime is stuck in
Reconciling, Suspended, or Error.
Operator Metrics
Section titled “Operator Metrics”The operator metrics surface includes reconciliation, lifecycle, scaling, state, catalog, auto-suspend, wake TTL, KEDA, and node-pool signals.
Common examples:
xtrinode_reconcile_total{namespace="team-a",name="analytics"}xtrinode_reconcile_errors_total{namespace="team-a",name="analytics"}xtrinode_workers_current{namespace="team-a",name="analytics"}xtrinode_workers_desired{namespace="team-a",name="analytics"}xtrinode_state{namespace="team-a",name="analytics"}xtrinode_catalog_sync_errors_total{namespace="team-a",name="analytics"}xtrinode_nodepool_provision_failed_total{namespace="team-a",name="analytics"}Useful dashboard panels:
| Panel | Query shape |
|---|---|
| Runtime count by phase | count by (phase) (xtrinode_state) |
| Reconciliation error rate | rate(xtrinode_reconcile_errors_total[5m]) |
| Worker count by runtime | sum by (namespace,name) (xtrinode_workers_current) |
| Slow reconciliation steps | histogram_quantile(0.95, xtrinode_reconcile_step_duration_seconds_bucket) |
Gateway Metrics
Section titled “Gateway Metrics”The gateway owns the Trino-facing request path, so it is also the best place to observe routing and first-query demand.
xtrinode_gateway_inflight_queries{namespace="team-a",xtrinode="analytics"}gateway_requests_total{routing_group="team-a--analytics"}gateway_503_total{routing_group="team-a--analytics"}gateway_auto_resume_total{routing_group="team-a--analytics"}gateway_request_duration_seconds_bucket{routing_group="team-a--analytics"}The xtrinode_gateway_inflight_queries metric is especially important for KEDA
query scaling because it can exist before runtime workers are available.
Depending on Prometheus scrape labeling, the runtime KEDA default checks both
exported_namespace and namespace label forms.
ServiceMonitor
Section titled “ServiceMonitor”When Prometheus Operator is installed, enable ServiceMonitor resources for the
components you want scraped. Operator and gateway charts use serviceMonitor;
the API server chart nests this under metrics.serviceMonitor.
# operator and gatewayserviceMonitor: enabled: true
# api servermetrics: serviceMonitor: enabled: truePrometheus must be able to reach the operator, gateway, and any runtime metrics endpoints used by KEDA triggers.
Alerting Starting Points
Section titled “Alerting Starting Points”Start with alerts for reconciliation errors, error-state runtimes, pipeline step failures, auto-suspend drift, and catalog sync errors.
- alert: XTrinodeHighReconcileErrorRate expr: rate(xtrinode_reconcile_errors_total[5m]) > 0.1 for: 5m labels: severity: critical- alert: XTrinodeStuckInErrorState expr: xtrinode_state{phase="Error"} == 4 for: 15m labels: severity: critical- alert: CatalogSyncErrors expr: rate(xtrinode_catalog_sync_errors_total[5m]) > 0 for: 10m labels: severity: warningFor dashboards, keep separate views for runtime overview, reconciliation pipeline health, and auto-suspend or worker-scaling behavior. Those three views map directly to the main incident paths: platform health, operator progress, and runtime capacity.
The observability chart includes Vector for structured log collection. The default local path can emit JSON logs to console and expose Vector internal metrics. Production deployments can route logs to systems such as Loki, object storage, or Elasticsearch.
Example Loki queries:
{namespace="xtrinode-system", log_source="xtrinode-operator"}{namespace="xtrinode-system", level="error"}{namespace="xtrinode-gateway"} | json | routing_group="team-a--analytics"Runtime Health Checklist
Section titled “Runtime Health Checklist”kubectl get deployment -n xtrinode-systemkubectl get deployment -n xtrinode-gatewaykubectl get xtrinode analytics -n team-a -o yamlkubectl get pods -n team-akubectl logs -n xtrinode-system -l app.kubernetes.io/name=xtrinode-operatorkubectl logs -n xtrinode-gateway -l app.kubernetes.io/name=xtrinode-gatewayFor an incident, collect status YAML, recent events, operator logs, gateway logs, KEDA status, and any node-pool provider status before changing the runtime.