Operate

Distributed Compute

Roll out remote workers without moving identity, policy, or governance out of the control plane.

This runbook describes how to roll out remote compute without weakening Duck’s security and governance model.

Architecture Boundaries

  • the gateway remains the single policy enforcement point
  • workers execute already-rewritten SQL
  • gateway-to-worker transport uses internal gRPC
  • storage, auth, and governance metadata remain anchored in the control plane

When to Use Remote Compute

  • you need worker isolation or a separate execution fleet
  • you want lifecycle-style async execution
  • you need a staged rollout with local fallback
  • query or orchestration load makes local-only execution an operator bottleneck

Admin Checklist

  • confirm the gateway feature flags match the intended rollout
  • set worker auth and listen addresses explicitly
  • start with fallback enabled on assignments
  • canary a small set of users or groups before widening traffic
  • monitor queue latency and failure reasons before widening scope

Remote Compute Settings

Setting Applies To Why It Matters
AGENT_TOKEN Worker Authenticates the worker to the control plane
LISTEN_ADDR Worker Binds the worker’s public listener correctly
GRPC_LISTEN_ADDR Worker Exposes the internal gRPC path for execution traffic
MAX_MEMORY_GB Worker Caps worker memory for safer isolation
QUERY_RESULT_TTL Worker Controls how long async results stay available
QUERY_CLEANUP_INTERVAL Worker Governs lifecycle cleanup pressure
FEATURE_REMOTE_ROUTING Gateway Enables routing work to remote workers
FEATURE_ASYNC_QUEUE Gateway Turns on queued async execution behavior
FEATURE_CURSOR_MODE Gateway Affects cursor-style remote result handling
FEATURE_INTERNAL_GRPC Gateway Enables the internal transport to workers
REMOTE_CANARY_USERS Gateway Limits early rollout to a known audience

Health and Failure Handling

  • monitor GET /health and GET /metrics
  • expect fallback behavior when worker health degrades and assignments allow local execution
  • use retention settings to control in-memory lifecycle result pressure
  • document the operator decision for when fallback should be automatic versus disabled

Rollout Sequence

  1. enable remote support with local fallback
  2. route a limited audience
  3. observe queue latency and completion behavior
  4. widen scope only after representative success

Next Steps