Distributed Deployment Guide
Adora supports deploying dataflows across multiple machines for multi-robot fleets, edge AI pipelines, and distributed robotics systems. This guide covers cluster management, node scheduling, binary distribution, auto-recovery, and operational best practices.
Table of Contents
- Overview
- Quick Start
- Features at a Glance
- Cluster Configuration Reference
- Cluster Commands Reference
- Node Scheduling
- Binary Distribution
- systemd Service Management
- Auto-Recovery
- Rolling Upgrade
- Use Cases
- Operations Runbook
- Deployment YAML Reference
- Best Practices
Overview
Adora’s distributed architecture has three tiers:
CLI --> Coordinator --> Daemon(s) --> Nodes / Operators
(one) (per machine) (user code)
- CLI sends control commands (build, start, stop) to the coordinator.
- Coordinator orchestrates daemons, resolves node placement, and manages dataflow lifecycle.
- Daemons run on each machine, spawning and supervising node processes.
- Nodes communicate via shared memory (same machine) or Zenoh pub-sub (cross-machine).
There are two paths to distributed deployment:
Ad-hoc – manually start adora daemon on each machine, then use the coordinator for control. Good for development and testing. See Distributed Deployments in the CLI reference.
Managed (cluster.yml) – define your cluster topology in a YAML file, then use adora cluster commands for SSH-based lifecycle management. This guide focuses on the managed path.
Quick Start
- Create a
cluster.yml:
coordinator:
addr: 10.0.0.1
machines:
- id: robot
host: 10.0.0.2
user: ubuntu
- id: gpu-server
host: 10.0.0.3
user: ubuntu
- Bring up the cluster:
adora cluster up cluster.yml
- Start a dataflow:
adora start dataflow.yml --name my-app --attach
- Check cluster health:
adora cluster status
- Tear down:
adora cluster down
Features at a Glance
| Feature | Command / Config | Description |
|---|---|---|
| Cluster lifecycle | adora cluster up/status/down | SSH-based daemon management from a single machine |
| Label scheduling | _unstable_deploy.labels | Route nodes to daemons by key-value labels |
| Binary distribution | _unstable_deploy.distribute | local, scp, or http strategies |
| systemd services | adora cluster install/uninstall | Persistent daemon services that survive reboots |
| Auto-recovery | Automatic | Re-spawn nodes when a daemon reconnects |
| Rolling upgrade | adora cluster upgrade | SCP binary + restart per-machine sequentially |
| Dataflow restart | adora cluster restart | Restart a running dataflow by name or UUID |
Cluster Configuration Reference
A cluster.yml file defines the coordinator address and the set of machines in the cluster.
Full Schema
coordinator:
addr: 10.0.0.1 # IP address the coordinator binds to (required)
port: 6013 # WebSocket port (default: 6013)
machines:
- id: edge-01 # Unique machine identifier (required)
host: 10.0.0.2 # SSH-reachable hostname or IP (required)
user: ubuntu # SSH user (optional, defaults to current user)
labels: # Key-value labels for scheduling (optional)
gpu: "true"
arch: arm64
- id: edge-02
host: 10.0.0.3
labels:
arch: arm64
Fields
coordinator
| Field | Type | Default | Description |
|---|---|---|---|
addr | IP address | (required) | Address the coordinator binds to |
port | u16 | 6013 | WebSocket port |
machines[]
| Field | Type | Default | Description |
|---|---|---|---|
id | string | (required) | Unique machine identifier, used in _unstable_deploy.machine |
host | string | (required) | SSH-reachable hostname or IP address |
user | string | current user | SSH username |
labels | map | empty | Key-value pairs for label-based scheduling |
Validation Rules
- At least one machine must be defined.
- Machine IDs must be non-empty and unique.
- Machine hosts must be non-empty.
- Unknown fields are rejected (
deny_unknown_fields).
Example: 3-Machine GPU Cluster
coordinator:
addr: 192.168.1.1
machines:
- id: coordinator-host
host: 192.168.1.1
labels:
role: control
- id: gpu-a100
host: 192.168.1.10
user: ml
labels:
gpu: a100
arch: x86_64
- id: jetson-01
host: 192.168.1.20
user: nvidia
labels:
gpu: jetson
arch: arm64
Cluster Commands Reference
All adora cluster commands operate on a cluster.yml file and use SSH to manage remote machines.
SSH options used: BatchMode=yes, ConnectTimeout=10, StrictHostKeyChecking=accept-new.
adora cluster up
Bring up a multi-machine cluster from a cluster.yml file. Starts the coordinator locally, then SSH-es into each machine to start a daemon.
adora cluster up <PATH>
Arguments:
| Argument | Description |
|---|---|
PATH | Path to the cluster configuration file |
Behavior:
- Loads and validates the cluster config.
- Starts the coordinator locally on
addr:port. - For each machine, SSH-es in and runs
nohup adora daemon --machine-id <id> --coordinator-addr <addr> --coordinator-port <port> [--labels k1=v1,k2=v2] --quiet. - Polls until all expected daemons register with the coordinator (30s timeout).
Example:
$ adora cluster up cluster.yml
Starting coordinator on 10.0.0.1:6013...
Starting daemon on robot (ubuntu@10.0.0.2)... OK
Starting daemon on gpu-server (ubuntu@10.0.0.3)... OK
All 2 daemons connected.
adora cluster status
Show the current status of the cluster. Displays connected daemons and active dataflow count.
adora cluster status [--coordinator-addr ADDR] [--coordinator-port PORT]
Flags:
| Flag | Default | Description |
|---|---|---|
--coordinator-addr | localhost | Coordinator hostname or IP |
--coordinator-port | 6013 | Coordinator WebSocket port |
Example:
$ adora cluster status
DAEMON ID LAST HEARTBEAT
robot 2s ago
gpu-server 1s ago
Active dataflows: 1
adora cluster down
Tear down the cluster (coordinator and all daemons).
adora cluster down [--coordinator-addr ADDR] [--coordinator-port PORT]
Terminates all daemons and the coordinator process.
adora cluster install
Install adora-daemon as a systemd service on each machine. SSH-es into each machine, writes a systemd unit file, and enables the service.
adora cluster install <PATH>
Arguments:
| Argument | Description |
|---|---|
PATH | Path to the cluster configuration file |
Behavior:
For each machine, creates and enables a systemd service named adora-daemon-<id>. The unit file:
[Unit]
Description=Adora Daemon (<id>)
After=network-online.target
Wants=network-online.target
[Service]
ExecStart=adora daemon --machine-id <id> --coordinator-addr <addr> --coordinator-port <port> --labels k1=v1,k2=v2 --quiet
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Example:
$ adora cluster install cluster.yml
Installing adora-daemon-robot on ubuntu@10.0.0.2... OK
Installing adora-daemon-gpu-server on ubuntu@10.0.0.3... OK
2/2 succeeded.
adora cluster uninstall
Uninstall adora-daemon systemd services from each machine. Stops, disables, and removes the systemd unit.
adora cluster uninstall <PATH>
Behavior:
For each machine, runs:
sudo systemctl stop adora-daemon-<id>
sudo systemctl disable adora-daemon-<id>
sudo rm -f /etc/systemd/system/adora-daemon-<id>.service
sudo systemctl daemon-reload
adora cluster upgrade
Rolling upgrade: SCP the local adora binary to each machine and restart daemons. Processes machines sequentially to maintain availability.
adora cluster upgrade <PATH>
Behavior:
For each machine sequentially:
- SCP the local
adorabinary to/usr/local/bin/adoraon the target machine. - Restart the systemd service via
sudo systemctl restart adora-daemon-<id>. - Poll the coordinator until the daemon reconnects (30s timeout, 500ms intervals).
Nodes on other machines continue running while each machine is being upgraded.
Example:
$ adora cluster upgrade cluster.yml
Upgrading robot (ubuntu@10.0.0.2)...
SCP binary... OK
Restart service... OK
Waiting for reconnect... OK (3.2s)
Upgrading gpu-server (ubuntu@10.0.0.3)...
SCP binary... OK
Restart service... OK
Waiting for reconnect... OK (2.8s)
2/2 succeeded.
adora cluster restart
Restart a running dataflow by name or UUID. Stops the dataflow and immediately re-starts it using the stored descriptor (no YAML path needed).
adora cluster restart <DATAFLOW>
Arguments:
| Argument | Description |
|---|---|
DATAFLOW | Name or UUID of the dataflow to restart |
Example:
$ adora cluster restart my-app
Restarting dataflow `my-app`
dataflow restarted: a1b2c3d4-... -> e5f6a7b8-...
Node Scheduling
When the coordinator receives a dataflow, it decides which daemon runs each node based on the _unstable_deploy section in the dataflow YAML. Resolution priority: machine > labels > unnamed.
Machine-based scheduling
Assign a node to a specific machine by its id from cluster.yml:
nodes:
- id: camera
_unstable_deploy:
machine: robot
path: ./camera-driver
outputs:
- frames
The coordinator looks up the daemon whose machine-id matches. If no matching daemon is connected, the deployment fails with: no matching daemon for machine id "robot".
Label-based scheduling
Assign a node by requiring specific labels on the target daemon:
nodes:
- id: inference
_unstable_deploy:
labels:
gpu: "true"
path: ./ml-model
inputs:
frames: camera/frames
outputs:
- predictions
The coordinator finds the first connected daemon whose labels are a superset of the required labels. All required key-value pairs must match exactly. If no daemon satisfies the requirements, deployment fails with: no daemon matches labels {"gpu": "true"}.
Unassigned nodes
Nodes without an _unstable_deploy section (or with an empty one) are assigned to the first unnamed daemon – one that connected without a --machine-id flag.
How resolve_daemon() works internally
The coordinator resolves node placement in coordinator/run/mod.rs:
resolve_daemon(connections, deploy) -> DaemonId
1. If deploy.machine is Some(id):
-> look up daemon by machine-id
2. Else if deploy.labels is non-empty:
-> find first daemon where all required labels match
3. Else:
-> pick first unnamed daemon
The label matching function iterates over all connected daemons and checks that every required key-value pair exists in the daemon’s label set (conn.labels.get(k) == Some(v)). This is a superset check: a daemon with {gpu: "true", arch: "arm64", role: "edge"} satisfies the requirement {gpu: "true"}.
Binary Distribution
Control how node binaries are delivered to remote daemons via the distribute field.
Local (default)
Each daemon builds from source on its own machine. This is the current default behavior.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01
distribute: local
path: ./my-node
SCP mode
The CLI pushes the locally-built binary to the target machine via SSH/SCP before spawning.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01
distribute: scp
path: ./my-node
HTTP mode
The coordinator runs an artifact store. Daemons pull binaries from the coordinator via HTTP before spawning.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01
distribute: http
path: ./my-node
Artifacts are served from GET /api/artifacts/{build_id}/{node_id} on the coordinator’s WebSocket port. The endpoint requires authentication (Bearer token) and sanitizes node IDs to prevent path traversal.
When to use each strategy
| Strategy | Best for | Tradeoffs |
|---|---|---|
local | Homogeneous clusters, CI builds | Requires build toolchain on every machine |
scp | Heterogeneous clusters, cross-compiled binaries | Requires SSH access from CLI to all machines |
http | Air-gapped daemons, firewalled networks | Requires coordinator reachability from all daemons |
systemd Service Management
For production deployments, install daemons as systemd services so they survive reboots and auto-restart on failure.
Install
adora cluster install cluster.yml
Creates a systemd unit file on each machine (see adora cluster install for the full unit template). Key properties:
- Restart=on-failure with RestartSec=5: daemon auto-restarts if it crashes.
- After=network-online.target: waits for network before starting.
- WantedBy=multi-user.target: starts on boot.
Uninstall
adora cluster uninstall cluster.yml
Stops, disables, and removes the unit file from each machine, then reloads the systemd daemon.
Verifying service status
After install, check services directly:
ssh ubuntu@10.0.0.2 sudo systemctl status adora-daemon-robot
Auto-Recovery
When a daemon disconnects and reconnects (e.g., after a network blip, machine reboot, or service restart), the coordinator automatically re-spawns any missing dataflows on that daemon.
How it works
- Daemon reconnects and sends a
StatusReportlisting its currently running dataflows. - Coordinator compares the report against its expected state (dataflows that should have nodes on this daemon).
- For each running dataflow with nodes assigned to this daemon that the daemon did not report, the coordinator sends a
SpawnDataflowNodescommand to re-spawn the missing nodes.
30-second backoff
To prevent crash loops (e.g., a node that immediately crashes on spawn), recovery uses a per-daemon, per-dataflow backoff:
- After a recovery attempt, the coordinator records the timestamp.
- Subsequent recovery for the same daemon/dataflow pair is skipped until 30 seconds have elapsed.
- The backoff clears when the daemon reports the dataflow as running again.
This means a node that crashes immediately will only be re-spawned once every 30 seconds, not in a tight loop.
Limitations
- Auto-recovery only applies to dataflows started via
adora start(coordinator-managed). Localadora rundataflows are not tracked by the coordinator. - Recovery re-spawns all nodes assigned to the reconnecting daemon, not individual nodes. For per-node restart on crash, use restart policies.
Rolling Upgrade
Upgrade the adora binary on all cluster machines with zero downtime using sequential per-machine upgrades.
Process
adora cluster upgrade cluster.yml
For each machine, sequentially:
- SCP the local
adorabinary to/usr/local/bin/adoraon the target. - Restart the systemd service (
systemctl restart adora-daemon-<id>). - Poll the coordinator until the daemon reconnects (30s timeout).
Because machines are upgraded one at a time, nodes on other machines continue running. After the daemon reconnects, auto-recovery re-spawns any dataflow nodes that were running on that machine.
Prerequisites
- Daemons must be installed as systemd services (
adora cluster install). - The local
adorabinary must be compatible with the cluster’s coordinator version. - SSH access with
sudopermissions on all target machines.
Use Cases
1. Edge AI Pipeline (Robot + GPU Server)
A camera node runs on the robot, sends frames to a GPU server for inference, and results flow back to an actuator on the robot.
cluster.yml:
coordinator:
addr: 192.168.1.1
machines:
- id: robot
host: 192.168.1.10
user: ubuntu
labels:
role: edge
- id: gpu-server
host: 192.168.1.20
user: ml
labels:
gpu: "true"
dataflow.yml:
nodes:
- id: camera
_unstable_deploy:
machine: robot
path: ./camera-driver
outputs:
- frames
- id: inference
_unstable_deploy:
labels:
gpu: "true"
path: ./ml-model
inputs:
frames: camera/frames
outputs:
- predictions
- id: actuator
_unstable_deploy:
machine: robot
path: ./actuator-driver
inputs:
commands: inference/predictions
2. Multi-Robot Fleet
A central coordinator manages N robots with heterogeneous hardware. Label scheduling routes nodes to the right machines without hardcoding machine IDs.
cluster.yml:
coordinator:
addr: 10.0.0.1
machines:
- id: bot-01
host: 10.0.0.11
user: robot
labels:
fleet: warehouse
lidar: "true"
- id: bot-02
host: 10.0.0.12
user: robot
labels:
fleet: warehouse
camera: rgbd
- id: bot-03
host: 10.0.0.13
user: robot
labels:
fleet: warehouse
lidar: "true"
camera: rgbd
dataflow.yml:
nodes:
- id: lidar-driver
_unstable_deploy:
labels:
lidar: "true"
path: ./lidar-driver
outputs:
- scans
- id: camera-driver
_unstable_deploy:
labels:
camera: rgbd
path: ./camera-driver
outputs:
- frames
With this configuration, lidar-driver runs on bot-01 or bot-03, and camera-driver runs on bot-02 or bot-03.
3. CI/CD Pipeline for Robotics
Automate cluster management in CI:
# Setup
adora cluster install cluster.yml
# Deploy new version
adora cluster upgrade cluster.yml
# Run integration tests
adora start test-dataflow.yml --name integration-test --attach
# Monitor
adora cluster status
adora top
# Cleanup
adora stop integration-test
4. Development to Production
| Stage | Approach | Command |
|---|---|---|
| Local dev | Single-process, no coordinator | adora run dataflow.yml |
| Staging | Ad-hoc daemons, manual setup | adora up + adora daemon on each machine |
| Production | Managed cluster, systemd services | adora cluster install cluster.yml |
Operations Runbook
Initial Setup Checklist
- SSH keys: Distribute SSH keys so the CLI machine can reach all cluster machines without a password (
BatchMode=yes). - Adora binary: Install the
adorabinary on all machines (same version). - Network: Ensure coordinator port (default 6013) is reachable from all machines. Ensure Zenoh ports are open between daemons for cross-machine node communication.
- cluster.yml: Create the cluster configuration with correct IPs, users, and labels.
Day-to-Day Operations
# Start a dataflow
adora start dataflow.yml --name my-app --attach
# List running dataflows
adora list
# Monitor resource usage
adora top
# View node logs
adora logs my-app <node-id> --follow
# Stop a dataflow
adora stop my-app
# Check cluster health
adora cluster status
Upgrading
- Build or download the new
adorabinary locally. - Run
adora cluster upgrade cluster.yml. - Verify with
adora cluster statusthat all daemons reconnected. - Running dataflows are automatically re-spawned via auto-recovery.
Troubleshooting
Daemon not connecting
- Verify the coordinator is running and reachable:
curl http://<addr>:6013/api/health(or check coordinator logs). - Check daemon logs:
journalctl -u adora-daemon-<id> -f(systemd) or the daemon’s stderr output (ad-hoc). - Confirm the
--coordinator-addrand--coordinator-portmatch the coordinator’s actual bind address.
SSH failures during cluster commands
- Ensure
ssh -o BatchMode=yes <user>@<host> echo okworks from the CLI machine. - Check that
StrictHostKeyChecking=accept-newis acceptable for your environment (first connection auto-accepts the host key). - Verify the
userfield incluster.ymlmatches a valid SSH user on the target.
Label mismatch errors
- Error:
no daemon matches labels {"gpu": "true"}. - Check that the daemon was started with the correct
--labelsflag. - Run
adora cluster statusto see connected daemons. Labels are set at daemon startup fromcluster.ymland cannot be changed at runtime.
Auto-recovery not triggering
- Auto-recovery only applies to coordinator-managed dataflows (
adora start), notadora run. - Check coordinator logs for
auto-recovery: re-spawningmessages. - If the node crashes immediately, recovery is throttled to once every 30 seconds per daemon per dataflow.
Deployment YAML Reference
The _unstable_deploy section on each node controls placement and distribution. All fields are optional.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01 # Target machine ID from cluster.yml
labels: # Label requirements (superset match)
gpu: "true"
arch: arm64
distribute: local # local | scp | http
working_dir: /opt/my-app # Working directory on the target machine
path: ./my-node
Fields
| Field | Type | Default | Description |
|---|---|---|---|
machine | string | none | Target machine ID. Takes priority over labels. |
labels | map | empty | Required daemon labels. All key-value pairs must match. |
distribute | string | local | Binary distribution strategy: local, scp, or http. |
working_dir | path | none | Working directory on the target machine. |
Resolution priority
- machine – if set, the node is assigned to the daemon with that machine ID.
- labels – if set (and machine is not), the node is assigned to the first daemon whose labels are a superset of the required labels.
- Fallback – if neither is set, the node is assigned to the first unnamed (no machine-id) daemon.
Best Practices
- Use labels over machine IDs for flexibility. Labels decouple your dataflow from specific machines, making it easier to add, remove, or replace hardware.
- Use systemd install for production. Daemon services survive reboots and auto-restart on failure with
Restart=on-failure. - Use coordinator persistence (
adora coordinator --store redb) with clusters so the coordinator survives restarts. See Coordinator State Persistence. - Set restart policies on nodes for per-node resilience. Combine with auto-recovery for defense in depth. See Restart Policies.
- Monitor with multiple tools:
adora cluster statusfor daemon health,adora topfor resource usage,adora logsfor node output. - Test locally first. Develop with
adora run dataflow.yml, then deploy to a cluster. The same dataflow YAML works in both modes –_unstable_deployfields are ignored in local mode. - Use rolling upgrades instead of stopping the entire cluster.
adora cluster upgradeprocesses one machine at a time to maintain availability. - Keep cluster.yml in version control alongside your dataflow definitions.