分布式部署指南
Dora supports deploying dataflows across multiple machines for multi-robot fleets, edge AI pipelines, and distributed robotics systems. This guide covers cluster management, node scheduling, binary distribution, auto-recovery, and operational best practices.
目录
- Overview
- Quick Start
- Features at a Glance
- Cluster Configuration Reference
- Cluster Commands Reference
- Node Scheduling
- Binary Distribution
- systemd Service Management
- Auto-Recovery
- Rolling Upgrade
- Use Cases
- Operations Runbook
- Deployment YAML Reference
- Best Practices
概述
Dora’s distributed architecture has three tiers:
CLI --> Coordinator --> Daemon(s) --> Nodes / Operators
(one) (per machine) (user code)
- CLI sends control commands (build, start, stop) to the coordinator.
- Coordinator orchestrates daemons, resolves node placement, and manages dataflow lifecycle.
- Daemons run on each machine, spawning and supervising node processes.
- Nodes communicate via shared memory (same machine) or Zenoh pub-sub (cross-machine).
There are two paths to distributed deployment:
Ad-hoc – manually start dora daemon on each machine, then use the coordinator for control. Good for development and testing. See Distributed Deployments in the CLI reference.
Managed (cluster.yml) – define your cluster topology in a YAML file, then use dora cluster commands for SSH-based lifecycle management. This guide focuses on the managed path.
快速开始
- Create a
cluster.yml:
coordinator:
addr: 10.0.0.1
machines:
- id: robot
host: 10.0.0.2
user: ubuntu
- id: gpu-server
host: 10.0.0.3
user: ubuntu
- Bring up the cluster:
dora cluster up cluster.yml
- Start a dataflow:
dora start dataflow.yml --name my-app --attach
- Check cluster health:
dora cluster status
- Tear down:
dora cluster down
功能一览
| 特性 | Command / Config | 描述 |
|---|---|---|
| Cluster lifecycle | dora cluster up/status/down | SSH-based daemon management from a single machine |
| Label scheduling | _unstable_deploy.labels | Route nodes to daemons by key-value labels |
| Binary distribution | _unstable_deploy.distribute | local, scp, or http strategies |
| systemd services | dora cluster install/uninstall | Persistent daemon services that survive reboots |
| Auto-recovery | Automatic | Re-spawn nodes when a daemon reconnects |
| Rolling upgrade | dora cluster upgrade | SCP binary + restart per-machine sequentially |
| Dataflow restart | dora cluster restart | Restart a running dataflow by name or UUID |
集群配置参考
A cluster.yml file defines the coordinator address and the set of machines in the cluster.
完整模式
coordinator:
addr: 10.0.0.1 # IP address the coordinator binds to (required)
port: 6013 # WebSocket port (default: 6013)
machines:
- id: edge-01 # Unique machine identifier (required)
host: 10.0.0.2 # SSH-reachable hostname or IP (required)
user: ubuntu # SSH user (optional, defaults to current user)
labels: # Key-value labels for scheduling (optional)
gpu: "true"
arch: arm64
- id: edge-02
host: 10.0.0.3
labels:
arch: arm64
字段
coordinator
| Field | 类型 | 默认 | 描述 |
|---|---|---|---|
addr | IP address | (required) | Address the coordinator binds to |
port | u16 | 6013 | WebSocket port |
machines[]
| Field | 类型 | 默认 | 描述 |
|---|---|---|---|
id | string | (required) | Unique machine identifier, used in _unstable_deploy.machine |
host | string | (required) | SSH-reachable hostname or IP address |
user | string | current user | SSH username |
labels | map | empty | Key-value pairs for label-based scheduling |
Validation Rules
- At least one machine must be defined.
- Machine IDs must be non-empty and unique.
- Machine hosts must be non-empty.
- Unknown fields are rejected (
deny_unknown_fields).
Example: 3-Machine GPU Cluster
coordinator:
addr: 192.168.1.1
machines:
- id: coordinator-host
host: 192.168.1.1
labels:
role: control
- id: gpu-a100
host: 192.168.1.10
user: ml
labels:
gpu: a100
arch: x86_64
- id: jetson-01
host: 192.168.1.20
user: nvidia
labels:
gpu: jetson
arch: arm64
Cluster Commands Reference
All dora cluster commands operate on a cluster.yml file and use SSH to manage remote machines.
SSH options used: BatchMode=yes, ConnectTimeout=10, StrictHostKeyChecking=accept-new.
dora cluster up
Bring up a multi-machine cluster from a cluster.yml file. Starts the coordinator locally, then SSH-es into each machine to start a daemon.
dora cluster up <PATH>
Arguments:
| Argument | 描述 |
|---|---|
PATH | Path to the cluster configuration file |
Behavior:
- Loads and validates the cluster config.
- Starts the coordinator locally on
addr:port. - For each machine, SSH-es in and runs
nohup dora daemon --machine-id <id> --coordinator-addr <addr> --coordinator-port <port> [--labels k1=v1,k2=v2] --quiet. - Polls until all expected daemons register with the coordinator (30s timeout).
Example:
$ dora cluster up cluster.yml
Starting coordinator on 10.0.0.1:6013...
Starting daemon on robot (ubuntu@10.0.0.2)... OK
Starting daemon on gpu-server (ubuntu@10.0.0.3)... OK
All 2 daemons connected.
dora cluster status
Show the current status of the cluster. Displays connected daemons and active dataflow count.
dora cluster status [--coordinator-addr ADDR] [--coordinator-port PORT]
Flags:
| 标志 | 默认 | 描述 |
|---|---|---|
--coordinator-addr | localhost | Coordinator hostname or IP |
--coordinator-port | 6013 | Coordinator WebSocket port |
Example:
$ dora cluster status
DAEMON ID LAST HEARTBEAT
robot 2s ago
gpu-server 1s ago
Active dataflows: 1
dora cluster down
Tear down the cluster (coordinator and all daemons).
dora cluster down [--coordinator-addr ADDR] [--coordinator-port PORT]
Terminates all daemons and the coordinator process.
dora cluster install
Install dora-daemon as a systemd service on each machine. SSH-es into each machine, writes a systemd unit file, and enables the service.
dora cluster install <PATH>
Arguments:
| Argument | 描述 |
|---|---|
PATH | Path to the cluster configuration file |
Behavior:
For each machine, creates and enables a systemd service named dora-daemon-<id>. The unit file:
[Unit]
Description=Dora Daemon (<id>)
After=network-online.target
Wants=network-online.target
[Service]
ExecStart=dora daemon --machine-id <id> --coordinator-addr <addr> --coordinator-port <port> --labels k1=v1,k2=v2 --quiet
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Example:
$ dora cluster install cluster.yml
Installing dora-daemon-robot on ubuntu@10.0.0.2... OK
Installing dora-daemon-gpu-server on ubuntu@10.0.0.3... OK
2/2 succeeded.
dora cluster uninstall
Uninstall dora-daemon systemd services from each machine. Stops, disables, and removes the systemd unit.
dora cluster uninstall <PATH>
Behavior:
For each machine, runs:
sudo systemctl stop dora-daemon-<id>
sudo systemctl disable dora-daemon-<id>
sudo rm -f /etc/systemd/system/dora-daemon-<id>.service
sudo systemctl daemon-reload
dora cluster upgrade
Rolling upgrade: SCP the local dora binary to each machine and restart daemons. Processes machines sequentially to maintain availability.
dora cluster upgrade <PATH>
Behavior:
For each machine sequentially:
- SCP the local
dorabinary to/usr/local/bin/doraon the target machine. - Restart the systemd service via
sudo systemctl restart dora-daemon-<id>. - Poll the coordinator until the daemon reconnects (30s timeout, 500ms intervals).
Nodes on other machines continue running while each machine is being upgraded.
Example:
$ dora cluster upgrade cluster.yml
Upgrading robot (ubuntu@10.0.0.2)...
SCP binary... OK
Restart service... OK
Waiting for reconnect... OK (3.2s)
Upgrading gpu-server (ubuntu@10.0.0.3)...
SCP binary... OK
Restart service... OK
Waiting for reconnect... OK (2.8s)
2/2 succeeded.
dora cluster restart
Restart a running dataflow by name or UUID. Stops the dataflow and immediately re-starts it using the stored descriptor (no YAML path needed).
dora cluster restart <DATAFLOW>
Arguments:
| Argument | 描述 |
|---|---|
DATAFLOW | Name or UUID of the dataflow to restart |
Example:
$ dora cluster restart my-app
Restarting dataflow `my-app`
dataflow restarted: a1b2c3d4-... -> e5f6a7b8-...
Node Scheduling
When the coordinator receives a dataflow, it decides which daemon runs each node based on the _unstable_deploy section in the dataflow YAML. Resolution priority: machine > labels > unnamed.
Machine-based scheduling
Assign a node to a specific machine by its id from cluster.yml:
nodes:
- id: camera
_unstable_deploy:
machine: robot
path: ./camera-driver
outputs:
- frames
The coordinator looks up the daemon whose machine-id matches. If no matching daemon is connected, the deployment fails with: no matching daemon for machine id "robot".
Label-based scheduling
Assign a node by requiring specific labels on the target daemon:
nodes:
- id: inference
_unstable_deploy:
labels:
gpu: "true"
path: ./ml-model
inputs:
frames: camera/frames
outputs:
- predictions
The coordinator finds the first connected daemon whose labels are a superset of the required labels. All required key-value pairs must match exactly. If no daemon satisfies the requirements, deployment fails with: no daemon matches labels {"gpu": "true"}.
Unassigned nodes
Nodes without an _unstable_deploy section (or with an empty one) are assigned to the first unnamed daemon – one that connected without a --machine-id flag.
How resolve_daemon() works internally
The coordinator resolves node placement in coordinator/run/mod.rs:
resolve_daemon(connections, deploy) -> DaemonId
1. If deploy.machine is Some(id):
-> look up daemon by machine-id
2. Else if deploy.labels is non-empty:
-> find first daemon where all required labels match
3. Else:
-> pick first unnamed daemon
The label matching function iterates over all connected daemons and checks that every required key-value pair exists in the daemon’s label set (conn.labels.get(k) == Some(v)). This is a superset check: a daemon with {gpu: "true", arch: "arm64", role: "edge"} satisfies the requirement {gpu: "true"}.
Binary Distribution
Control how node binaries are delivered to remote daemons via the distribute field.
Local (default)
Each daemon builds from source on its own machine. This is the current default behavior.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01
distribute: local
path: ./my-node
SCP mode
The CLI pushes the locally-built binary to the target machine via SSH/SCP before spawning.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01
distribute: scp
path: ./my-node
HTTP mode
The coordinator runs an artifact store. Daemons pull binaries from the coordinator via HTTP before spawning.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01
distribute: http
path: ./my-node
Artifacts are served from GET /api/artifacts/{build_id}/{node_id} on the coordinator’s WebSocket port. The endpoint requires authentication (Bearer token) and sanitizes node IDs to prevent path traversal.
When to use each strategy
| Strategy | Best for | Tradeoffs |
|---|---|---|
local | Homogeneous clusters, CI builds | Requires build toolchain on every machine |
scp | Heterogeneous clusters, cross-compiled binaries | Requires SSH access from CLI to all machines |
http | Air-gapped daemons, firewalled networks | Requires coordinator reachability from all daemons |
systemd Service Management
For production deployments, install daemons as systemd services so they survive reboots and auto-restart on failure.
Install
dora cluster install cluster.yml
Creates a systemd unit file on each machine (see dora cluster install for the full unit template). Key properties:
- Restart=on-failure with RestartSec=5: daemon auto-restarts if it crashes.
- After=network-online.target: waits for network before starting.
- WantedBy=multi-user.target: starts on boot.
Uninstall
dora cluster uninstall cluster.yml
Stops, disables, and removes the unit file from each machine, then reloads the systemd daemon.
Verifying service status
After install, check services directly:
ssh ubuntu@10.0.0.2 sudo systemctl status dora-daemon-robot
自动恢复
When a daemon disconnects and reconnects (e.g., after a network blip, machine reboot, or service restart), the coordinator automatically re-spawns any missing dataflows on that daemon.
How it works
- Daemon reconnects and sends a
StatusReportlisting its currently running dataflows. - Coordinator compares the report against its expected state (dataflows that should have nodes on this daemon).
- For each running dataflow with nodes assigned to this daemon that the daemon did not report, the coordinator sends a
SpawnDataflowNodescommand to re-spawn the missing nodes.
30-second backoff
To prevent crash loops (e.g., a node that immediately crashes on spawn), recovery uses a per-daemon, per-dataflow backoff:
- After a recovery attempt, the coordinator records the timestamp.
- Subsequent recovery for the same daemon/dataflow pair is skipped until 30 seconds have elapsed.
- The backoff clears when the daemon reports the dataflow as running again.
This means a node that crashes immediately will only be re-spawned once every 30 seconds, not in a tight loop.
限制
- Auto-recovery only applies to dataflows started via
dora start(coordinator-managed). Localdora rundataflows are not tracked by the coordinator. - Recovery re-spawns all nodes assigned to the reconnecting daemon, not individual nodes. For per-node restart on crash, use restart policies.
- Known issue (#260): when the daemon’s WebSocket connection to the coordinator drops, the daemon currently kills all running node processes before reconnecting. This means the coordinator’s auto-recovery path re-spawns the nodes from scratch rather than reclaiming still-running processes. The net effect is a brief disruption (nodes restart) rather than seamless continuity. A fix to preserve running processes across reconnect cycles is planned.
Rolling Upgrade
Upgrade the dora binary on all cluster machines with zero downtime using sequential per-machine upgrades.
Process
dora cluster upgrade cluster.yml
For each machine, sequentially:
- SCP the local
dorabinary to/usr/local/bin/doraon the target. - Restart the systemd service (
systemctl restart dora-daemon-<id>). - Poll the coordinator until the daemon reconnects (30s timeout).
Because machines are upgraded one at a time, nodes on other machines continue running. After the daemon reconnects, auto-recovery re-spawns any dataflow nodes that were running on that machine.
前提条件
- Daemons must be installed as systemd services (
dora cluster install). - The local
dorabinary must be compatible with the cluster’s coordinator version. - SSH access with
sudopermissions on all target machines.
Use Cases
1. Edge AI Pipeline (Robot + GPU Server)
A camera node runs on the robot, sends frames to a GPU server for inference, and results flow back to an actuator on the robot.
cluster.yml:
coordinator:
addr: 192.168.1.1
machines:
- id: robot
host: 192.168.1.10
user: ubuntu
labels:
role: edge
- id: gpu-server
host: 192.168.1.20
user: ml
labels:
gpu: "true"
dataflow.yml:
nodes:
- id: camera
_unstable_deploy:
machine: robot
path: ./camera-driver
outputs:
- frames
- id: inference
_unstable_deploy:
labels:
gpu: "true"
path: ./ml-model
inputs:
frames: camera/frames
outputs:
- predictions
- id: actuator
_unstable_deploy:
machine: robot
path: ./actuator-driver
inputs:
commands: inference/predictions
2. Multi-Robot Fleet
A central coordinator manages N robots with heterogeneous hardware. Label scheduling routes nodes to the right machines without hardcoding machine IDs.
cluster.yml:
coordinator:
addr: 10.0.0.1
machines:
- id: bot-01
host: 10.0.0.11
user: robot
labels:
fleet: warehouse
lidar: "true"
- id: bot-02
host: 10.0.0.12
user: robot
labels:
fleet: warehouse
camera: rgbd
- id: bot-03
host: 10.0.0.13
user: robot
labels:
fleet: warehouse
lidar: "true"
camera: rgbd
dataflow.yml:
nodes:
- id: lidar-driver
_unstable_deploy:
labels:
lidar: "true"
path: ./lidar-driver
outputs:
- scans
- id: camera-driver
_unstable_deploy:
labels:
camera: rgbd
path: ./camera-driver
outputs:
- frames
With this configuration, lidar-driver runs on bot-01 or bot-03, and camera-driver runs on bot-02 or bot-03.
3. CI/CD Pipeline for Robotics
Automate cluster management in CI:
# Setup
dora cluster install cluster.yml
# Deploy new version
dora cluster upgrade cluster.yml
# Run integration tests
dora start test-dataflow.yml --name integration-test --attach
# Monitor
dora cluster status
dora top
# Cleanup
dora stop integration-test
4. Development to Production
| Stage | Approach | 命令 |
|---|---|---|
| Local dev | Single-process, no coordinator | dora run dataflow.yml |
| Staging | Ad-hoc daemons, manual setup | dora up + dora daemon on each machine |
| Production | Managed cluster, systemd services | dora cluster install cluster.yml |
Operations Runbook
Initial Setup Checklist
- SSH keys: Distribute SSH keys so the CLI machine can reach all cluster machines without a password (
BatchMode=yes). - Dora binary: Install the
dorabinary on all machines (same version). - Network: Ensure coordinator port (default 6013) is reachable from all machines. Ensure Zenoh ports are open between daemons for cross-machine node communication.
- cluster.yml: Create the cluster configuration with correct IPs, users, and labels.
Day-to-Day Operations
# Start a dataflow
dora start dataflow.yml --name my-app --attach
# List running dataflows
dora list
# Monitor resource usage
dora top
# View node logs
dora logs my-app <node-id> --follow
# Stop a dataflow
dora stop my-app
# Check cluster health
dora cluster status
Upgrading
- Build or download the new
dorabinary locally. - Run
dora cluster upgrade cluster.yml. - Verify with
dora cluster statusthat all daemons reconnected. - Running dataflows are automatically re-spawned via auto-recovery.
故障排除
Daemon not connecting
- Verify the coordinator is running and reachable:
curl http://<addr>:6013/api/health(or check coordinator logs). - Check daemon logs:
journalctl -u dora-daemon-<id> -f(systemd) or the daemon’s stderr output (ad-hoc). - Confirm the
--coordinator-addrand--coordinator-portmatch the coordinator’s actual bind address.
SSH failures during cluster commands
- Ensure
ssh -o BatchMode=yes <user>@<host> echo okworks from the CLI machine. - Check that
StrictHostKeyChecking=accept-newis acceptable for your environment (first connection auto-accepts the host key). - Verify the
userfield incluster.ymlmatches a valid SSH user on the target.
Label mismatch errors
- Error:
no daemon matches labels {"gpu": "true"}. - Check that the daemon was started with the correct
--labelsflag. - Run
dora cluster statusto see connected daemons. Labels are set at daemon startup fromcluster.ymland cannot be changed at runtime.
Auto-recovery not triggering
- Auto-recovery only applies to coordinator-managed dataflows (
dora start), notdora run. - Check coordinator logs for
auto-recovery: re-spawningmessages. - If the node crashes immediately, recovery is throttled to once every 30 seconds per daemon per dataflow.
Deployment YAML Reference
The _unstable_deploy section on each node controls placement and distribution. All fields are optional.
nodes:
- id: my-node
_unstable_deploy:
machine: edge-01 # Target machine ID from cluster.yml
labels: # Label requirements (superset match)
gpu: "true"
arch: arm64
distribute: local # local | scp | http
working_dir: /opt/my-app # Working directory on the target machine
path: ./my-node
字段
| Field | 类型 | 默认 | 描述 |
|---|---|---|---|
machine | string | none | Target machine ID. Takes priority over labels. |
labels | map | empty | Required daemon labels. All key-value pairs must match. |
distribute | string | local | Binary distribution strategy: local, scp, or http. |
working_dir | path | none | Working directory on the target machine. |
Resolution priority
- machine – if set, the node is assigned to the daemon with that machine ID.
- labels – if set (and machine is not), the node is assigned to the first daemon whose labels are a superset of the required labels.
- Fallback – if neither is set, the node is assigned to the first unnamed (no machine-id) daemon.
最佳实践
- Use labels over machine IDs for flexibility. Labels decouple your dataflow from specific machines, making it easier to add, remove, or replace hardware.
- Use systemd install for production. Daemon services survive reboots and auto-restart on failure with
Restart=on-failure. - Use coordinator persistence (
dora coordinator --store redb) with clusters so the coordinator survives restarts. See Coordinator State Persistence. - Set restart policies on nodes for per-node resilience. Combine with auto-recovery for defense in depth. See Restart Policies.
- Monitor with multiple tools:
dora cluster statusfor daemon health,dora topfor resource usage,dora logsfor node output. - Test locally first. Develop with
dora run dataflow.yml, then deploy to a cluster. The same dataflow YAML works in both modes –_unstable_deployfields are ignored in local mode. - Use rolling upgrades instead of stopping the entire cluster.
dora cluster upgradeprocesses one machine at a time to maintain availability. - Keep cluster.yml in version control alongside your dataflow definitions.