Dora

智能数据流导向机器人架构 (Agentic Dataflow-Oriented Robotic Architecture) – 一个 100% Rust 的实时机器人与 AI 应用框架。

为什么选择 Dora？

性能

10-17x faster than ROS2 Python – 100% Rust internals with zero-copy shared memory IPC for messages >4KB, flat latency from 4KB to 4MB payloads
Apache Arrow native – columnar memory format end-to-end with zero serialization overhead; shared across all language bindings
Zenoh SHM 数据面 – 节点通过 Zenoh 共享内存直接发布，零拷贝数据传输；跨机自动网络回退
非阻塞事件循环 – Zenoh 发布卸载到排水任务；指标收集在后台运行

开发者体验

Single CLI, full lifecycle – dora run for local dev, dora up/start for distributed prod, plus build, logs, monitoring, record/replay all from one tool
Declarative YAML dataflows – define pipelines as directed graphs, connect nodes through typed inputs/outputs, optional type annotations with static validation
Multi-language nodes – write nodes in Rust, Python, C, or C++ with native APIs (not wrappers); mix languages freely in one dataflow
Reusable modules – compose sub-graphs as standalone YAML files with typed inputs/outputs, parameters, and nested composition
Hot reload – live-reload Python operators without restarting the dataflow

Production Readiness

Fault tolerance – per-node restart policies (never/on-failure/always), exponential backoff, health monitoring, circuit breakers with configurable input timeouts
Distributed by default – local shared memory between co-located nodes, automatic Zenoh pub-sub for cross-machine communication, SSH-based cluster management with label scheduling
Coordinator HA – persistent redb state store, daemon auto-reconnect, dataflow records survive coordinator restart (full running-dataflow reclaim across restart is partial; see open tracker)
动态拓扑 – 通过 CLI 在运行中的数据流上添加和删除节点，无需重启
Configurable queue policies – drop_oldest (default) or backpressure per input, with metrics on dropped messages
软实时 – 可选 --rt 标志启用 mlockall + SCHED_FIFO；每节点 cpu_affinity 绑核
OpenTelemetry – built-in structured logging with rotation/routing, metrics, distributed tracing

Debugging and Observability

Record/replay – capture dataflow messages to .drec files, replay offline at any speed with node substitution
Topic inspection – topic echo to print live data, topic hz TUI for frequency analysis, topic info for schema and bandwidth
Resource monitoring – dora top TUI showing per-node CPU, memory, queue depth, network I/O across all machines
Log aggregation – subscribe to dora/logs to receive structured log messages from all nodes without extra wiring
Trace inspection – trace list and trace view for viewing coordinator spans without external infrastructure

Ecosystem

Communication patterns – built-in service (request/reply), action (goal/feedback/result), and streaming (session/segment/chunk) patterns via well-known metadata keys
ROS2 bridge – bidirectional interop with ROS2 topics, services, and actions
In-process operators – lightweight functions that run inside a shared runtime, avoiding per-node process overhead

安装

From crates.io (recommended)

cargo install dora-cli           # CLI（dora 命令）
pip install dora-rs              # Python 节点/算子 API

从源码安装

git clone https://github.com/dora-rs/dora.git
cd dora
cargo build --release -p dora-cli
PATH=$PATH:$(pwd)/target/release

# Python API (requires maturin >= 1.8: pip install maturin)
# Must run from the package directory for dependency resolution
cd apis/python/node && maturin develop --uv && cd ../../..

平台安装器

macOS / Linux：

curl --proto '=https' --tlsv1.2 -LsSf \
  https://github.com/dora-rs/dora/releases/latest/download/dora-cli-installer.sh | sh

Windows：

powershell -ExecutionPolicy ByPass -c "irm https://github.com/dora-rs/dora/releases/latest/download/dora-cli-installer.ps1 | iex"

构建特性

特性	描述	默认
`tracing`	OpenTelemetry 追踪支持	是
`metrics`	OpenTelemetry 指标收集	否
`python`	Python 算子支持（PyO3）	否
`redb-backend`	持久化协调器状态（redb）	否

cargo install dora-cli --features redb-backend

验证

dora --version
dora status

Python 快速入门

本指南将引导你使用 Python 编写 dora 数据流的节点和算子。

前提条件

cargo install dora-cli    # CLI（dora 命令）
pip install dora-rs       # Python 节点/算子 API

dora-rs 包已包含 pyarrow 作为依赖。

从源码构建（替代 pip install dora-rs）：

pip install maturin  # requires >= 1.8
cd apis/python/node && maturin develop --uv && cd ../../..

Hello World: Sender and Receiver

Create three files:

sender.py – sends 100 numbered messages:

import pyarrow as pa
from dora import Node

node = Node()
for i in range(100):
    node.send_output("message", pa.array([i]))

receiver.py – receives and prints messages:

from dora import Node

node = Node()
for event in node:
    if event["type"] == "INPUT":
        values = event["value"].to_pylist()
        print(f"Received {event['id']}: {values}")
    elif event["type"] == "STOP":
        break

dataflow.yml – connects sender to receiver:

nodes:
  - id: sender
    path: sender.py
    outputs:
      - message

  - id: receiver
    path: receiver.py
    inputs:
      message: sender/message

Run it:

dora run dataflow.yml

Events

Every call to node.next() or iteration over for event in node returns an event dictionary:

Key	类型	描述
`type`	str	`"INPUT"`, `"INPUT_CLOSED"`, `"STOP"`, or `"ERROR"`
`id`	str	Input name (e.g. `"message"`) – only for `INPUT` events
`value`	pyarrow.Array or None	The data payload
`metadata`	dict	Tracing/routing metadata

Handle events by checking event["type"]:

for event in node:
    match event["type"]:
        case "INPUT":
            process(event["id"], event["value"])
        case "INPUT_CLOSED":
            print(f"Input {event['id']} closed")
        case "STOP":
            break

Working with Arrow Data

All data flows through dora as Apache Arrow arrays. Common patterns:

import pyarrow as pa

# Simple values
node.send_output("count", pa.array([42]))
node.send_output("names", pa.array(["alice", "bob"]))

# Read values back
values = event["value"].to_pylist()  # [42] or ["alice", "bob"]

# Structured data
struct = pa.StructArray.from_arrays(
    [pa.array([1.5]), pa.array(["hello"])],
    names=["x", "y"],
)
node.send_output("point", struct)

# Raw bytes (images, serialized data, etc.)
node.send_output("frame", pa.array(raw_bytes))

Operators

Operators are lightweight alternatives to nodes. They run inside the dora runtime process (no separate OS process), making them faster for simple transformations.

Define an Operator class with an on_event method:

# doubler_op.py
import pyarrow as pa
from dora import DoraStatus

class Operator:
    def on_event(self, event, send_output) -> DoraStatus:
        if event["type"] == "INPUT":
            value = event["value"].to_pylist()[0]
            send_output("doubled", pa.array([value * 2]), event["metadata"])
        return DoraStatus.CONTINUE

Reference it in YAML with operator instead of path:

nodes:
  - id: timer
    path: dora/timer/millis/500
    outputs:
      - tick

  - id: doubler
    operator:
      python: doubler_op.py
      inputs:
        tick: timer/tick
      outputs:
        - doubled

When to use operators vs nodes:

	Nodes	Operators
Process model	Separate OS process	In-process (shared runtime)
Startup cost	Higher	Lower
Isolation	Full process isolation	Shared memory space
Best for	Long-running, heavy compute	Lightweight transforms, filters

异步节点

For nodes that need async I/O (HTTP calls, database queries, etc.), use recv_async():

import asyncio
from dora import Node

async def main():
    node = Node()
    for _ in range(50):
        event = await node.recv_async()
        if event["type"] == "STOP":
            break
        # Do async work here
        result = await fetch_data(event["value"])
        node.send_output("result", result)

asyncio.run(main())

See examples/python-async for a complete example.

日志

Use node.log() for structured logging that integrates with dora logs:

node.log("info", "Processing item", {"count": str(i)})

Or use Python’s standard logging module – dora captures stdout/stderr automatically:

import logging
logging.info("Processing item %d", i)

See examples/python-logging for logging module integration.

Timers

Built-in timer nodes generate periodic ticks without writing any code:

nodes:
  - id: tick-source
    path: dora/timer/millis/100    # tick every 100ms
    outputs:
      - tick

  - id: my-node
    path: my_node.py
    inputs:
      tick: tick-source/tick

Also available: dora/timer/hz/30 for 30 Hz.

下一步

Python API 参考 – Node、Operator、DataflowBuilder、CUDA 完整 API 文档
通信模式 – 服务（请求/应答）和动作（目标/反馈/结果）模式
Examples – python-dataflow, python-async, python-drain, python-concurrent-rw, python-multiple-arrays
分布式部署 – 使用 dora up 跨多台机器运行

Dora Architecture

Comprehensive architecture reference for Dora (AI-Dora, Agentic Dataflow-Oriented Robotic Architecture) — a 100% Rust framework for real-time robotics and AI applications.

Overview and Design Philosophy

Dora is built on four core principles:

Dataflow-oriented: Applications are directed graphs of nodes connected by typed data channels. Nodes declare inputs and outputs; the framework handles routing, scheduling, and lifecycle.
Zero-copy performance: Messages above 4 KiB use shared memory with 128-byte aligned buffers and atomic coordination, achieving 10-17x lower latency than ROS2.
Multi-language: First-class support for Rust, Python (PyO3), C, and C++ nodes — all sharing the same Apache Arrow data format.
Four-layer stack: Message protocol, core libraries, daemon/runtime execution, and CLI/coordinator orchestration.

Architecture Stack

┌─────────────────────────────────────────────────┐
│  CLI (dora)          Coordinator (orchestrator) │  Layer 4: Orchestration
├─────────────────────────────────────────────────┤
│  Daemon (per-machine)    Runtime (operators)     │  Layer 3: Execution
├─────────────────────────────────────────────────┤
│  dora-core    shared-memory-server    Node API  │  Layer 2: Core Libraries
├─────────────────────────────────────────────────┤
│  dora-message (protocol + Arrow types)          │  Layer 1: Protocol
└─────────────────────────────────────────────────┘

Workspace Structure

Rust edition 2024, MSRV 1.85.0, workspace version 0.1.0. All crates share the workspace version.

Binaries (7)

Path	Crate	Role
`binaries/cli`	dora-cli	CLI binary (`dora` command) — build, run, stop dataflows
`binaries/coordinator`	dora-coordinator	Orchestrates distributed multi-daemon deployments; WebSocket server
`binaries/daemon`	dora-daemon	Spawns nodes, manages shared-memory/TCP communication per machine
`binaries/runtime`	dora-runtime	In-process operator execution (Python/C/C++ via dlopen/PyO3)
`binaries/ros2-bridge-node`	dora-ros2-bridge-node	ROS2 integration node
`binaries/record-node`	dora-record-node	Records dataflow messages to `.drec` format
`binaries/replay-node`	dora-replay-node	Replays recorded messages from `.drec` files

Core Libraries (6)

Path	Crate	Role
`libraries/message`	dora-message	All inter-component message types, protocol definitions, Arrow metadata
`libraries/core`	dora-core	Dataflow descriptor parsing, build utilities, Zenoh config
`libraries/shared-memory-server`	shared-memory-server	Zero-copy IPC for messages >= 4 KiB
`libraries/recording`	dora-recording	Recording format (.drec): bincode header + entries + footer
`libraries/arrow-convert`	dora-arrow-convert	Arrow type conversions (numeric, datetime)
`libraries/coordinator-store`	dora-coordinator-store	State persistence for coordinator (in-memory or redb backend)

Extension Libraries (5)

Path	Crate	Role
`libraries/extensions/telemetry/tracing`	dora-tracing	OpenTelemetry distributed tracing (OTLP exporter)
`libraries/extensions/telemetry/metrics`	dora-metrics	System metrics collection (CPU, memory, disk)
`libraries/extensions/download`	dora-download	HTTP file download utility for operator/node binaries
`libraries/extensions/ros2-bridge`	dora-ros2-bridge	ROS2 integration: topic pub/sub, services, actions
`libraries/log-utils`	dora-log-utils	Log parsing, merging, filtering, formatting

API Crates (9)

Path	Crate	Language
`apis/rust/node`	dora-node-api	Rust
`apis/rust/operator`	dora-operator-api	Rust
`apis/rust/operator/macros`	dora-operator-api-macros	Rust (proc-macro)
`apis/rust/operator/types`	dora-operator-api-types	Rust (FFI-safe types)
`apis/python/node`	dora-node-api-python	Python (PyO3) – builds the `dora` module
`apis/python/operator`	dora-operator-api-python	Python (PyO3) – compiled into dora-node-api-python
`apis/c/node`	dora-node-api-c	C
`apis/c/operator`	dora-operator-api-c	C/C++

Component Architecture

CLI

The dora command provides three command groups:

Lifecycle (run, up, down, build, start, stop, restart):

dora run executes a dataflow locally without coordinator/daemon (single-machine shortcut)
dora up / dora down manage coordinator + daemon infrastructure
dora start / dora stop control dataflows on a running coordinator

Monitoring (list, logs, inspect, topic, node, record, replay, trace):

Real-time inspection with dora inspect top
Topic subscription and data inspection
Recording and replay via .drec files

Setup (status, new, graph, system, completion, self):

Project scaffolding, dataflow visualization, self-update

Coordinator

The coordinator is an Axum-based WebSocket server that orchestrates distributed deployments.

                          ┌──────────────────┐
                          │   Coordinator     │
            WS /api/control  │  ┌────────────┐  │  WS /api/daemon
   CLI ◄──────────────────►  │  │   State    │  │ ◄──────────────────► Daemon(s)
                          │  │   Store    │  │
                          │  └────────────┘  │
                          │  /api/artifacts  │
                          │  /health         │
                          └──────────────────┘

WebSocket routes:

/api/control — CLI control plane (build, start, stop, list, logs, topic subscribe)
/api/daemon — Daemon registration and event stream
/api/artifacts/{build_id}/{node_id} — Binary artifact downloads
/health — Health check endpoint

State management: In-memory by default, optional persistent storage via redb backend.

Daemon

The daemon runs one per machine and manages the lifecycle of all nodes on that machine.

┌──────────────────────────────────────────────────────┐
│                     Daemon                           │
│                                                      │
│  ┌──────────┐  ┌───────────┐  ┌──────────────────┐  │
│  │ Event    │  │ Spawner   │  │ Node Comm        │  │
│  │ Loop     │──│ (nodes)   │  │ ┌──────────────┐ │  │
│  │          │  └───────────┘  │ │ TCP listener │ │  │
│  │ Sources: │  ┌───────────┐  │ │ Shmem server │ │  │
│  │ • Coord  │  │ Fault     │  │ │ Unix socket  │ │  │
│  │ • Nodes  │──│ Tolerance │  │ └──────────────┘ │  │
│  │ • Zenoh  │  └───────────┘  └──────────────────┘  │
│  │ • Timers │                                        │
│  └──────────┘                                        │
│                                                      │
│  ┌──────────────────────────────────────────────┐    │
│  │ Running Dataflows                            │    │
│  │  ├─ Node A (process) ◄──► TCP/Shmem          │    │
│  │  ├─ Node B (process) ◄──► TCP/Shmem          │    │
│  │  └─ Runtime (operators) ◄──► TCP/Shmem       │    │
│  └──────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────┘

Event loop (Daemon::run_inner()): Async Tokio event loop merging:

Coordinator commands (WebSocket)
Node events (TCP/shared memory)
Inter-daemon events (Zenoh)
Heartbeat (5s interval), metrics collection (2s), health checks (5s default)

Node spawning:

Create working directory for the node
Set up communication channel (TCP or shmem)
Serialize NodeConfig to environment variable
Spawn process with sanitized environment (blocks LD_PRELOAD, DYLD_INSERT_LIBRARIES, etc.)
Monitor via ProcessHandle

Runtime

The runtime executes in-process operators (Python, shared library, WASM) in a dedicated process.

┌──────────────────────────────┐
│          Runtime             │
│                              │
│  ┌────────────────────────┐  │
│  │ Operator Runner        │  │
│  │ (separate thread)      │  │
│  │                        │  │
│  │ SharedLibrary → dlopen │  │
│  │ Python → PyO3          │  │
│  │ Wasm → (planned)       │  │
│  └──────────┬─────────────┘  │
│             │ flume(2)       │
│  ┌──────────▼─────────────┐  │
│  │ Event Merge Loop       │  │
│  │ ├─ OperatorEvent       │  │
│  │ └─ DaemonEvent         │  │
│  └────────────────────────┘  │
└──────────────────────────────┘

Single-threaded Tokio runtime
Operator runs in a separate thread, communicates via flume::bounded(2) channel
Input queue size per data ID configurable (default: 10)

Nodes

Nodes are standalone processes that communicate with the daemon.

Lifecycle:

Node starts, reads NodeConfig from environment
Registers with daemon via DaemonRequest::Register
Subscribes to events via DaemonRequest::Subscribe
Processes events in a loop (NextEvent → handle → SendMessage)
Reports drop tokens for shared memory cleanup
Signals completion via OutputsDone

Communication Protocols

CLI to Coordinator (WebSocket)

Property	值
Transport	WebSocket over TCP
Default port	6013
Auth	Bearer token in `Authorization` header
Control messages	JSON text frames (request/response/event)
Topic data	Binary frames: `[16-byte UUID][bincode payload]`
Rate limit	20 connections per IP per 60s
Max connections	256

JSON-RPC-like message format:

// Request (client → server)
{"id": "uuid", "method": "control", "params": {...}}

// Response (server → client)
{"id": "uuid", "result": {...}}
// or
{"id": "uuid", "error": "message"}

// Event (fire-and-forget, either direction)
{"event": "log", "payload": {...}}

Key control methods: Build, Start, Stop, List, Logs, TopicSubscribe, TopicUnsubscribe, Reload, Restart, Destroy.

Coordinator to Daemon (WebSocket)

Property	值
Transport	WebSocket (daemon connects to coordinator)
Route	`/api/daemon`
Retry	Exponential backoff 1s → 30s, max 50 attempts
Registration	`DaemonRegisterRequest` with version, machine_id, labels

Daemon events (daemon → coordinator): BuildResult, SpawnResult, AllNodesReady, AllNodesFinished, Heartbeat, StatusReport, Log, NodeMetrics, Exit.

Coordinator commands (coordinator → daemon): Build, Spawn, AllNodesReady, StopDataflow, ReloadDataflow, Logs, Destroy, Heartbeat.

Daemon to Node (Local)

Three transport options, configured via LocalCommunicationConfig:

TCP (default):

Binds 127.0.0.1:0 (ephemeral port), TCP_NODELAY enabled
Frame format: [8-byte u64 LE length][bincode payload]
Max message: 64 MiB, read timeout: 30s

Shared Memory (zero-copy):

Four 4 KiB regions per node: control, events, drop tokens, events-close
Used for messages >= 4096 bytes (ZERO_COPY_THRESHOLD)
Atomic synchronization with acquire/release ordering

Node → Daemon requests: Register, Subscribe, SendMessage, CloseOutputs, OutputsDone, NextEvent, ReportDropTokens, SubscribeDrop, NodeConfig.

Daemon → Node replies: Result, PreparedMessage, NextEvents, NextDropEvents, NodeConfig, Empty.

Node events: Stop, Reload, Input, InputClosed, InputRecovered, NodeRestarted, AllInputsClosed.

Daemon to Daemon (Zenoh)

Property	值
Transport	Zenoh pub-sub
Router port	7447
Peer port	5456
Routing	linkstate
Serialization	bincode

Topic pattern:

dora/{network_id}/{dataflow_id}/output/{node_id}/{output_id}

Default network_id is "default".

InterDaemonEvent:

Output { dataflow_id, node_id, output_id, metadata, data } — data message
OutputClosed { dataflow_id, node_id, output_id } — stream end

Message Types and Wire Formats

Timestamped Wrapper

All inter-component messages are wrapped in a timestamp:

#![allow(unused)]
fn main() {
pub struct Timestamped<T> {
    pub inner: T,
    pub timestamp: uhlc::Timestamp,  // hybrid logical clock
}
}

DataMessage

Transport abstraction for payloads:

#![allow(unused)]
fn main() {
pub enum DataMessage {
    Vec(AVec<u8, ConstAlign<128>>),    // inline, 128-byte aligned
    SharedMemory {
        shared_memory_id: String,
        len: usize,
        drop_token: DropToken,          // UUIDv7, tracks lifetime
    },
}
}

LogMessage

#![allow(unused)]
fn main() {
pub struct LogMessage {
    pub build_id: Option<BuildId>,
    pub dataflow_id: Option<DataflowId>,
    pub node_id: Option<NodeId>,
    pub daemon_id: Option<DaemonId>,
    pub level: LogLevelOrStdout,       // Stdout | LogLevel(Error/Warn/Info/Debug/Trace)
    pub target: Option<String>,
    pub module_path: Option<String>,
    pub file: Option<String>,
    pub line: Option<u32>,
    pub message: String,
    pub timestamp: DateTime<Utc>,
    pub fields: Option<BTreeMap<String, String>>,
}
}

NodeError

#![allow(unused)]
fn main() {
pub struct NodeError {
    pub timestamp: uhlc::Timestamp,
    pub cause: NodeErrorCause,         // GraceDuration | Cascading | FailedToSpawn | Other
    pub exit_status: NodeExitStatus,   // Success | IoError | ExitCode | Signal | Unknown
}
}

Data Format and Metadata

Apache Arrow

All data payloads use Apache Arrow columnar format with 128-byte alignment. Arrow type information is carried in every message via ArrowTypeInfo:

#![allow(unused)]
fn main() {
pub struct ArrowTypeInfo {
    pub data_type: DataType,           // Arrow DataType
    pub len: usize,
    pub null_count: usize,
    pub validity: Option<Vec<u8>>,     // null bitmap
    pub offset: usize,
    pub buffer_offsets: Vec<BufferOffset>,
    pub child_data: Vec<ArrowTypeInfo>,  // recursive for nested types
}
}

元数据

Every message carries structured metadata:

#![allow(unused)]
fn main() {
pub struct Metadata {
    metadata_version: u16,
    timestamp: uhlc::Timestamp,
    pub type_info: ArrowTypeInfo,
    pub parameters: MetadataParameters,   // BTreeMap<String, Parameter>
}
}

Parameter Types

#![allow(unused)]
fn main() {
pub enum Parameter {
    Bool(bool),
    Integer(i64),
    String(String),
    ListInt(Vec<i64>),
    Float(f64),
    ListFloat(Vec<f64>),
    ListString(Vec<String>),
    Timestamp(DateTime<Utc>),
}
}

Well-Known Metadata Keys

Key	用途
`request_id`	Service request/reply correlation
`goal_id`	Action goal identifier
`goal_status`	Action completion: `succeeded`, `aborted`, `canceled`
`session_id`	Streaming session identifier
`segment_id`	Streaming segment within a session
`seq`	Streaming chunk sequence number
`fin`	Last chunk of a streaming segment
`flush`	Discard older queued messages on input

零拷贝共享内存

架构

┌────────────────────────────────────────────────────┐
│              Shared Memory Region                  │
│                                                    │
│  ┌──────────┐ ┌──────────┐ ┌──────┐ ┌────┐ ┌────┐│
│  │ Server   │ │ Client   │ │Discon│ │Len │ │Data││
│  │ Event    │ │ Event    │ │(bool)│ │(u64)│ │    ││
│  └──────────┘ └──────────┘ └──────┘ └────┘ └────┘│
│  (raw_sync_2)  (raw_sync_2) AtomicBool AtomicU64  │
└────────────────────────────────────────────────────┘

ShmemChannel

#![allow(unused)]
fn main() {
pub struct ShmemChannel {
    memory: Shmem,
    server_event: Box<dyn EventImpl>,
    client_event: Box<dyn EventImpl>,
    disconnect_offset: usize,
    len_offset: usize,
    data_offset: usize,
    server: bool,
}
}

Synchronization Protocol

Send (write → release store length → signal event → check disconnect):

Copy data to shared memory buffer
Store message length with Release ordering (publishes data)
Signal event to wake receiver
Check disconnect flag with Acquire ordering

Receive (wait event → check disconnect → acquire load length → read data):

Wait for event signal
Check disconnect flag with Acquire ordering
Load message length with Acquire ordering (ensures all writes visible)
Read and deserialize data from buffer

Thresholds and Limits

参数	值
`ZERO_COPY_THRESHOLD`	4096 bytes
Control region size	4 KiB per node
Events region size	4 KiB per node
Drop region size	4 KiB per node
Max cache count	20 regions
Max cache bytes	256 MiB

DropToken Lifecycle

Sender allocates shared memory, generates DropToken (UUIDv7)
Sender transmits DataMessage::SharedMemory { shared_memory_id, len, drop_token }
Receiver processes data, returns drop_token via ReportDropTokens
Sender receives confirmed token, returns memory to cache for reuse

Dataflow Specification

YAML Format

nodes:
  # Standard node (executable)
  - id: my-node
    build: cargo build --release
    path: target/release/my-node
    inputs:
      tick: dora/timer/millis/100
      data: other-node/output
    outputs:
      - result
    restart_policy: on-failure
    max_restarts: 3
    restart_delay: 1.0
    env:
      DEBUG: true

  # Single operator (Python)
  - id: processor
    operator:
      python: process.py
      inputs:
        image: camera/frame
      outputs:
        - detection

  # Multi-operator runtime
  - id: pipeline
    operators:
      - id: stage1
        python: stage1.py
        inputs:
          data: source/output
        outputs:
          - intermediate
      - id: stage2
        shared-library: target/release/libstage2.so
        inputs:
          data: stage1/intermediate
        outputs:
          - final

  # ROS2 bridge
  - id: ros-input
    ros2:
      topic: /robot/state
      message_type: sensor_msgs/JointState
      direction: subscribe
      qos:
        reliable: true
    outputs:
      - joints

Descriptor Structs

#![allow(unused)]
fn main() {
pub struct Descriptor {
    pub nodes: Vec<Node>,
    pub communication: CommunicationConfig,
    pub deploy: Option<Deploy>,
    pub debug: Debug,
    pub health_check_interval: Option<f64>,  // default 5.0s
}
}

Node types (mutually exclusive fields):

path — standard executable/script
operator — single in-process operator
operators — multiple in-process operators
custom — legacy configuration
ros2 — declarative ROS2 bridge

Timer Nodes

Built-in timer nodes generate periodic ticks:

dora/timer/millis/<N> — every N milliseconds
dora/timer/secs/<N> — every N seconds

Operator Sources

#![allow(unused)]
fn main() {
pub enum OperatorSource {
    SharedLibrary(String),   // .so/.dll path
    Python(PythonSource),    // Python module
    Wasm(String),            // WebAssembly (planned)
}
}

Deploy Configuration

#![allow(unused)]
fn main() {
pub struct Deploy {
    pub machine: Option<String>,
    pub working_dir: Option<PathBuf>,
    pub labels: BTreeMap<String, String>,
    pub distribute: DistributeStrategy,  // Local | Scp | Http
}
}

容错

重启策略

#![allow(unused)]
fn main() {
pub enum RestartPolicy {
    Never,       // default
    OnFailure,   // restart on non-zero exit
    Always,      // restart unless user-stopped or inputs closed
}
}

Configuration fields per node:

max_restarts — 0 = unlimited
restart_delay — initial backoff in seconds (doubles each attempt)
max_restart_delay — caps exponential backoff
restart_window — reset counter after N seconds (enables “N restarts per M seconds”)
health_check_timeout — kill node if no activity within this duration

健康监测

Heartbeat interval: 5 seconds (daemon → coordinator)
Health check interval: 5 seconds (configurable per dataflow)
Metrics collection: 2-second interval (CPU, memory, disk, pending messages)

Circuit Breaker

Per-input timeout detection with automatic recovery:

Input configured with input_timeout: <seconds>
If no data arrives within timeout → InputClosed event sent to node
Node marks input as degraded, can use cached last-known value
When upstream recovers → InputRecovered event, circuit breaker re-opens
Node status transitions: Running → Degraded → Running

Cascading Error Tracking

#![allow(unused)]
fn main() {
pub struct CascadingErrorCauses {
    pub caused_by: BTreeMap<NodeId, NodeId>,
}
}

Tracks which node failure caused downstream failures, enabling root-cause analysis.

Fault Tolerance Metrics

#![allow(unused)]
fn main() {
pub struct FaultToleranceSnapshot {
    pub restarts: u64,
    pub health_check_kills: u64,
    pub input_timeouts: u64,
    pub circuit_breaker_recoveries: u64,
}
}

Reported per daemon via heartbeat events. Visible via dora inspect top.

分布式部署

Multi-Daemon Architecture

  ┌──────────┐       Zenoh        ┌──────────┐
  │ Daemon A │◄──────────────────►│ Daemon B │
  │ Machine 1│    pub/sub         │ Machine 2│
  │          │                    │          │
  │ Node 1   │                    │ Node 3   │
  │ Node 2   │                    │ Node 4   │
  └────┬─────┘                    └────┬─────┘
       │ WS                            │ WS
       └──────────┐  ┌────────────────┘
                  ▼  ▼
             ┌──────────┐
             │Coordinator│
             │  :6013    │
             └──────────┘

Zenoh Topic Naming

dora/{network_id}/{dataflow_id}/output/{node_id}/{output_id}

network_id isolates separate Dora clusters (default: "default")
Zenoh router port: 7447, peer port: 5456
Routing mode: linkstate

Build Distribution

Three strategies via DistributeStrategy:

Local — each daemon builds from source (default)
Scp — CLI pushes built binaries via SSH/SCP
Http — daemons pull from coordinator’s /api/artifacts endpoint

Machine Labels

Nodes can target specific machines via labels:

_unstable_deploy:
  labels:
    gpu: "true"
    arch: "arm64"

Recording and Replay

.drec Binary Format

[HEADER]
├─ MAGIC: 8 bytes ("DORAREC")
├─ version: u16 LE (currently 1)
├─ start_nanos: u64 LE (Unix epoch nanoseconds)
├─ dataflow_id: 16 bytes (UUID)
├─ yaml_len: u32 LE
└─ descriptor_yaml: [u8; yaml_len]

[ENTRIES] (repeated)
├─ record_len: u32 LE
├─ node_id_len: u16 LE
├─ node_id: [u8; node_id_len]
├─ output_id_len: u16 LE
├─ output_id: [u8; output_id_len]
├─ timestamp_offset_nanos: u64 LE
├─ event_bytes_len: u32 LE
└─ event_bytes: [u8; event_bytes_len]    (bincode InterDaemonEvent)

[FOOTER] (optional, written on clean finish)
├─ FOOTER_MAGIC: 8 bytes ("DORAEND")
├─ total_messages: u64 LE
└─ total_bytes: u64 LE

Writer/Reader API

#![allow(unused)]
fn main() {
pub struct RecordingWriter<W: Write> { /* ... */ }
impl<W: Write> RecordingWriter<W> {
    pub fn new(inner: W, header: &RecordingHeader) -> Result<Self>;
    pub fn write_entry(&mut self, entry: &RecordEntry) -> Result<()>;
    pub fn finish(self) -> Result<RecordingFooter>;
}

pub struct RecordingReader<R: Read> { /* ... */ }
impl<R: Read> RecordingReader<R> {
    pub fn open(inner: R) -> Result<Self>;
    pub fn header(&self) -> &RecordingHeader;
    pub fn next_entry(&mut self) -> Result<Option<RecordEntry>>;
}
}

Extensions

Telemetry

Distributed Tracing (dora-tracing):

OpenTelemetry with OTLP exporter (compatible with Jaeger, Zipkin, Tempo)
Context propagation across nodes
Setup: set_up_tracing(name: &str)

Metrics (dora-metrics):

System metrics via sysinfo (CPU, memory, disk)
OpenTelemetry meter with OTLP exporter
Async process observer: run_metrics_monitor(meter_id)

ROS2 桥接

Declarative YAML-based ROS2 integration supporting:

Topics — subscribe (ROS2 → Dora) or publish (Dora → ROS2):

ros2:
  topic: /camera/image
  message_type: sensor_msgs/Image
  direction: subscribe

Services — client or server role:

ros2:
  service: /add_two_ints
  service_type: example_interfaces/AddTwoInts
  role: 客户端

Actions — goal/feedback/result lifecycle:

ros2:
  action: /fibonacci
  action_type: example_interfaces/Fibonacci
  role: 客户端

QoS configuration:

qos:
  reliable: true
  durability: transient_local
  keep_last: 10

Download

File download utility for fetching operator/node binaries from HTTP URLs. Sanitizes filenames, sets executable permissions on Unix.

Key Constants and Defaults

Constant	值	Location
`DORA_COORDINATOR_PORT_WS_DEFAULT`	6013	Coordinator WebSocket port
`DORA_DAEMON_LOCAL_LISTEN_PORT_DEFAULT`	53291	Daemon TCP listener port
`ZERO_COPY_THRESHOLD`	4096 bytes	Shared memory activation
`MAX_MESSAGE_BYTES`	64 MiB	Max TCP/bincode message
`MAX_CONTROL_MESSAGE_BYTES`	1 MiB	Max control plane JSON message
`TCP_READ_TIMEOUT`	30 seconds	Socket read timeout
`WS_PING_INTERVAL`	10 seconds	WebSocket keepalive
`MAX_WS_CONNECTIONS`	256	Concurrent WebSocket limit
`MAX_CONNECTIONS_PER_IP`	20 / 60s	Rate limiting
`MAX_TOPICS_PER_SUBSCRIBE`	64	Topic batch limit
`MAX_SUBSCRIPTIONS_PER_CONNECTION`	16	Per-connection limit
`MAX_BINARY_PAYLOAD_BYTES`	64 MiB	Topic data frame limit
`WATCHDOG_INTERVAL`	5 seconds	Heartbeat to coordinator
`METRICS_INTERVAL`	2 seconds	Metrics collection
`HEALTH_CHECK_INTERVAL`	5 seconds	Default node health check
`MAX_BUFFERED_LOG_MESSAGES`	10,000	Log buffer capacity
`MAX_PENDING_REPLIES`	256	Pending coordinator replies
`MAX_ERROR_BYTES`	4096	Max error message size
Default input queue size	10	Per-input message buffer

Identifiers and Data Structures

ID Types

类型	Underlying	Validation
`DataflowId`	`uuid::Uuid`	Assigned on dataflow start
`SessionId`	`uuid::Uuid` (v7)	Per CLI session
`BuildId`	`uuid::Uuid` (v7)	Per build operation
`DaemonId`	`{ machine_id: Option<String>, uuid: Uuid (v7) }`	Persisted in `.daemon-id`
`NodeId`	`String`	Validated: `[a-zA-Z0-9_.-]`, non-empty
`DataId`	`String`	Same validation as `NodeId`
`OperatorId`	`String`	No validation
`DropToken`	`Uuid` (v7)	Per shared-memory message

Authentication

#![allow(unused)]
fn main() {
pub struct AuthToken(String);  // 64 hex chars (32 bytes)
}

Generated via cryptographically random bytes
Stored at <working_dir>/.dora-token
Constant-time comparison to prevent timing attacks
Applied to all WebSocket routes

Node Status

#![allow(unused)]
fn main() {
pub enum NodeStatus {
    Running,     // healthy
    Restarting,  // restart in progress
    Degraded,    // circuit breaker open (input timeout)
    Failed,      // terminal failure
}
}

Serialization Summary

Channel	Format	备注
CLI ↔ Coordinator	JSON text frames	Preserves u128 for HLC timestamps
Coordinator ↔ Daemon	JSON text frames	Direct string serialization
Daemon ↔ Node (TCP)	bincode over length-prefixed frames	8-byte LE length prefix
Daemon ↔ Node (shmem)	bincode via shared memory	Atomic synchronization
Daemon ↔ Daemon	bincode over Zenoh	Apache Arrow data format
Recording	bincode entries in .drec	Custom binary container

Dataflow YAML Specification

Dataflows are defined in YAML files. Each file describes a graph of nodes, their inputs/outputs, and execution parameters.

A JSON Schema is available at the repo root (dora-schema.json) for editor autocompletion and validation.

快速开始

nodes:
  - id: sender
    path: sender.py
    outputs:
      - message

  - id: receiver
    path: receiver.py
    inputs:
      message: sender/message

Run with dora run dataflow.yml (local mode) or dora up && dora start dataflow.yml (networked mode).

Editor Setup

Add a schema comment at the top of your YAML file for VS Code autocompletion (requires the YAML extension):

# yaml-language-server: $schema=https://raw.githubusercontent.com/dora-rs/dora/main/dora-schema.json
nodes:
  - id: my-node
    # ... autocompletion works here

Root-Level Fields

Field	类型	默认	描述
`nodes`	list	required	List of node configurations
`strict_types`	bool	`false`	Treat type warnings as errors in `validate` and `build`
`type_rules`	list	`[]`	User-defined type compatibility rules (see Type Annotations)
`health_check_interval`	float	`5.0`	Seconds between daemon health check sweeps. For each node with `health_check_timeout` set, the daemon checks whether the node has communicated within its timeout; if not, the node is killed and its `restart_policy` is evaluated
`_unstable_deploy`	object	--	Root-level deployment config (see Deployment)
`_unstable_debug`	object	--	Debug options (see Debug)

Node Configuration

Every node requires an id. All other fields are optional (though most nodes need at least path or operator/operators).

Identity

Field	类型	描述
`id`	string	Required. Unique identifier. Must not contain `/`. Whitespace is discouraged
`name`	string	Human-readable display name (metadata only, used in tooling and logs)
`description`	string	Documentation string (metadata only, not used at runtime)

Source

A node’s executable comes from a local path, a git repository, a module reference, or is implicit (operator/ROS2 nodes).

Field	类型	描述
`path`	string	Path to executable or script. Can also be a URL (legacy)
`module`	string	Path to a module definition file (mutually exclusive with `path`). See Modules Guide
`git`	string	Git repo URL. `dora build` clones it and uses the clone dir as working directory
`branch`	string	Branch to checkout (requires `git`, mutually exclusive with `tag`/`rev`)
`tag`	string	Tag to checkout (requires `git`, mutually exclusive with `branch`/`rev`)
`rev`	string	Commit hash to checkout (requires `git`, mutually exclusive with `branch`/`tag`)
`build`	string	Build commands run during `dora build`. Each line runs separately. `pip`/`pip3` lines use `uv` when `--uv` is passed
`args`	string	Command-line arguments (space-separated)

Example with git source:

- id: rust-node
  git: https://github.com/dora-rs/dora.git
  branch: main
  build: cargo build -p example-node --release
  path: target/release/example-node

Data I/O

Inputs

Inputs subscribe to another node’s output using the format <node-id>/<output-id>:

inputs:
  # Short form
  image: camera/frames
  tick: dora/timer/millis/100

  # Long form with options
  sensor_data:
    source: sensor/frames
    queue_size: 10
    queue_policy: drop_oldest
    input_timeout: 5.0

  # Lossless input (blocks sender when full)
  commands:
    source: controller/cmd
    queue_size: 100
    queue_policy: backpressure

Input option	类型	默认	描述
`source`	string	required	`<node-id>/<output-id>` or timer path
`queue_size`	integer	`10`	Input buffer size
`queue_policy`	string	`drop_oldest`	`drop_oldest`: drops oldest message when full. `backpressure`: buffers up to 10x `queue_size` without dropping (drops with ERROR log at hard cap)
`input_timeout`	float	--	Circuit breaker timeout in seconds. If no message arrives within this period, the daemon closes the input and the node receives an `InputClosed` event for graceful degradation

Built-in Timers

定时器是以固定间隔发出 tick 的虚拟节点：

inputs:
  tick: dora/timer/millis/100   # every 100ms
  slow: dora/timer/millis/1000  # every 1s
  fast: dora/timer/hz/30        # 30 Hz (~33ms)

Built-in Log Aggregation

Subscribe to structured log messages from all (or filtered) nodes:

inputs:
  all_logs: dora/logs               # all nodes, all levels
  errors:   dora/logs/error         # error+ from all nodes
  sensor:   dora/logs/info/sensor   # info+ from specific node

Each message arrives as a JSON-encoded LogMessage string. See Logging for details.

Outputs

A list of output identifiers the node produces:

outputs:
  - processed_image
  - metadata

类型注解

Optional type annotations for inputs and outputs. Types are never required – unannotated ports remain fully dynamic.

- id: camera
  path: camera.py
  outputs:
    - image
    - depth
  output_types:
    image: std/media/v1/Image
    depth: std/media/v1/Image

- id: detector
  path: detect.py
  inputs:
    image: camera/image
  input_types:
    image: std/media/v1/Image
  outputs:
    - bbox
  output_types:
    bbox: std/vision/v1/BoundingBox

Field	类型	默认	描述
`output_types`	object	`{}`	Maps output IDs to type URNs. Keys must match entries in `outputs`
`input_types`	object	`{}`	Maps input IDs to expected type URNs. Keys must match entries in `inputs`
`output_metadata`	object	`{}`	Maps output IDs to lists of required metadata keys
`pattern`	string	--	Communication pattern shorthand: `service-server`, `service-client`, `action-server`, `action-client`

Type URNs use the format std/<category>/v<version>/<TypeName> and support parameters (e.g. std/media/v1/AudioFrame[sample_type=f32]). See the Type Annotations Guide for the full standard type library, parameterized types, compatibility rules, and user-defined types.

Run dora validate <file> to check type annotations statically. For runtime checking, set DORA_RUNTIME_TYPE_CHECK=warn or error:

dora validate dataflow.yml
DORA_RUNTIME_TYPE_CHECK=warn dora run dataflow.yml

Types also appear on dora graph edge labels when annotated.

Arrow IPC 帧格式

Per-output wire framing override. Default is raw (Arrow buffer layout). Set to arrow-ipc for self-describing Arrow IPC stream format.

Field	类型	默认	描述
`output_framing`	map	`{}`	Per-output framing: `raw` (default) or `arrow-ipc`

- id: sensor
  path: ./sensor
  outputs:
    - image
  output_framing:
    image: arrow-ipc

Module Parameters

When using module:, pass configuration values via params::

- id: fast_pipeline
  module: modules/transform.module.yml
  inputs:
    data: sender/value
  params:
    speed: "2.0"
    mode: turbo

Inside the module, params are available as $PARAM_<UPPERCASE_KEY> in args: and as environment variables. See the Modules Guide for full documentation.

Environment

env:
  MY_VAR: "value"          # string
  DEBUG: true               # boolean
  PORT: 8080                # integer
  RATE: 1.5                 # float
  FROM_HOST:
    __dora_env: HOST_VAR   # read from host environment at runtime

Environment variables apply to both build commands and node execution. Values support $VAR expansion syntax.

日志

Field	类型	默认	描述
`send_stdout_as`	string	--	Route raw stdout/stderr lines as a data output. Each line is sent as a separate Arrow message
`send_logs_as`	string	--	Route structured log entries as a data output. Each entry is a JSON string with fields: `timestamp`, `level`, `node_id`, `message`, `target`, `fields`
`min_log_level`	string	--	Suppress logs below this level from file output, coordinator forwarding, and `send_logs_as`. Levels from most to least verbose: `stdout` (all output including raw stdout), `trace`, `debug`, `info`, `warn`, `error`
`max_log_size`	string	--	Rotate log file at this size (e.g. `"50MB"`, `"1GB"`)
`max_rotated_files`	integer	`5`	Number of rotated log files to keep

Example:

- id: sensor
  path: ./sensor
  min_log_level: info
  send_stdout_as: raw_output
  send_logs_as: log_entries
  max_log_size: "100MB"
  max_rotated_files: 3
  outputs:
    - data
    - raw_output
    - log_entries

When using send_stdout_as or send_logs_as, include the output name in the outputs list so downstream nodes can subscribe to it.

For a complete guide to all logging features, see Logging.

容错

Field	类型	默认	描述
`restart_policy`	string	`never`	`never`, `on-failure`, or `always`
`max_restarts`	integer	`0`	Max restart attempts. 0 = unlimited
`restart_delay`	float	--	Initial backoff in seconds. Doubles each attempt
`max_restart_delay`	float	--	Cap for exponential backoff
`restart_window`	float	--	Time window for counting restarts. The counter resets after this many seconds since the first restart in the current window. Enables “N restarts per M seconds” semantics with `max_restarts`
`health_check_timeout`	float	--	If the node does not communicate with the daemon (send outputs, subscribe, etc.) for this many seconds, the daemon kills the process and evaluates the `restart_policy`

Restart policies:

never (default): no automatic restart
on-failure: restart only on non-zero exit code
always: restart on any exit, except when stopped by user or all inputs closed with success

Example with exponential backoff:

- id: sensor
  path: ./sensor
  restart_policy: on-failure
  max_restarts: 5
  restart_delay: 1.0         # 1s, 2s, 4s, 8s, 16s
  max_restart_delay: 30.0    # capped at 30s
  restart_window: 300.0      # 5 restarts per 5 minutes
  health_check_timeout: 30.0

Deployment

使用 _unstable_deploy 将节点分配到特定机器：

- id: camera-driver
  _unstable_deploy:
    machine: robot-arm
  path: ./target/debug/camera
  outputs:
    - frames

- id: ml-inference
  _unstable_deploy:
    machine: gpu-server
    labels:
      gpu: "true"
    distribute: scp
  path: ./target/debug/inference
  inputs:
    frames: camera-driver/frames

Deploy field	类型	默认	描述
`machine`	string	--	Target machine/daemon ID. The coordinator routes the node to the daemon registered with this ID
`working_dir`	string	--	Working directory on the target machine
`labels`	object	--	Key-value labels for scheduling. The coordinator matches these against labels reported by each daemon at registration
`distribute`	string	`local`	How built binaries reach the target daemon: `local` – each daemon builds from source independently; `scp` – CLI pushes the built binary via SSH/SCP before spawn; `http` – daemon pulls the binary from the coordinator’s HTTP artifact store

当节点位于不同机器时，通信自动从共享内存切换到 Zenoh 发布/订阅。

CPU 亲和性

Pin a node’s process to specific CPU cores (Linux only).

Field	类型	默认	描述
`cpu_affinity`	list of int	--	CPU core indices to pin the node process to

- id: controller
  path: ./controller
  cpu_affinity: [0, 1]

算子节点

Operators run in-process inside a shared runtime (no separate process). Use operator for a single operator or operators for multiple.

Single Operator

The id field is optional for single operators (defaults to the node id):

- id: detector
  operator:
    python: detect.py
    build: pip install -r requirements.txt
    inputs:
      image: camera/frames
    outputs:
      - bbox

Multiple Operators

Each operator in operators requires a unique id:

- id: runtime-node
  operators:
    - id: preprocessor
      shared-library: ../../target/debug/libpreprocess
      inputs:
        raw: sensor/data
      outputs:
        - processed
    - id: analyzer
      shared-library: ../../target/debug/libanalyze
      inputs:
        data: runtime-node/preprocessor/processed
      outputs:
        - result

Operator Source Types

Field	描述
`python`	Python script path, or `{source: "script.py", conda_env: "myenv"}`
`shared-library`	Path to a shared library (`.so`/`.dylib`/`.dll`)

Operators also support inputs, outputs, build, send_stdout_as, send_logs_as, min_log_level, max_log_size, and max_rotated_files with the same semantics as node-level fields.

ROS2 桥接

Declare a node as a ROS2 bridge to automatically convert between ROS2 DDS messages and Dora’s Arrow format. No custom code needed.

Single Topic

- id: camera_bridge
  ros2:
    topic: /camera/image_raw
    message_type: sensor_msgs/Image
    direction: subscribe
  outputs:
    - image

Multiple Topics

- id: robot_bridge
  ros2:
    topics:
      - topic: /camera/image_raw
        message_type: sensor_msgs/Image
        direction: subscribe
        output: image
      - topic: /cmd_vel
        message_type: geometry_msgs/Twist
        direction: publish
        input: velocity
    qos:
      reliable: true
  inputs:
    velocity: planner/cmd_vel
  outputs:
    - image

Service Bridge

- id: add_service
  ros2:
    service: /add_two_ints
    service_type: example_interfaces/AddTwoInts
    role: 服务端
  inputs:
    request: client_node/request
  outputs:
    - response

Action Bridge

- id: nav_action
  ros2:
    action: /navigate
    action_type: nav2_msgs/NavigateToPose
    role: 客户端
  inputs:
    goal: planner/goal
  outputs:
    - feedback
    - result

QoS Configuration

QoS can be set at the bridge level (applies to all topics) or per-topic:

QoS field	类型	默认	描述
`reliable`	bool	`false`	Reliable vs best-effort transport
`durability`	string	`volatile`	`volatile` or `transient_local`
`liveliness`	string	`automatic`	`automatic`, `manual_by_participant`, `manual_by_topic`
`lease_duration`	float	infinity	Lease duration in seconds
`max_blocking_time`	float	--	Max blocking time for reliable transport
`keep_last`	integer	`1`	History depth (KeepLast policy)
`keep_all`	bool	`false`	Use KeepAll history instead of KeepLast

Other ROS2 Fields

Field	类型	默认	描述
`namespace`	string	`/`	ROS2 namespace
`node_name`	string	node `id`	ROS2 node name

Debug

_unstable_debug:
  enable_debug_inspection: true

Required for dora topic echo, dora topic hz, and dora topic info commands.

通信模式

Dora supports four communication patterns built on top of the dataflow:

Topic (default): pub/sub dataflow
Service: request/reply via request_id metadata
Action: goal/feedback/result via goal_id/goal_status metadata, with cancellation support
Streaming: session/segment/chunk via session_id/segment_id/seq/fin/flush metadata, with queue flush for interruption

See Communication Patterns for details and examples.

Full Example

health_check_interval: 10.0

_unstable_debug:
  enable_debug_inspection: true

nodes:
  - id: webcam
    operator:
      python: webcam.py
      inputs:
        tick: dora/timer/millis/100
      outputs:
        - image

  - id: detector
    operator:
      python: detect.py
      build: pip install ultralytics
      inputs:
        image: webcam/image
      outputs:
        - bbox

  - id: plotter
    operator:
      python: plot.py
      inputs:
        image: webcam/image
        bbox: detector/bbox

  - id: logger
    path: ./logger
    inputs:
      bbox: detector/bbox
    send_stdout_as: logs
    min_log_level: info
    restart_policy: on-failure
    max_restarts: 3
    outputs:
      - logs

类型注解

Optional type annotations on dataflow inputs and outputs. Types are never required – unannotated ports remain fully dynamic. Type checking runs at build time and validate time (no runtime overhead by default).

快速开始

nodes:
  - id: camera
    path: camera.py
    outputs:
      - image
    output_types:
      image: std/media/v1/Image

  - id: detector
    path: detect.py
    inputs:
      image: camera/image
    input_types:
      image: std/media/v1/Image
    outputs:
      - bbox
    output_types:
      bbox: std/vision/v1/BoundingBox

Validate with:

dora validate dataflow.yml

# Fail with non-zero exit code on warnings (for CI)
dora validate --strict-types dataflow.yml

# Type checks also run during build
dora build dataflow.yml --strict-types

You can also set strict_types: true at the top level of the YAML to enable strict mode without the CLI flag:

strict_types: true
nodes:
  # ...

Type URN Format

Type URNs follow the pattern std/<category>/v<version>/<TypeName>:

std/core/v1/Float32
std/media/v1/Image
std/vision/v1/BoundingBox

Parameterized Types

Some struct types accept parameters to distinguish variants:

std/media/v1/AudioFrame[sample_type=f32]
std/media/v1/AudioFrame[sample_type=f32,channels=2]

Matching rules:

Same base + same params -> compatible
Same base + one side unparameterized -> compatible (wildcard)
Same base + different param values -> mismatch

# These are compatible (wildcard):
output_types:
  audio: std/media/v1/AudioFrame[sample_type=f32]
input_types:
  audio: std/media/v1/AudioFrame

# These are a mismatch:
output_types:
  audio: std/media/v1/AudioFrame[sample_type=f32]
input_types:
  audio: std/media/v1/AudioFrame[sample_type=i16]

Standard Type Library

`std/core/v1`

类型	Arrow Type	描述
`Float32`	Float32	32-bit float
`Float64`	Float64	64-bit float
`Int32`	Int32	32-bit signed integer
`Int64`	Int64	64-bit signed integer
`UInt8`	UInt8	8-bit unsigned integer
`UInt32`	UInt32	32-bit unsigned integer
`UInt64`	UInt64	64-bit unsigned integer
`String`	Utf8	UTF-8 string
`Bytes`	LargeBinary	Raw bytes (universal sink – any type is compatible)
`Bool`	Boolean	Boolean

`std/math/v1`

类型	Arrow Type	字段	描述
`Vector3`	Struct	x, y, z (Float64)	3D vector
`Quaternion`	Struct	x, y, z, w (Float64)	Quaternion
`Pose`	Struct	position, orientation	6-DOF pose
`Transform`	Struct	translation, rotation	Coordinate transform

`std/control/v1`

类型	Arrow Type	描述
`Twist`	Struct	Linear and angular velocity
`JointState`	Struct	Joint positions, velocities, efforts
`Odometry`	Struct	Pose + Twist in a reference frame

`std/media/v1`

类型	Arrow Type	Parameters	描述
`Image`	Struct	`encoding`	Raw image (width, height, encoding, data)
`CompressedImage`	LargeBinary	`format`	JPEG/PNG compressed image
`PointCloud`	Struct	`point_type`	3D point cloud
`AudioFrame`	Struct	`sample_type` (default: f32)	Audio samples

`std/vision/v1`

类型	Arrow Type	描述
`BoundingBox`	Struct	2D bounding box with confidence and label
`Detection`	Struct	Object detection result (list of BoundingBox)
`Segmentation`	Struct	Pixel-level segmentation mask

Validation Rules

dora validate and dora build check:

Key existence: output_types keys must appear in outputs, input_types keys must appear in inputs
URN resolution: All type URNs must exist in the standard or user-defined type library. Typos get “did you mean?” suggestions.
Edge compatibility: Connected edges must have compatible types (exact match, implicit widening, or user-defined rules)
Timer auto-typing: Timer inputs (dora/timer/*) are automatically typed as std/core/v1/UInt64
Type inference: When only the upstream side annotates a type, it is inferred on the downstream input and reported
Parameterized types: Parameter mismatches are detected (see above)
Metadata patterns: output_metadata keys and pattern shorthands are validated (see below)
Schema compatibility: Struct types are checked at the field level – missing fields or wrong field types are flagged

All checks produce warnings (non-fatal by default). Use --strict-types to treat warnings as errors for CI pipelines.

Type warnings:
  - node "camera": output_types key "framez" not found in outputs list
  - node "detector": unknown type "std/vision/v1/BoundingBx" on output "bbox"
    (did you mean "std/vision/v1/BoundingBox"?)
  - node "detector": type mismatch on input "image": upstream camera/image
    declares "std/core/v1/Bytes", but expected "std/media/v1/Image"

Inferred types:
  inferred std/core/v1/Float64 on processor/reading (from sensor/reading)

Type Compatibility Rules

Beyond exact matching, the type checker supports implicit widening conversions:

From	To
`UInt8`	`UInt32`
`UInt32`	`UInt64`
`Int32`	`Int64`
`Float32`	`Float64`
Any type	`Bytes` (universal sink)

Widening is transitive up to depth 3 (e.g. UInt8 -> UInt32 -> UInt64 works, but chains of 4+ do not).

User-Defined Compatibility Rules

Add custom rules in the dataflow YAML:

type_rules:
  - from: myproject/SensorV1
    to: myproject/SensorV2

nodes:
  # ...

Metadata Patterns

Nodes that implement communication patterns (services, actions) can declare required metadata keys on their outputs.

Explicit metadata

- id: 服务端
  path: server.py
  outputs:
    - response
  output_metadata:
    response: [request_id]

Pattern shorthand

Use the pattern field to auto-imply required metadata keys:

- id: 服务端
  path: server.py
  pattern: service-server
  outputs:
    - response

模式	Required metadata keys
`service-server`	`request_id`
`service-client`	`request_id`
`action-server`	`goal_id`, `goal_status`
`action-client`	`goal_id`

User-Defined Types

Projects can define custom types in a types/ directory next to the dataflow. The directory structure determines the URN prefix:

project/
  dataflow.yml
  types/
    myproject/
      sensors/
        v1.yml    # URN prefix: myproject/sensors/v1

Type YAML files use the same format as the standard library:

types:
  MySensor:
    arrow: Struct
    description: Custom sensor reading
    fields:
      - name: temperature
        type: Float32
      - name: humidity
        type: Float32

This creates the URN myproject/sensors/v1/MySensor.

The std/ prefix is reserved and cannot be used for user types.

User types are loaded automatically by dora validate and dora build when a types/ directory exists.

Runtime Type Checking

In addition to static validation, Dora supports optional runtime type checking on send_output(). When enabled, the actual Arrow data type is compared against the declared output_types at send time.

Enable via environment variable:

# Warn on mismatches (log and continue)
DORA_RUNTIME_TYPE_CHECK=warn dora run dataflow.yml

# Error on mismatches (node returns error)
DORA_RUNTIME_TYPE_CHECK=error dora run dataflow.yml

Valid values: 1, warn, true (warn mode), error (error mode). Unset or any other value disables checking (zero overhead).

Scope:

Validates output_types on the sender side (send_output() calls). input_types are checked statically by dora validate but not enforced at runtime
Covers all languages that send Arrow arrays (Rust, Python, C++ Arrow path)
Raw byte sends (send_output_bytes, C nodes) are untyped and skip checking
Complex types (Struct-based: Image, Vector3, etc.) are skipped – only primitive types, String, Bytes, and Bool are validated at runtime

Graph Visualization

When outputs have type annotations, dora graph shows the type on edge labels:

dora graph dataflow.yml --open

Edges display as output_name [TypeName] (e.g. image [Image]).

Operators

Operators support the same output_types, input_types, output_metadata, and pattern fields:

- id: runtime-node
  operators:
    - id: preprocessor
      python: preprocess.py
      inputs:
        raw: sensor/data
      input_types:
        raw: std/core/v1/Bytes
      outputs:
        - processed
      output_types:
        processed: std/media/v1/Image

Modules (Reusable Sub-Dataflows)

Modules let you define reusable sub-graphs of nodes in separate YAML files and compose them into larger dataflows. Modules are expanded at compile time – the runtime never sees them.

快速开始

Module file (modules/transform_module.yml):

module:
  name: transform_pipeline
  inputs: [raw_data]
  outputs: [filtered]

nodes:
  - id: doubler
    path: doubler.py
    inputs:
      data: _mod/raw_data
    outputs:
      - doubled

  - id: filter
    path: filter_even.py
    inputs:
      data: doubler/doubled
    outputs:
      - filtered

Dataflow file (dataflow.yml):

nodes:
  - id: sender
    path: sender.py
    outputs:
      - value

  - id: pipeline
    module: modules/transform_module.yml
    inputs:
      raw_data: sender/value

  - id: receiver
    path: receiver.py
    inputs:
      filtered: pipeline/filtered

After expansion, pipeline becomes two nodes: pipeline.doubler and pipeline.filter, with all wiring resolved automatically.

Module Definition File

A module file has two sections:

`module:` header

Field	类型	Required	描述
`name`	string	yes	Module name (metadata only)
`inputs`	list	no	Required input port names
`inputs_optional`	list	no	Optional input ports (silently skipped if not wired)
`outputs`	list	no	Output port names exposed to the parent dataflow

`nodes:` list

Standard node definitions, with one special syntax: _mod/port_name references a module input port. When expanded, _mod/port_name is replaced with whatever the parent wired to that port.

module:
  name: my_module
  inputs: [camera_feed]
  outputs: [detections]

nodes:
  - id: detector
    path: detect.py
    inputs:
      image: _mod/camera_feed    # resolved to parent's wiring
    outputs:
      - detections

Module-level build

Modules can have a top-level build: command that runs before any inner node builds:

module:
  name: ml_pipeline
  inputs: [image]
  outputs: [result]

build: pip install -r requirements.txt

nodes:
  - id: model
    path: model.py
    inputs:
      image: _mod/image
    outputs:
      - result

Using Modules

Reference a module in a dataflow node using the module: field instead of path::

- id: nav_stack
  module: modules/navigation.module.yml
  inputs:
    goal_pose: localization/goal

The module node’s inputs: map wires parent outputs to module input ports. External nodes reference module outputs as <module_id>/<output_name> (e.g., nav_stack/cmd_vel).

Parameters

Pass configuration values to modules via params::

- id: fast_pipeline
  module: modules/transform_module.yml
  inputs:
    raw_data: sender/value
  params:
    speed: "2.0"
    mode: turbo

Inside the module, reference params in args: using $PARAM_<UPPERCASE_KEY>:

nodes:
  - id: processor
    path: processor.py
    args: --speed $PARAM_SPEED --mode $PARAM_MODE
    inputs:
      data: _mod/raw_data
    outputs:
      - result

Parameters are also injected as environment variables (PARAM_SPEED, PARAM_MODE) into every node inside the module.

Expansion Rules

Load the module YAML file and validate its header
Prefix all internal node IDs with {module_id}. (e.g., nav_stack.planner)
Replace _mod/port_name references with the actual sources from the parent’s input map
Rewrite internal cross-references (e.g., planner/path becomes nav_stack.planner/path)
Map module-declared outputs to internal node outputs, so nav_stack/cmd_vel resolves to nav_stack.controller/cmd_vel
Replace the module node with the expanded flat nodes
Substitute params: values in args: fields and inject as env vars

Use dora expand to see the result:

dora expand dataflow.yml

Nested Modules

Modules can reference other modules. The expansion is recursive with a depth limit of 8 levels:

# outer_module.yml
module:
  name: outer
  inputs: [data]
  outputs: [result]

nodes:
  - id: inner
    module: inner_module.yml
    inputs:
      raw: _mod/data

  - id: postprocess
    path: postprocess.py
    inputs:
      data: inner/processed
    outputs:
      - result

After expansion, node IDs are fully qualified: outer.inner.some_node.

Optional Inputs

Declare inputs as optional when a module should work with or without certain connections:

module:
  name: flexible_processor
  inputs: [data]
  inputs_optional: [config]
  outputs: [result]

nodes:
  - id: processor
    path: processor.py
    inputs:
      data: _mod/data
      config: _mod/config    # silently dropped if not wired
    outputs:
      - result

When the parent doesn’t wire config, the input is simply omitted from the expanded node.

Visualization

dora graph renders module boundaries as Mermaid subgraphs, making it easy to see which nodes came from which module:

dora graph dataflow.yml --open

Validation

Validate a standalone module file without a full dataflow:

dora expand --module modules/transform_module.yml

This checks:

Valid YAML structure
Module header is present with name, inputs, outputs
All _mod/ references correspond to declared inputs or optional inputs
No duplicate node IDs
Internal wiring is consistent

安全

Path confinement: Module file paths must resolve within the dataflow’s base directory. Absolute paths and directory traversal (../) outside the base are rejected.
File size limit: Module files are capped at 1 MB.
Depth limit: Recursive nesting is capped at 8 levels.
Param key validation: Parameter keys must be alphanumeric with underscores only.

示例

See examples/module-dataflow/ for a complete working example with a sender, transform module (doubler + filter), and receiver.

dora run examples/module-dataflow/dataflow.yml

通信模式

Dora is a dataflow framework based on pub/sub message passing. On top of basic topics, the framework supports service (request/reply), action (goal/feedback/result), and streaming (session/segment/chunk) patterns using well-known metadata keys. No changes to the daemon, coordinator, or YAML syntax are required – the patterns are implemented as conventions at the node API level.

1. 主题（发布/订阅）

The default pattern. A node publishes data on an output, and any node that subscribes to that output receives it.

nodes:
  - id: 发布者
    outputs:
      - data
  - id: 订阅者
    inputs:
      data: publisher/data

Use when: streaming sensor data, periodic status, fire-and-forget events.

2. 服务（请求/应答）

A client sends a request and expects exactly one response, correlated by a request_id metadata key.

Well-known metadata keys

Key	Constant	描述
`request_id`	`dora_node_api::REQUEST_ID`	UUID v7 correlating request and response

YAML

nodes:
  - id: 客户端
    inputs:
      tick: dora/timer/millis/500
      response: server/response
    outputs:
      - request

  - id: 服务端
    inputs:
      request: client/request
    outputs:
      - response

Node API helpers

#![allow(unused)]
fn main() {
// Client: send request with auto-generated request_id
let rid = node.send_service_request("request".into(), params, data)?;

// Server: pass through metadata.parameters (includes request_id)
node.send_service_response("response".into(), metadata.parameters, result)?;
}

The server MUST pass through the request_id from the incoming request’s metadata parameters into the response. The client matches responses to requests using this key.

Example: examples/service-example/

3. 动作（目标/反馈/结果）

A client sends a goal and receives periodic feedback plus a final result. Actions support cancellation.

Well-known metadata keys

Key	Constant	描述
`goal_id`	`dora_node_api::GOAL_ID`	UUID v7 identifying the goal
`goal_status`	`dora_node_api::GOAL_STATUS`	Final status of the goal

Goal status values:

值	Constant	Meaning
`succeeded`	`GOAL_STATUS_SUCCEEDED`	Goal completed successfully
`aborted`	`GOAL_STATUS_ABORTED`	Goal aborted by server
`canceled`	`GOAL_STATUS_CANCELED`	Goal canceled by client

YAML

nodes:
  - id: 客户端
    inputs:
      tick: dora/timer/millis/2000
      feedback: server/feedback
      result: server/result
    outputs:
      - goal
      - cancel

  - id: 服务端
    inputs:
      goal: client/goal
      cancel: client/cancel
    outputs:
      - feedback
      - result

Cancel pattern

The client sends a message on the cancel output with goal_id in the metadata. The server checks for cancel requests between processing steps and sends a result with goal_status = "canceled".

Example: examples/action-example/

4. Streaming (session/segment/chunk)

For real-time pipelines (voice, video, sensor streams) where a user can interrupt mid-stream and queued data must be discarded.

Well-known metadata keys

Key	类型	Constant	描述
`session_id`	String	`SESSION_ID`	Identifies the conversation/session
`segment_id`	Integer	`SEGMENT_ID`	Logical unit within a session (e.g. one utterance)
`seq`	Integer	`SEQ`	Chunk sequence number within a segment
`fin`	Bool	`FIN`	`true` on the last chunk of a segment
`flush`	Bool	`FLUSH`	`true` to discard older queued messages on this input

YAML

nodes:
  - id: asr
    inputs:
      mic: mic-source/audio
    outputs:
      - text

  - id: llm
    inputs:
      text: asr/text
    outputs:
      - tokens

  - id: tts
    inputs:
      tokens: llm/tokens
    outputs:
      - audio

节点 API

#![allow(unused)]
fn main() {
use dora_node_api::{StreamSegment, DoraNode};

let mut seg = StreamSegment::new();

// Send chunks with auto-incrementing seq (e.g. inside an ASR node)
node.send_stream_chunk("text".into(), &mut seg, false, chunk_data)?;
// Mark final chunk of a segment
node.send_stream_chunk("text".into(), &mut seg, true, last_chunk)?;

// On user interruption: flush downstream queues and start a new segment.
// The prior segment ends without a fin=true signal -- old data is discarded.
let flush_params = seg.flush();
node.send_output("text".into(), flush_params, empty_data)?;
}

Queue flush behavior

When a message arrives with flush: true in its metadata, the receiver’s input queue is cleared of all older messages before the flush message is delivered. This enables instant interruption in voice pipelines – when the user speaks over TTS output, the ASR node sends a new segment with flush: true, and the TTS node immediately discards any queued audio chunks from the previous response.

Note: flush discards all queued messages on the input regardless of session_id. Do not multiplex independent sessions on a single input when using flush.

Python

# Streaming metadata is a plain dict
params = {
    "session_id": session_id,
    "segment_id": 1,
    "seq": 0,
    "fin": False,
    "flush": True,  # flush older queued messages
}
node.send_output("text", data, metadata={"parameters": params})

5. Choosing a pattern

Need a response?	Long-running?	Cancelable?	Real-time stream?	模式
否	-	-	否	Topic
是	否	否	否	Service
是	是	Optional	否	Action
否	是	Via flush	是	Streaming

6. Important details

goal_status matching is case-sensitive. Always use the exact lowercase values: "succeeded", "aborted", "canceled". The ROS2 bridge defaults to Aborted for unrecognised values.

7. Python compatibility

Python nodes use the same metadata conventions. Parameters are plain dicts with string keys:

import uuid

# Service client (uuid7 for time-ordered IDs, matching Rust API)
params = {"request_id": str(uuid.uuid7())}
node.send_output("request", data, metadata={"parameters": params})

# Service server -- pass through parameters
node.send_output("response", result, metadata=event["metadata"])

Note: uuid.uuid7() requires Python 3.13+. On older versions, use the uuid_utils package or uuid.uuid4() (random v4 also works for correlation, but loses time-ordering).

Rust API 参考

本文档介绍用于构建 Dora 数据流组件的两个主要 Rust crate：

dora-node-api – 用于独立节点可执行文件
dora-operator-api – 用于由 Dora 运行时管理的进程内算子

节点 API (`dora-node-api`)

Add to your Cargo.toml:

[dependencies]
dora-node-api = { workspace = true }

DoraNode

The primary struct for sending outputs and retrieving node information. Obtained through one of the initialization functions below.

初始化

#![allow(unused)]
fn main() {
// Recommended: auto-detect environment (daemon, testing, or interactive).
pub fn init_from_env() -> NodeResult<(Self, EventStream)>

// Same as init_from_env but errors instead of falling back to interactive mode.
pub fn init_from_env_force() -> NodeResult<(Self, EventStream)>

// For dynamic nodes: connect to the daemon by node ID.
pub fn init_from_node_id(node_id: NodeId) -> NodeResult<(Self, EventStream)>

// Try init_from_env first; fall back to init_from_node_id.
pub fn init_flexible(node_id: NodeId) -> NodeResult<(Self, EventStream)>

// Standalone interactive mode (prompts for inputs on the terminal).
pub fn init_interactive() -> NodeResult<(Self, EventStream)>

// Integration test mode with synthetic inputs/outputs.
pub fn init_testing(
    input: TestingInput,
    output: TestingOutput,
    options: TestingOptions,
) -> NodeResult<(Self, EventStream)>
}

init_from_env is the recommended entry point. It checks, in order:

Thread-local testing state set by setup_integration_testing
DORA_NODE_CONFIG environment variable (set by the daemon)
DORA_TEST_WITH_INPUTS environment variable (file-based integration testing)
Interactive terminal fallback (only if stdin is a TTY)

Sending Outputs

All send methods silently ignore output IDs not declared in the dataflow YAML.

#![allow(unused)]
fn main() {
// Send an Arrow array. Copies data into shared memory when beneficial.
pub fn send_output(
    &mut self,
    output_id: DataId,
    parameters: MetadataParameters,
    data: impl Array,
) -> NodeResult<()>

// Send raw bytes. Copies into shared memory when beneficial.
pub fn send_output_bytes(
    &mut self,
    output_id: DataId,
    parameters: MetadataParameters,
    data_len: usize,
    data: &[u8],
) -> NodeResult<()>

// Send raw bytes via a closure for zero-copy writing.
pub fn send_output_raw<F>(
    &mut self,
    output_id: DataId,
    parameters: MetadataParameters,
    data_len: usize,
    data: F,
) -> NodeResult<()>
where
    F: FnOnce(&mut [u8])

// Send raw bytes with explicit Arrow type information.
pub fn send_typed_output<F>(
    &mut self,
    output_id: DataId,
    type_info: ArrowTypeInfo,
    parameters: MetadataParameters,
    data_len: usize,
    data: F,
) -> NodeResult<()>
where
    F: FnOnce(&mut [u8])

// Send a pre-allocated DataSample with type information.
pub fn send_output_sample(
    &mut self,
    output_id: DataId,
    type_info: ArrowTypeInfo,
    parameters: MetadataParameters,
    sample: Option<DataSample>,
) -> NodeResult<()>

// Report output IDs as closed. No further sends allowed for those IDs.
pub fn close_outputs(&mut self, outputs_ids: Vec<DataId>) -> NodeResult<()>
}

Service, Action, and Streaming Helpers

Higher-level methods for the communication patterns. These use well-known metadata keys to correlate requests, goals, responses, and streaming segments.

#![allow(unused)]
fn main() {
// Generate a unique, time-ordered ID (UUID v7) for correlation.
pub fn new_request_id() -> String
pub fn new_goal_id() -> String   // alias for new_request_id

// Send a service request. Injects a `request_id` into parameters and returns it.
pub fn send_service_request(
    &mut self,
    output_id: DataId,
    parameters: MetadataParameters,
    data: impl Array,
) -> NodeResult<String>

// Send a service response. Semantic alias for send_output.
// Caller must pass through the request_id from the incoming request's metadata.
pub fn send_service_response(
    &mut self,
    output_id: DataId,
    parameters: MetadataParameters,
    data: impl Array,
) -> NodeResult<()>
}

Service example (client sends request, server replies):

#![allow(unused)]
fn main() {
// Client: auto-generates and injects request_id
let rid = node.send_service_request("request".into(), params, data)?;

// Server: pass through metadata.parameters (includes request_id)
node.send_service_response("response".into(), metadata.parameters, result)?;
}

Action example (client sends goal, server streams feedback + result):

#![allow(unused)]
fn main() {
use dora_node_api::{GOAL_ID, GOAL_STATUS, GOAL_STATUS_SUCCEEDED, Parameter};

// Client: generate goal_id, attach to params
let goal_id = DoraNode::new_goal_id();
params.insert(GOAL_ID.to_string(), Parameter::String(goal_id));
node.send_output("goal".into(), params, data)?;

// Server: extract goal_id, send feedback/result with goal_status
let gid = get_string_param(&metadata.parameters, GOAL_ID);
}

Streaming example (real-time voice/video pipeline with interruption):

#![allow(unused)]
fn main() {
use dora_node_api::StreamSegment;

// Create a streaming segment builder (auto-generates session_id)
let mut seg = StreamSegment::new();

// Send chunks with auto-incrementing seq
node.send_stream_chunk("text".into(), &mut seg, false, chunk_data)?;
// Mark final chunk of a segment
node.send_stream_chunk("text".into(), &mut seg, true, last_chunk)?;

// On user interruption: flush downstream queues and start a new segment
let flush_params = seg.flush();
node.send_output("text".into(), flush_params, empty_data)?;
}

See patterns.md for the full guide and examples/service-example and examples/action-example for working code.

数据分配

#![allow(unused)]
fn main() {
// Allocate a DataSample of the given size.
// Uses shared memory for data >= ZERO_COPY_THRESHOLD (4096 bytes).
pub fn allocate_data_sample(&mut self, data_len: usize) -> NodeResult<DataSample>
}

Node Information

#![allow(unused)]
fn main() {
// Node ID from the dataflow YAML.
pub fn id(&self) -> &NodeId

// Unique identifier for this dataflow run.
pub fn dataflow_id(&self) -> &DataflowId

// Input/output configuration for this node.
pub fn node_config(&self) -> &NodeRunConfig

// True if this node was restarted after a previous exit or failure.
pub fn is_restart(&self) -> bool

// Number of times this node has been restarted (0 on first run).
pub fn restart_count(&self) -> u32

// Parsed dataflow YAML descriptor.
pub fn dataflow_descriptor(&self) -> NodeResult<&Descriptor>
}

日志

Rust nodes have two ways to emit structured logs. Both produce identical structured log entries in the daemon.

Option 1: Node API (recommended for most cases)

All log methods emit structured JSONL to stdout, which the daemon parses automatically. Works with min_log_level filtering, send_logs_as routing, and dora/logs subscribers.

#![allow(unused)]
fn main() {
// General structured log. Level: "error", "warn", "info", "debug", "trace".
pub fn log(&self, level: &str, message: &str, target: Option<&str>)

// Structured log with additional key-value fields.
pub fn log_with_fields(
    &self,
    level: &str,
    message: &str,
    target: Option<&str>,
    fields: Option<&BTreeMap<String, String>>,
)

// Convenience methods (no target parameter).
pub fn log_error(&self, message: &str)
pub fn log_warn(&self, message: &str)
pub fn log_info(&self, message: &str)
pub fn log_debug(&self, message: &str)
pub fn log_trace(&self, message: &str)
}

Option 2: Rust tracing crate

When dora’s tracing subscriber is initialized (via init_tracing() or the default feature), tracing::info!() etc. output structured JSON to stdout that the daemon parses identically:

#![allow(unused)]
fn main() {
tracing::info!("Sensor started");
tracing::warn!(sensor_id = "temp-01", "High temperature");
}

Use tracing when you want ecosystem integration (spans, instrumentation, OpenTelemetry). Use node.log_*() when you want explicit control or structured fields as BTreeMap.

方法	Structured?	Fields?	OpenTelemetry?	Best for
`node.log_info(msg)`	是	否	否	Quick one-liner
`node.log_with_fields(...)`	是	Yes (BTreeMap)	否	Structured key-value context
`tracing::info!(key = val, msg)`	是	Yes (spans)	是	Ecosystem integration, OTel
`println!()`	No (`stdout` level)	否	否	Quick debugging

EventStream

Asynchronous iterator over incoming events destined for this node. Implements the futures::Stream trait.

The event stream closes itself after a Stop event is received. Nodes should exit once the stream ends.

#![allow(unused)]
fn main() {
// Block until the next event arrives. Returns None when the stream closes.
// Uses an internal EventScheduler that may reorder events for fairness.
pub fn recv(&mut self) -> Option<Event>

// Block with a timeout. Returns an Event::Error on timeout.
pub fn recv_timeout(&mut self, dur: Duration) -> Option<Event>

// Async receive with EventScheduler reordering.
pub async fn recv_async(&mut self) -> Option<Event>

// Async receive with a timeout. Returns Event::Error on timeout.
pub async fn recv_async_timeout(&mut self, dur: Duration) -> Option<Event>

// Non-blocking receive. Returns TryRecvError::Empty if nothing is ready.
pub fn try_recv(&mut self) -> Result<Event, TryRecvError>

// Drain all buffered events without blocking.
// Returns Some(Vec::new()) if nothing is ready; None if the stream is closed.
pub fn drain(&mut self) -> Option<Vec<Event>>

// True if no events are buffered in the scheduler or receiver.
pub fn is_empty(&self) -> bool

// Returns and resets accumulated drop counts per input ID.
// For `drop_oldest` inputs, drops happen at `queue_size`.
// For `backpressure` inputs, drops happen at 10x `queue_size` (hard safety cap).
pub fn drain_drop_counts(&mut self) -> HashMap<DataId, u64>
}

EventStream also implements futures::Stream<Item = Event>, so it can be used with StreamExt::next() and other combinators. Unlike recv/recv_async, the Stream implementation does not use the EventScheduler, preserving chronological event order.

Event

Represents an incoming event. This enum is #[non_exhaustive] – ignore unknown variants to stay forward-compatible.

#![allow(unused)]
fn main() {
#[non_exhaustive]
pub enum Event {
    // An input was received from another node.
    Input {
        id: DataId,           // input ID from the YAML (not the sender's output ID)
        metadata: Metadata,   // timestamp and type information
        data: ArrowData,      // Apache Arrow data
    },

    // The sender mapped to this input exited; no more data will arrive.
    InputClosed { id: DataId },

    // A previously closed input recovered (e.g., upstream node came back after timeout).
    InputRecovered { id: DataId },

    // An upstream node has restarted. Useful for resetting caches or state.
    NodeRestarted { id: NodeId },

    // The event stream is about to close. See StopCause for the reason.
    Stop(StopCause),

    // Instructs the node to reload an operator (used internally by the runtime).
    Reload { operator_id: Option<OperatorId> },

    // An unexpected internal error. Log it for debugging.
    Error(String),
}
}

StopCause

#![allow(unused)]
fn main() {
#[non_exhaustive]
pub enum StopCause {
    // Explicit stop via `dora stop` or Ctrl-C. Exit promptly or be killed.
    Manual,

    // All inputs were closed (upstream nodes exited). Only sent if the node has inputs.
    AllInputsClosed,
}
}

Supporting Types

DataSample

A data region suitable for sending as an output message. Uses shared memory for data >= ZERO_COPY_THRESHOLD to enable zero-copy transfer.

Implements Deref<Target = [u8]> and DerefMut for reading and writing the underlying bytes.

Metadata and MetadataParameters

#![allow(unused)]
fn main() {
// Full metadata attached to every input event.
pub struct Metadata {
    // Contains timestamp, Arrow type info, and user-defined parameters.
}

// User-controlled metadata fields attached when sending outputs.
// Type alias for BTreeMap<String, Parameter>.
// Default is empty. Pass metadata.parameters from an input to forward metadata.
pub type MetadataParameters = BTreeMap<String, Parameter>;

// A single metadata parameter value.
pub enum Parameter {
    Bool(bool), Integer(i64), Float(f64), String(String),
    ListInt(Vec<i64>), ListFloat(Vec<f64>), ListString(Vec<String>),
    Timestamp(DateTime<Utc>),
}

// Extract typed parameters, returning None if missing or wrong type.
pub fn get_string_param<'a>(params: &'a MetadataParameters, key: &str) -> Option<&'a str>
pub fn get_integer_param(params: &MetadataParameters, key: &str) -> Option<i64>
pub fn get_bool_param(params: &MetadataParameters, key: &str) -> Option<bool>
}

Well-known metadata keys (for communication patterns):

Constant	值	Used by
`REQUEST_ID`	`"request_id"`	Service request/response correlation
`GOAL_ID`	`"goal_id"`	Action goal identification
`GOAL_STATUS`	`"goal_status"`	Action result status
`GOAL_STATUS_SUCCEEDED`	`"succeeded"`	Goal completed successfully
`GOAL_STATUS_ABORTED`	`"aborted"`	Goal aborted by server
`GOAL_STATUS_CANCELED`	`"canceled"`	Goal canceled by client
`SESSION_ID`	`"session_id"`	Streaming session identifier
`SEGMENT_ID`	`"segment_id"`	Streaming segment within a session
`SEQ`	`"seq"`	Streaming chunk sequence number
`FIN`	`"fin"`	Last chunk of a streaming segment
`FLUSH`	`"flush"`	Discard older queued messages on input

All constants are re-exported from dora_node_api.

Identity Types

#![allow(unused)]
fn main() {
// Unique identifier for a running dataflow instance (UUID v4).
pub struct DataflowId(/* ... */);

// Node identifier, as defined in the dataflow YAML.
pub struct NodeId(/* ... */);

// Input/output identifier, as defined in the dataflow YAML.
pub struct DataId(/* ... */);
}

Error Types

#![allow(unused)]
fn main() {
#[derive(Debug, Error)]
pub enum NodeError {
    Init(String),        // config parsing, env vars, daemon handshake
    Connection(String),  // daemon connection lost
    Output(String),      // send or close failure
    Data(String),        // allocation or descriptor parsing
    Internal(eyre::Report),  // catch-all for unexpected errors
}

pub type NodeResult<T> = Result<T, NodeError>;
}

TryRecvError

#![allow(unused)]
fn main() {
pub enum TryRecvError {
    Empty,   // no event available right now
    Closed,  // event stream has been closed
}
}

ZERO_COPY_THRESHOLD

#![allow(unused)]
fn main() {
pub const ZERO_COPY_THRESHOLD: usize = 4096;
}

Messages smaller than this threshold are sent via TCP. Messages at or above this size use shared memory for zero-copy transfer.

ArrowData

#![allow(unused)]
fn main() {
// Wrapper around arrow::array::ArrayRef. Implements Deref to the inner ArrayRef.
pub struct ArrowData(pub arrow::array::ArrayRef);
}

Data from Event::Input arrives as ArrowData. Use TryFrom conversions or Arrow APIs to extract typed values.

InputTracker

Helper for tracking input health and caching the last received value per input. Useful for graceful degradation when upstream nodes time out.

#![allow(unused)]
fn main() {
pub struct InputTracker { /* ... */ }

impl InputTracker {
    pub fn new() -> Self

    // Update state from an event. Returns true if the event was relevant.
    pub fn process_event(&mut self, event: &Event) -> bool

    // Current state of an input (Healthy or Closed), if tracked.
    pub fn state(&self, id: &DataId) -> Option<InputState>

    // True if the input is currently closed.
    pub fn is_closed(&self, id: &DataId) -> bool

    // Last received value for an input. Available even when closed.
    pub fn last_value(&self, id: &DataId) -> Option<&ArrowData>

    // All inputs currently in Closed state.
    pub fn closed_inputs(&self) -> Vec<&DataId>

    // True if any tracked input is closed.
    pub fn any_closed(&self) -> bool
}

pub enum InputState {
    Healthy,  // receiving data normally
    Closed,   // upstream exited or timed out
}
}

Integration Testing

The integration_testing module provides tools for testing nodes without a running daemon.

setup_integration_testing

Sets up thread-local state so that the next call to DoraNode::init_from_env on the same thread initializes in test mode.

#![allow(unused)]
fn main() {
pub fn setup_integration_testing(
    input: TestingInput,
    output: TestingOutput,
    options: TestingOptions,
)
}

TestingInput

#![allow(unused)]
fn main() {
pub enum TestingInput {
    // Load events from a JSON file (must deserialize to IntegrationTestInput).
    FromJsonFile(PathBuf),

    // Provide events directly.
    Input(IntegrationTestInput),
}
}

TestingOutput

#![allow(unused)]
fn main() {
pub enum TestingOutput {
    // Write outputs to a JSONL file (created or overwritten).
    ToFile(PathBuf),

    // Write outputs as JSONL to any writer.
    ToWriter(Box<dyn std::io::Write + Send>),

    // Send each output as a JSON object to a flume channel.
    ToChannel(flume::Sender<serde_json::Map<String, serde_json::Value>>),
}
}

TestingOptions

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Default)]
pub struct TestingOptions {
    // Skip time offsets in outputs for deterministic comparison.
    pub skip_output_time_offsets: bool,
}
}

Environment Variable Testing

Nodes using init_from_env also support file-based testing via environment variables:

变量	描述
`DORA_TEST_WITH_INPUTS`	Path to a JSON input file (`IntegrationTestInput` format)
`DORA_TEST_WRITE_OUTPUTS_TO`	Path for the output JSONL file (default: `outputs.jsonl` next to inputs)
`DORA_TEST_NO_OUTPUT_TIME_OFFSET`	If set, omit time offsets for deterministic outputs

算子 API (`dora-operator-api`)

Operators are in-process components managed by the Dora runtime. They are compiled as shared libraries (.so/.dylib/.dll) and loaded by the runtime.

Add to your Cargo.toml:

[dependencies]
dora-operator-api = { workspace = true }

[lib]
crate-type = ["cdylib"]

DoraOperator Trait

#![allow(unused)]
fn main() {
pub trait DoraOperator: Default {
    fn on_event(
        &mut self,
        event: &Event,
        output_sender: &mut DoraOutputSender,
    ) -> Result<DoraStatus, String>;
}
}

Implement this trait to define your operator’s behavior. The runtime calls on_event for each incoming event. Return DoraStatus to control execution flow.

Event (Operator)

The operator Event enum is simpler than the node Event and uses &str for IDs.

#![allow(unused)]
fn main() {
#[non_exhaustive]
pub enum Event<'a> {
    // An input was received.
    Input { id: &'a str, data: ArrowData },

    // Failed to parse the input data as an Arrow array.
    InputParseError { id: &'a str, error: String },

    // An input was closed by the sender.
    InputClosed { id: &'a str },

    // The operator should stop.
    Stop,
}
}

DoraOutputSender

#![allow(unused)]
fn main() {
pub struct DoraOutputSender<'a>(/* ... */);

impl DoraOutputSender<'_> {
    // Send an output. `id` is the output ID from your dataflow YAML.
    pub fn send(&mut self, id: String, data: impl Array) -> Result<(), String>
}
}

DoraStatus

Returned from on_event to control the operator lifecycle.

#![allow(unused)]
fn main() {
pub enum DoraStatus {
    Continue,  // keep running, wait for the next event
    Stop,      // stop this operator
    StopAll,   // stop the entire dataflow
}
}

register_operator! Macro

Generates the FFI entry points required by the Dora runtime to load and call your operator.

#![allow(unused)]
fn main() {
use dora_operator_api::register_operator;

register_operator!(MyOperator);
}

This must be called exactly once per crate, at the top level, with the type that implements DoraOperator.

Quick Start Example: Node

A minimal node that receives tick inputs and sends a random number as output.

use dora_node_api::{DoraNode, Event, IntoArrow, dora_core::config::DataId};

fn main() -> eyre::Result<()> {
    let (mut node, mut events) = DoraNode::init_from_env()?;

    let output = DataId::from("random".to_owned());

    while let Some(event) = events.recv() {
        match event {
            Event::Input { id, metadata, data } => {
                if id.as_str() == "tick" {
                    let value: u64 = fastrand::u64(..);
                    node.send_output(
                        output.clone(),
                        metadata.parameters,
                        value.into_arrow(),
                    )?;
                }
            }
            Event::Stop(_) => {}
            _ => {}
        }
    }

    Ok(())
}

Corresponding dataflow YAML:

nodes:
  - id: timer
    path: dora/timer/millis/100
    outputs:
      - tick

  - id: my-node
    path: ./target/debug/my-node
    inputs:
      tick: timer/tick
    outputs:
      - random

  - id: sink
    path: ./target/debug/sink
    inputs:
      data: my-node/random

Quick Start Example: Operator

A minimal operator that counts ticks and forwards formatted messages.

#![allow(unused)]
#![warn(unsafe_op_in_unsafe_fn)]

fn main() {
use dora_operator_api::{
    DoraOperator, DoraOutputSender, DoraStatus, Event, IntoArrow, register_operator,
};

register_operator!(MyOperator);

#[derive(Debug, Default)]
struct MyOperator {
    ticks: usize,
}

impl DoraOperator for MyOperator {
    fn on_event(
        &mut self,
        event: &Event,
        output_sender: &mut DoraOutputSender,
    ) -> Result<DoraStatus, String> {
        match event {
            Event::Input { id, data } => match *id {
                "tick" => {
                    self.ticks += 1;
                    let msg = format!("tick count: {}", self.ticks);
                    output_sender.send("status".into(), msg.into_arrow())?;
                }
                other => eprintln!("ignoring unexpected input {other}"),
            },
            Event::InputClosed { id } => {
                if *id == "tick" {
                    return Ok(DoraStatus::Stop);
                }
            }
            Event::Stop => {}
            other => {
                eprintln!("received unknown event {other:?}");
            }
        }

        Ok(DoraStatus::Continue)
    }
}
}

Corresponding dataflow YAML:

nodes:
  - id: timer
    path: dora/timer/millis/500
    outputs:
      - tick

  - id: runtime-node
    operator:
      shared_library: ./target/debug/libmy_operator
      inputs:
        tick: timer/tick
      outputs:
        - status

Python API 参考

This document covers the Python APIs for building dora nodes, operators, and dataflows. Install with:

pip install dora-rs

节点 API

from dora import Node

The Node class is the primary interface for custom nodes. It connects to a running dataflow, receives input events, and sends outputs.

Node 类

`init(node_id=None)`

Create a new node and connect to the running dataflow.

# Standard: node ID is read from environment variables set by the daemon
node = Node()

# Dynamic: connect to a running dataflow by explicit node ID
node = Node(node_id="my-dynamic-node")

Parameters:

node_id (str, optional) – Explicit node ID for dynamic nodes. When omitted, the node reads its identity from environment variables set by the dora daemon.

Raises: RuntimeError if the node cannot connect to the dataflow.

`next(timeout=None)`

Retrieve the next event from the event stream. Blocks until an event is available or the timeout expires.

event = node.next()              # block indefinitely
event = node.next(timeout=2.0)   # block up to 2 seconds

Parameters:

timeout (float, optional) – Maximum wait time in seconds.

Returns: dict – An event dictionary, or None if all senders have been dropped or the timeout expired.

`drain()`

Retrieve all buffered events without blocking.

events = node.drain()
for event in events:
    print(event["type"])

Returns: list[dict] – A list of event dictionaries. Returns an empty list if no events are buffered.

`try_recv()`

Non-blocking receive. Returns the next buffered event if one is available.

event = node.try_recv()
if event is not None:
    print(event["type"])

Returns: dict | None – An event dictionary, or None if no event is buffered.

`recv_async(timeout=None)`

Asynchronous receive. For use with asyncio.

event = await node.recv_async()
event = await node.recv_async(timeout=5.0)

Parameters:

timeout (float, optional) – Maximum wait time in seconds. Returns an error if the timeout is reached.

Returns: dict | None – An event dictionary, or None if all senders have been dropped.

Note: This method is experimental. The pyo3 async (Rust-Python FFI) integration is still in development.

`is_empty()`

Check whether there are any buffered events in the event stream.

if not node.is_empty():
    event = node.try_recv()

Returns: bool

`send_output(output_id, data, metadata=None)`

Send data on an output channel.

import pyarrow as pa

# Send raw bytes
node.send_output("status", b"OK")

# Send an Apache Arrow array (zero-copy capable)
node.send_output("values", pa.array([1, 2, 3]))

# Send with metadata
node.send_output("image", pa.array(pixels), {"camera_id": "front"})

Parameters:

output_id (str) – The output name as declared in the dataflow YAML.
data (bytes | pyarrow.Array) – The payload. Use bytes for simple data or pyarrow.Array for zero-copy shared-memory transport.
metadata (dict, optional) – Key-value pairs attached to the message. Supported value types: bool, int, float, str, list[int], list[float], list[str], datetime.datetime.

Raises: RuntimeError if data is neither bytes nor a pyarrow.Array.

Service, action, and streaming patterns

Python nodes use the same metadata key conventions as Rust for communication patterns. Parameters are plain dicts with string keys.

Well-known metadata keys:

Key	描述
`"request_id"`	Service request/response correlation (UUID v7)
`"goal_id"`	Action goal identification (UUID v7)
`"goal_status"`	Action result status: `"succeeded"`, `"aborted"`, or `"canceled"`
`"session_id"`	Streaming session identifier
`"segment_id"`	Streaming segment within a session (integer)
`"seq"`	Streaming chunk sequence number (integer)
`"fin"`	Last chunk of a streaming segment (bool)
`"flush"`	Discard older queued messages on input (bool)

Service client example:

import uuid

# Send a request with a unique request_id
request_id = str(uuid.uuid7())  # Python 3.13+; use uuid_utils or uuid.uuid4() on older versions
node.send_output("request", data, {"request_id": request_id})

Service server example:

# Pass through the metadata (includes request_id) from the incoming request
node.send_output("response", result, event["metadata"])

Action client example:

goal_id = str(uuid.uuid7())
node.send_output("goal", data, {"goal_id": goal_id})

Streaming example (flush downstream queues on user interruption):

params = {
    "session_id": session_id,
    "segment_id": 1,
    "seq": 0,
    "fin": False,
    "flush": True,
}
node.send_output("text", data, metadata={"parameters": params})

See patterns.md for the full guide.

日志

Python nodes can log using either Python’s built-in logging module (recommended) or the explicit node API.

Python logging module (auto-bridged):

When Node() is created, it automatically installs a handler that routes Python’s logging module through the dora daemon. No configuration needed:

import logging
from dora import Node

node = Node()  # Installs the logging bridge

logging.info("Sensor initialized")       # -> structured "info" log entry
logging.warning("High temperature")      # -> structured "warn" log entry
logging.debug("Raw bytes: %s", data)     # -> structured "debug" log entry

These log entries are captured with full metadata (level, message, file path, line number) and work with min_log_level filtering, send_logs_as routing, and dora/logs subscribers.

Note: Do not call logging.basicConfig() before creating Node(). The constructor sets up the bridge; calling basicConfig() first may install a conflicting handler.

Explicit node API:

`log(level, message, target=None, fields=None)`

Emit a structured log message with optional target and key-value fields.

node.log("info", "Processing frame", target="vision")
node.log("error", "Sensor timeout", fields={"sensor": "lidar", "retry": "3"})

Parameters:

level (str) – Log level: "error", "warn", "info", "debug", or "trace".
message (str) – The log message.
target (str, optional) – Target module or subsystem name.
fields (dict[str, str], optional) – Structured key-value context fields.

Works with the daemon’s min_log_level filtering, send_logs_as routing, and dora/logs subscribers.

`log_error(message)`, `log_warn(message)`, `log_info(message)`, `log_debug(message)`, `log_trace(message)`

Convenience methods for common log levels:

node.log_error("Connection failed")
node.log_warn("Temperature elevated")
node.log_info("Sensor initialized")
node.log_debug("Raw bytes received")
node.log_trace("Entering loop iteration")

Each is equivalent to node.log(level, message).

When to use which:

方法	Structured?	Fields?	Best for
`logging.info()`	是	否	General-purpose logging
`node.log("info", msg, fields={...})`	是	是	Structured context (sensor_id, etc.)
`node.log_info(msg)`	是	否	Quick one-liner
`print()`	否	否	Legacy code, quick debugging

`dataflow_descriptor()`

Return the full dataflow descriptor (the parsed dataflow YAML) as a Python dictionary.

descriptor = node.dataflow_descriptor()
print(descriptor["nodes"])

Returns: dict

`node_config()`

Return the configuration block for this node from the dataflow descriptor.

config = node.node_config()
model_path = config.get("model", "default.pt")

Returns: dict

`dataflow_id()`

Return the unique identifier of the running dataflow.

print(node.dataflow_id())  # e.g. "a1b2c3d4-..."

Returns: str

`is_restart()`

Check whether this node was restarted after a previous exit or failure. Useful for deciding whether to restore saved state or start fresh.

if node.is_restart():
    restore_checkpoint()

Returns: bool

`restart_count()`

Return how many times this node has been restarted. Returns 0 on the first run, 1 after the first restart, and so on.

print(f"Restart #{node.restart_count()}")

Returns: int

`merge_external_events(subscription)`

Merge a ROS2 subscription stream into the node’s main event loop. After calling this method, ROS2 messages arrive as events with kind set to "external".

from dora import Node, Ros2Context, Ros2Node, Ros2NodeOptions, Ros2Topic

node = Node()
ros2_context = Ros2Context()
ros2_node = ros2_context.new_node("listener", Ros2NodeOptions())
topic = Ros2Topic("/chatter", "std_msgs/String", ros2_node)
subscription = ros2_node.create_subscription(topic)

node.merge_external_events(subscription)

for event in node:
    if event["kind"] == "external":
        print("ROS2:", event["value"])
    elif event["type"] == "INPUT":
        print("Dora:", event["id"])

Parameters:

subscription (dora.Ros2Subscription) – A ROS2 subscription created via the dora ROS2 bridge.

Iteration support

The Node class implements __iter__ and __next__, so you can iterate directly:

for event in node:
    match event["type"]:
        case "INPUT":
            process(event["value"])
        case "STOP":
            break

The iterator calls next() with no timeout on each iteration. It yields None when the event stream is closed, which terminates the loop.

Event dictionary

Events are returned as plain Python dictionaries. The structure depends on the event type.

INPUT

An input message arrived from another node.

{
    "type": "INPUT",
    "id": "camera_image",          # input ID as declared in the dataflow YAML
    "kind": "dora",               # "dora" for dataflow events, "external" for ROS2
    "value": <pyarrow.Array>,      # the payload as an Apache Arrow array
    "metadata": {
        "timestamp": datetime,     # UTC-aware datetime.datetime
        "open_telemetry_context": "...",  # tracing context (if enabled)
        ...                        # any user-supplied metadata
    },
}

Access the data:

values = event["value"].to_pylist()     # convert to Python list
array = event["value"].to_numpy()       # convert to NumPy array

INPUT_CLOSED

An input channel was closed (the upstream node finished).

{
    "type": "INPUT_CLOSED",
    "id": "camera_image",
    "kind": "dora",
}

STOP

The dataflow is shutting down.

{
    "type": "STOP",
    "id": "MANUAL" | "ALL_INPUTS_CLOSED",   # stop cause
    "kind": "dora",
}

ERROR

An error occurred in the runtime.

{
    "type": "ERROR",
    "error": "description of the error",
    "kind": "dora",
}

External (ROS2)

When using merge_external_events, ROS2 messages arrive as:

{
    "kind": "external",
    "value": <pyarrow.Array>,   # the ROS2 message as an Arrow array
}

DoraStatus enum

Used as the return value from operator on_event methods to control the event loop.

from dora import DoraStatus

值	Meaning
`DoraStatus.CONTINUE`	Continue processing events (value `0`)
`DoraStatus.STOP`	Stop this operator (value `1`)
`DoraStatus.STOP_ALL`	Stop the entire dataflow (value `2`)

算子 API

Operators run inside the dora runtime process (no separate OS process). They are defined as a Python class named Operator with an on_event method.

Operator class (user-defined)

Create a Python file with an Operator class:

from dora import DoraStatus

class Operator:
    def __init__(self):
        # Initialize state here
        self.count = 0

    def on_event(self, dora_event, send_output) -> DoraStatus:
        if dora_event["type"] == "INPUT":
            self.count += 1
            # Process the input and optionally send output
            send_output("result", b"processed", dora_event["metadata"])
        return DoraStatus.CONTINUE

Methods:

__init__(self) – Called once when the operator is loaded. Initialize any state or models here.
on_event(self, dora_event, send_output) -> DoraStatus – Called for every incoming event. Must return an DoraStatus value.

Parameters of on_event:

dora_event (dict) – An event dictionary.
send_output (callable) – Callback to send output data (see below).

The runtime also sets self.dataflow_descriptor on the operator instance with the parsed dataflow YAML as a dictionary.

send_output callback

The send_output callback is passed to on_event for sending data from an operator.

send_output(output_id, data, metadata=None)

Parameters:

output_id (str) – The output name as declared in the dataflow YAML.
data (bytes | pyarrow.Array) – The payload.
metadata (dict, optional) – Metadata to attach. Pass dora_event["metadata"] to propagate tracing context.

Example:

import pyarrow as pa
from dora import DoraStatus

class Operator:
    def on_event(self, dora_event, send_output) -> DoraStatus:
        if dora_event["type"] == "INPUT":
            result = pa.array([42], type=pa.int64())
            send_output("output", result, dora_event["metadata"])
        return DoraStatus.CONTINUE

DataflowBuilder

from dora.builder import DataflowBuilder, Node, Operator, Output

Build dataflow YAML programmatically in Python.

DataflowBuilder class

`init(name="dora-dataflow")`

Create a new dataflow builder.

flow = DataflowBuilder("my-robot")

Parameters:

name (str, optional) – Name of the dataflow. Defaults to "dora-dataflow".

`add_node(id, **kwargs) -> Node`

Add a node to the dataflow. Returns a Node object for further configuration.

sender = flow.add_node("sender")

Parameters:

id (str) – Unique node identifier.
**kwargs – Additional node configuration passed through to the YAML.

Returns: Node (builder)

`to_yaml(path=None) -> str | None`

Generate the YAML representation of the dataflow. If path is given, writes to file and returns None. Otherwise returns the YAML string.

# Write to file
flow.to_yaml("dataflow.yml")

# Get as string
yaml_str = flow.to_yaml()

Parameters:

path (str, optional) – File path to write the YAML.

Returns: str | None

Context manager

DataflowBuilder supports the with statement:

with DataflowBuilder("my-flow") as flow:
    flow.add_node("sender").path("sender.py")
    flow.to_yaml("dataflow.yml")

Node class (builder)

Returned by DataflowBuilder.add_node(). All setter methods return self for chaining.

`path(path) -> Node`

Set the path to the node’s executable or script.

node.path("my_node.py")

`args(args) -> Node`

Set command-line arguments for the node.

node.args("--verbose --port 8080")

`env(env) -> Node`

Set environment variables for the node.

node.env({"MODEL_PATH": "/models/yolo.pt"})

`build(command) -> Node`

Set the build command for the node (run before starting).

node.build("pip install -r requirements.txt")

`git(url, branch=None, tag=None, rev=None) -> Node`

Set a Git repository as the source for the node.

node.git("https://github.com/org/repo.git", branch="main")

`add_operator(operator) -> Node`

Attach an Operator to this node.

op = Operator("detector", python="object_detection.py")
node.add_operator(op)

`add_output(output_id) -> Output`

Declare an output on this node and return an Output reference for use as an input source.

output = sender.add_output("data")

`add_input(input_id, source, queue_size=None, queue_policy=None) -> Node`

Subscribe this node to an output from another node.

# Using an Output object
output = sender.add_output("data")
receiver.add_input("data", output)

# Using a string reference
receiver.add_input("tick", "dora/timer/millis/100")

# With a custom queue size
receiver.add_input("images", camera_output, queue_size=2)

# Lossless input (blocks sender when full)
receiver.add_input("commands", cmd_output, queue_size=100, queue_policy="backpressure")

Parameters:

input_id (str) – Name of the input on this node.
source (str | Output) – Either a string ("node_id/output_id") or an Output object.
queue_size (int, optional) – Maximum number of buffered messages for this input.
queue_policy (str, optional) – "drop_oldest" (default) or "backpressure" (buffers up to 10x queue_size before dropping).

`to_dict() -> dict`

Return the dictionary representation of the node for YAML serialization.

Output class (builder)

Returned by Node.add_output(). Represents a reference to a node’s output, used as a source in add_input().

output = sender.add_output("data")
receiver.add_input("sensor_data", output)
str(output)  # "sender/data"

Operator class (builder)

Defines an operator for embedding in a node’s YAML configuration.

`init(id, name=None, description=None, build=None, python=None, shared_library=None, send_stdout_as=None)`

op = Operator(
    id="detector",
    python="object_detection.py",
    send_stdout_as="detection_text",
)

Parameters:

id (str) – Unique operator identifier.
name (str, optional) – Display name.
description (str, optional) – Human-readable description.
build (str, optional) – Build command to run before loading.
python (str, optional) – Path to the Python operator file.
shared_library (str, optional) – Path to a shared library operator.
send_stdout_as (str, optional) – Route the operator’s stdout as an output with this ID.

`to_dict() -> dict`

Return the dictionary representation for YAML serialization.

CUDA Module

from dora.cuda import torch_to_ipc_buffer, ipc_buffer_to_ipc_handle, open_ipc_handle

Utilities for zero-copy GPU tensor sharing between nodes via CUDA IPC. Requires PyTorch with CUDA and Numba with CUDA support.

`torch_to_ipc_buffer(tensor) -> tuple[pyarrow.Array, dict]`

Convert a PyTorch CUDA tensor into an Arrow array containing the CUDA IPC handle, plus a metadata dictionary. Send both through the dataflow to share GPU memory without copying.

import torch
import pyarrow as pa
from dora import Node
from dora.cuda import torch_to_ipc_buffer

node = Node()
tensor = torch.randn(1024, 768, device="cuda")
ipc_buffer, metadata = torch_to_ipc_buffer(tensor)
node.send_output("gpu_data", ipc_buffer, metadata)

Parameters:

tensor (torch.Tensor) – A CUDA tensor.

Returns: tuple[pyarrow.Array, dict] – The IPC handle as an int8 Arrow array, and metadata with shape, strides, dtype, size, offset, and source info.

`ipc_buffer_to_ipc_handle(handle_buffer, metadata) -> IpcHandle`

Reconstruct a CUDA IPC handle from a received Arrow buffer and metadata.

from dora.cuda import ipc_buffer_to_ipc_handle

event = node.next()
ipc_handle = ipc_buffer_to_ipc_handle(event["value"], event["metadata"])

Parameters:

handle_buffer (pyarrow.Array) – The Arrow array from event["value"].
metadata (dict) – The metadata from event["metadata"].

Returns: dora.cuda.IpcHandle (a lightweight wrapper around cudaIpcMemHandle_t; call .open() to map the handle into the current process and get a device pointer, .close() to release it).

`open_ipc_handle(ipc_handle, metadata) -> ContextManager[torch.Tensor]`

Open a CUDA IPC handle and yield a PyTorch tensor. Use as a context manager to ensure proper cleanup.

from dora.cuda import ipc_buffer_to_ipc_handle, open_ipc_handle

event = node.next()
ipc_handle = ipc_buffer_to_ipc_handle(event["value"], event["metadata"])

with open_ipc_handle(ipc_handle, event["metadata"]) as tensor:
    result = tensor * 2  # use the GPU tensor directly

Parameters:

ipc_handle (IpcHandle) – Handle from ipc_buffer_to_ipc_handle.
metadata (dict) – The metadata dictionary with shape, strides, and dtype info.

Returns: Context manager yielding a torch.Tensor on CUDA.

Quick Start Example

A complete node that receives images, processes them, and sends results:

#!/usr/bin/env python3
"""Example node: receives messages, transforms them, and sends output."""

import logging

import pyarrow as pa
from dora import Node


def main():
    node = Node()

    for event in node:
        if event["type"] == "INPUT":
            input_id = event["id"]

            if input_id == "message":
                values = event["value"].to_pylist()
                number = values[0]

                # Create a struct array with multiple fields
                result = pa.StructArray.from_arrays(
                    [
                        pa.array([number * 2]),
                        pa.array([f"Message #{number}"]),
                    ],
                    names=["doubled", "description"],
                )
                node.send_output("transformed", result)
                logging.info("Transformed message %d", number)

        elif event["type"] == "STOP":
            logging.info("Node stopping")
            break


if __name__ == "__main__":
    main()

Run with:

dora run dataflow.yml

DataflowBuilder Example

Build a dataflow programmatically instead of writing YAML by hand:

#!/usr/bin/env python3
"""Build a simple sender -> receiver dataflow."""

from dora.builder import DataflowBuilder, Operator

flow = DataflowBuilder("example-flow")

# Add a timer-driven sender node
sender = flow.add_node("sender")
sender.path("sender.py")
tick_output = sender.add_output("message")

# Add a receiver that subscribes to the sender
receiver = flow.add_node("receiver")
receiver.path("receiver.py")
receiver.add_input("message", tick_output)

# Add a node with a timer input
timed_node = flow.add_node("periodic")
timed_node.path("periodic.py")
timed_node.add_input("tick", "dora/timer/millis/100")

# Add a node with an operator
runtime_node = flow.add_node("runtime-node")
op = Operator("detector", python="object_detection.py")
runtime_node.add_operator(op)
runtime_node.add_input("image", "camera/image")

# Write or print the YAML
flow.to_yaml("dataflow.yml")
print(flow.to_yaml())

C API 参考

This document covers the two C APIs provided by the Dora framework: the Node API for standalone C processes and the Operator API for shared-library operators loaded by the Dora runtime.

节点 API (dora-node-api-c)

Header: apis/c/node/node_api.h Crate: dora-node-api-c (builds as staticlib)

The Node API is used by standalone C executables that participate in an Dora dataflow as external processes. The daemon spawns the process and sets environment variables that the node reads during initialization.

初始化

`init_dora_context_from_env`

void *init_dora_context_from_env();

Initializes an Dora node context from environment variables set by the daemon. Returns an opaque pointer to the context on success, or NULL on failure.

The returned pointer must be passed to all subsequent Node API calls that expect a context argument. When the node is finished, free it with free_dora_context.

`free_dora_context`

void free_dora_context(void *dora_context);

Frees a context previously created by init_dora_context_from_env. Each context must be freed exactly once. After freeing, the pointer must not be used again.

事件循环

`dora_next_event`

void *dora_next_event(void *dora_context);

Blocks until the next event is available for this node. Returns an opaque pointer to the event, or NULL when all event streams have closed (indicating the node should exit).

The returned pointer must not be dereferenced directly. Use the read_dora_* functions to extract the event type and payload. Free the event with free_dora_event when done.

`free_dora_event`

void free_dora_event(void *dora_event);

Frees an event previously returned by dora_next_event. Each event must be freed exactly once. After freeing, the event pointer and all derived pointers (from read_dora_input_id, read_dora_input_data) become invalid.

Event Inspection

`read_dora_event_type`

enum DoraEventType read_dora_event_type(void *dora_event);

Returns the type of the given event. See DoraEventType for possible values.

`read_dora_input_id`

void read_dora_input_id(void *dora_event, char **out_ptr, size_t *out_len);

Reads the input ID from an DoraEventType_Input event. Writes the string start pointer to *out_ptr and its byte length to *out_len. The string is valid UTF-8 but not null-terminated; use out_len to determine its bounds.

If the event is not an input event, sets *out_ptr = NULL and *out_len = 0.

The returned pointer borrows from the event. It becomes invalid after free_dora_event is called.

`read_dora_input_data`

void read_dora_input_data(void *dora_event, char **out_ptr, size_t *out_len);

Reads the raw data bytes from an DoraEventType_Input event. Writes the data start pointer to *out_ptr and its byte length to *out_len.

Sets *out_ptr = NULL and *out_len = 0 if the event is not an input event or the input carries no data.

Currently only UInt8 Arrow arrays are supported. Other Arrow data types will cause a runtime panic. Future versions will use the Arrow C Data Interface for full type support.

The returned pointer borrows from the event. It becomes invalid after free_dora_event is called.

`read_dora_input_timestamp`

unsigned long long read_dora_input_timestamp(void *dora_event);

Returns the hybrid logical clock timestamp from an input event’s metadata as a uint64 value. Returns 0 if the event is not an input event.

Output

`dora_send_output`

int dora_send_output(
    void *dora_context,
    const char *id_ptr,
    size_t id_len,
    const char *data_ptr,
    size_t data_len
);

Sends output data to all downstream subscribers. The output ID (id_ptr/id_len) must be a valid UTF-8 string matching one of the node’s declared outputs in the dataflow YAML. The data (data_ptr/data_len) is sent as raw bytes (UInt8 Arrow array).

Returns 0 on success, -1 on error. Errors are logged via tracing.

Returns -1 immediately if any pointer argument is NULL.

日志

`dora_log`

int dora_log(
    void *dora_context,
    const char *level_ptr,
    size_t level_len,
    const char *msg_ptr,
    size_t msg_len
);

Sends a structured log message through the Dora logging pipeline. Both level and msg must be valid UTF-8 strings.

Valid log levels: "error", "warn", "info", "debug", "trace".

Returns 0 on success, -1 on error. Returns -1 immediately if any pointer argument is NULL.

Enums

`DoraEventType`

enum DoraEventType {
    DoraEventType_Stop,        // Graceful shutdown requested
    DoraEventType_Input,       // New input data available
    DoraEventType_InputClosed, // An input stream was closed
    DoraEventType_Error,       // An error occurred
    DoraEventType_Unknown,     // Unrecognized event type
};

算子 API (dora-operator-api-c)

Headers: apis/c/operator/operator_api.h, apis/c/operator/operator_types.h Crate: dora-operator-api-c

The Operator API is used by shared libraries (.so/.dylib/.dll) loaded into the Dora runtime process. Unlike nodes, operators do not have their own main function. Instead, they export three functions that the runtime calls at the appropriate lifecycle points.

The operator_types.h header is auto-generated by safer-ffi and defines all C-compatible struct and enum types.

Lifecycle Functions

`dora_init_operator`

DoraInitResult_t dora_init_operator(void);

Called once when the runtime loads the operator. Allocate and initialize any operator state, then return it via the operator_context field. The runtime passes this pointer back on every subsequent call.

Return an DoraInitResult_t with .result.error = NULL on success.

`dora_drop_operator`

DoraResult_t dora_drop_operator(void *operator_context);

Called once when the operator is being unloaded. Free all resources associated with operator_context.

Return an DoraResult_t with .error = NULL on success.

Event Handling

`dora_on_event`

OnEventResult_t dora_on_event(
    RawEvent_t *event,
    const SendOutput_t *send_output,
    void *operator_context
);

Called by the runtime each time an event arrives for this operator. Inspect the event fields to determine the event type:

Field	Meaning
`event->input != NULL`	New input available
`event->stop == true`	Graceful shutdown requested
`event->error.ptr != NULL`	An error occurred (UTF-8 string in `error.ptr`/`error.len`)
`event->input_closed.ptr != NULL`	An input stream closed (input ID in `input_closed.ptr`/`input_closed.len`)

Use send_output to emit data to downstream nodes (see dora_send_operator_output). Return an OnEventResult_t with the appropriate DoraStatus_t to control the operator lifecycle.

Input Reading

`dora_read_input_id`

char *dora_read_input_id(const Input_t *input);

Returns a newly allocated null-terminated string containing the input ID. The caller must free it with dora_free_input_id.

`dora_read_data`

Vec_uint8_t dora_read_data(Input_t *input);

Reads the input data as a byte array. Consumes the underlying Arrow array from the input (the data can only be read once per event). Returns a Vec_uint8_t with .ptr = NULL if the input has no data or the data has already been consumed.

The caller must free the returned data with dora_free_data.

Output Sending

`dora_send_operator_output`

DoraResult_t dora_send_operator_output(
    const SendOutput_t *send_output,
    const char *id,
    const uint8_t *data_ptr,
    size_t data_len
);

Sends output data to downstream subscribers. The id must be a null-terminated string matching one of the operator’s declared outputs. The data (data_ptr/data_len) is converted to a UInt8 Arrow array internally.

Returns an DoraResult_t with .error = NULL on success.

内存管理

The Operator API allocates memory that the caller must free using the corresponding functions:

Allocation source	Free function
`dora_read_input_id`	`dora_free_input_id`
`dora_read_data`	`dora_free_data`

void dora_free_input_id(char *input_id);
void dora_free_data(Vec_uint8_t data);

Failing to call these functions will leak memory. Do not use free() on these allocations – they are allocated by the Rust runtime and must be freed through the API.

Structs

`Vec_uint8_t`

typedef struct Vec_uint8 {
    uint8_t *ptr;
    size_t len;
    size_t cap;
} Vec_uint8_t;

A Rust-allocated byte vector. Access len bytes starting at ptr. Do not modify cap. Free with dora_free_data.

`DoraResult_t`

typedef struct DoraResult {
    Vec_uint8_t *error;  // NULL on success, points to error string on failure
} DoraResult_t;

Generic result type. A NULL error pointer indicates success. When non-NULL, the error pointer contains a UTF-8 error message.

`DoraInitResult_t`

typedef struct DoraInitResult {
    DoraResult_t result;
    void *operator_context;  // opaque pointer to operator state
} DoraInitResult_t;

Returned by dora_init_operator. On success, result.error is NULL and operator_context holds the operator state pointer.

`OnEventResult_t`

typedef struct OnEventResult {
    DoraResult_t result;
    DoraStatus_t status;
} OnEventResult_t;

Returned by dora_on_event. Contains both an error/success result and a status code controlling the operator lifecycle.

`RawEvent_t`

typedef struct RawEvent {
    Input_t *input;           // non-NULL when this is an input event
    Vec_uint8_t input_closed; // non-empty when an input stream closed
    bool stop;                // true when shutdown is requested
    Vec_uint8_t error;        // non-empty on error
} RawEvent_t;

Represents an event delivered to the operator. Multiple fields may be set simultaneously; check them in order of priority.

`Input_t`

typedef struct Input Input_t;  // opaque

Opaque type representing an input event’s data. Use dora_read_input_id and dora_read_data to extract its contents.

`Output_t`

typedef struct Output Output_t;  // opaque

Opaque type used internally by dora_send_operator_output. Not created directly by user code.

`SendOutput_t`

typedef struct SendOutput {
    ArcDynFn1_DoraResult_Output_t send_output;
} SendOutput_t;

Callback handle passed to dora_on_event. Pass it to dora_send_operator_output to emit data. Do not store it beyond the scope of the current dora_on_event call.

`Metadata_t`

typedef struct Metadata {
    Vec_uint8_t open_telemetry_context;
} Metadata_t;

Event metadata containing an OpenTelemetry trace context string.

Operator Enums

`DoraStatus_t`

enum DoraStatus {
    DORA_STATUS_CONTINUE = 0,  // Keep running
    DORA_STATUS_STOP     = 1,  // Stop this operator
    DORA_STATUS_STOP_ALL = 2,  // Stop the entire dataflow
};
typedef uint8_t DoraStatus_t;

Returned in OnEventResult_t to control operator lifecycle after processing an event.

Node Example

A complete C node that receives timer ticks and sends output messages:

#include <stdio.h>
#include <string.h>
#include "node_api.h"

int main() {
    void *ctx = init_dora_context_from_env();
    if (ctx == NULL) {
        fprintf(stderr, "failed to init dora context\n");
        return 1;
    }

    for (int i = 0; i < 100; i++) {
        void *event = dora_next_event(ctx);
        if (event == NULL)
            break;  // all streams closed

        enum DoraEventType ty = read_dora_event_type(event);

        if (ty == DoraEventType_Input) {
            char *id;
            size_t id_len;
            read_dora_input_id(event, &id, &id_len);

            // Send a response
            char out_id[] = "message";
            char out_data[64];
            int out_len = snprintf(out_data, sizeof(out_data),
                                   "iteration %d", i);

            dora_send_output(ctx, out_id, strlen(out_id),
                              out_data, out_len);
        } else if (ty == DoraEventType_Stop) {
            free_dora_event(event);
            break;
        }

        free_dora_event(event);
    }

    free_dora_context(ctx);
    return 0;
}

Dataflow YAML for the node:

nodes:
  - id: c_node
    path: build/c_node
    inputs:
      timer: dora/timer/millis/100
    outputs:
      - message

Operator Example

A complete C operator that reads input, maintains state, and sends output:

#include "operator_api.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

DoraInitResult_t dora_init_operator(void) {
    // Allocate operator state (a simple counter)
    int *counter = (int *)calloc(1, sizeof(int));

    DoraInitResult_t result = {.operator_context = counter};
    return result;
}

DoraResult_t dora_drop_operator(void *operator_context) {
    free(operator_context);
    DoraResult_t result = {.error = NULL};
    return result;
}

OnEventResult_t dora_on_event(
    RawEvent_t *event,
    const SendOutput_t *send_output,
    void *operator_context)
{
    OnEventResult_t result = {.status = DORA_STATUS_CONTINUE};
    int *counter = (int *)operator_context;

    if (event->input != NULL) {
        char *id = dora_read_input_id(event->input);
        Vec_uint8_t data = dora_read_data(event->input);

        if (data.ptr != NULL) {
            *counter += 1;
            printf("received input '%s', counter: %d\n", id, *counter);

            // Send counter value as string
            char buf[64];
            int len = snprintf(buf, sizeof(buf), "count=%d", *counter);
            result.result = dora_send_operator_output(
                send_output, "counter", (uint8_t *)buf, len);

            dora_free_data(data);
        }

        dora_free_input_id(id);
    }

    if (event->stop) {
        result.status = DORA_STATUS_STOP;
    }

    return result;
}

Dataflow YAML for the operator:

nodes:
  - id: runtime-node
    operators:
      - id: c_operator
        shared-library: build/operator
        inputs:
          data: source_node/output
        outputs:
          - counter

Building and Linking

Node (static library)

C nodes link against dora-node-api-c, which builds as a static library.

Step 1: Build the static library

cargo build -p dora-node-api-c --release

This produces target/release/libdora_node_api_c.a (or .lib on Windows).

Step 2: Compile and link

clang node.c -ldora_node_api_c -L ../../target/release -o build/c_node <FLAGS>

Platform-specific linker flags:

Platform	Flags
Linux	`-lm -lrt -ldl -pthread`
macOS	`-framework CoreServices -framework Security -lSystem -lresolv -lpthread -lc -lm`
Windows	`-ladvapi32 -luserenv -lkernel32 -lws2_32 -lbcrypt -lncrypt -lschannel -lntdll -liphlpapi -lcfgmgr32 -lcredui -lcrypt32 -lcryptnet -lfwpuclnt -lgdi32 -lmsimg32 -lmswsock -lole32 -lopengl32 -lsecur32 -lshell32 -lsynchronization -luser32 -lwinspool -Wl,-nodefaultlib:libcmt -D_DLL -lmsvcrt`

On Windows, add the .exe extension to the output file.

Operator (shared library)

C operators are compiled into shared libraries that the Dora runtime loads at startup.

Step 1: Compile to object file

clang -c operator.c -o build/operator.o -fdeclspec -fPIC

Omit -fPIC on Windows.

Step 2: Link as shared library

# Linux
clang -shared build/operator.o -o build/liboperator.so

# macOS
clang -shared build/operator.o -o build/liboperator.dylib

# Windows
clang -shared build/operator.o -o build/operator.dll

Step 3: Reference in dataflow YAML

operators:
  - id: c_operator
    shared-library: build/operator   # without lib prefix or extension
    inputs:
      data: source/output
    outputs:
      - result

The shared-library path omits the platform-specific prefix (lib) and extension (.so/.dylib/.dll). The runtime resolves the correct file for the current platform.

Include Paths

The Node API header is at apis/c/node/node_api.h. The Operator API headers are at apis/c/operator/operator_api.h and apis/c/operator/operator_types.h. Adjust your include paths accordingly:

# Node
clang -I path/to/dora/apis/c/node node.c ...

# Operator
clang -I path/to/dora/apis/c/operator operator.c ...

C++ Compatibility

Both headers include extern "C" guards (in the operator headers) or use C-compatible declarations (in the node header), so they can be included directly from C++ source files.

C++ API 参考

Dora provides C++ bindings for both standalone nodes and in-process operators via CXX (Rust-C++ interop). The CXX bridge generates type-safe C++ headers from Rust definitions – no raw FFI or manual extern "C" declarations are needed.

Two crates provide the C++ surface:

Crate	库	用例
`dora-node-api-cxx`	`libdora_node_api_cxx.a`	Standalone node executable
`dora-operator-api-cxx`	`libdora_operator_api_cxx.a`	Shared-library operator loaded by the runtime

Generated headers: dora-node-api.h and dora-operator-api.h.

Node API (`dora-node-api-cxx`)

初始化

#include "dora-node-api.h"

// Initialize a node from environment variables set by the Dora daemon.
// Returns an DoraNode struct containing the event stream and output sender.
// Throws on failure.
DoraNode init_dora_node();

DoraNode

Returned by init_dora_node(). Owns the event stream and the output sender for the lifetime of the node.

struct DoraNode {
    rust::Box<Events>        events;       // event stream (blocking receiver)
    rust::Box<OutputSender>  send_output;  // output sender
};

Events

Opaque Rust type exposed to C++. Provides blocking iteration over the node’s incoming events.

// Member function -- call on the boxed object directly.
rust::Box<DoraEvent> Events::next();

// Free function form -- equivalent to events->next().
rust::Box<DoraEvent> next_event(rust::Box<Events>& events);

Both forms block until the next event arrives and return an owned DoraEvent.

DoraEvent

Opaque Rust type. Inspect its kind with event_type(), then downcast with event_as_input() or event_as_arrow_input().

// Determine the event kind.
DoraEventType event_type(const rust::Box<DoraEvent>& event);

// Downcast to a raw-byte input. Throws if the event is not Input.
DoraInput event_as_input(rust::Box<DoraEvent> event);

// Downcast to an Arrow FFI input (writes Arrow C Data Interface structs).
// out_array and out_schema must point to valid ArrowArray / ArrowSchema structs.
// Returns DoraResult with empty error on success.
DoraResult event_as_arrow_input(
    rust::Box<DoraEvent> event,
    uint8_t* out_array,
    uint8_t* out_schema);

// Same as above, but also returns the input ID and metadata.
ArrowInputInfo event_as_arrow_input_with_info(
    rust::Box<DoraEvent> event,
    uint8_t* out_array,
    uint8_t* out_schema);

DoraEventType

enum class DoraEventType : uint8_t {
    Stop,             // graceful shutdown requested
    Input,            // new data arrived on an input
    InputClosed,      // a single input was closed
    Error,            // an error occurred
    Unknown,          // unrecognized event variant
    AllInputsClosed,  // all inputs closed (stream ended)
};

DoraInput

Returned by event_as_input(). Contains raw bytes.

struct DoraInput {
    rust::String     id;    // input identifier (e.g. "tick", "image")
    rust::Vec<uint8_t> data;  // raw payload bytes
};

ArrowInputInfo

Returned by event_as_arrow_input_with_info(). Contains the input ID, metadata, and an error string.

struct ArrowInputInfo {
    rust::String       id;        // input identifier
    rust::Box<Metadata> metadata; // attached metadata
    rust::String       error;     // empty on success
};

DoraResult

Returned by output-sending functions. Check the error field – empty means success.

struct DoraResult {
    rust::String error;  // empty string on success
};

OutputSender

Opaque Rust type. All methods take rust::Box<OutputSender>& as the first argument (the sender from DoraNode::send_output).

send_output

Send raw bytes on a named output.

DoraResult send_output(
    rust::Box<OutputSender>& sender,
    rust::String id,
    rust::Slice<const uint8_t> data);

send_output_with_metadata

Send raw bytes with attached metadata.

DoraResult send_output_with_metadata(
    rust::Box<OutputSender>& sender,
    rust::String id,
    rust::Slice<const uint8_t> data,
    rust::Box<Metadata> metadata);

send_arrow_output

Send an Arrow array via the C Data Interface. The pointers must reference valid ArrowArray and ArrowSchema structs. Ownership of the Arrow data transfers to Rust on success.

DoraResult send_arrow_output(
    rust::Box<OutputSender>& sender,
    rust::String id,
    uint8_t* array_ptr,
    uint8_t* schema_ptr);

// Overload with metadata (same C++ name via cxx_name attribute).
DoraResult send_arrow_output(
    rust::Box<OutputSender>& sender,
    rust::String id,
    uint8_t* array_ptr,
    uint8_t* schema_ptr,
    rust::Box<Metadata> metadata);

log_message

Send a log message through the Dora logging system.

DoraResult log_message(
    const rust::Box<OutputSender>& sender,
    rust::String level,    // e.g. "info", "warn", "error"
    rust::String message);

元数据

Opaque Rust type for attaching typed key-value pairs to outputs.

Construction

rust::Box<Metadata> new_metadata();

Reading

uint64_t     Metadata::timestamp() const;

bool         Metadata::get_bool(const rust::Str key) const;        // throws on missing/wrong type
int64_t      Metadata::get_int(const rust::Str key) const;
double       Metadata::get_float(const rust::Str key) const;
rust::String Metadata::get_str(const rust::Str key) const;

rust::Vec<int64_t>      Metadata::get_list_int(const rust::Str key) const;
rust::Vec<double>       Metadata::get_list_float(const rust::Str key) const;
rust::Vec<rust::String> Metadata::get_list_string(const rust::Str key) const;

int64_t      Metadata::get_timestamp(const rust::Str key) const;   // nanoseconds since epoch
rust::String Metadata::get_json(const rust::Str key) const;        // single value as JSON string

Writing

All setters throw on failure.

void Metadata::set_bool(const rust::Str key, bool value);
void Metadata::set_int(const rust::Str key, int64_t value);
void Metadata::set_float(const rust::Str key, double value);
void Metadata::set_string(const rust::Str key, rust::String value);

void Metadata::set_list_int(const rust::Str key, rust::Vec<int64_t> value);
void Metadata::set_list_float(const rust::Str key, rust::Vec<double> value);
void Metadata::set_list_string(const rust::Str key, rust::Vec<rust::String> value);

void Metadata::set_timestamp(const rust::Str key, int64_t nanos);  // nanoseconds since epoch

Introspection

MetadataValueType Metadata::type(const rust::Str key) const;  // throws if key missing
rust::String      Metadata::to_json() const;                   // full metadata as JSON
rust::Vec<rust::String> Metadata::list_keys() const;

MetadataValueType

enum class MetadataValueType : uint8_t {
    Bool,
    Integer,
    Float,
    String,
    ListInt,
    ListFloat,
    ListString,
    Timestamp,
};

Service, Action, and Streaming Patterns

C++ nodes can implement communication patterns using the metadata API. The well-known metadata keys are:

Key	描述
`"request_id"`	Service request/response correlation (UUID v7)
`"goal_id"`	Action goal identification (UUID v7)
`"goal_status"`	Action result status: `"succeeded"`, `"aborted"`, or `"canceled"`
`"session_id"`	Streaming session identifier
`"segment_id"`	Streaming segment within a session (integer)
`"seq"`	Streaming chunk sequence number (integer)
`"fin"`	Last chunk of a streaming segment (bool)
`"flush"`	Discard older queued messages on input (bool)

// Service server: pass through request_id from input metadata
auto input_metadata = event_as_arrow_input_with_info(event);
send_output_with_metadata(sender, "response", result, std::move(input_metadata.metadata));

// Action server: set goal_id and goal_status on result
auto meta = new_metadata();
meta->set_string("goal_id", goal_id);
meta->set_string("goal_status", "succeeded");
send_output_with_metadata(sender, "result", result_data, std::move(meta));

CombinedEvents (ROS2 integration)

When using the optional ros2-bridge feature, node events and ROS2 subscription events can be merged into a single stream.

// Convert Dora events into a combined stream.
CombinedEvents dora_events_into_combined(rust::Box<Events> events);

// Create an empty combined stream (for ROS2-only nodes).
CombinedEvents empty_combined_events();

CombinedEvents struct

struct CombinedEvents {
    rust::Box<MergedEvents> events;

    CombinedEvent next();  // blocking -- returns the next merged event
};

CombinedEvent struct

struct CombinedEvent {
    rust::Box<MergedDoraEvent> event;

    bool is_dora() const;  // true if this is a standard Dora event
};

// Downcast a combined event back to an DoraEvent. Throws if not an Dora event.
rust::Box<DoraEvent> downcast_dora(CombinedEvent event);

ROS2 subscriptions add their own events to the merged stream. Use subscription->matches(event) and subscription->downcast(event) to handle ROS2-specific events (see the ROS2 Bridge docs).

Operator API (`dora-operator-api-cxx`)

Operators are shared libraries loaded by the Dora runtime. The C++ side implements two functions that the CXX bridge calls into.

Required C++ interface

You must provide a header operator.h and an implementation file. The header declares an Operator class and two free functions:

// operator.h
#pragma once
#include <memory>
#include "dora-operator-api.h"

class Operator {
public:
    Operator();
    // Add any state your operator needs.
};

std::unique_ptr<Operator> new_operator();

DoraOnInputResult on_input(
    Operator& op,
    rust::Str id,
    rust::Slice<const uint8_t> data,
    OutputSender& output_sender);

new_operator() – called once at startup; returns the operator instance.
on_input() – called for every input event; process data and optionally send outputs.

OutputSender (operator)

Available inside on_input(). Sends data on a named output.

DoraSendOutputResult send_output(
    OutputSender& sender,
    rust::Str id,
    rust::Slice<const uint8_t> data);

Result types

struct DoraOnInputResult {
    rust::String error;  // empty on success
    bool         stop;   // true to request graceful shutdown
};

struct DoraSendOutputResult {
    rust::String error;  // empty on success
};

Quick Start: Node Example

A minimal node that receives timer ticks and sends a counter.

#include "dora-node-api.h"
#include <iostream>
#include <vector>

int main() {
    auto dora_node = init_dora_node();
    unsigned char counter = 0;

    for (;;) {
        auto event = next_event(dora_node.events);
        auto ty = event_type(event);

        if (ty == DoraEventType::AllInputsClosed) {
            break;
        }
        if (ty == DoraEventType::Stop) {
            break;
        }
        if (ty == DoraEventType::Input) {
            auto input = event_as_input(std::move(event));
            counter += 1;

            std::cout << "Input: " << std::string(input.id)
                      << " counter=" << (int)counter << std::endl;

            std::vector<unsigned char> out{counter};
            rust::Slice<const uint8_t> slice{out.data(), out.size()};
            auto result = send_output(dora_node.send_output, "counter", slice);
            if (!result.error.empty()) {
                std::cerr << "Send error: " << std::string(result.error) << std::endl;
                return 1;
            }
        }
    }
    return 0;
}

Dataflow YAML:

nodes:
  - id: cxx-node
    path: build/my_node
    inputs:
      tick: dora/timer/millis/300
    outputs:
      - counter

Quick Start: Arrow Node Example

A node that receives and sends Arrow arrays via the C Data Interface, with metadata.

#include "dora-node-api.h"
#include <arrow/api.h>
#include <arrow/c/bridge.h>
#include <iostream>

int main() {
    auto dora_node = init_dora_node();

    for (int i = 0; i < 10; i++) {
        auto event = dora_node.events->next();
        auto ty = event_type(event);

        if (ty == DoraEventType::AllInputsClosed || ty == DoraEventType::Stop) {
            break;
        }
        if (ty == DoraEventType::Input) {
            // Receive Arrow input with metadata
            struct ArrowArray c_array;
            struct ArrowSchema c_schema;
            auto info = event_as_arrow_input_with_info(
                std::move(event),
                reinterpret_cast<uint8_t*>(&c_array),
                reinterpret_cast<uint8_t*>(&c_schema));

            if (!info.error.empty()) {
                std::cerr << std::string(info.error) << std::endl;
                continue;
            }

            std::cout << "Input: " << std::string(info.id)
                      << " ts=" << info.metadata->timestamp() << std::endl;

            auto imported = arrow::ImportArray(&c_array, &c_schema);
            auto array = imported.ValueOrDie();
            std::cout << "Arrow: " << array->ToString() << std::endl;

            // Build an output Arrow array
            arrow::Int32Builder builder;
            builder.Append(i * 10);
            std::shared_ptr<arrow::Array> out_array;
            builder.Finish(&out_array);

            // Export and send with metadata
            struct ArrowArray out_c_array;
            struct ArrowSchema out_c_schema;
            arrow::ExportArray(*out_array, &out_c_array, &out_c_schema);

            auto meta = new_metadata();
            meta->set_string("source", "cpp-arrow-node");
            meta->set_int("iteration", i);

            auto result = send_arrow_output(
                dora_node.send_output, "counter",
                reinterpret_cast<uint8_t*>(&out_c_array),
                reinterpret_cast<uint8_t*>(&out_c_schema),
                std::move(meta));

            if (!result.error.empty()) {
                std::cerr << "Send error: " << std::string(result.error) << std::endl;
            }
        }
    }
    return 0;
}

Quick Start: Operator Example

A minimal operator shared library.

// operator.cc
#include "operator.h"
#include <iostream>
#include <vector>

Operator::Operator() {}

std::unique_ptr<Operator> new_operator() {
    return std::make_unique<Operator>();
}

DoraOnInputResult on_input(
    Operator& op,
    rust::Str id,
    rust::Slice<const uint8_t> data,
    OutputSender& output_sender)
{
    op.counter += 1;

    std::vector<unsigned char> out{op.counter};
    rust::Slice<const uint8_t> slice{out.data(), out.size()};
    auto send_result = send_output(output_sender, rust::Str("status"), slice);

    return DoraOnInputResult{send_result.error, false};
}

Dataflow YAML:

nodes:
  - id: runtime-node
    operators:
      - id: my-operator
        shared-library: build/my_operator
        inputs:
          data: some-node/output
        outputs:
          - status

Build Integration (CMake)

The recommended build approach uses CMake with the DoraTargets.cmake helper (see examples/cmake-dataflow/).

Project structure

my-project/
  CMakeLists.txt
  DoraTargets.cmake       # copied from examples/cmake-dataflow/
  node/main.cc
  operator/operator.h
  operator/operator.cc
  dataflow.yml

CMakeLists.txt

cmake_minimum_required(VERSION 3.21)
project(my-dataflow LANGUAGES C CXX)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_FLAGS "-fPIC")

include(DoraTargets.cmake)
link_directories(${dora_link_dirs})

# Standalone node (executable)
add_executable(my_node node/main.cc ${node_bridge})
add_dependencies(my_node Dora_cxx)
target_include_directories(my_node PRIVATE ${dora_cxx_include_dir})
target_link_libraries(my_node dora_node_api_cxx)

# Operator (shared library)
add_library(my_operator SHARED
    operator/operator.cc ${operator_bridge})
add_dependencies(my_operator Dora_cxx)
target_include_directories(my_operator PRIVATE
    ${dora_cxx_include_dir} ${dora_c_include_dir}
    ${CMAKE_CURRENT_SOURCE_DIR}/operator)
target_link_libraries(my_operator dora_operator_api_cxx)

install(TARGETS my_node DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/bin)
install(TARGETS my_operator DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/lib)

What DoraTargets.cmake provides

变量	描述
`dora_cxx_include_dir`	Path to generated CXX headers (`dora-node-api.h`, `dora-operator-api.h`)
`dora_c_include_dir`	Path to C API headers (for mixed C/C++ projects)
`dora_link_dirs`	Library search path for `libdora_node_api_cxx.a` / `libdora_operator_api_cxx.a`
`node_bridge`	Generated CXX bridge source file for nodes (`node_bridge.cc`)
`operator_bridge`	Generated CXX bridge source file for operators (`operator_bridge.cc`)
`Dora_cxx`	CMake target dependency that builds the CXX crates

Build steps

# Option A: Build against local Dora source
mkdir build && cd build
cmake .. -DDORA_ROOT_DIR=/path/to/dora
cmake --build .

# Option B: Build against Dora from GitHub (cloned automatically)
mkdir build && cd build
cmake ..
cmake --build .

要求

C++20 compiler
Rust toolchain (for building the Dora static libraries via Cargo)
CMake 3.21+
For Arrow integration: Apache Arrow C++ library

CXX Bridge Notes

All Rust opaque types (Events, OutputSender, DoraEvent, Metadata, MergedEvents, MergedDoraEvent) are accessed through rust::Box<T>.
rust::String, rust::Vec<T>, and rust::Slice<const T> are CXX bridge types that interoperate with their C++ standard library counterparts. See the CXX type reference.
Functions that return Result<T> in Rust throw C++ exceptions on the error path.
Arrow FFI functions (event_as_arrow_input, send_arrow_output) are unsafe on the Rust side. The caller must pass valid pointers to ArrowArray / ArrowSchema structs cast to uint8_t*.
The node library is a static archive (staticlib). Link it into your executable with -ldora_node_api_cxx.
The operator library is also a static archive. Link it into your shared library with -ldora_operator_api_cxx.

Dora CLI 参考

Dora（AI-Dora，数据流导向机器人架构）是一个 100% Rust 的实时机器人与 AI 应用框架。本文档从终端用户和开发者两个角度介绍 dora CLI。

Quick Start
Installation
Core Concepts
Dataflow Descriptor
Command Reference
- Lifecycle
- Monitoring
- Debugging
- Setup
- Utility
- Self-Management
Environment Variables
Architecture Guide
Writing Nodes
Writing Operators
Distributed Deployments – see also Distributed Deployment Guide for cluster management, scheduling, and operations
Troubleshooting
Debugging and Observability – standalone guide covering record/replay, topic inspection, log analysis, and resource monitoring
API References: Rust | Python | C | C++

快速开始

# Create a new project
dora new my-robot --kind dataflow --lang rust

# Run locally (no coordinator/daemon needed)
dora run dataflow.yml

# Or use coordinator/daemon for production
dora up
dora start dataflow.yml --attach
# Ctrl-C to stop
dora down

安装

From crates.io (recommended)

cargo install dora-cli

从源码安装

cargo install --path binaries/cli --locked

验证

dora --version
dora status

核心概念

数据流

数据流是由类型化数据通道连接的节点有向图。节点产生输出供其他节点作为输入消费。框架处理数据路由、序列化（Apache Arrow）和生命周期管理。

执行模式

模式	命令	基础设施	用例
本地	`dora run`	无	开发、测试、单机
分布式	`dora up` + `dora start`	Coordinator + Daemon(s)	生产、多机

组件角色

CLI  -->  Coordinator  -->  Daemon(s)  -->  Nodes / Operators
              (control plane)  (per machine)    (user code)

CLI：用户界面。发送命令、显示日志。
协调器：跨机器编排数据流生命周期。
守护进程：生成节点进程、管理 IPC、收集指标。
节点：产生和消费 Arrow 数据的独立进程。
算子：在共享运行时内运行的进程内代码（比节点更低延迟）。

数据格式

所有数据以 Apache Arrow 列式数组的形式在系统中流转。这使得同机节点间零拷贝共享内存传输和零序列化开销成为可能。

数据流描述符

Dataflows are defined in YAML files. Here is the complete schema:

最小示例

nodes:
  - id: sender
    path: sender.py
    outputs:
      - message

  - id: receiver
    path: receiver.py
    inputs:
      message: sender/message

完整模式

# Dataflow-level settings
health_check_interval: 5.0    # health check sweep interval in seconds (default: 5.0)

nodes:
  - id: my-node                 # unique identifier (required)
    name: "My Node"             # human-readable name (optional)
    description: "..."          # description (optional)

    # --- Source (pick one) ---
    path: ./target/debug/my-node          # local executable
    # path: https://example.com/node.zip  # download from URL
    # git: https://github.com/org/repo.git  # build from git
    #   branch: main            # git branch (mutually exclusive with tag/rev)
    #   tag: v1.0               # git tag
    #   rev: abc123             # git commit hash

    # --- Build ---
    build: cargo build -p my-node   # shell command to build (optional)

    # --- Inputs ---
    inputs:
      # Short form: source_node/output_id
      tick: dora/timer/millis/100
      data: other-node/output

      # Long form with options
      sensor_data:
        source: sensor/frames
        queue_size: 10            # input buffer size (default: 10)
        queue_policy: drop_oldest # or "backpressure" (buffers up to 10x queue_size)
        input_timeout: 5.0        # circuit breaker timeout in seconds

    # --- Outputs ---
    outputs:
      - processed
      - status

    # --- Environment ---
    env:
      MY_VAR: "value"
      FROM_ENV:
        __dora_env: HOST_VAR     # read from host environment
    args: "--verbose"             # command-line arguments

    # --- Fault tolerance ---
    restart_policy: on-failure    # never (default) | on-failure | always
    max_restarts: 5               # 0 = unlimited
    restart_delay: 1.0            # initial backoff in seconds
    max_restart_delay: 30.0       # backoff cap in seconds
    restart_window: 300.0         # reset counter after N seconds
    health_check_timeout: 30.0    # kill if no activity for N seconds

    # --- Logging ---
    min_log_level: info           # source-level filter (daemon-side)
    send_stdout_as: raw_output    # route raw stdout as data output
    send_logs_as: log_entries     # route structured logs as data output
    max_log_size: "50MB"          # rotate log files at this size
    max_rotated_files: 5          # number of rotated files to keep (1-100)

    # --- Deployment ---
    _unstable_deploy:
      machine: A                  # target machine/daemon ID

# Debug settings
_unstable_debug:
  enable_debug_inspection: true   # required for topic echo/hz/info

内置定时器节点

定时器是以固定间隔发出 tick 的虚拟节点：

inputs:
  tick: dora/timer/millis/100   # every 100ms
  slow: dora/timer/millis/1000  # every 1s
  fast: dora/timer/hz/30        # 30 Hz (~33ms)

算子节点

算子在共享运行时中进程内运行（无独立进程）：

nodes:
  # Single operator (shorthand)
  - id: detector
    operator:
      python: detect.py
      build: pip install -r requirements.txt
      inputs:
        image: camera/frames
      outputs:
        - bbox

  # Multiple operators sharing a runtime
  - id: runtime-node
    operators:
      - id: preprocessor
        shared-library: ../../target/debug/libpreprocess
        inputs:
          raw: sensor/data
        outputs:
          - processed
      - id: analyzer
        shared-library: ../../target/debug/libanalyze
        inputs:
          data: runtime-node/preprocessor/processed
        outputs:
          - result

分布式部署

使用 _unstable_deploy 将节点分配到特定机器：

nodes:
  - id: camera-driver
    _unstable_deploy:
      machine: robot-arm
    path: ./target/debug/camera
    outputs:
      - frames

  - id: ml-inference
    _unstable_deploy:
      machine: gpu-server
    path: ./target/debug/inference
    inputs:
      frames: camera-driver/frames
    outputs:
      - predictions

当节点位于不同机器时，通信自动从共享内存切换到 Zenoh 发布/订阅。

命令参考

生命周期命令

`dora run`

在本地运行数据流，无需协调器或守护进程。适合开发和测试。

dora run <PATH> [OPTIONS]

参数/标志	默认	描述
`<PATH>`	必需	Path to dataflow descriptor YAML
`--stop-after <DURATION>`		Auto-stop after duration (e.g., `30s`, `5m`)
`--uv`	false	Use `uv` for Python node management
`--debug`	false	Enable debug topics (equivalent to `enable_debug_inspection: true`)
`--allow-shell-nodes`	false	Enable shell-based node execution
`--log-level <LEVEL>`	`stdout`	Min display level: `error\|warn\|info\|debug\|trace\|stdout`
`--log-format <FORMAT>`	`pretty`	Output format: `pretty\|json\|compact`
`--log-filter <FILTER>`		Per-node level overrides: `"node1=debug,node2=warn"`

Examples:

# Basic run
dora run dataflow.yml

# Stop after 10 seconds, only show warnings
dora run dataflow.yml --stop-after 10s --log-level warn

# Python dataflow with uv
dora run dataflow.yml --uv

# Debug one node, silence others
dora run dataflow.yml --log-level warn --log-filter "sensor=debug"

# JSON output for CI pipelines
dora run dataflow.yml --log-format json --stop-after 30s 2>test.json

`dora up`

在本地模式下启动协调器和守护进程。

dora up

Spawns dora coordinator and dora daemon as background processes. Waits for both to be ready before returning. Idempotent: if already running, does nothing.

`dora down` (alias: `dora destroy`)

拆卸协调器和守护进程。首先停止所有运行中的数据流。

dora down [OPTIONS]

标志	默认	描述
`--coordinator-addr <IP>`	`127.0.0.1`	Coordinator address
`--coordinator-port <PORT>`	`6013`	Coordinator port

`dora build`

运行数据流描述符中定义的构建命令。

dora build <PATH> [OPTIONS]

标志	默认	描述
`<PATH>`	必需	Dataflow descriptor path
`--uv`	false	Use `uv` for Python builds
`--local`	false	Force local build (skip coordinator)
`--strict-types`	false	Treat type warnings as errors (non-zero exit code)

Type checking: After expanding modules, build runs the same type checks as validate. Warnings are printed by default; use --strict-types (or set strict_types: true in the YAML) to fail the build on type mismatches. User-defined types in a types/ directory next to the dataflow are loaded automatically.

Build strategy: If nodes have _unstable_deploy sections and a coordinator is reachable, builds are distributed to target machines. Otherwise, builds run locally.

Git sources: Nodes with a git: field are cloned/updated before building. The build command runs from the git repository root.

`dora start`

在运行中的协调器上启动数据流。

dora start <PATH> [OPTIONS]

标志	默认	描述
`<PATH>`	必需	Dataflow descriptor path
`--name <NAME>`, `-n`		Assign a name to the dataflow
`--attach`	auto	Attach to log stream and wait for completion
`--detach`	auto	Return immediately after spawn
`--debug`	false	Enable debug topics (equivalent to `enable_debug_inspection: true`)
`--hot-reload`	false	Watch Python files and reload on change
`--uv`	false	Use `uv` for Python nodes
`--coordinator-addr <IP>`	`127.0.0.1`	Coordinator address
`--coordinator-port <PORT>`	`6013`	Coordinator port

If neither --attach nor --detach is specified: attaches if running in a TTY, detaches otherwise.

Attach mode: Streams logs, handles Ctrl-C gracefully (first = stop, second = force kill).

Hot reload: Watches Python operator source files. On change, sends a reload request to the coordinator which propagates to the daemon.

`dora stop`

停止运行中的数据流。

dora stop [UUID_OR_NAME] [OPTIONS]

标志	默认	描述
`[UUID_OR_NAME]`	interactive	Dataflow UUID or name
`--name <NAME>`, `-n`		Alternative name specification
`--grace-duration <DURATION>`		Graceful shutdown timeout
`--force`, `-f`	false	Immediate termination
`--coordinator-addr <IP>`	`127.0.0.1`	Coordinator address
`--coordinator-port <PORT>`	`6013`	Coordinator port

If no identifier is given and running in a TTY, presents an interactive picker.

Stop sequence: Send Event::Stop -> wait grace duration -> SIGTERM -> hard kill.

`dora restart`

重启运行中的数据流（停止 + 使用存储的描述符重新启动）。无需 YAML 路径 – 协调器保留了原始描述符。

dora restart [UUID] [OPTIONS]

标志	默认	描述
`[UUID]`		Dataflow UUID
`--name <NAME>`, `-n`		Restart by name instead of UUID
`--grace-duration <DURATION>`		Graceful shutdown timeout for the stop phase
`--force`, `-f`	false	Force kill before restart
`--coordinator-addr <IP>`	`127.0.0.1`	Coordinator address
`--coordinator-port <PORT>`	`6013`	Coordinator port

Examples:

# Restart by name
dora restart --name my-app

# Restart by UUID with forced stop
dora restart a1b2c3d4-... --force

`dora record`

Record dataflow messages to a .drec file for offline replay. See Debugging Guide for full workflows.

dora record <DATAFLOW_YAML> [OPTIONS]

标志	默认	描述
`<DATAFLOW_YAML>`	必需	Path to dataflow descriptor
`-o, --output <PATH>`	`recording_{timestamp}.drec`	Output file path
`--topics <TOPICS>`	all	Comma-separated `node/output` topics to record
`--proxy`	false	Stream via WebSocket instead of recording on target
`--output-yaml <PATH>`		Write modified YAML without running (dry run)

Default mode injects a record node into the dataflow. --proxy mode requires a running dataflow and enable_debug_inspection: true.

`dora replay`

Replay a recorded .drec file by replacing source nodes with replay nodes. See Debugging Guide for full workflows.

dora replay <FILE> [OPTIONS]

标志	默认	描述
`<FILE>`	必需	Path to `.drec` recording
`--speed <FLOAT>`	`1.0`	Playback speed (0 = max speed)
`--loop`	false	Loop the recording
`--replace <NODE_IDS>`	all recorded	Comma-separated nodes to replace
`--output-yaml <PATH>`		Write modified YAML without running (dry run)

监控命令

`dora list` (alias: `dora ps`)

列出运行中的数据流及指标。

dora list [OPTIONS]

标志	默认	描述
`--format <FMT>`, `-f`	`table`	Output format: `table\|json`
`--status <STATUS>`		Filter: `running\|finished\|failed`
`--name <PATTERN>`		Filter by name (case-insensitive substring)
`--sort-by <FIELD>`		Sort by: `cpu\|memory`
`--quiet`, `-q`	false	Print only UUIDs
`--coordinator-addr <IP>`	`127.0.0.1`	Coordinator address
`--coordinator-port <PORT>`	`6013`	Coordinator port

Output columns: UUID, Name, Status, Nodes, CPU, Memory

`dora logs`

显示和跟踪数据流与节点的日志。

dora logs [UUID_OR_NAME] [NODE] [OPTIONS]

标志	默认	描述
`[UUID_OR_NAME]`		Dataflow UUID or name
`[NODE]`		Node name (required unless `--all-nodes`)
`--all-nodes`	false	Merge logs from all nodes by timestamp
`--tail <N>`	all	Show last N lines
`--follow`, `-f`	false	Stream new log entries
`--local`	false	Read from local `out/` directory
`--since <DURATION>`		Show logs newer than duration ago
`--until <DURATION>`		Show logs older than duration ago
`--level <LEVEL>`	`stdout`	Min log level
`--log-format <FORMAT>`	`pretty`	Output format
`--log-filter <FILTER>`		Per-node level overrides
`--grep <PATTERN>`		Case-insensitive text search
`--coordinator-addr <IP>`	`127.0.0.1`	Coordinator address
`--coordinator-port <PORT>`	`6013`	Coordinator port

Filter pipeline: Read/Parse -> Time filters -> Grep -> Tail -> Display

Examples:

# Follow all nodes live
dora logs my-dataflow --all-nodes --follow

# Last 50 errors from a specific node
dora logs my-dataflow sensor --level error --tail 50

# Search logs from last 5 minutes
dora logs my-dataflow --all-nodes --since 5m --grep "timeout"

# Read local files (no coordinator needed)
dora logs --local --all-nodes --tail 100

# Post-mortem analysis: errors in time window
dora logs --local sensor --since 1h --until 30m --level error

Duration formats: 30 (seconds), 30s, 5m, 1h, 2d

`dora inspect top` (alias: `dora top`)

Real-time TUI monitor for node resource usage (like top).

dora inspect top [OPTIONS]
dora top [OPTIONS]

标志	默认	描述
`--refresh-interval <SECONDS>`	`2`	Update interval (min: 1)
`--once`	false	Print a single JSON snapshot and exit (for scripting/CI)
`--coordinator-addr <IP>`	`127.0.0.1`	Coordinator address
`--coordinator-port <PORT>`	`6013`	Coordinator port

Requires an interactive terminal (unless --once is used).

Key	动作
`q` / `Esc`	Quit
`Up` / `k`	Select previous node
`Down` / `j`	Select next node
`n`	Sort by node name
`c`	Sort by CPU
`m`	Sort by memory
`r`	Force refresh

Columns: NODE, STATUS, DATAFLOW, PID, CPU%, MEMORY (MB), RESTARTS, QUEUE, NET TX, NET RX, I/O READ (MB/s), I/O WRITE (MB/s)

STATUS: Running, Restarting, Degraded (broken inputs), or Failed
RESTARTS: Current restart count per node
QUEUE: Pending messages in the node’s input queue
NET TX/RX: Cumulative cross-daemon network bytes sent/received via Zenoh

CPU values are per-core (can exceed 100% with multiple cores). Metrics come from daemons, so this works for distributed deployments.

Scripting example:

# JSON snapshot for CI/monitoring pipelines
dora top --once | jq '.[].cpu_usage'

`dora topic list`

List all topics (outputs) in a running dataflow.

dora topic list [OPTIONS]

标志	默认	描述
`-d <DATAFLOW>`, `--dataflow`	interactive	Dataflow UUID or name
`--format <FMT>`	`table`	Output format: `table\|json`

`dora topic echo`

Subscribe to topics and display messages in real-time.

dora topic echo [OPTIONS] [DATA...]

标志	默认	描述
`-d <DATAFLOW>`, `--dataflow`	必需	Dataflow UUID or name
`[DATA...]`	all outputs	Topics to echo (e.g., `node1/output`)
`--format <FMT>`	`table`	Output format: `table\|json`

Requires _unstable_debug.enable_debug_inspection: true in the descriptor.

`dora topic hz`

Measure topic publish frequency with a TUI dashboard.

dora topic hz [OPTIONS] [DATA...]

标志	默认	描述
`-d <DATAFLOW>`, `--dataflow`	必需	Dataflow UUID or name
`[DATA...]`	all outputs	Topics to measure
`--window <SECONDS>`	`10`	Sliding window (min: 1)

Requires an interactive terminal. Displays: Avg (ms), Avg (Hz), Min (ms), Max (ms), Std (ms), plus a rate sparkline and histogram for the selected topic.

`dora topic info`

Show detailed metadata of a single topic.

dora topic info [OPTIONS] DATA

标志	默认	描述
`-d <DATAFLOW>`, `--dataflow`	必需	Dataflow UUID or name
`DATA`	必需	Single topic (e.g., `camera/image`)
`--duration <SECONDS>`	`5`	Collection duration (min: 1)

Subscribes to the topic for the specified duration and reports: type (Arrow schema), publisher, subscribers, message count, bandwidth.

`dora node`

Manage and inspect dataflow nodes.

`dora node list`

dora node list [OPTIONS]

Lists nodes in a running dataflow with their status, CPU, memory, and restart count.

Columns: NODE, STATUS, PID, CPU%, MEMORY (MB), RESTARTS, DATAFLOW

`dora node info`

Show detailed information about a specific node including status, inputs, outputs, and metrics.

dora node info <NODE> [OPTIONS]

标志	默认	描述
`<NODE>`	必需	Node ID to inspect
`-d <DATAFLOW>`, `--dataflow`	interactive	Dataflow UUID or name
`-f <FORMAT>`, `--format`	`table`	Output format: `table\|json`

`dora node restart`

Restart a single node within a running dataflow. The daemon stops the node process and respawns it.

dora node restart <NODE> [OPTIONS]

标志	默认	描述
`<NODE>`	必需	Node ID to restart
`-d <DATAFLOW>`, `--dataflow`	interactive	Dataflow UUID or name
`--grace <DURATION>`		Grace period before force-killing the node

`dora node stop`

Stop a single node within a running dataflow without stopping the entire dataflow.

dora node stop <NODE> [OPTIONS]

标志	默认	描述
`<NODE>`	必需	Node ID to stop
`-d <DATAFLOW>`, `--dataflow`	interactive	Dataflow UUID or name
`--grace <DURATION>`		Grace period before force-killing the node

`dora topic pub`

Publish JSON data to a topic in a running dataflow. Requires enable_debug_inspection: true.

dora topic pub <TOPIC> [DATA] [OPTIONS]

标志	默认	描述
`<TOPIC>`	必需	Topic to publish to (format: `node_id/output_id`)
`[DATA]`		JSON data to publish (required unless `--file`)
`--file <PATH>`		Read data from a JSON file instead of command line
`--count <N>`	`1`	Number of messages to publish
`-d <DATAFLOW>`, `--dataflow`	必需	Dataflow UUID or name

Examples:

# Publish a single value
dora topic pub -d my-app sensor/threshold '[42]'

# Publish from file, 10 times
dora topic pub -d my-app sensor/config --file config.json --count 10

`dora param`

Manage runtime parameters for nodes. Parameters are persisted in the coordinator store and optionally forwarded to running nodes.

`dora param list`

List all runtime parameters for a node.

dora param list <NODE> [OPTIONS]

标志	默认	描述
`<NODE>`	必需	Node ID
`-d <DATAFLOW>`, `--dataflow`	interactive	Dataflow UUID or name
`--format <FMT>`	`table`	Output format: `table\|json`

`dora param get`

Get a single runtime parameter value.

dora param get <NODE> <KEY> [OPTIONS]

标志	默认	描述
`<NODE>`	必需	Node ID
`<KEY>`	必需	Parameter key
`-d <DATAFLOW>`, `--dataflow`	interactive	Dataflow UUID or name

`dora param set`

Set a runtime parameter. The value is JSON. The parameter is stored in the coordinator and forwarded to the node if it is running.

dora param set <NODE> <KEY> <VALUE> [OPTIONS]

标志	默认	描述
`<NODE>`	必需	Node ID
`<KEY>`	必需	Parameter key (max 256 bytes)
`<VALUE>`	必需	Parameter value as JSON (max 64KB serialized)
`-d <DATAFLOW>`, `--dataflow`	interactive	Dataflow UUID or name

Examples:

# Set a numeric parameter
dora param set -d my-app sensor threshold 42

# Set a string parameter
dora param set -d my-app camera resolution '"1080p"'

# Set a complex parameter
dora param set -d my-app detector config '{"confidence": 0.8, "nms": 0.5}'

`dora param delete`

Delete a runtime parameter.

dora param delete <NODE> <KEY> [OPTIONS]

标志	默认	描述
`<NODE>`	必需	Node ID
`<KEY>`	必需	Parameter key
`-d <DATAFLOW>`, `--dataflow`	interactive	Dataflow UUID or name

`dora doctor`

Diagnose environment, coordinator/daemon connectivity, and optionally validate a dataflow YAML.

dora doctor [OPTIONS]

标志	默认	描述
`--dataflow <PATH>`		Path to a dataflow YAML to validate

Checks performed:

Coordinator reachability
Daemon connectivity
Active dataflow status
Dataflow YAML validation (if --dataflow provided)

Examples:

# Basic health check
dora doctor

# Check environment + validate a dataflow
dora doctor --dataflow dataflow.yml

`dora trace list`

List recent traces captured by the coordinator. The coordinator captures spans from dora_coordinator and dora_core crates in-memory (up to 4096 spans). No external tracing infrastructure required.

dora trace list [OPTIONS]

标志	默认	描述
`--coordinator-addr <IP>`	`127.0.0.1`	Coordinator address
`--coordinator-port <PORT>`	`6013`	Coordinator port

Output columns: TRACE ID (first 12 chars), ROOT SPAN, SPANS, STARTED, DURATION

Example:

dora trace list

TRACE ID      ROOT SPAN          SPANS  STARTED              DURATION
a1b2c3d4e5f6  spawn_dataflow     12     2026-03-01 10:30:05  1.234s
f8e7d6c5b4a3  build_dataflow     5      2026-03-01 10:29:58  0.500s

`dora trace view`

View spans for a specific trace as an indented tree. Supports prefix matching on trace IDs.

dora trace view <TRACE_ID> [OPTIONS]

参数/标志	默认	描述
`<TRACE_ID>`	必需	Full trace ID or unique prefix
`--coordinator-addr <IP>`	`127.0.0.1`	Coordinator address
`--coordinator-port <PORT>`	`6013`	Coordinator port

Example:

dora trace view a1b2c3d4

spawn_dataflow [INFO 1.234s] {build_id="abc", session_id="def"}
  build_dataflow [INFO 0.500s]
    download_node [DEBUG 0.200s] {url="..."}
  start_inner [INFO 0.734s]
    spawn_node [INFO 0.100s] {node_id="camera"}
    spawn_node [INFO 0.080s] {node_id="detector"}

Trace IDs are prefix-matched: if the prefix uniquely identifies a trace, it resolves automatically. If ambiguous, you’ll be prompted to use a longer prefix.

设置命令

`dora status` (alias: `dora check`)

检查系统健康状态和连接性。

dora status [OPTIONS]

Reports coordinator connectivity, daemon status, and active dataflow count.

`dora new`

从模板生成新的项目或节点。

dora new <NAME> [OPTIONS]

标志	默认	描述
`<NAME>`	必需	Project or node name
`--kind <KIND>`	`dataflow`	`dataflow\|node`
`--lang <LANG>`	`rust`	`rust\|python\|c\|cxx`

`dora expand`

Expand module references in a dataflow and print the resulting flat YAML. Useful for debugging module composition.

dora expand <PATH> [OPTIONS]

标志	默认	描述
`<PATH>`	必需	Dataflow descriptor (or module file with `--module`)
`--module`	false	Validate a standalone module file instead of a full dataflow

Examples:

# Expand a dataflow with modules
dora expand dataflow.yml

# Validate a module file
dora expand --module modules/navigation.module.yml

See the Modules Guide for full documentation on module composition.

`dora graph`

以图形方式可视化数据流。

dora graph <PATH> [OPTIONS]

标志	默认	描述
`<PATH>`	必需	Dataflow descriptor path
`--mermaid`	false	Output Mermaid diagram text
`--open`	false	Open HTML in browser

Without --mermaid, generates an interactive HTML file using mermaid.js. When outputs have type annotations, edge labels include the type name (e.g. image [Image]).

# Generate HTML
dora graph dataflow.yml --open

# Generate Mermaid for GitHub markdown
dora graph dataflow.yml --mermaid

`dora validate`

Validate a dataflow YAML file and check type annotations.

dora validate <PATH> [OPTIONS]

标志	默认	描述
`<PATH>`	必需	Dataflow descriptor path
`--strict-types`	false	Treat warnings as errors (non-zero exit code for CI)

Checks:

Key existence: output_types/input_types keys exist in the corresponding outputs/inputs lists
URN resolution: All type URNs resolve in the standard or user-defined type library
Edge compatibility: Connected edges have compatible types (exact match, widening, or user-defined rules)
Parameterized types: Parameter mismatches (e.g. AudioFrame[sample_type=f32] vs AudioFrame[sample_type=i16])
Timer auto-typing: Timer inputs are automatically typed as std/core/v1/UInt64
Type inference: When only upstream annotates a type, it is inferred on the downstream input
Metadata patterns: output_metadata keys and pattern shorthands are validated
Schema compatibility: Struct types are checked at the field level (missing/wrong fields)

User-defined types in a types/ directory next to the dataflow are loaded automatically.

# Validate with warnings
dora validate dataflow.yml

# Strict mode for CI (exit 1 on warnings)
dora validate --strict-types dataflow.yml

See the Type Annotations Guide for the full type library and usage details.

实用命令

`dora completion`

生成 shell 补全脚本。

dora completion [SHELL]

Shell is auto-detected if omitted. Supported: bash, zsh, fish, elvish, powershell.

# Bash
eval "$(dora completion bash)"
echo 'eval "$(dora completion bash)"' >> ~/.bashrc

# Zsh
eval "$(dora completion zsh)"
echo 'eval "$(dora completion zsh)"' >> ~/.zshrc

# Fish
dora completion fish > ~/.config/fish/completions/dora.fish

`dora system`

System management commands.

dora system status [OPTIONS]

Currently provides status as a subcommand (equivalent to dora status).

自管理命令

`dora self update`

检查并安装 CLI 更新。

dora self update [--check-only]

Downloads from GitHub releases (dora-rs/dora).

`dora self uninstall`

从系统中移除 CLI。

dora self uninstall [--force]

Without --force, prompts for confirmation (requires a TTY). Tries uv pip uninstall first, then pip uninstall, then binary self-delete.

环境变量

All environment variables serve as fallbacks. CLI flags always take precedence.

变量	默认	命令	描述
`DORA_COORDINATOR_ADDR`	`127.0.0.1`	All coordinator commands	Coordinator IP address
`DORA_COORDINATOR_PORT`	`6013`	All coordinator commands	Coordinator WebSocket port
`DORA_LOG_LEVEL`	`stdout`	`run`, `logs`	Default minimum log level
`DORA_LOG_FORMAT`	`pretty`	`run`, `logs`	Default output format
`DORA_LOG_FILTER`		`run`, `logs`	Default per-node level overrides
`DORA_ALLOW_SHELL_NODES`		`run`	Enable shell node execution
`DORA_RUNTIME_TYPE_CHECK`		`run`, `start`	Runtime type checking: `warn` (log mismatches) or `error` (fail on mismatch). See Type Annotations

# Set defaults for a development session
export DORA_COORDINATOR_ADDR=192.168.1.10
export DORA_LOG_LEVEL=info
export DORA_LOG_FORMAT=compact

架构指南

This section is for developers who want to understand the framework internals, extend it, or debug issues.

通信栈

                    ┌─────────────────────────────────────┐
                    │           CLI (dora)                │
                    │   WebSocket (JSON request/reply)     │
                    └─────────────┬───────────────────────┘
                                  │
                    ┌─────────────▼───────────────────────┐
                    │        Coordinator                   │
                    │   WebSocket control + daemon mgmt    │
                    │   State: InMemoryStore | RedbStore   │
                    └──┬──────────────────────────────┬───┘
                       │                              │
          ┌────────────▼──────────┐     ┌─────────────▼──────────┐
          │     Daemon A          │     │     Daemon B           │
          │  (machine: robot)     │     │  (machine: gpu-server) │
          │                       │     │                        │
          │  ┌─────┐  ┌─────┐    │     │  ┌──────┐  ┌───────┐  │
          │  │Node1│  │Node2│    │     │  │Node3 │  │Node4  │  │
          │  └──┬──┘  └──┬──┘    │     │  └──┬───┘  └───┬───┘  │
          │     │shmem    │shmem  │     │     │shmem      │shmem │
          │     └────┬────┘       │     │     └─────┬─────┘      │
          └──────────┼────────────┘     └───────────┼────────────┘
                     │                              │
                     └──────── Zenoh pub/sub ────────┘
                              (cross-machine)

协议层

层级	Transport	Format	Use
CLI <-> Coordinator	WebSocket	JSON (ControlRequest/Reply)	Commands, log streaming
Coordinator <-> Daemon	WebSocket	JSON (DaemonCoordinatorEvent)	Node lifecycle, metrics
Daemon <-> Node (small)	TCP / Unix socket	Custom binary	Control messages, small data
Daemon <-> Node (large)	Shared memory	Zero-copy Arrow	Data messages > 4KB
Daemon <-> Daemon	Zenoh pub/sub	Arrow + metadata	Cross-machine data routing

协调器内部机制

The coordinator is an event-driven async server:

Event Sources:
  - CLI WebSocket connections (ControlRequest)
  - Daemon WebSocket connections (DaemonEvent)
  - Heartbeat timer (3s interval)
  - External events (for embedding)

Event Loop:
  merge_all(cli_events, daemon_events, heartbeat, external)
    -> handle_event()
    -> update state
    -> persist to store (if redb)
    -> send replies

Key types:

#![allow(unused)]
fn main() {
// State
RunningDataflow { uuid, name, descriptor, daemons, node_metrics, ... }
RunningBuild    { build_id, errors, log_subscribers, pending_results, ... }
DaemonConnection { sender, pending_replies, last_heartbeat }

// Store trait
trait CoordinatorStore: Send + Sync {
    fn put_dataflow(&self, record: &DataflowRecord) -> Result<()>;
    fn get_dataflow(&self, uuid: &Uuid) -> Result<Option<DataflowRecord>>;
    fn list_dataflows(&self) -> Result<Vec<DataflowRecord>>;
    // ... daemon and build methods
}
}

Store backends:

memory (default): In-memory, lost on restart.
redb: Persistent to disk (~/.dora/coordinator.redb). Survives crashes. Requires redb-backend feature.

dora coordinator --store redb
dora coordinator --store redb:/custom/path.redb

守护进程内部机制

The daemon manages node processes on a single machine:

Per Node:
  1. Build (if build command specified)
  2. Spawn process with DORA_NODE_CONFIG env var
  3. Node registers via TCP/shmem handshake
  4. Route inputs/outputs between nodes
  5. Collect metrics (CPU, memory, I/O)
  6. Handle restart policy on exit
  7. Forward logs to coordinator

Communication:
  - Shared memory for messages > 4KB (zero-copy)
  - TCP for control messages and small data
  - flume channels for internal event routing

Metrics collection:

#![allow(unused)]
fn main() {
struct NodeMetrics {
    pid: u32,
    cpu_usage: f32,      // per-core percentage
    memory_mb: f64,
    disk_read_mb_s: Option<f64>,
    disk_write_mb_s: Option<f64>,
    status: NodeStatus,  // Running | Restarting | Degraded | Failed
    restart_count: u32,
    pending_messages: u64,
}
}

消息类型

All inter-component messages are defined in libraries/message/:

#![allow(unused)]
fn main() {
// Node identification
struct NodeId(String);      // [a-zA-Z0-9_.-]
struct DataId(String);      // same validation
type DataflowId = uuid::Uuid;

// Data metadata
struct Metadata {
    timestamp: uhlc::Timestamp,    // hybrid logical clock
    type_info: ArrowTypeInfo,      // Arrow schema
    parameters: MetadataParameters, // custom key-value pairs
}

// Node events (daemon -> node)
enum NodeEvent {
    Stop,
    Reload { operator_id },
    Input { id, metadata, data },
    InputClosed { id },
    InputRecovered { id },
    NodeRestarted { id },
    AllInputsClosed,
}
}

时间戳

Dora uses a Unified Hybrid Logical Clock (UHLC) for distributed causality. Every message carries a uhlc::Timestamp that preserves causal ordering across machines without synchronized clocks.

零拷贝共享内存

For large messages (> 4KB), the daemon uses shared memory regions:

Sender node requests a shared memory slot from daemon
Daemon allocates a region and returns the ID
Sender writes Arrow data directly into shared memory
Daemon notifies receiver node of the region ID
Receiver reads directly from shared memory (zero-copy)
Receiver sends a drop token when done

This achieves 10-17x lower latency than ROS2 for large payloads.

编写节点

Rust 节点

use dora_node_api::{DoraNode, Event, IntoArrow};
use dora_core::config::DataId;

fn main() -> eyre::Result<()> {
    let (mut node, mut events) = DoraNode::init_from_env()?;

    let output = DataId::from("result".to_owned());

    while let Some(event) = events.recv() {
        match event {
            Event::Input { id, metadata, data } => {
                // Process input data (Arrow array)
                let result: u64 = 42;
                node.send_output(
                    output.clone(),
                    metadata.parameters,
                    result.into_arrow(),
                )?;
            }
            Event::Stop(_) => break,
            Event::InputClosed { id } => {
                eprintln!("input {id} closed");
            }
            Event::InputRecovered { id } => {
                eprintln!("input {id} recovered");
            }
            _ => {}
        }
    }
    Ok(())
}

Cargo.toml:

[dependencies]
dora-node-api = { workspace = true }
eyre = "0.6"

Python 节点

import pyarrow as pa
from dora import Node

node = Node()

for event in node:
    if event["type"] == "INPUT":
        # event["value"] is a PyArrow array
        values = event["value"].to_pylist()
        result = pa.array([sum(values)])
        node.send_output("result", result)
    elif event["type"] == "STOP":
        break

C 节点

#include "node_api.h"

int main() {
    void *ctx = init_dora_context_from_env();
    // ... event loop using dora_next_event / dora_send_output
    free_dora_context(ctx);
    return 0;
}

节点日志

Nodes can emit structured logs:

Rust:

#![allow(unused)]
fn main() {
// Via tracing (recommended)
tracing::info!("processing frame {}", frame_id);

// Via node API
node.log_info("processing complete");
node.log_with_fields("info", "reading", None, Some(&fields));
}

Python:

import logging
logging.info("processing frame %d", frame_id)

# Or via node API
node.log("info", "processing complete")

编写算子

Operators run in-process inside a shared runtime, avoiding process spawn overhead.

Rust 算子

#![allow(unused)]
fn main() {
use dora_operator_api::{register_operator, DoraOperator, DoraOutputSender, DoraStatus, Event};

#[register_operator]
#[derive(Default)]
pub struct MyOperator {
    counter: u32,
}

impl DoraOperator for MyOperator {
    fn on_event(
        &mut self,
        event: &Event,
        output_sender: &mut DoraOutputSender,
    ) -> Result<DoraStatus, String> {
        match event {
            Event::Input { id, data } => {
                self.counter += 1;
                output_sender.send(
                    "count".to_string(),
                    arrow::array::UInt32Array::from(vec![self.counter]),
                )?;
                Ok(DoraStatus::Continue)
            }
            Event::Stop => Ok(DoraStatus::Stop),
            _ => Ok(DoraStatus::Continue),
        }
    }
}
}

Cargo.toml:

[lib]
crate-type = ["cdylib"]

[dependencies]
dora-operator-api = { workspace = true }
arrow = "53"

Python 算子

nodes:
  - id: my-node
    operator:
      python: my_operator.py
      inputs:
        data: source/output
      outputs:
        - result

# my_operator.py
class Operator:
    def __init__(self):
        self.counter = 0

    def on_event(self, event, send_output):
        if event["type"] == "INPUT":
            self.counter += 1
            send_output("result", pa.array([self.counter]))

分布式部署

设置

# Machine A (coordinator + daemon)
dora up

# Machine B (daemon only, pointing to coordinator on Machine A)
dora daemon --interface 0.0.0.0 --coordinator-addr 192.168.1.10 --machine-id B

# Machine C (same)
dora daemon --interface 0.0.0.0 --coordinator-addr 192.168.1.10 --machine-id C

带机器分配的数据流

nodes:
  - id: camera
    _unstable_deploy:
      machine: robot
    path: ./camera-driver
    outputs:
      - frames

  - id: inference
    _unstable_deploy:
      machine: gpu-server
    path: ./ml-model
    inputs:
      frames: camera/frames
    outputs:
      - predictions

  - id: actuator
    _unstable_deploy:
      machine: robot
    path: ./actuator-driver
    inputs:
      commands: inference/predictions

构建和启动

# From any machine with coordinator access
dora build dataflow.yml       # distributed build on target machines
dora start dataflow.yml --name my-robot --attach

监控

# Resource usage across all machines
dora top

# Logs from any node regardless of machine
dora logs my-robot inference --follow

# List all dataflows
dora list

协调器持久化

For production, use the redb store backend so the coordinator survives restarts:

dora coordinator --store redb

State is persisted to ~/.dora/coordinator.redb. On restart, stale dataflows are marked as failed and the coordinator resumes normal operation.

For managed cluster deployments (cluster.yml, SSH-based lifecycle, label scheduling, systemd services, rolling upgrades), see the Distributed Deployment Guide.

故障排除

For a comprehensive debugging guide covering record/replay workflows, topic inspection, resource monitoring, and end-to-end debugging scenarios, see Debugging and Observability Guide.

常见问题

“Could not connect to dora-coordinator”

Run dora up first, or check DORA_COORDINATOR_ADDR/DORA_COORDINATOR_PORT
Verify with dora status

“enable_debug_inspection not enabled”

Use --debug flag: dora start dataflow.yml --debug or dora run dataflow.yml --debug

Or add to your dataflow YAML:

_unstable_debug:
  enable_debug_inspection: true

Required for topic echo, topic hz, topic info

“dora top requires an interactive terminal”

These TUI commands need a real terminal (not piped output)
Same applies to topic hz

Node not receiving inputs

Check that output names match: source_node/output_id
Verify the source node lists the output in its outputs: array
Check dora topic list for available topics

Logs not appearing

Check --log-level setting (default stdout shows everything)
Check min_log_level in YAML (filters at source)
For distributed: verify coordinator/daemon connectivity

Build fails with git source

Verify git: URL is accessible
Check that branch, tag, or rev exists
Build command runs from the git repo root, not the dataflow directory

调试工作流

# 1. Full environment diagnosis
dora doctor --dataflow dataflow.yml

# 2. Start with verbose logging and debug topics
dora run dataflow.yml --log-level trace --debug

# 3. Inspect a specific node
dora node info -d my-dataflow problem-node

# 4. Monitor specific node logs
dora logs my-dataflow problem-node --follow --level debug

# 5. Check resource usage
dora top

# 6. Inspect topic data
dora topic echo -d my-dataflow problem-node/output

# 7. Publish test data to a topic
dora topic pub -d my-dataflow problem-node/input '[1, 2, 3]'

# 8. Measure frequencies
dora topic hz -d my-dataflow --window 5

# 9. View/modify runtime parameters
dora param list -d my-dataflow problem-node
dora param set -d my-dataflow problem-node threshold 42

# 10. Restart a misbehaving node without stopping the dataflow
dora node restart -d my-dataflow problem-node

# 11. View coordinator traces (no external infra needed)
dora trace list
dora trace view <trace-id-prefix>

# 12. Visualize dataflow graph
dora graph dataflow.yml --open

日志文件位置

out/
  <dataflow-uuid>/
    log_<node-id>.jsonl          # current log
    log_<node-id>.1.jsonl        # rotated (previous)
    log_<node-id>.2.jsonl        # rotated (older)

Read directly with:

dora logs --local --all-nodes
dora logs --local <node-name> --tail 50

日志

Dora provides a structured logging system for real-time robotics and AI dataflows. Logs are captured per-node as structured JSONL files, forwarded to the coordinator for live streaming, and optionally routed through the dataflow graph as data messages.

Which Logging Approach Should I Use?

Start here if you’re unsure which approach fits your use case.

I want to…	Approach	配置
Log from Python	Use Python’s `logging` module (auto-bridged)	Nothing – just `import logging`
Log from Rust	Use `node.log_info()` / `node.log_error()` etc.	Nothing – works out of the box
Log from C/C++	Use `dora_log()` / `log_message()`	Nothing – works out of the box
Filter noisy nodes	Set `min_log_level` in YAML	Per-node YAML field
Watch all logs in one place	Subscribe to `dora/logs` virtual input	`inputs: logs: dora/logs`
Process one node’s logs as data	Use `send_logs_as` on that node	Per-node YAML + wire the output
Rotate log files	Set `max_log_size` in YAML	Per-node YAML field
Build a custom log sink	Use `dora-log-utils` crate	Rust dependency
Filter CLI display	Use `--log-level` / `--log-filter` flags	CLI flags or env vars

Language-Specific Quick Start

Python – the simplest path is Python’s built-in logging module:

import logging
from dora import Node

node = Node()  # Automatically bridges Python logging -> dora

logging.info("Sensor started")       # Captured as structured "info" log
logging.warning("High temp: 42C")    # Captured as structured "warn" log
print("raw debug output")            # Captured as "stdout" level

When Node() is created, it installs a handler that routes all Python logging calls through Rust’s tracing system. The daemon parses these as structured log entries with level, message, file, and line number. No extra configuration needed.

You can also use the explicit API for structured fields:

node.log_info("Reading acquired")
node.log("info", "Reading acquired", fields={"sensor_id": "temp-01"})

Rust – use the node API convenience methods:

#![allow(unused)]
fn main() {
let (node, mut events) = DoraNode::init_from_env()?;

// Convenience methods (recommended for most cases)
node.log_info("Sensor started");
node.log_warn("High temperature");

// With structured fields
let mut fields = BTreeMap::new();
fields.insert("sensor_id".into(), "temp-01".into());
node.log_with_fields("info", "Reading acquired", None, Some(&fields));
}

Alternatively, Rust nodes can use the tracing crate. When dora’s tracing subscriber is initialized (via init_tracing()), tracing::info!() etc. output structured JSON to stdout, which the daemon parses automatically:

#![allow(unused)]
fn main() {
// Also works -- parsed as structured logs by the daemon
tracing::info!("Sensor started");
tracing::warn!(sensor_id = "temp-01", "High temperature");
}

Use node.log_*() when you want explicit control over the log format. Use tracing::*!() when you want ecosystem integration (spans, instrumentation, OpenTelemetry). Both produce identical structured log entries in the daemon.

C – use the dora_log() function:

dora_log(ctx, "info", 4, "Sensor started", 14);

C++ – use the log_message() function:

log_message(node.send_output, "info", "Sensor started");

功能一览

特性	范围	配置
日志级别过滤	CLI display	`--log-level`, `DORA_LOG_LEVEL`
Output formats	CLI display	`--log-format`, `DORA_LOG_FORMAT`
Per-node level overrides	CLI display	`--log-filter`, `DORA_LOG_FILTER`
Source-level filtering	Per-node YAML	`min_log_level`
Stdout-as-data routing	Per-node YAML	`send_stdout_as`
Structured log routing	Per-node YAML	`send_logs_as`
Log file rotation	Per-node YAML	`max_log_size`
Rotation file limit	Per-node YAML	`max_rotated_files`
Node log API	Rust/Python/C/C++ node	`node.log()`, `dora_log()`, etc.
Log utilities library	Rust crate	`dora-log-utils`
Log aggregation	Dataflow input	`dora/logs` virtual input
Time-range filtering	`dora logs`	`--since`, `--until`
Live log streaming	`dora logs`	`--follow`
Text search	`dora logs`	`--grep`
Local log reading	`dora logs`	`--local`, `--all-nodes`

Log File Format

Each node produces a JSONL file (one JSON object per line) at:

<working_dir>/out/<dataflow_uuid>/log_<node_id>.jsonl

Each line has this structure:

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "info",
  "node_id": "sensor",
  "message": "Starting sensor...",
  "target": "sensor::module",
  "fields": { "key": "value" }
}

Field	类型	描述
`timestamp`	string	RFC3339 timestamp with millisecond precision
`level`	string	`"error"`, `"warn"`, `"info"`, `"debug"`, `"trace"`, or `"stdout"`
`node_id`	string	Node ID
`message`	string	The log message text
`target`	string?	Rust module target (e.g. `"sensor::module"`), null if absent
`fields`	object?	Structured key-value fields from the logging framework. Trust model: fields originate from node stdout and are passed through without sanitization. In mixed-trust environments, log consumers should validate field contents before acting on them

How Node Output Becomes Log Entries

The daemon captures each line of stdout/stderr from a node process and attempts to parse it as a structured log message (JSON with level, message, timestamp, and optional fields). If parsing succeeds, the structured fields are preserved. If parsing fails, the raw line becomes a "stdout"-level entry.

This means nodes using Rust’s tracing or log crate with JSON output get full structured logging automatically. Nodes that simply println! produce "stdout"-level entries.

Viewing Logs: `dora run`

When running a dataflow with dora run, logs from all nodes are displayed in real-time on the terminal.

Flags

dora run dataflow.yml [OPTIONS]

标志	默认	Env Var	描述
`--log-level LEVEL`	`stdout`	`DORA_LOG_LEVEL`	Minimum level to display
`--log-format FORMAT`	`pretty`	`DORA_LOG_FORMAT`	Output format: `pretty`, `json`, `compact`
`--log-filter FILTER`	none	`DORA_LOG_FILTER`	Per-node level overrides

日志级别

From most to least verbose:

Level	描述
`stdout`	Everything including raw stdout from nodes (default)
`trace`	Fine-grained diagnostic messages
`debug`	Developer-level diagnostic messages
`info`	General informational messages
`warn`	Warning conditions
`error`	Error conditions only

Setting --log-level info hides stdout, trace, and debug messages. The stdout level is a special catch-all that passes everything.

Level Filtering Logic

The level filter uses LogLevelOrStdout::passes():

Message level    Filter level    Displayed?
─────────────    ────────────    ──────────
stdout           stdout          yes
stdout           info            no       (stdout only passes stdout filter)
info             stdout          yes      (any log level passes stdout filter)
debug            info            no       (debug is more verbose than info)
error            info            yes      (error is less verbose than info)

Per-Node Overrides

The --log-filter flag lets you set different levels for different nodes:

dora run dataflow.yml --log-level info --log-filter "sensor=debug,planner=warn"

This shows info and above for all nodes, except sensor (shows debug and above) and planner (shows warn and above).

Format: "node1=level,node2=level" (comma-separated name=level pairs).

Output Formats

Pretty (default) – colored, human-readable:

10:30:00 INFO   sensor: Starting sensor...

10:30:01 INFO   [dora]: spawning node processor

10:30:01 stdout sensor: raw output line

Timestamp in local timezone (HH:MM:SS)
Level colored: ERROR (red), WARN (yellow), INFO (green), DEBUG (blue), TRACE (dimmed), stdout (italic dimmed blue)
Node name in bold with a unique color based on the name
System messages prefixed with [dora]
Lifecycle messages (spawning, node finished, stopping) get visual separation with blank lines

Json – full LogMessage struct as JSON, one per line:

{"build_id":null,"dataflow_id":"abc-123","node_id":"sensor","level":"INFO","message":"Starting...","timestamp":"2024-01-15T10:30:00Z",...}

Useful for piping to jq or ingesting into log aggregation systems.

Compact – minimal, no color:

10:30:00 INFO sensor: Starting sensor...

Useful for CI/CD environments and log files.

Viewing Logs: `dora logs`

Read historical logs or stream live logs from a running dataflow.

Basic Usage

# Read logs for a specific node (via coordinator)
dora logs <dataflow_uuid> <node_name>

# Read local log files directly
dora logs --local <node_name>
dora logs --local --all-nodes

# Stream live logs
dora logs <dataflow_uuid> <node_name> --follow
dora logs --local <node_name> --follow

Flags

标志	Short	默认	描述
`--local`		false	Read from local `out/` directory instead of coordinator
`--all-nodes`		false	Merge logs from all nodes, sorted by timestamp
`--tail N`	`-n`	all	Show only the last N lines
`--follow`	`-f`	false	Stream new log entries as they arrive
`--since DURATION`		none	Only show logs newer than this duration ago
`--until DURATION`		none	Only show logs older than this duration ago
`--level LEVEL`		`stdout`	Minimum log level (env: `DORA_LOG_LEVEL`)
`--grep PATTERN`		none	Case-insensitive text search
`--coordinator-addr IP`		`127.0.0.1`	Coordinator address
`--coordinator-port PORT`		default	Coordinator control port

Time Filters

--since and --until accept duration strings relative to now:

# Logs from the last 5 minutes
dora logs --local sensor --since 5m

# Logs from 1 hour ago to 30 minutes ago
dora logs --local sensor --since 1h --until 30m

# Last 10 errors from the past hour
dora logs --local sensor --since 1h --level error --tail 10

Supported duration formats: 30 (seconds), 30s, 5m, 1h, 2d.

Text Search

--grep performs case-insensitive substring matching against:

The log message text
The node ID
The module target

# Find all timeout-related messages
dora logs --local --all-nodes --grep "timeout"

# Find errors from a specific module
dora logs --local sensor --grep "camera::driver" --level error

Filter Pipeline

All filters are applied in this order:

Read/Parse -> Time Filters -> Grep -> Tail -> Display

When --since, --until, or --grep are used in coordinator mode, the CLI fetches all logs from the server (ignoring --tail server-side) and applies all filters client-side. This ensures correct results when combining filters.

Local vs Coordinator Mode

Local mode (--local) reads JSONL files directly from the out/ directory in the current working directory. No coordinator or daemon needs to be running. If --all-nodes is used or no node name is given, all log files are merged and sorted by timestamp.

Coordinator mode (default) connects to a running coordinator via WebSocket. The coordinator reads log files from the daemon’s working directory and streams them back. This works for both local and distributed deployments.

Follow Mode

Local follow (--local --follow): Polls log files every 200ms for new content. New lines are parsed, filtered by --grep, and printed. Time/tail filters only apply to the initial historical output.

Coordinator follow (--follow): Opens a WebSocket subscription to the coordinator. The coordinator forwards log messages from the daemon in real-time. Level filtering is applied server-side for efficiency. --grep and --since are applied client-side on the stream.

环境变量

All environment variables serve as fallbacks – CLI flags always take precedence.

变量	Used By	Values	描述
`DORA_LOG_LEVEL`	`dora run`, `dora logs`	`error`, `warn`, `info`, `debug`, `trace`, `stdout`	Default minimum log level
`DORA_LOG_FORMAT`	`dora run`	`pretty`, `json`, `compact`	Default output format
`DORA_LOG_FILTER`	`dora run`	`"node1=level,node2=level"`	Default per-node overrides
`DORA_QUIET`	daemon	any value	Suppress log forwarding to display (file writing continues)

Example:

# Set defaults for a development session
export DORA_LOG_LEVEL=info
export DORA_LOG_FORMAT=pretty
export DORA_LOG_FILTER="sensor=debug"

# These are equivalent:
dora run dataflow.yml
dora run dataflow.yml --log-level info --log-format pretty --log-filter "sensor=debug"

# CLI flag overrides env var:
dora run dataflow.yml --log-level debug   # overrides DORA_LOG_LEVEL=info

YAML Configuration

`min_log_level`

Filter logs at the source (daemon-side) before they reach log files, the coordinator, or send_logs_as routing.

nodes:
  - id: noisy-sensor
    path: ./target/debug/sensor
    min_log_level: info    # suppress debug/trace/stdout from this node

Valid values: error, warn, info, debug, trace, stdout.

When set, the daemon drops log messages below this level immediately after parsing. This reduces disk I/O, network traffic, and log file size. The filtering uses the same passes() logic as the CLI display filter.

`send_stdout_as`

Route raw stdout/stderr lines as dataflow output messages.

nodes:
  - id: legacy-node
    path: ./legacy-script.py
    send_stdout_as: raw_output
    outputs:
      - raw_output
      - data

  - id: log-consumer
    inputs:
      logs: legacy-node/raw_output

Each stdout/stderr line is sent as an Arrow-encoded string. This is useful for integrating legacy nodes that output data on stdout (e.g., Python scripts using print()).

Both send_stdout_as and normal log file writing happen – stdout routing does not suppress log files.

`send_logs_as`

Route parsed structured log entries as dataflow output messages.

nodes:
  - id: sensor
    path: ./target/debug/sensor
    send_logs_as: log_entries
    outputs:
      - data
      - log_entries

  - id: log-aggregator
    inputs:
      sensor_logs: sensor/log_entries

Unlike send_stdout_as, this only sends lines that were successfully parsed as structured logs (not raw stdout). Each entry is serialized as a full JSON LogMessage string. The min_log_level filter applies before routing – suppressed messages are not sent.

Use this to build log aggregation, alerting, or monitoring nodes within the dataflow itself.

`dora/logs` – Automatic Log Aggregation

Subscribe to logs from all nodes with a single input line – no manual wiring needed:

nodes:
  - id: sensor
    path: sensor.py
    inputs:
      tick: dora/timer/millis/200
    outputs:
      - reading

  - id: processor
    path: processor.py
    inputs:
      reading: sensor/reading
    outputs:
      - result

  - id: log-viewer
    path: log_viewer.py
    inputs:
      logs: dora/logs              # all nodes, all levels
      errors: dora/logs/error      # only error+ from all nodes
      sensor: dora/logs/info/sensor  # info+ from one node

The dora/logs virtual input works like dora/timer – the daemon handles subscription internally. Each log message arrives as a JSON-encoded LogMessage string in an Arrow array. To prevent infinite loops, a node never receives its own log messages.

Syntax:

Input	描述
`dora/logs`	All logs from all nodes
`dora/logs/<level>`	Logs at `<level>` or above from all nodes
`dora/logs/<level>/<node-id>`	Logs at `<level>` or above from a specific node

Levels: stdout, error, warn, info, debug, trace.

When to use dora/logs vs send_logs_as:

	`dora/logs`	`send_logs_as`
范围	All nodes at once	One node at a time
YAML changes	Only the consumer	Each source node
Adding a node	Zero wiring changes	Must update consumer
用例	Dashboard, monitoring	Per-node log processing

See examples/log-aggregator/ for a complete working example.

`max_log_size`

Enable size-based log file rotation.

nodes:
  - id: sensor
    path: ./target/debug/sensor
    max_log_size: "50MB"

值	Bytes
`"1KB"` or `"1K"`	1,024
`"50MB"` or `"50M"`	52,428,800
`"1GB"` or `"1G"`	1,073,741,824
`"1000"`	1,000 (plain number = bytes)

When the active log file exceeds the configured size, the daemon:

Flushes and closes the current file
Renames existing rotated files: .4.jsonl -> .5.jsonl, .3.jsonl -> .4.jsonl, etc.
Renames the current file: log_sensor.jsonl -> log_sensor.1.jsonl
Creates a fresh log_sensor.jsonl
Deletes any file beyond the rotation limit (default 5, configurable via max_rotated_files)

Naming convention:

log_sensor.jsonl       # current (active)
log_sensor.1.jsonl     # previous
log_sensor.2.jsonl     # older
log_sensor.3.jsonl
log_sensor.4.jsonl
log_sensor.5.jsonl     # oldest (deleted on next rotation)

Maximum disk usage per node: max_log_size * (1 + max_rotated_files) (1 active + N rotated).

Without max_log_size, log files grow unbounded. For long-running dataflows, always set this.

The dora logs --local command automatically reads all rotated files for a node and merges them in chronological order (oldest rotated file first, current file last).

`max_rotated_files`

Control how many rotated log files to keep (default: 5, range: 1-100).

nodes:
  - id: sensor
    path: ./target/debug/sensor
    max_log_size: "50MB"
    max_rotated_files: 10    # keep 10 rotated files instead of 5

With max_rotated_files: 10 and max_log_size: "50MB", maximum disk usage is 50MB * 11 = 550MB per node. Lower values save disk space; higher values preserve more history.

Runtime Node Restrictions

For runtime nodes (operators), only one of each logging field is allowed per runtime:

# OK -- single operator
nodes:
  - id: runtime-node
    operator:
      python: process.py
      send_logs_as: logs
      min_log_level: info
      max_log_size: "100MB"

# ERROR -- multiple operators with conflicting configs
nodes:
  - id: runtime-node
    operators:
      - id: op1
        python: a.py
        send_logs_as: logs1
      - id: op2
        python: b.py
        send_logs_as: logs2    # Error: multiple send_logs_as

When a single operator in a runtime sets these fields, the output name is prefixed with the operator ID (e.g., op1/logs).

Node Log API

Nodes can emit structured log messages programmatically using the node API. These are equivalent to writing JSON-formatted log lines to stdout – the daemon parses them identically.

Rust

#![allow(unused)]
fn main() {
use dora_node_api::DoraNode;
use std::collections::BTreeMap;

let (node, mut events) = DoraNode::init_from_env()?;

// General log with level string and optional target
node.log("info", "sensor initialized", Some("sensor::init"));

// Convenience methods (no target parameter)
node.log_error("connection failed");
node.log_warn("temperature elevated");
node.log_info("reading acquired");
node.log_debug("raw bytes received");
node.log_trace("entering loop iteration");

// Structured fields (key-value context preserved through send_logs_as)
let mut fields = BTreeMap::new();
fields.insert("sensor_id".to_string(), "temp-01".to_string());
fields.insert("reading".to_string(), "42.5".to_string());
node.log_with_fields("info", "reading acquired", None, Some(&fields));
}

The level parameter accepts "error", "warn" (or "warning"), "info", "debug", "trace". Unknown levels default to "info". Fields are capped at 60 KB total to match the downstream 64 KB parse limit.

Python

Python nodes have three ways to log, all producing structured log entries:

from dora import Node
import logging

node = Node()

# Option 1: Python's logging module (recommended -- auto-bridged by Node())
logging.info("sensor initialized")
logging.warning("temperature elevated")
logging.debug("raw bytes: %s", data)

# Option 2: Explicit dora API with level string
node.log("info", "sensor initialized", target="sensor.init")
node.log("info", "reading acquired", fields={"sensor_id": "temp-01", "reading": "42.5"})

# Option 3: Convenience methods
node.log_error("connection failed")
node.log_warn("temperature elevated")
node.log_info("reading acquired")
node.log_debug("raw bytes received")
node.log_trace("entering loop iteration")

# This also works but produces "stdout"-level entries (no structure):
print("raw output")

How the Python logging bridge works: When Node() is created, it installs a custom logging.Handler that routes all Python logging calls through Rust’s tracing system. The daemon parses these as structured log entries with level, message, file path, and line number. This happens automatically – no configuration needed.

方法	Structured?	Fields support?	When to use
`logging.info()`	是	No (use `extra=` for custom formatters)	General-purpose logging
`node.log("info", msg, fields={...})`	是	是	When you need structured key-value context
`node.log_info(msg)`	是	否	Quick one-liner, same as `node.log("info", msg)`
`print()`	No (`stdout` level)	否	Legacy code, quick debugging

Common pitfall: Do not call logging.basicConfig() before creating Node(). The node constructor sets up the logging bridge; calling basicConfig() first may install a conflicting handler. If you need custom formatters, configure them after Node() creation.

C

#include "node_api.h"

void *ctx = init_dora_context_from_env();
const char *level = "info";
const char *msg = "sensor initialized";
dora_log(ctx, level, strlen(level), msg, strlen(msg));

C++

// Via the cxx bridge
auto node = init_dora_node();
log_message(node.send_output, "info", "sensor initialized");

Log Utilities Library (`dora-log-utils`)

The dora-log-utils crate provides parsing, merging, filtering, and formatting utilities for working with LogMessage entries in custom sink nodes. Use it when building nodes that consume log data via send_logs_as.

API

#![allow(unused)]
fn main() {
use dora_log_utils;

// Parse a LogMessage from JSON (as received from send_logs_as)
let log = dora_log_utils::parse_log(json_str)?;

// Parse directly from Arrow input data (convenience for event handlers)
let log = dora_log_utils::parse_log_from_arrow(&data)?;

// Merge multiple log streams into a single timeline
let merged = dora_log_utils::merge_by_timestamp(vec![stream_a, stream_b]);

// Filter by minimum level
let errors = dora_log_utils::filter_by_level(&logs, &min_level);

// Format as JSON (one line, no trailing newline)
let json = dora_log_utils::format_json(&log);

// Format as compact single-line: "<timestamp> <node> <LEVEL>: <message>"
let compact = dora_log_utils::format_compact(&log);

// Format as pretty: "[<timestamp>][<LEVEL>][<node>] <message>"
let pretty = dora_log_utils::format_pretty(&log);
}

Dependency

Add to your sink node’s Cargo.toml:

[dependencies]
dora-log-utils = { workspace = true }

Log Sink Examples

Three example sink nodes demonstrate how to consume logs routed via send_logs_as and forward them to external destinations.

File Sink (`examples/log-sink-file/`)

Merges log streams from multiple nodes into a single JSONL file. Useful for unified log collection.

nodes:
  - id: sensor
    path: sensor.py
    send_logs_as: log_entries
    inputs:
      tick: dora/timer/millis/200
    outputs:
      - reading
      - log_entries

  - id: processor
    path: processor.py
    send_logs_as: log_entries
    inputs:
      reading: sensor/reading
    outputs:
      - result
      - log_entries

  - id: file_sink
    path: log-sink-file
    inputs:
      sensor_logs: sensor/log_entries
      processor_logs: processor/log_entries
    env:
      LOG_FILE: "./combined.jsonl"

The file sink reads LOG_FILE from the environment (default ./combined.jsonl), parses each incoming Arrow message with dora_log_utils::parse_log_from_arrow(), formats it as JSON, and appends it to the file.

TCP Sink (`examples/log-sink-tcp/`)

Forwards log entries over a TCP socket to a remote log collector. Useful for embedded systems that lack local filesystems and need to stream logs off-device.

nodes:
  - id: source
    path: source.py
    send_logs_as: log_entries
    inputs:
      tick: dora/timer/millis/500
    outputs:
      - data
      - log_entries

  - id: tcp_sink
    path: log-sink-tcp
    inputs:
      logs: source/log_entries
    env:
      SINK_ADDR: "127.0.0.1:9876"

The TCP sink reads SINK_ADDR from the environment (default 127.0.0.1:9876), connects to the server on startup, and sends each log entry as a JSON line. It reconnects automatically on write failure.

Alert Router (`examples/log-sink-alert/`)

Splits incoming log entries by severity. All logs are forwarded to the all_logs output; only error and warn logs are forwarded to the alerts output. This enables downstream nodes to handle alerts differently (e.g., trigger notifications, write to a dedicated file).

nodes:
  - id: source
    path: my_node.py
    send_stdout_as: log_entries
    inputs:
      tick: dora/timer/millis/200
    outputs:
      - log_entries

  - id: alert_router
    path: log-sink-alert
    inputs:
      logs: source/log_entries
    outputs:
      - all_logs
      - alerts

The source node uses send_stdout_as to route its stdout lines as Arrow string data. The router parses each log entry with dora_log_utils::parse_log_from_arrow(), checks the level, and uses node.send_output() to forward data to the appropriate outputs. Nodes using the node API can alternatively use send_logs_as to route structured logs from node.log().

Building a Custom Sink

To build your own sink node, follow this pattern:

use dora_node_api::{DoraNode, Event};

fn main() -> eyre::Result<()> {
    let (_node, mut events) = DoraNode::init_from_env()?;

    while let Some(event) = events.recv() {
        match event {
            Event::Input { data, .. } => {
                let log = dora_log_utils::parse_log_from_arrow(&data)?;
                // Process the log entry: write to file, send over network, etc.
                let json = dora_log_utils::format_json(&log);
                println!("{json}");
            }
            Event::Stop(_) => break,
            _ => {}
        }
    }
    Ok(())
}

How the Daemon Processes Logs

Understanding the internal pipeline helps with debugging and tuning. For each node, the daemon runs a dedicated async task that processes log lines in order:

Node Process (stdout/stderr)
    |
    v
[1] Capture: lines buffered in mpsc channel (capacity 100)
    |
    v
[2] send_stdout_as: raw line -> Arrow data -> dataflow output
    |
    v
[3] Parse: try JSON structured log, fall back to Stdout-level
    |
    v
[4] min_log_level filter: drop messages below threshold
    |
    v
[5] send_logs_as: LogMessage -> JSON -> Arrow data -> dataflow output
    |
    v
[6] Write JSONL: compact format to log file, track bytes written
    |
    v
[7] Rotation check: if bytes_written >= max_log_size, rotate files
    |
    v
[8] Forward: send LogMessage to display channel (unless DORA_QUIET)
    |
    v
[9] Sync: fsync log file to disk

Key details:

Step 2 happens before parsing, so send_stdout_as captures every line including non-structured output
Step 4 happens before Steps 5-8, so min_log_level suppresses messages from all downstream processing
Step 5 only fires for successfully parsed structured logs (Step 3 success path)
Step 8 sends to either a flume channel (dora run direct mode) or the coordinator (distributed mode)
Step 9 calls sync_all() after every write, ensuring durability at the cost of some I/O overhead

Structured Log Parsing

When a node emits JSON-formatted log output (e.g., from tracing-subscriber with JSON formatting), the daemon extracts:

level: log severity
message: the log text
target: module path
timestamp: when the log was emitted
fields: arbitrary key-value pairs
build_id, dataflow_id, node_id, daemon_id: extracted from fields as fallback

The daemon also sets dataflow_id, node_id, and daemon_id on all messages to ensure they are always present in the log file.

Coordinator Log Streaming Protocol

When a daemon runs under a coordinator (distributed mode), log forwarding works via WebSocket:

Daemon -> Coordinator: Each LogMessage is wrapped in DaemonEvent::Log(message) and sent over the daemon’s WebSocket connection
Coordinator storage: The coordinator stores/forwards logs
CLI subscription: The CLI sends ControlRequest::LogSubscribe { dataflow_id, level } over its WebSocket connection
Server-side filtering: The coordinator only forwards messages where msg_level <= subscription_level. This reduces network traffic for filtered subscriptions
CLI receive: Messages arrive as serialized LogMessage structs

The --level flag maps to log::LevelFilter:

stdout -> LevelFilter::Trace (most permissive, receives everything)
info -> LevelFilter::Info (receives Error, Warn, Info)
etc.

Complete YAML Reference

nodes:
  - id: sensor
    path: ./target/debug/sensor
    outputs:
      - data
      - raw_output       # for send_stdout_as
      - log_entries       # for send_logs_as

    # Source-level log filtering (daemon-side)
    min_log_level: info          # suppress debug/trace/stdout

    # Route stdout to dataflow
    send_stdout_as: raw_output   # every stdout line becomes a data message

    # Route structured logs to dataflow
    send_logs_as: log_entries    # parsed log entries become data messages

    # Log file rotation
    max_log_size: "50MB"         # rotate when file exceeds 50MB
    max_rotated_files: 5         # keep 5 rotated files (default, range 1-100)

    inputs:
      tick: dora/timer/millis/100

Complete Example

The examples/python-logging/ directory contains a runnable three-node pipeline that exercises every logging feature:

sensor (noisy, high-volume) --> processor (structured logs) --> monitor (log aggregator)

Dataflow configuration highlights:

nodes:
  - id: sensor
    path: sensor.py
    min_log_level: info       # suppress debug noise at source
    max_log_size: "1KB"       # small for demo (triggers rotation quickly)
    inputs:
      tick: dora/timer/millis/50
    outputs:
      - reading

  - id: processor
    path: processor.py
    send_logs_as: log_entries  # route structured logs as data
    inputs:
      reading: sensor/reading
    outputs:
      - result
      - log_entries

  - id: monitor
    path: monitor.py
    inputs:
      logs: processor/log_entries
      reading: sensor/reading

What each node demonstrates:

sensor – Mixes print() (raw stdout), logging.info(), logging.debug(), and logging.warning(). With min_log_level: info, debug messages are dropped by the daemon before reaching log files. With max_log_size: "1KB", log rotation kicks in after a few seconds.
processor – Uses send_logs_as: log_entries to route its structured log entries as dataflow data. Raw print() output is not routed (only parsed structured entries are).
monitor – Subscribes to processor/log_entries and counts warnings/errors, demonstrating in-dataflow log aggregation.

Direct mode (dora run – single process, good for quick testing):

# Basic run
dora run examples/python-logging/dataflow.yml --stop-after 5s

# Only warnings and above
dora run examples/python-logging/dataflow.yml --log-level warn --stop-after 5s

# Per-node overrides
dora run examples/python-logging/dataflow.yml --log-filter "monitor=debug,sensor=warn" --stop-after 5s

# JSON output for machine parsing
dora run examples/python-logging/dataflow.yml --log-format json --stop-after 3s

# Environment variable control
DORA_LOG_LEVEL=warn dora run examples/python-logging/dataflow.yml --stop-after 5s

Distributed mode (dora up + dora start – coordinator/daemon architecture, required for multi-machine deployments):

# Start infrastructure
dora up

# Start attached (live log stream)
dora start examples/python-logging/dataflow.yml --attach

# Or start detached and query logs separately
dora start examples/python-logging/dataflow.yml
dora logs <dataflow-id> sensor --follow                    # stream one node
dora logs <dataflow-id> sensor --follow --level warn       # only warnings
dora logs <dataflow-id> --all-nodes --tail 20              # last 20 lines
dora logs <dataflow-id> processor --grep "error" --since 5m  # targeted search

In distributed mode, logs flow Node -> Daemon -> Coordinator -> CLI over WebSocket. The coordinator buffers log messages until a subscriber connects, so you won’t miss logs even if you attach late. YAML-level settings (min_log_level, send_logs_as, max_log_size) work identically since they are applied at the daemon.

	`dora run`	`dora start`
Display filtering	`--log-level`, `--log-format`, `--log-filter`	`--level` on `dora logs`
Per-node overrides	`--log-filter "sensor=debug"`	Separate `dora logs` per node
Remote nodes	否	是
Live streaming	Always attached	`--attach` or `dora logs --follow`

Post-run log analysis (works the same for both modes):

# Read all local logs
dora logs --local --all-nodes --tail 20

# Search for warnings in sensor logs
dora logs --local sensor --grep "high temp"

# Check that rotation created multiple files
ls -la out/*/log_sensor*.jsonl

Use Case Scenarios

1. Debugging a Noisy Sensor Pipeline

A camera sensor node floods the logs with debug messages, making it hard to see errors from other nodes.

nodes:
  - id: camera
    path: ./target/debug/camera
    min_log_level: warn          # suppress info/debug/trace at the source
    max_log_size: "10MB"         # limit disk usage

  - id: detector
    path: ./target/debug/detector

  - id: planner
    path: ./target/debug/planner

# During development: see everything from detector, only warnings from camera
dora run dataflow.yml --log-level debug --log-filter "camera=warn,detector=debug"

# In production: only errors
export DORA_LOG_LEVEL=error
dora run dataflow.yml

What happens:

Camera node’s debug/info messages are dropped by the daemon before reaching the log file (min_log_level: warn)
The CLI further filters display based on --log-filter
Log files rotate at 10MB, keeping at most 60MB on disk for the camera node

2. Log Aggregation Within the Dataflow

Build an in-dataflow log monitoring node that watches for errors across multiple nodes and sends alerts.

nodes:
  - id: camera
    path: ./target/debug/camera
    send_logs_as: logs
    outputs:
      - frames
      - logs

  - id: detector
    path: ./target/debug/detector
    send_logs_as: logs
    outputs:
      - detections
      - logs

  - id: log-monitor
    path: ./target/debug/log-monitor
    inputs:
      camera_logs: camera/logs
      detector_logs: detector/logs
    outputs:
      - alerts

Node-side handling in the log monitor (using dora-log-utils):

#![allow(unused)]
fn main() {
use dora_node_api::{DoraNode, Event};
use dora_message::common::{LogLevel, LogLevelOrStdout};

let (mut node, mut events) = DoraNode::init_from_env()?;
while let Some(event) = events.recv() {
    match event {
        Event::Input { data, .. } => {
            let log = dora_log_utils::parse_log_from_arrow(&data)?;

            let is_error = matches!(log.level,
                LogLevelOrStdout::LogLevel(LogLevel::Error));

            if is_error || log.message.contains("timeout") {
                // Send alert downstream
                node.send_output("alerts", /* ... */)?;
            }
        }
        Event::Stop(_) => break,
        _ => {}
    }
}
}

See also the Log Sink Examples section for complete runnable examples.

3. Post-Mortem Debugging of a Crash

After a dataflow crashes, investigate what happened in the last few minutes.

# Find available dataflows
ls out/

# Read the last 50 lines from all nodes around the crash
dora logs --local --all-nodes --tail 50

# Focus on errors in the last 5 minutes
dora logs --local --all-nodes --since 5m --level error

# Search for a specific error pattern
dora logs --local --all-nodes --grep "out of memory"

# Drill into a specific node
dora logs --local detector --since 2m

# Export as JSON for external analysis
dora run dataflow.yml --log-format json 2>logs.json

4. Long-Running Production Dataflow

A dataflow runs for days or weeks. Without log rotation, disk space fills up.

nodes:
  - id: ingest
    path: ./target/debug/ingest
    min_log_level: info        # no debug noise in production
    max_log_size: "100MB"      # ~600MB max per node (100MB * 6)
    restart_policy: always
    inputs:
      tick: dora/timer/millis/1000
    outputs:
      - data

  - id: processor
    path: ./target/debug/processor
    min_log_level: warn        # only warnings and errors
    max_log_size: "50MB"
    restart_policy: on-failure
    inputs:
      data: ingest/data
    outputs:
      - results

  - id: writer
    path: ./target/debug/writer
    min_log_level: error       # minimal logging
    max_log_size: "20MB"
    inputs:
      results: processor/results

Disk budget:

ingest: up to 600MB (100MB x 6 files)
processor: up to 300MB (50MB x 6 files)
writer: up to 120MB (20MB x 6 files)
Total: ~1GB maximum disk usage for all logs

5. Live Monitoring of a Distributed Deployment

Multiple daemons running on different machines, monitored from a central workstation.

# Start infrastructure (coordinator + local daemon)
dora up

# On remote machines, start a daemon pointing to the coordinator:
#   dora daemon --coordinator-addr 192.168.1.10

# Start the dataflow (detached)
dora start dataflow.yml

# Open targeted log streams in separate terminals:

# Terminal 1: all sensor warnings
dora logs <dataflow-id> sensor --follow --level warn

# Terminal 2: processor errors with text search
dora logs <dataflow-id> processor --follow --level error --grep "timeout"

# Terminal 3: all nodes merged
dora logs <dataflow-id> --all-nodes --follow

# Terminal 4: historical + live (errors from the last hour, then stream)
dora logs <dataflow-id> processor --since 1h --level error --follow

# Monitor a remote coordinator from another machine:
dora logs <dataflow-id> sensor --follow --coordinator-addr 192.168.1.10

How it works internally:

CLI connects to the coordinator (default localhost:6013, or --coordinator-addr)
For historical logs: request-reply with filters applied client-side (--since, --grep, --tail)
For --follow: opens a WebSocket subscription to the coordinator
Coordinator filters by --level server-side before forwarding (reduces network traffic)
CLI applies --grep and --since client-side on the live stream
Coordinator buffers log messages until a subscriber connects, so late-joining subscribers see recent history

6. CI/CD Pipeline with Structured Logging

In CI, use JSON format for machine-parseable output and compact format for readable logs.

# Machine-parseable logs for CI tooling
dora run dataflow.yml --log-format json --stop-after 30s 2>test-logs.json

# Compact logs for CI console output
dora run dataflow.yml --log-format compact --log-level info --stop-after 30s

# Post-run analysis: count errors per node
dora logs --local --all-nodes --level error | wc -l

With JSON format, each line is a complete LogMessage that can be processed by jq, log aggregators, or custom scripts:

# Extract error messages with jq
cat test-logs.json | jq -r 'select(.level == "ERROR") | "\(.node_id): \(.message)"'

Performance Considerations

Logging adds I/O overhead proportional to log volume. Here’s how to tune it:

min_log_level is the most impactful setting. It filters at the daemon before any I/O: no log file write, no coordinator forwarding, no send_logs_as routing. A node emitting 1000 debug lines/sec at min_log_level: info generates zero overhead for those lines.

send_logs_as adds a dataflow message per log line. Each parsed log entry is serialized to JSON, converted to Arrow, and sent through the dataflow. For high-volume nodes, this can consume significant bandwidth. Use min_log_level to limit what gets routed.

dora/logs subscribers share a single serialization. The daemon converts each log line to Arrow once and clones the result for each subscriber. The cost scales linearly with subscriber count, not log volume x subscriber count. For most dataflows (1-3 log subscribers), this is negligible.

Log line size is capped at 1 MB. Lines longer than 1 MB from node stdout/stderr are truncated to prevent heap exhaustion. This protects against buggy nodes that dump large binary data to stdout.

Log file rotation is recommended for long-running dataflows. Without max_log_size, log files grow unbounded. A node emitting 100 lines/sec at ~200 bytes/line fills 1 GB in ~14 hours.

Recommended production settings:

nodes:
  - id: my-node
    path: ./my-node
    min_log_level: info        # drop debug/trace at source
    max_log_size: "50MB"       # rotate at 50MB
    max_rotated_files: 5       # keep 5 rotated files (300MB max)

最佳实践

Set min_log_level in production. Source-level filtering at the daemon prevents debug noise from reaching log files and the network. This is the most effective way to reduce log volume since it filters before any I/O.

Always set max_log_size for long-running dataflows. Without rotation, a single noisy node can fill the disk. Start with "50MB" (300MB total per node with rotation) and adjust based on your storage budget. Use max_rotated_files to tune how much history to keep (default 5, range 1-100).

Use environment variables for team defaults. Set DORA_LOG_LEVEL and DORA_LOG_FORMAT in your shell profile or CI configuration. Individual developers can override with CLI flags.

Use --log-filter during development. Instead of changing YAML config, use per-node display overrides to focus on the node you’re debugging: --log-filter "my-node=debug".

Use send_logs_as for operational monitoring. Build monitoring nodes that watch for error patterns, compute error rates, or forward alerts. This keeps monitoring logic within the dataflow graph. Use dora-log-utils to parse and format log entries in custom sink nodes (see examples/log-sink-file/ and examples/log-sink-tcp/).

Prefer send_logs_as over send_stdout_as for structured data. send_stdout_as captures every stdout line (including raw prints), while send_logs_as only captures parsed structured log entries with full metadata.

Use --local for post-mortem debugging. After a crash, dora logs --local --all-nodes works without a running coordinator and merges all node logs chronologically.

Combine --since with --grep for targeted debugging. Instead of scrolling through thousands of lines, narrow the window: dora logs --local sensor --since 5m --grep "error".

Use JSON format for log pipelines. When feeding logs to external systems (ELK, Grafana Loki, Datadog), use --log-format json for structured ingestion.

调试与可观测性指南

This guide covers how to debug, record, replay, and monitor dora dataflows. It is written for new users who want to understand what went wrong in a dataflow, measure performance, or reproduce issues offline.

前提条件

Before using topic inspection commands (topic echo, topic hz, topic info), enable debug message publishing using either approach:

Option 1: CLI flag (recommended)

dora start dataflow.yml --debug
dora run dataflow.yml --debug

Option 2: YAML descriptor

_unstable_debug:
  enable_debug_inspection: true

This tells the daemon to publish all inter-node messages to Zenoh, where the coordinator can proxy them to CLI clients via WebSocket. Without this flag, topic inspection commands will return an error.

The record, replay, logs, list, top, graph, node info/restart/stop, param, and doctor commands do not require this flag. The topic pub command does require it.

快速调试清单

When something goes wrong, follow this sequence:

# 1. Run full environment diagnosis
dora doctor --dataflow dataflow.yml

# 2. What dataflows are active?
dora list

# 3. Inspect the problem node
dora node info -d my-dataflow problem-node

# 4. Check node resource usage
dora top

# 5. Stream logs from the problem node
dora logs my-dataflow problem-node --follow --level debug

# 6. Is the node producing output?
dora topic echo -d my-dataflow problem-node/output

# 7. Inject test data
dora topic pub -d my-dataflow problem-node/input '[1, 2, 3]'

# 8. Is it publishing at the expected rate?
dora topic hz -d my-dataflow --window 5

# 9. Check/modify runtime parameters
dora param list -d my-dataflow problem-node
dora param set -d my-dataflow problem-node debug_level 2

# 10. Restart a misbehaving node (without stopping the dataflow)
dora node restart -d my-dataflow problem-node

# 11. View coordinator traces (no external infra needed)
dora trace list
dora trace view <trace-id-prefix>

# 12. Visualize the dataflow graph
dora graph dataflow.yml --open

# 13. Record for offline analysis
dora record dataflow.yml -o debug-capture.drec

录制和回放

Record captures live dataflow messages to a file. Replay substitutes source nodes with recorded data, letting you reproduce behavior without hardware.

Recording a Dataflow

# Record all topics (default output: recording_{timestamp}.drec)
dora record dataflow.yml

# Specify output file
dora record dataflow.yml -o my-capture.drec

This injects a hidden __dora_record__ node into the dataflow that subscribes to all node outputs and writes them to a .drec file. The record node binary (dora-record-node) is auto-built on first use.

The recording runs until you press Ctrl-C or the dataflow stops.

Recording Specific Topics

# Only record camera and lidar
dora record dataflow.yml --topics sensor/image,lidar/points

Topic names use the format node_id/output_id. Available topics can be discovered with dora topic list -d <dataflow>.

Proxy Recording (Remote / Diskless)

When the target machine has no local disk or you want to record on your local machine:

# Start the dataflow first (detached)
dora start dataflow.yml --detach

# Record via WebSocket proxy -- data streams through coordinator to CLI
dora record dataflow.yml --proxy -o capture.drec

# Record specific topics via proxy
dora record dataflow.yml --proxy --topics sensor/image,lidar/points

How proxy mode works:

The dataflow must already be running (dora start --detach)
The CLI connects to the coordinator via WebSocket
The coordinator subscribes to Zenoh on the CLI’s behalf
Message data streams through WebSocket binary frames to the CLI
The CLI writes the .drec file locally

This requires enable_debug_inspection: true in the descriptor.

When to use --proxy:

Embedded targets with no local disk
Remote machines where you want the recording on your workstation
When you only have WebSocket connectivity (no direct Zenoh access)

When to use default mode (no --proxy):

Same machine or shared filesystem
High-throughput scenarios (no WebSocket overhead)
No need for enable_debug_inspection

Replaying a Recording

# Replay at original speed
dora replay recording.drec

# Replay at 2x speed
dora replay recording.drec --speed 2.0

# Replay as fast as possible (speed 0)
dora replay recording.drec --speed 0

Replay works by:

Reading the .drec file header to get the original dataflow descriptor
Identifying which nodes produced the recorded data
Replacing those source nodes with dora-replay-node instances
Running the modified dataflow – downstream nodes receive replayed data identically to live data

The replay node binary (dora-replay-node) is auto-built on first use.

Replay Options

标志	默认	描述
`--speed <FLOAT>`	`1.0`	Playback speed multiplier. `2.0` = 2x, `0.5` = half speed, `0` = as fast as possible
`--loop`	off	Loop the recording continuously
`--replace <NODES>`	all recorded	Comma-separated list of nodes to replace
`--output-yaml <PATH>`	-	Write modified descriptor YAML without running

Selective Replay

Replace only specific source nodes while keeping others live:

# Only replace the sensor node, keep camera live
dora replay recording.drec --replace sensor

# Replace sensor and lidar, keep everything else live
dora replay recording.drec --replace sensor,lidar

This is useful when you want to debug a specific processing pipeline with known input data while keeping other parts of the system live.

Dry Run (Output YAML)

Both record and replay support --output-yaml to see the modified descriptor without running:

# See what the record-injected descriptor looks like
dora record dataflow.yml --output-yaml record-modified.yml

# See what the replay-modified descriptor looks like
dora replay recording.drec --output-yaml replay-modified.yml

Recording File Format

The .drec format is a simple binary file:

┌──────────────────────────────────┐
│ Header (bincode)                 │
│   version: u32                   │
│   start_nanos: u64               │
│   dataflow_id: Uuid              │
│   descriptor_yaml: Vec<u8>       │
├──────────────────────────────────┤
│ Entry 1 (bincode)                │
│   node_id: String                │
│   output_id: String              │
│   timestamp_offset_nanos: u64    │
│   event_bytes: Vec<u8>           │
├──────────────────────────────────┤
│ Entry 2 ...                      │
├──────────────────────────────────┤
│ ...                              │
├──────────────────────────────────┤
│ Footer (bincode)                 │
│   total_messages: u64            │
│   total_bytes: u64               │
└──────────────────────────────────┘

The event_bytes field contains the raw Timestamped<InterDaemonEvent> bincode payload – the same format used on the wire between daemons. The descriptor_yaml in the header stores the original dataflow descriptor so replay can reconstruct the dataflow.

Node Management

Node Info

Get detailed information about a specific node including its status, inputs, outputs, metrics, and restart count:

dora node info -d my-dataflow camera

# JSON output
dora node info -d my-dataflow camera --format json

Node Restart

Restart a single node without stopping the entire dataflow. Useful for recovering a misbehaving node or picking up configuration changes:

# Restart with default grace period
dora node restart -d my-dataflow camera

# Restart with custom grace period
dora node restart -d my-dataflow camera --grace 10s

The daemon sends a stop event, waits for the grace period, then respawns the node process.

Node Stop

Stop a single node without stopping the entire dataflow:

dora node stop -d my-dataflow camera

# With custom grace period
dora node stop -d my-dataflow camera --grace 5s

主题检查

Topic inspection commands subscribe to live dataflow messages via the coordinator’s WebSocket proxy. They require --debug flag or enable_debug_inspection: true.

Listing Topics

# List all topics in a running dataflow
dora topic list -d my-dataflow

# JSON output
dora topic list -d my-dataflow --format json

Shows each output, which node publishes it, and which nodes subscribe to it. This command reads from the descriptor and does not require enable_debug_inspection.

Echoing Topic Data

Stream live topic data to the terminal:

# Echo a single topic
dora topic echo -d my-dataflow camera_node/image

# Echo multiple topics
dora topic echo -d my-dataflow robot1/pose robot2/vel

# JSON output (useful for piping to jq or other tools)
dora topic echo -d my-dataflow robot1/pose --format json

# Echo all topics
dora topic echo -d my-dataflow

Each line shows the topic name, Arrow data content, and metadata parameters. Use --format json for machine-readable output:

{"timestamp":1709000000000,"name":"robot1/pose","data":[1.0,2.0,3.0],"metadata":null}

Measuring Frequency

Interactive TUI showing per-topic publish frequency:

# All topics with 10-second sliding window
dora topic hz -d my-dataflow --window 10

# Specific topics with 5-second window
dora topic hz -d my-dataflow robot1/pose robot2/vel --window 5

The TUI displays:

Average frequency (Hz)
Average, min, max interval
Standard deviation
Sparkline showing recent activity

Press q or Ctrl-C to exit. Requires an interactive terminal.

Publishing Test Data

Inject data into a running dataflow for testing. Requires enable_debug_inspection: true.

# Publish a single Arrow array
dora topic pub -d my-dataflow sensor/threshold '[42]'

# Publish from a JSON file
dora topic pub -d my-dataflow sensor/config --file test-config.json

# Publish multiple messages
dora topic pub -d my-dataflow sensor/trigger '[1]' --count 10

这适用于：

Testing node behavior with known input data
Triggering specific code paths in downstream nodes
Simulating sensor inputs without hardware

Topic Metadata and Stats

One-shot statistics collection:

# Collect stats for 5 seconds (default)
dora topic info -d my-dataflow camera_node/image

# Collect for 10 seconds
dora topic info -d my-dataflow camera_node/image --duration 10

Reports:

Arrow data type
Publisher node
Subscriber nodes (from descriptor)
Message count and bandwidth
Publishing frequency

Runtime Parameters

Runtime parameters let you read and modify node configuration while a dataflow is running, without restarting. Parameters are stored in the coordinator and optionally forwarded to running nodes.

# List all parameters for a node
dora param list -d my-dataflow detector

# Get a single parameter
dora param get -d my-dataflow detector confidence

# Set a parameter (value is JSON)
dora param set -d my-dataflow detector confidence 0.8
dora param set -d my-dataflow detector config '{"nms": 0.5, "classes": ["car", "person"]}'

# Delete a parameter
dora param delete -d my-dataflow detector confidence

Parameters are persisted in the coordinator store (in-memory or redb). When a node is running, param set also forwards the new value to the node’s daemon. Nodes can read parameters through the node event stream.

Limits: Keys max 256 bytes, values max 64KB serialized.

Environment Diagnosis

dora doctor performs a comprehensive health check of your environment:

# Basic diagnosis
dora doctor

# Diagnosis + dataflow validation
dora doctor --dataflow dataflow.yml

Checks performed:

Coordinator reachability
Connected daemon status
Active dataflow health
Dataflow YAML validation (if --dataflow provided)

Use this as a first step when debugging any issue, or in CI to validate the environment before running tests.

追踪检查

The coordinator captures tracing spans in-memory from dora_coordinator and dora_core crates (up to 4096 spans in a ring buffer). You can view these traces without any external tracing infrastructure (no Jaeger, Tempo, etc. required).

Listing Traces

dora trace list

Shows all captured traces with their root span name, span count, start time, and total duration:

TRACE ID      ROOT SPAN          SPANS  STARTED              DURATION
a1b2c3d4e5f6  spawn_dataflow     12     2026-03-01 10:30:05  1.234s
f8e7d6c5b4a3  build_dataflow     5      2026-03-01 10:29:58  0.500s

Viewing a Trace

# Full trace ID
dora trace view a1b2c3d4-e5f6-7890-abcd-1234567890ab

# Or use a unique prefix
dora trace view a1b2c3d4

Displays spans as an indented tree showing parent-child relationships, log levels, durations, and span fields:

spawn_dataflow [INFO 1.234s] {build_id="abc", session_id="def"}
  build_dataflow [INFO 0.500s]
    download_node [DEBUG 0.200s] {url="..."}
  start_inner [INFO 0.734s]
    spawn_node [INFO 0.100s] {node_id="camera"}
    spawn_node [INFO 0.080s] {node_id="detector"}

When to Use Trace Inspection

Quick debugging – see what the coordinator did during a start, stop, or build without setting up Jaeger/Tempo
Performance analysis – identify slow spans in dataflow lifecycle operations
Deployment troubleshooting – understand the sequence and timing of coordinator operations

For full distributed tracing across daemons and nodes, set DORA_OTLP_ENDPOINT and use an OTLP-compatible backend.

资源监控

dora top (also dora inspect top) provides a real-time TUI showing per-node resource usage:

# Default 2-second refresh
dora top

# Custom refresh interval
dora top --refresh-interval 5

# JSON snapshot for scripting/CI
dora top --once | jq .

Displays for each node:

CPU usage (% of a single core)
Memory (RSS)
Node status (Running, Restarting, Degraded, Failed)
Restart count
Queue depth (pending messages)
Network TX/RX (cross-daemon bytes via Zenoh)
Disk I/O read/write

Metrics are collected by daemons and reported to the coordinator, so this works for distributed dataflows across multiple machines. Press q or Ctrl-C to exit.

Use --once to print a single JSON snapshot and exit, useful for CI pipelines and monitoring integrations.

Note: CPU percentages are per-core, so values can exceed 100% for multi-threaded nodes. Nodes on different machines may have different CPUs, so percentages are not directly comparable across machines.

日志分析

Live Log Streaming

# Stream logs from a specific node
dora logs my-dataflow sensor-node --follow

# Stream logs from all nodes
dora logs my-dataflow --all-nodes --follow

# Filter by log level
dora logs my-dataflow sensor-node --follow --level debug

# Stream with grep filter
dora logs my-dataflow --all-nodes --follow --grep "error"

Without --follow, reads from local log files. With --follow, streams live from the coordinator via WebSocket.

Local Log Files

Logs are stored in the out/ directory:

out/
  <dataflow-uuid>/
    log_<node-id>.jsonl          # current log
    log_<node-id>.1.jsonl        # rotated (previous)
    log_<node-id>.2.jsonl        # rotated (older)

Read directly:

# All nodes, local files
dora logs --local --all-nodes

# Specific node, last 50 lines
dora logs --local sensor-node --tail 50

Filtering and Searching

标志	示例	描述
`--level <LEVEL>`	`--level debug`	Minimum level: error, warn, info, debug, trace, stdout
`--log-filter <FILTER>`	`--log-filter "sensor=debug,processor=warn"`	Per-node level filter
`--grep <PATTERN>`	`--grep "timeout"`	Case-insensitive substring match
`--since <DURATION>`	`--since 5m`	Only logs newer than this
`--until <DURATION>`	`--until 1h`	Only logs older than this
`--tail <N>`	`--tail 100`	Show last N lines
`--log-format <FMT>`	`--log-format json`	Output format: pretty (default) or json

Environment variables:

DORA_LOG_LEVEL – default log level
DORA_LOG_FORMAT – default log format
DORA_LOG_FILTER – default per-node filter

Dataflow Visualization

Generate a visual graph of your dataflow:

# Generate HTML and open in browser
dora graph dataflow.yml --open

# Generate Mermaid diagram text
dora graph dataflow.yml --mermaid

The Mermaid output can be pasted into mermaid.live or used in GitHub markdown:

```mermaid
graph TD
    sensor --> processor
    processor --> controller
```

The HTML mode generates a self-contained file with an interactive mermaid.js diagram.

Monitoring Running Dataflows

# Full environment diagnosis
dora doctor

# List all dataflows (active and completed)
dora list

# List nodes in a specific dataflow
dora node list -d my-dataflow

# Get detailed info on a specific node
dora node info -d my-dataflow camera

# Check coordinator/daemon status
dora status

# View/modify runtime parameters
dora param list -d my-dataflow detector
dora param set -d my-dataflow detector threshold 0.5

dora list shows each dataflow’s UUID, name, status, and node count. Use -d <name> with other commands to target a specific dataflow.

End-to-End Debugging Workflows

Workflow 1: Node Not Producing Output

# 1. Verify the node is running
dora list
dora top

# 2. Check its logs
dora logs my-dataflow problem-node --follow --level trace

# 3. Check if upstream nodes are publishing
dora topic echo -d my-dataflow upstream-node/output

# 4. Verify topic wiring
dora topic list -d my-dataflow
dora graph dataflow.yml --open

Workflow 2: Unexpected Data or Wrong Values

# 1. Echo the topic to see raw data
dora topic echo -d my-dataflow node/output --format json

# 2. Record for offline analysis
dora record dataflow.yml -o debug.drec

# 3. Replay with known input to isolate the issue
dora replay debug.drec --replace sensor --speed 0

Workflow 3: Performance Issues

# 1. Check CPU/memory per node
dora top

# 2. Measure publish frequencies
dora topic hz -d my-dataflow --window 10

# 3. Get bandwidth stats for suspected bottleneck
dora topic info -d my-dataflow heavy-node/output --duration 10

# 4. Record and replay at max speed to find throughput limits
dora record dataflow.yml -o perf.drec
dora replay perf.drec --speed 0

Workflow 4: Reproducing a Field Issue

# On the robot / target machine:
dora start dataflow.yml --detach
dora record dataflow.yml --proxy -o field-capture.drec

# Transfer the .drec file to your workstation, then:
dora replay field-capture.drec
dora replay field-capture.drec --speed 0.5  # slow motion
dora replay field-capture.drec --loop        # continuous replay

Workflow 5: Remote Debugging (No Direct Access)

When you only have WebSocket connectivity to the coordinator:

# All these commands work over WebSocket -- no Zenoh needed
dora list
dora top
dora logs my-dataflow --all-nodes --follow
dora topic echo -d my-dataflow node/output
dora topic hz -d my-dataflow
dora record dataflow.yml --proxy -o remote-capture.drec

另请参阅

CLI 参考 – 完整命令参考
WebSocket Control Plane – how CLI communicates with coordinator
WebSocket Topic Data Channel – how topic data is proxied
Testing Guide – running smoke tests

容错

Dora provides built-in fault tolerance for robotic and AI dataflows. Nodes can automatically restart on failure, detect stale upstream connections, gracefully degrade when inputs are unavailable, and the coordinator can persist state to disk so it survives crashes and restarts.

功能一览

特性	范围	配置
重启策略	Per-node	`restart_policy`, `max_restarts`, `restart_delay`, …
Health monitoring	Per-node	`health_check_timeout`, `health_check_interval` (dataflow-level)
Input timeouts	Per-input	`input_timeout`
Circuit breaker	Automatic	Triggered by `input_timeout`, auto-recovers
NodeRestarted event	Downstream nodes	Automatic when upstream restarts
InputTracker API	Rust nodes	`dora_node_api::InputTracker`
Observability	Daemon-wide	Atomic counters logged periodically
Distributed health	Multi-daemon	Coordinator heartbeat monitoring
Coordinator state persistence	Coordinator	`--store redb` (requires `redb-backend` feature)

重启策略

Control what happens when a node exits or crashes.

配置

nodes:
  - id: my-node
    path: ./target/debug/my-node
    restart_policy: on-failure  # never | on-failure | always
    max_restarts: 5             # 0 = unlimited (default: 0)
    restart_delay: 1.0          # initial delay in seconds
    max_restart_delay: 30.0     # cap for exponential backoff
    restart_window: 300.0       # reset counter after this many seconds

Policy Types

never (default) – Node is not restarted. Failure propagates normally.

on-failure – Restart only when the node exits with a non-zero exit code. Clean exits (code 0) are not restarted.

always – Restart on any exit, except:

The dataflow was stopped by the user (dora stop or Ctrl-C)
All inputs were closed and the node exited with a non-zero code

How Restarts Work Internally

When a node process exits, the daemon evaluates the restart decision in this order:

Policy check: Does the restart policy allow it?
- Never -> no restart
- OnFailure -> restart only if exit code != 0
- Always -> restart
Disable check: Has disable_restart been set? (set when all inputs close or during manual stop via stop_all)
Window check: If restart_window is set and the window has elapsed since the first restart, reset the counter to 0
Limit check: If max_restarts > 0 and the window counter exceeds it, give up permanently
Backoff: If restart_delay is set, sleep for the computed delay (re-checking disable_restart after waking)
Respawn: The node process is spawned fresh with the same configuration

The daemon tracks restart state per node instance in the spawn/prepared.rs lifecycle loop. Each node runs in its own tokio task, so restarts don’t block other nodes.

Backoff

When restart_delay is set, the daemon waits before restarting. The delay doubles on each attempt (exponential backoff) and is capped by max_restart_delay.

The backoff exponent is capped at 16 internally to prevent overflow (2^16 = 65536x multiplier).

Example with restart_delay: 1.0 and max_restart_delay: 10.0:

Attempt 1: wait 1s    (1.0 * 2^0)
Attempt 2: wait 2s    (1.0 * 2^1)
Attempt 3: wait 4s    (1.0 * 2^2)
Attempt 4: wait 8s    (1.0 * 2^3)
Attempt 5: wait 10s   (capped at max_restart_delay)
Attempt 6: wait 10s   (capped)

During the backoff sleep, the daemon continuously monitors the disable_restart flag. If all inputs close while the node is waiting to restart, the restart is cancelled with the log message: “restart cancelled: inputs closed during backoff wait”.

Restart Window

When restart_window is set, the restart counter resets after the window elapses (measured from the first restart in the current window). This enables “N restarts per M seconds” semantics.

Example: max_restarts: 5, restart_window: 300.0 means “at most 5 restarts per 5 minutes”. If the window elapses without hitting the limit, the counter resets and the node gets another 5 attempts.

Restart Disable During Shutdown

When the daemon stops a dataflow (via stop_all), it calls disable_restart() on every node before sending Stop events. This prevents the restart mechanism from fighting the shutdown process. The disable_restart flag is an Arc<AtomicBool> shared between the daemon event loop and the node’s spawn lifecycle task.

NodeRestarted Event

When a node restarts, the daemon sends a NodeRestarted event to all downstream nodes that consume its outputs. This allows downstream nodes to:

Reset internal state or caches
Log the upstream recovery
Re-initialize connections or sessions

The event carries the NodeId of the restarting node. Downstream nodes receive it automatically via the event stream:

#![allow(unused)]
fn main() {
match event {
    Event::NodeRestarted { id } => {
        println!("upstream node {id} restarted, resetting state");
        // Clear any cached state from the old node instance
    }
    _ => {}
}
}

The daemon finds downstream nodes via dataflow.mappings, which maps each node’s outputs to all subscribing (receiver_node, input_id) pairs. Each unique receiver gets one NodeRestarted event per restart.

健康监测

Passive monitoring detects hung nodes that stop communicating with the daemon.

health_check_interval: 2.0  # seconds (default: 5.0, dataflow-level)
nodes:
  - id: my-node
    path: ./target/debug/my-node
    health_check_timeout: 30.0  # seconds (per-node)
    restart_policy: on-failure

Configurable Health Check Interval

The health_check_interval is a dataflow-level setting that controls how often the daemon checks node health. Default is 5.0 seconds. Lower values detect hung nodes faster but add more overhead. Set this at the top level of your dataflow YAML, not per-node.

How It Works Internally

The daemon runs a health check sweep at the configured health_check_interval (via a tokio interval stream emitting Event::NodeHealthCheckInterval).

Each RunningNode has a last_activity: Arc<AtomicU64> field storing the timestamp (milliseconds since epoch) of the last communication. This is updated atomically by the node’s communication handler (node_communication/mod.rs) every time the node sends any request to the daemon (event subscriptions, output sends, etc.).

The health check function (check_node_health) iterates all running nodes:

Skip nodes without health_check_timeout set
Skip nodes with last_activity == 0 (not yet connected)
Compute elapsed_ms = now - last_activity
If elapsed_ms > timeout_ms, log a warning and kill the node process

After killing, the normal exit handling runs, which evaluates the restart policy. This means health_check_timeout combined with restart_policy: on-failure automatically recovers hung nodes.

What Counts as “Activity”

Any message from the node to the daemon counts:

Event subscription requests
Output data sends (via shared memory or TCP)
Timer tick acknowledgments

Normal input data received from other nodes does not reset the timer – the node must actively communicate with the daemon.

Input Timeouts and Circuit Breaker

Per-input timeouts detect when an upstream node stops producing data.

配置

nodes:
  - id: downstream-node
    path: ./target/debug/downstream
    inputs:
      sensor_data:
        source: camera-node/frames
        input_timeout: 5.0  # seconds

The input_timeout is set per input, not per node. Different inputs can have different timeouts.

How It Works Internally

The daemon maintains an InputDeadline for each input with a timeout:

struct InputDeadline {
    timeout: Duration,        // configured timeout
    last_received: Instant,   // last time data arrived
}

These are stored in RunningDataflow.input_deadlines keyed by (NodeId, DataId).

Timeout detection runs during the same 5-second health check interval. The check_input_timeouts function:

Scans all input_deadlines entries
If last_received.elapsed() > timeout, the input is “broken”
The (node_id, input_id) pair is moved from input_deadlines to broken_inputs
The daemon calls break_input() which sends InputClosed { id } to the downstream node
If all of a node’s inputs are now closed (and none are broken/recoverable), AllInputsClosed is sent and the node’s restart is disabled

Deadline reset: Every time data arrives on an input, its last_received is reset to Instant::now().

Circuit Breaker: Auto-Recovery

The circuit breaker tracks broken inputs in RunningDataflow.broken_inputs. When new data arrives on a broken input:

The data is delivered to the node normally
The broken_inputs entry is removed
The input is re-added to open_inputs
A new InputDeadline is created (restarting the timeout)
An InputRecovered { id } event is sent to the node
The circuit_breaker_recoveries counter is incremented

This means recovery is fully automatic. If the upstream node restarts (via restart policy) and begins producing data again, downstream nodes seamlessly resume receiving it.

Node-Side Handling

In Rust nodes, handle these events in your event loop:

#![allow(unused)]
fn main() {
use dora_node_api::{DoraNode, Event};

let (mut node, mut events) = DoraNode::init_from_env()?;
while let Some(event) = events.recv() {
    match event {
        Event::Input { id, data, .. } => {
            // Normal processing
        }
        Event::InputClosed { id } => {
            // Upstream stopped producing on this input.
            // You can: use cached data, skip processing, alert operator, etc.
        }
        Event::InputRecovered { id } => {
            // Upstream is back online for this input.
            // Resume normal processing.
        }
        Event::Stop(_) => break,
        _ => {}
    }
}
}

InputTracker API (Rust)

The InputTracker helper tracks input health and caches the last received value per input, making graceful degradation easy.

#![allow(unused)]
fn main() {
use dora_node_api::{DoraNode, Event, InputTracker, InputState};

let (mut node, mut events) = DoraNode::init_from_env()?;
let mut tracker = InputTracker::new();

while let Some(event) = events.recv() {
    tracker.process_event(&event);

    match event {
        Event::Input { id, data, .. } => {
            // Fresh data available
        }
        Event::InputClosed { id } => {
            // Input timed out -- fall back to cached data
            if let Some(stale_data) = tracker.last_value(&id) {
                // Use stale_data as fallback
            }
        }
        Event::Stop(_) => break,
        _ => {}
    }

    // Check overall health
    if tracker.any_closed() {
        let closed: Vec<_> = tracker.closed_inputs();
        // Log or adjust behavior
    }
}
}

Internal Design

InputTracker maintains two HashMaps:

states: HashMap<DataId, InputState> – current state per input (Healthy or Closed)
cache: HashMap<DataId, ArrowData> – last received value per input

On Event::Input, both maps are updated (state = Healthy, cache = data clone). On Event::InputClosed, only state changes (cache is preserved). On Event::InputRecovered, state is set back to Healthy. The cache is never cleared, so last_value() always returns the most recent data even after the input closes.

Note: ArrowData wraps Arc<dyn arrow::array::Array>, so the cache clone is reference-counted (cheap).

API Reference

方法	返回值	描述
`new()`	`InputTracker`	Create empty tracker
`process_event(&Event)`	`bool`	Update state. Returns true if event was relevant
`state(&DataId)`	`Option<InputState>`	Current state (Healthy or Closed)
`is_closed(&DataId)`	`bool`	Check if input is closed
`last_value(&DataId)`	`Option<&ArrowData>`	Last received value (available even when closed)
`closed_inputs()`	`Vec<&DataId>`	All currently closed inputs
`any_closed()`	`bool`	True if any tracked input is closed

Observability

The daemon tracks fault tolerance events with atomic counters (FaultToleranceStats) and logs a summary every 5 seconds during the health check interval.

Counters

Counter	类型	Incremented when
`restarts`	`AtomicU64`	A node restart is initiated (in spawn lifecycle)
`health_check_kills`	`AtomicU64`	A node is killed by the health check (unresponsive)
`input_timeouts`	`AtomicU64`	An input timeout fires (circuit breaker trips)
`circuit_breaker_recoveries`	`AtomicU64`	Data arrives on a broken input (auto-recovery)

All counters use Ordering::Relaxed since they are informational and don’t need strict ordering guarantees.

Log Output

When any counter is non-zero, the daemon emits a structured log line:

INFO fault tolerance stats restarts=3 health_kills=0 input_timeouts=1 cb_recoveries=1

These counters are cumulative for the lifetime of the daemon process. They are not reset between dataflows.

Distributed Health

In multi-daemon deployments, the coordinator monitors daemon heartbeats.

协议

Heartbeat interval: 3 seconds (coordinator sends heartbeat to each daemon)
Disconnect threshold: 30 seconds without a response
Detection: On each heartbeat sweep, the coordinator removes daemons that haven’t responded within the threshold
Notification: The coordinator broadcasts PeerDaemonDisconnected { daemon_id } to all remaining daemons

DaemonInfo

The ConnectedMachines CLI query returns Vec<DaemonInfo>:

#![allow(unused)]
fn main() {
pub struct DaemonInfo {
    pub daemon_id: DaemonId,
    pub last_heartbeat_ago_ms: u64,  // milliseconds since last heartbeat
}
}

This allows monitoring tools to detect daemons that are alive but slow to respond.

Daemon-Side Handling

When a daemon receives PeerDaemonDisconnected, it logs a structured warning:

WARN peer daemon disconnected daemon_id=machine-B

Currently this is informational. Future work may include automatic migration of nodes from the disconnected daemon.

协调器状态持久化

By default the coordinator holds all state in memory. If the coordinator process crashes or is restarted, all knowledge of running dataflows is lost – daemons continue running but become orphaned, and users must manually re-run dataflows.

The redb store backend solves this by persisting coordinator state to a single file on disk using redb, a pure-Rust embedded key-value store with copy-on-write B-trees that are crash-safe by design.

Design: Stateless Coordinator with Stateful Backend

The coordinator itself remains stateless in the K8s sense – it can be stopped and restarted at any time. All durable state lives in the store backend behind the CoordinatorStore trait:

Coordinator (stateless process)
    |
    v
CoordinatorStore trait
    |
    +-- InMemoryStore (default, no persistence)
    +-- RedbStore     (persists to ~/.dora/coordinator.redb)

This separation means:

The coordinator event loop never reads from the filesystem during normal operation (only at startup recovery)
All state mutations are written to the store at well-defined persistence points
The store can be swapped without changing coordinator logic

Enabling Persistence

# Use default path (~/.dora/coordinator.redb)
dora coordinator --store redb

# Use custom path
dora coordinator --store redb:/path/to/coordinator.redb

# Default: in-memory only (no persistence)
dora coordinator --store memory

The redb backend requires the redb-backend Cargo feature, which is enabled in the default CLI build.

What Is Persisted

The store tracks three record types:

Record	Key	Persisted Fields
`DataflowRecord`	UUID (16 bytes)	uuid, name, descriptor (JSON), status, daemon IDs, generation counter, created/updated timestamps
`BuildRecord`	UUID (16 bytes)	build ID, status, errors, created/updated timestamps
`DaemonInfo`	DaemonId (bincode)	daemon ID, machine ID

Records are serialized with bincode for compact, fast encoding.

Dataflow Status Lifecycle

The coordinator persists dataflow status at every state transition:

Start command     -->  Pending
All daemons ready -->  Running
Stop command      -->  Stopping
All nodes finish  -->  Succeeded  or  Failed { error }
Spawn failure     -->  Failed { error: "spawn failed: ..." }

Each persist call increments the record’s generation counter, providing a monotonic version for conflict detection.

Persistence Points

The coordinator writes to the store at these moments in the event loop:

Dataflow started (ControlRequest::Start) – record created with status Pending
Dataflow spawned (DataflowSpawnResult success from all daemons) – updated to Running
Spawn failed (DataflowSpawnResult error) – updated to Failed with the actual error message
Stop requested (ControlRequest::Stop or StopByName) – updated to Stopping
All nodes finished (DataflowFinishedOnDaemon) – updated to Succeeded or Failed with per-node error details
Graceful shutdown (Ctrl-C or Destroy command) – all running dataflows marked Stopping before stop messages are sent

If a store write fails, the coordinator logs a warning and continues operating with in-memory state. This prevents a store failure from blocking the dataflow lifecycle.

Startup Recovery

When the coordinator starts with a redb store that contains data from a previous run, it performs recovery:

Read all persisted dataflow records via store.list_dataflows()
For any record with a non-terminal status (Pending, Running, Stopping):
- Mark it as Failed { error: "coordinator restarted" }
- Increment the generation counter
- Write the updated record back to the store
Terminal records (Succeeded, Failed) are left unchanged

This ensures that stale dataflows from a crashed coordinator are not confused with actively running ones. The daemons that were running those dataflows will detect the coordinator disconnect independently.

Error Detail Preservation

When a dataflow fails, the Failed status includes the actual per-node error messages rather than a generic string:

Failed { error: "node-1: exited with code 137; node-2: failed to spawn node: binary not found" }

Errors are collected from DataflowDaemonResult.node_results across all daemons, formatted as node_id: error_message, and joined with ; .

Schema Versioning

The redb database includes a meta table with a schema_version key. On open:

If no version exists (fresh database), the current version is written
If the stored version matches the binary’s version, the database opens normally
If there is a mismatch, the database is rejected with an error

This prevents silent data corruption when the serialization format of stored records changes between Dora versions. The current schema version is 1.

File Security

On Unix systems:

The database file is set to 0600 (owner read/write only) after creation
The default directory (~/.dora/) is set to 0700 (owner only)
Custom paths provided via redb:/path are validated to reject .. components

Internal Architecture

#![allow(unused)]
fn main() {
// Store trait (libraries/coordinator-store/src/lib.rs)
pub trait CoordinatorStore: Send + Sync {
    fn put_dataflow(&self, record: &DataflowRecord) -> Result<()>;
    fn get_dataflow(&self, uuid: &Uuid) -> Result<Option<DataflowRecord>>;
    fn list_dataflows(&self) -> Result<Vec<DataflowRecord>>;
    fn delete_dataflow(&self, uuid: &Uuid) -> Result<()>;
    // ... daemon and build methods
}
}

The RedbStore implementation uses three redb tables (daemons, dataflows, builds) with UUID-based binary keys and bincode-serialized values. All operations are synchronous (redb is a synchronous library); the coordinator calls them directly from the async event loop since they are fast in-process operations.

A bincode deserialization limit of 64 MiB guards against corrupted data that could encode huge allocation sizes in length prefixes.

Complete YAML Reference

# Dataflow-level settings
health_check_interval: 2.0    # health check sweep interval (default: 5.0s)

nodes:
  - id: sensor-node
    path: ./target/debug/sensor
    inputs:
      tick: dora/timer/millis/100
    outputs:
      - frames

  - id: processor
    path: ./target/debug/processor

    # Restart policy
    restart_policy: on-failure    # never | on-failure | always
    max_restarts: 5               # 0 = unlimited
    restart_delay: 1.0            # initial backoff delay (seconds)
    max_restart_delay: 30.0       # max backoff cap (seconds)
    restart_window: 300.0         # reset counter after N seconds

    # Health monitoring
    health_check_timeout: 30.0    # kill if no activity for N seconds

    inputs:
      frames:
        source: sensor-node/frames
        input_timeout: 5.0        # circuit breaker timeout (seconds)
        queue_size: 10            # input buffer size (default: 10)
    outputs:
      - result

Use Case Scenarios

1. Camera Pipeline with Intermittent Hardware Failures

A camera driver node occasionally crashes due to USB disconnects. The processing pipeline should survive these outages and resume when the camera reconnects.

nodes:
  - id: camera-driver
    path: ./target/debug/camera-driver
    restart_policy: on-failure
    max_restarts: 0               # unlimited -- hardware failures are expected
    restart_delay: 2.0            # wait for USB to re-enumerate
    max_restart_delay: 30.0
    inputs:
      tick: dora/timer/millis/33  # ~30 FPS
    outputs:
      - frames

  - id: object-detector
    path: ./target/debug/detector
    inputs:
      frames:
        source: camera-driver/frames
        input_timeout: 5.0        # tolerate 5s camera outage
    outputs:
      - detections

  - id: planner
    path: ./target/debug/planner
    inputs:
      detections:
        source: object-detector/detections
        input_timeout: 10.0       # longer tolerance -- can plan with stale data
      lidar:
        source: lidar-driver/points
        input_timeout: 3.0

What happens when the camera crashes:

camera-driver exits with non-zero code
Daemon evaluates on-failure policy -> restart after 2s backoff
During the outage, object-detector receives InputClosed { id: "frames" } after 5s
planner receives InputClosed { id: "detections" } after 10s
Camera restarts, begins producing frames
object-detector receives new frame data + InputRecovered { id: "frames" } (circuit breaker recovers)
planner receives detections + InputRecovered { id: "detections" }

Node-side handling in the planner:

#![allow(unused)]
fn main() {
use dora_node_api::{DoraNode, Event, InputTracker};

let (mut node, mut events) = DoraNode::init_from_env()?;
let mut tracker = InputTracker::new();

while let Some(event) = events.recv() {
    tracker.process_event(&event);

    match event {
        Event::Input { id, data, .. } => match id.as_ref() {
            "detections" => plan_with_detections(&data),
            "lidar" => update_lidar_map(&data),
            _ => {}
        },
        Event::InputClosed { id } => match id.as_ref() {
            "detections" => {
                // Camera pipeline down -- plan with lidar only
                plan_lidar_only();
            }
            "lidar" => {
                // LiDAR down -- use last known detection data
                if let Some(stale) = tracker.last_value(&"detections".into()) {
                    plan_with_stale_detections(stale);
                }
            }
            _ => {}
        },
        Event::Stop(_) => break,
        _ => {}
    }
}
}

2. ML Inference Node with OOM Crashes

An ML inference node occasionally runs out of memory on large inputs. It should restart quickly but give up after repeated failures (indicating a systemic issue).

nodes:
  - id: ml-inference
    path: ./target/debug/ml-inference
    restart_policy: on-failure
    max_restarts: 3
    restart_delay: 0.5
    restart_window: 60.0          # 3 restarts per minute
    health_check_timeout: 60.0    # ML inference can be slow
    inputs:
      images:
        source: preprocessor/images
    outputs:
      - predictions

Behavior:

Node crashes from OOM -> restarts after 0.5s
Crashes again on another large input -> restarts after 1.0s
Crashes a third time -> restarts after 2.0s
Crashes a fourth time within 60s -> max_restarts exceeded, node fails permanently
If the node runs stably for 60s after the first crash, the restart window resets and it gets 3 more chances

3. Multi-Sensor Fusion with Graceful Degradation

A robot fuses data from multiple sensors. Individual sensors may fail, but the system should continue operating with reduced capability.

nodes:
  - id: sensor-fusion
    path: ./target/debug/sensor-fusion
    inputs:
      camera:
        source: camera-node/frames
        input_timeout: 3.0
      lidar:
        source: lidar-node/points
        input_timeout: 3.0
      imu:
        source: imu-node/readings
        input_timeout: 1.0        # IMU is critical, short timeout
      gps:
        source: gps-node/fix
        input_timeout: 10.0       # GPS can be intermittent
    outputs:
      - fused-state

Node-side with InputTracker:

#![allow(unused)]
fn main() {
use dora_node_api::{DoraNode, Event, InputTracker};

let (mut node, mut events) = DoraNode::init_from_env()?;
let mut tracker = InputTracker::new();

while let Some(event) = events.recv() {
    tracker.process_event(&event);

    match event {
        Event::Input { id, data, .. } => {
            // Process fresh data from any sensor
            update_sensor(&id, &data);
            compute_and_send_fusion(&mut node, &tracker);
        }
        Event::InputClosed { id } => {
            // Sensor went offline -- adjust fusion weights
            eprintln!("sensor {id} offline, degrading");
            compute_and_send_fusion(&mut node, &tracker);
        }
        Event::InputRecovered { id } => {
            // Sensor back online
            eprintln!("sensor {id} recovered");
        }
        Event::Stop(_) => break,
        _ => {}
    }
}

fn compute_and_send_fusion(node: &mut DoraNode, tracker: &InputTracker) {
    // Use fresh data where available, stale cache for degraded sensors
    let camera = tracker.last_value(&"camera".into());
    let lidar = tracker.last_value(&"lidar".into());
    let imu = tracker.last_value(&"imu".into());

    if tracker.is_closed(&"imu".into()) {
        // IMU is critical -- switch to emergency mode
        emergency_stop(node);
        return;
    }

    // Fuse available sensors, weighting active ones higher
    let closed = tracker.closed_inputs();
    let active_count = 4 - closed.len();
    // ... fusion logic using active_count for confidence weighting
}
}

4. Long-Running Data Processing Pipeline

A batch processing pipeline runs continuously. The processing node occasionally hangs due to a third-party library bug. Health monitoring detects and recovers from these hangs.

nodes:
  - id: data-ingest
    path: ./target/debug/ingest
    restart_policy: always        # always restart -- this is a long-running service
    max_restarts: 0               # unlimited
    restart_delay: 1.0
    inputs:
      tick: dora/timer/millis/1000
    outputs:
      - records

  - id: processor
    path: ./target/debug/processor
    restart_policy: on-failure
    max_restarts: 10
    restart_delay: 0.5
    restart_window: 600.0         # 10 restarts per 10 minutes
    health_check_timeout: 30.0    # kill if hung for 30s
    inputs:
      records: data-ingest/records
    outputs:
      - results

  - id: writer
    path: ./target/debug/writer
    restart_policy: on-failure
    max_restarts: 5
    restart_delay: 2.0            # give DB time to recover
    max_restart_delay: 60.0
    inputs:
      results:
        source: processor/results
        input_timeout: 60.0       # processor may be slow

What happens when the processor hangs:

Processor stops communicating with daemon
After 30s, health check detects the hang and kills the process
health_check_kills counter increments
Daemon evaluates on-failure -> restart after 0.5s
New processor instance starts, resumes consuming from data-ingest
writer may have received InputClosed during the 60s timeout – or may not if the restart was fast enough
If writer did receive InputClosed, it gets InputRecovered when new results arrive

5. Distributed Deployment with Daemon Failure Detection

A multi-machine deployment where the coordinator monitors daemon health.

Machine A (coordinator + daemon):  camera-driver, preprocessor
Machine B (daemon):                ml-inference, postprocessor
Machine C (daemon):                planner, actuator-driver

What happens when Machine B loses network:

Coordinator’s heartbeat to Machine B fails
After 30s without response, coordinator removes Machine B from active daemons
Coordinator broadcasts PeerDaemonDisconnected { daemon_id: "machine-B" } to Machine A and Machine C
Daemons on A and C log: WARN peer daemon disconnected daemon_id=machine-B
Nodes on A and C with inputs from Machine B’s nodes receive InputClosed events (via their input timeouts)
CLI queries to ConnectedMachines show only A and C with their last_heartbeat_ago_ms

6. Coordinator Crash Recovery with redb Persistence

A long-running multi-daemon deployment where the coordinator must survive restarts without losing track of dataflow history.

# Start coordinator with persistent store
dora coordinator --store redb

# In another terminal, start a dataflow
dora start examples/rust-dataflow/dataflow.yml --name my-pipeline --detach

# Coordinator crashes or is killed (e.g., OOM, hardware failure)
# ... time passes ...

# Restart coordinator with the same store
dora coordinator --store redb

What happens on restart:

Coordinator opens ~/.dora/coordinator.redb and reads persisted dataflow records
Finds my-pipeline with status Running
Marks it as Failed { error: "coordinator restarted" }, increments generation
Logs: INFO recovering stale dataflow <uuid> ("my-pipeline") -> marking as Failed
dora list now shows my-pipeline with its final status and timestamps
Daemons detect the coordinator disconnect independently and stop their nodes
User can start a fresh dataflow – the coordinator is fully operational

The key benefit: the coordinator retains a complete history of dataflow lifecycle events across restarts. Without --store redb, all state would be lost and the operator would have no record of what was running before the crash.

7. Periodic Batch Job with Always-Restart

A node that processes batches and exits when done. It should restart to process the next batch.

nodes:
  - id: batch-processor
    path: ./target/debug/batch-proc
    restart_policy: always        # restart even on clean exit
    max_restarts: 0               # unlimited
    restart_delay: 10.0           # wait 10s between batches
    max_restart_delay: 10.0       # no exponential growth
    inputs:
      trigger: dora/timer/millis/1  # immediate first trigger
    outputs:
      - batch-result

The node processes one batch, exits with code 0, waits 10s, then restarts to process the next. The always policy ensures restarts even on success. Setting restart_delay == max_restart_delay gives a constant delay.

最佳实践

Start with on-failure. Use always only for nodes that are expected to exit and restart (e.g., periodic batch jobs).

Set max_restarts. Unlimited restarts can mask bugs. Start with 3-5 and increase if needed. Use max_restarts: 0 only for nodes where crashes are expected and unavoidable (hardware drivers, external API clients).

Use restart_window. Prevents permanent restart loops. A window of 60-300 seconds is typical. Without a window, a node that crashes at startup will exhaust its restart budget immediately.

Tune restart_delay. Start with 0.5-1.0 seconds. Too short causes thrashing; too long delays recovery. Match the delay to your node’s typical startup time and the root cause of failures:

USB/hardware reconnection: 2-5s
Network service reconnection: 1-3s
OOM/transient bugs: 0.5-1.0s

Set health_check_timeout generously. Should be at least 2-3x your node’s longest expected processing time. ML inference nodes may need 60s+. If too short, healthy nodes get killed during normal processing.

Set input_timeout per input. Not all inputs need the same timeout. Use shorter timeouts for high-frequency inputs (IMU, camera) and longer timeouts for slow/bursty sources (GPS, batch results). A good starting point is 3-5x the expected publish interval.

Use InputTracker for critical paths. When a node must keep running even with degraded inputs, use InputTracker to fall back to cached data. This is essential for sensor fusion, planning, and control nodes.

Use --store redb for production deployments. The redb backend ensures the coordinator retains dataflow history across crashes and restarts. The in-memory default is fine for development but loses all state on exit. The redb file is small (proportional to the number of dataflow records) and adds negligible overhead.

Known limitation: when the coordinator disconnects, the daemon currently kills running node processes before reconnecting (#260). Dataflow records survive coordinator restart (via redb), but running processes are restarted from scratch. Seamless process reclaim across reconnect is planned.

Combine features for defense in depth:

restart_policy + restart_delay -> recover from node crashes
health_check_timeout -> recover from hung nodes
input_timeout -> detect stale upstream data
InputTracker -> graceful degradation in node code
--store redb -> survive coordinator crashes

分布式部署指南

Dora supports deploying dataflows across multiple machines for multi-robot fleets, edge AI pipelines, and distributed robotics systems. This guide covers cluster management, node scheduling, binary distribution, auto-recovery, and operational best practices.

概述

Dora’s distributed architecture has three tiers:

CLI  -->  Coordinator  -->  Daemon(s)  -->  Nodes / Operators
              (one)          (per machine)     (user code)

CLI sends control commands (build, start, stop) to the coordinator.
Coordinator orchestrates daemons, resolves node placement, and manages dataflow lifecycle.
Daemons run on each machine, spawning and supervising node processes.
Nodes communicate via shared memory (same machine) or Zenoh pub-sub (cross-machine).

There are two paths to distributed deployment:

Ad-hoc – manually start dora daemon on each machine, then use the coordinator for control. Good for development and testing. See Distributed Deployments in the CLI reference.

Managed (cluster.yml) – define your cluster topology in a YAML file, then use dora cluster commands for SSH-based lifecycle management. This guide focuses on the managed path.

快速开始

Create a cluster.yml:

coordinator:
  addr: 10.0.0.1
machines:
  - id: robot
    host: 10.0.0.2
    user: ubuntu
  - id: gpu-server
    host: 10.0.0.3
    user: ubuntu

Bring up the cluster:

dora cluster up cluster.yml

Start a dataflow:

dora start dataflow.yml --name my-app --attach

Check cluster health:

dora cluster status

Tear down:

dora cluster down

功能一览

特性	Command / Config	描述
Cluster lifecycle	`dora cluster up/status/down`	SSH-based daemon management from a single machine
Label scheduling	`_unstable_deploy.labels`	Route nodes to daemons by key-value labels
Binary distribution	`_unstable_deploy.distribute`	local, scp, or http strategies
systemd services	`dora cluster install/uninstall`	Persistent daemon services that survive reboots
Auto-recovery	Automatic	Re-spawn nodes when a daemon reconnects
Rolling upgrade	`dora cluster upgrade`	SCP binary + restart per-machine sequentially
Dataflow restart	`dora cluster restart`	Restart a running dataflow by name or UUID

集群配置参考

A cluster.yml file defines the coordinator address and the set of machines in the cluster.

完整模式

coordinator:
  addr: 10.0.0.1            # IP address the coordinator binds to (required)
  port: 6013                 # WebSocket port (default: 6013)

machines:
  - id: edge-01              # Unique machine identifier (required)
    host: 10.0.0.2           # SSH-reachable hostname or IP (required)
    user: ubuntu              # SSH user (optional, defaults to current user)
    labels:                   # Key-value labels for scheduling (optional)
      gpu: "true"
      arch: arm64

  - id: edge-02
    host: 10.0.0.3
    labels:
      arch: arm64

字段

coordinator

Field	类型	默认	描述
`addr`	IP address	(required)	Address the coordinator binds to
`port`	u16	`6013`	WebSocket port

machines[]

Field	类型	默认	描述
`id`	string	(required)	Unique machine identifier, used in `_unstable_deploy.machine`
`host`	string	(required)	SSH-reachable hostname or IP address
`user`	string	current user	SSH username
`labels`	map	empty	Key-value pairs for label-based scheduling

Validation Rules

At least one machine must be defined.
Machine IDs must be non-empty and unique.
Machine hosts must be non-empty.
Unknown fields are rejected (deny_unknown_fields).

Example: 3-Machine GPU Cluster

coordinator:
  addr: 192.168.1.1

machines:
  - id: coordinator-host
    host: 192.168.1.1
    labels:
      role: control

  - id: gpu-a100
    host: 192.168.1.10
    user: ml
    labels:
      gpu: a100
      arch: x86_64

  - id: jetson-01
    host: 192.168.1.20
    user: nvidia
    labels:
      gpu: jetson
      arch: arm64

Cluster Commands Reference

All dora cluster commands operate on a cluster.yml file and use SSH to manage remote machines.

SSH options used: BatchMode=yes, ConnectTimeout=10, StrictHostKeyChecking=accept-new.

dora cluster up

Bring up a multi-machine cluster from a cluster.yml file. Starts the coordinator locally, then SSH-es into each machine to start a daemon.

dora cluster up <PATH>

Arguments:

Argument	描述
`PATH`	Path to the cluster configuration file

Behavior:

Loads and validates the cluster config.
Starts the coordinator locally on addr:port.
For each machine, SSH-es in and runs nohup dora daemon --machine-id <id> --coordinator-addr <addr> --coordinator-port <port> [--labels k1=v1,k2=v2] --quiet.
Polls until all expected daemons register with the coordinator (30s timeout).

Example:

$ dora cluster up cluster.yml
Starting coordinator on 10.0.0.1:6013...
Starting daemon on robot (ubuntu@10.0.0.2)... OK
Starting daemon on gpu-server (ubuntu@10.0.0.3)... OK
All 2 daemons connected.

dora cluster status

Show the current status of the cluster. Displays connected daemons and active dataflow count.

dora cluster status [--coordinator-addr ADDR] [--coordinator-port PORT]

Flags:

标志	默认	描述
`--coordinator-addr`	`localhost`	Coordinator hostname or IP
`--coordinator-port`	`6013`	Coordinator WebSocket port

Example:

$ dora cluster status
DAEMON ID      LAST HEARTBEAT
robot          2s ago
gpu-server     1s ago

Active dataflows: 1

dora cluster down

Tear down the cluster (coordinator and all daemons).

dora cluster down [--coordinator-addr ADDR] [--coordinator-port PORT]

Terminates all daemons and the coordinator process.

dora cluster install

Install dora-daemon as a systemd service on each machine. SSH-es into each machine, writes a systemd unit file, and enables the service.

dora cluster install <PATH>

Arguments:

Argument	描述
`PATH`	Path to the cluster configuration file

Behavior:

For each machine, creates and enables a systemd service named dora-daemon-<id>. The unit file:

[Unit]
Description=Dora Daemon (<id>)
After=network-online.target
Wants=network-online.target

[Service]
ExecStart=dora daemon --machine-id <id> --coordinator-addr <addr> --coordinator-port <port> --labels k1=v1,k2=v2 --quiet
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Example:

$ dora cluster install cluster.yml
Installing dora-daemon-robot on ubuntu@10.0.0.2... OK
Installing dora-daemon-gpu-server on ubuntu@10.0.0.3... OK
2/2 succeeded.

dora cluster uninstall

Uninstall dora-daemon systemd services from each machine. Stops, disables, and removes the systemd unit.

dora cluster uninstall <PATH>

Behavior:

For each machine, runs:

sudo systemctl stop dora-daemon-<id>
sudo systemctl disable dora-daemon-<id>
sudo rm -f /etc/systemd/system/dora-daemon-<id>.service
sudo systemctl daemon-reload

dora cluster upgrade

Rolling upgrade: SCP the local dora binary to each machine and restart daemons. Processes machines sequentially to maintain availability.

dora cluster upgrade <PATH>

Behavior:

For each machine sequentially:

SCP the local dora binary to /usr/local/bin/dora on the target machine.
Restart the systemd service via sudo systemctl restart dora-daemon-<id>.
Poll the coordinator until the daemon reconnects (30s timeout, 500ms intervals).

Nodes on other machines continue running while each machine is being upgraded.

Example:

$ dora cluster upgrade cluster.yml
Upgrading robot (ubuntu@10.0.0.2)...
  SCP binary... OK
  Restart service... OK
  Waiting for reconnect... OK (3.2s)
Upgrading gpu-server (ubuntu@10.0.0.3)...
  SCP binary... OK
  Restart service... OK
  Waiting for reconnect... OK (2.8s)
2/2 succeeded.

dora cluster restart

Restart a running dataflow by name or UUID. Stops the dataflow and immediately re-starts it using the stored descriptor (no YAML path needed).

dora cluster restart <DATAFLOW>

Arguments:

Argument	描述
`DATAFLOW`	Name or UUID of the dataflow to restart

Example:

$ dora cluster restart my-app
Restarting dataflow `my-app`
dataflow restarted: a1b2c3d4-... -> e5f6a7b8-...

Node Scheduling

When the coordinator receives a dataflow, it decides which daemon runs each node based on the _unstable_deploy section in the dataflow YAML. Resolution priority: machine > labels > unnamed.

Machine-based scheduling

Assign a node to a specific machine by its id from cluster.yml:

nodes:
  - id: camera
    _unstable_deploy:
      machine: robot
    path: ./camera-driver
    outputs:
      - frames

The coordinator looks up the daemon whose machine-id matches. If no matching daemon is connected, the deployment fails with: no matching daemon for machine id "robot".

Label-based scheduling

Assign a node by requiring specific labels on the target daemon:

nodes:
  - id: inference
    _unstable_deploy:
      labels:
        gpu: "true"
    path: ./ml-model
    inputs:
      frames: camera/frames
    outputs:
      - predictions

The coordinator finds the first connected daemon whose labels are a superset of the required labels. All required key-value pairs must match exactly. If no daemon satisfies the requirements, deployment fails with: no daemon matches labels {"gpu": "true"}.

Unassigned nodes

Nodes without an _unstable_deploy section (or with an empty one) are assigned to the first unnamed daemon – one that connected without a --machine-id flag.

How resolve_daemon() works internally

The coordinator resolves node placement in coordinator/run/mod.rs:

resolve_daemon(connections, deploy) -> DaemonId
  1. If deploy.machine is Some(id):
       -> look up daemon by machine-id
  2. Else if deploy.labels is non-empty:
       -> find first daemon where all required labels match
  3. Else:
       -> pick first unnamed daemon

The label matching function iterates over all connected daemons and checks that every required key-value pair exists in the daemon’s label set (conn.labels.get(k) == Some(v)). This is a superset check: a daemon with {gpu: "true", arch: "arm64", role: "edge"} satisfies the requirement {gpu: "true"}.

Binary Distribution

Control how node binaries are delivered to remote daemons via the distribute field.

Local (default)

Each daemon builds from source on its own machine. This is the current default behavior.

nodes:
  - id: my-node
    _unstable_deploy:
      machine: edge-01
      distribute: local
    path: ./my-node

SCP mode

The CLI pushes the locally-built binary to the target machine via SSH/SCP before spawning.

nodes:
  - id: my-node
    _unstable_deploy:
      machine: edge-01
      distribute: scp
    path: ./my-node

HTTP mode

The coordinator runs an artifact store. Daemons pull binaries from the coordinator via HTTP before spawning.

nodes:
  - id: my-node
    _unstable_deploy:
      machine: edge-01
      distribute: http
    path: ./my-node

Artifacts are served from GET /api/artifacts/{build_id}/{node_id} on the coordinator’s WebSocket port. The endpoint requires authentication (Bearer token) and sanitizes node IDs to prevent path traversal.

When to use each strategy

Strategy	Best for	Tradeoffs
`local`	Homogeneous clusters, CI builds	Requires build toolchain on every machine
`scp`	Heterogeneous clusters, cross-compiled binaries	Requires SSH access from CLI to all machines
`http`	Air-gapped daemons, firewalled networks	Requires coordinator reachability from all daemons

systemd Service Management

For production deployments, install daemons as systemd services so they survive reboots and auto-restart on failure.

Install

dora cluster install cluster.yml

Creates a systemd unit file on each machine (see dora cluster install for the full unit template). Key properties:

Restart=on-failure with RestartSec=5: daemon auto-restarts if it crashes.
After=network-online.target: waits for network before starting.
WantedBy=multi-user.target: starts on boot.

Uninstall

dora cluster uninstall cluster.yml

Stops, disables, and removes the unit file from each machine, then reloads the systemd daemon.

Verifying service status

After install, check services directly:

ssh ubuntu@10.0.0.2 sudo systemctl status dora-daemon-robot

自动恢复

When a daemon disconnects and reconnects (e.g., after a network blip, machine reboot, or service restart), the coordinator automatically re-spawns any missing dataflows on that daemon.

How it works

Daemon reconnects and sends a StatusReport listing its currently running dataflows.
Coordinator compares the report against its expected state (dataflows that should have nodes on this daemon).
For each running dataflow with nodes assigned to this daemon that the daemon did not report, the coordinator sends a SpawnDataflowNodes command to re-spawn the missing nodes.

30-second backoff

To prevent crash loops (e.g., a node that immediately crashes on spawn), recovery uses a per-daemon, per-dataflow backoff:

After a recovery attempt, the coordinator records the timestamp.
Subsequent recovery for the same daemon/dataflow pair is skipped until 30 seconds have elapsed.
The backoff clears when the daemon reports the dataflow as running again.

This means a node that crashes immediately will only be re-spawned once every 30 seconds, not in a tight loop.

限制

Auto-recovery only applies to dataflows started via dora start (coordinator-managed). Local dora run dataflows are not tracked by the coordinator.
Recovery re-spawns all nodes assigned to the reconnecting daemon, not individual nodes. For per-node restart on crash, use restart policies.
Known issue (#260): when the daemon’s WebSocket connection to the coordinator drops, the daemon currently kills all running node processes before reconnecting. This means the coordinator’s auto-recovery path re-spawns the nodes from scratch rather than reclaiming still-running processes. The net effect is a brief disruption (nodes restart) rather than seamless continuity. A fix to preserve running processes across reconnect cycles is planned.

Rolling Upgrade

Upgrade the dora binary on all cluster machines with zero downtime using sequential per-machine upgrades.

Process

dora cluster upgrade cluster.yml

For each machine, sequentially:

SCP the local dora binary to /usr/local/bin/dora on the target.
Restart the systemd service (systemctl restart dora-daemon-<id>).
Poll the coordinator until the daemon reconnects (30s timeout).

Because machines are upgraded one at a time, nodes on other machines continue running. After the daemon reconnects, auto-recovery re-spawns any dataflow nodes that were running on that machine.

前提条件

Daemons must be installed as systemd services (dora cluster install).
The local dora binary must be compatible with the cluster’s coordinator version.
SSH access with sudo permissions on all target machines.

Use Cases

1. Edge AI Pipeline (Robot + GPU Server)

A camera node runs on the robot, sends frames to a GPU server for inference, and results flow back to an actuator on the robot.

cluster.yml:

coordinator:
  addr: 192.168.1.1

machines:
  - id: robot
    host: 192.168.1.10
    user: ubuntu
    labels:
      role: edge
  - id: gpu-server
    host: 192.168.1.20
    user: ml
    labels:
      gpu: "true"

dataflow.yml:

nodes:
  - id: camera
    _unstable_deploy:
      machine: robot
    path: ./camera-driver
    outputs:
      - frames

  - id: inference
    _unstable_deploy:
      labels:
        gpu: "true"
    path: ./ml-model
    inputs:
      frames: camera/frames
    outputs:
      - predictions

  - id: actuator
    _unstable_deploy:
      machine: robot
    path: ./actuator-driver
    inputs:
      commands: inference/predictions

2. Multi-Robot Fleet

A central coordinator manages N robots with heterogeneous hardware. Label scheduling routes nodes to the right machines without hardcoding machine IDs.

cluster.yml:

coordinator:
  addr: 10.0.0.1

machines:
  - id: bot-01
    host: 10.0.0.11
    user: robot
    labels:
      fleet: warehouse
      lidar: "true"

  - id: bot-02
    host: 10.0.0.12
    user: robot
    labels:
      fleet: warehouse
      camera: rgbd

  - id: bot-03
    host: 10.0.0.13
    user: robot
    labels:
      fleet: warehouse
      lidar: "true"
      camera: rgbd

dataflow.yml:

nodes:
  - id: lidar-driver
    _unstable_deploy:
      labels:
        lidar: "true"
    path: ./lidar-driver
    outputs:
      - scans

  - id: camera-driver
    _unstable_deploy:
      labels:
        camera: rgbd
    path: ./camera-driver
    outputs:
      - frames

With this configuration, lidar-driver runs on bot-01 or bot-03, and camera-driver runs on bot-02 or bot-03.

3. CI/CD Pipeline for Robotics

Automate cluster management in CI:

# Setup
dora cluster install cluster.yml

# Deploy new version
dora cluster upgrade cluster.yml

# Run integration tests
dora start test-dataflow.yml --name integration-test --attach

# Monitor
dora cluster status
dora top

# Cleanup
dora stop integration-test

4. Development to Production

Stage	Approach	命令
Local dev	Single-process, no coordinator	`dora run dataflow.yml`
Staging	Ad-hoc daemons, manual setup	`dora up` + `dora daemon` on each machine
Production	Managed cluster, systemd services	`dora cluster install cluster.yml`

Operations Runbook

Initial Setup Checklist

SSH keys: Distribute SSH keys so the CLI machine can reach all cluster machines without a password (BatchMode=yes).
Dora binary: Install the dora binary on all machines (same version).
Network: Ensure coordinator port (default 6013) is reachable from all machines. Ensure Zenoh ports are open between daemons for cross-machine node communication.
cluster.yml: Create the cluster configuration with correct IPs, users, and labels.

Day-to-Day Operations

# Start a dataflow
dora start dataflow.yml --name my-app --attach

# List running dataflows
dora list

# Monitor resource usage
dora top

# View node logs
dora logs my-app <node-id> --follow

# Stop a dataflow
dora stop my-app

# Check cluster health
dora cluster status

Upgrading

Build or download the new dora binary locally.
Run dora cluster upgrade cluster.yml.
Verify with dora cluster status that all daemons reconnected.
Running dataflows are automatically re-spawned via auto-recovery.

故障排除

Daemon not connecting

Verify the coordinator is running and reachable: curl http://<addr>:6013/api/health (or check coordinator logs).
Check daemon logs: journalctl -u dora-daemon-<id> -f (systemd) or the daemon’s stderr output (ad-hoc).
Confirm the --coordinator-addr and --coordinator-port match the coordinator’s actual bind address.

SSH failures during cluster commands

Ensure ssh -o BatchMode=yes <user>@<host> echo ok works from the CLI machine.
Check that StrictHostKeyChecking=accept-new is acceptable for your environment (first connection auto-accepts the host key).
Verify the user field in cluster.yml matches a valid SSH user on the target.

Label mismatch errors

Error: no daemon matches labels {"gpu": "true"}.
Check that the daemon was started with the correct --labels flag.
Run dora cluster status to see connected daemons. Labels are set at daemon startup from cluster.yml and cannot be changed at runtime.

Auto-recovery not triggering

Auto-recovery only applies to coordinator-managed dataflows (dora start), not dora run.
Check coordinator logs for auto-recovery: re-spawning messages.
If the node crashes immediately, recovery is throttled to once every 30 seconds per daemon per dataflow.

Deployment YAML Reference

The _unstable_deploy section on each node controls placement and distribution. All fields are optional.

nodes:
  - id: my-node
    _unstable_deploy:
      machine: edge-01                # Target machine ID from cluster.yml
      labels:                          # Label requirements (superset match)
        gpu: "true"
        arch: arm64
      distribute: local                # local | scp | http
      working_dir: /opt/my-app         # Working directory on the target machine
    path: ./my-node

字段

Field	类型	默认	描述
`machine`	string	none	Target machine ID. Takes priority over labels.
`labels`	map	empty	Required daemon labels. All key-value pairs must match.
`distribute`	string	`local`	Binary distribution strategy: `local`, `scp`, or `http`.
`working_dir`	path	none	Working directory on the target machine.

Resolution priority

machine – if set, the node is assigned to the daemon with that machine ID.
labels – if set (and machine is not), the node is assigned to the first daemon whose labels are a superset of the required labels.
Fallback – if neither is set, the node is assigned to the first unnamed (no machine-id) daemon.

最佳实践

Use labels over machine IDs for flexibility. Labels decouple your dataflow from specific machines, making it easier to add, remove, or replace hardware.
Use systemd install for production. Daemon services survive reboots and auto-restart on failure with Restart=on-failure.
Use coordinator persistence (dora coordinator --store redb) with clusters so the coordinator survives restarts. See Coordinator State Persistence.
Set restart policies on nodes for per-node resilience. Combine with auto-recovery for defense in depth. See Restart Policies.
Monitor with multiple tools: dora cluster status for daemon health, dora top for resource usage, dora logs for node output.
Test locally first. Develop with dora run dataflow.yml, then deploy to a cluster. The same dataflow YAML works in both modes – _unstable_deploy fields are ignored in local mode.
Use rolling upgrades instead of stopping the entire cluster. dora cluster upgrade processes one machine at a time to maintain availability.
Keep cluster.yml in version control alongside your dataflow definitions.

性能

Dora achieves 10-17x lower latency than ROS2 Python through zero-copy shared memory IPC, Apache Arrow columnar format, and 100% Rust internals. This document covers methodology, reproduction, and tuning.

Architecture Advantages

层级	Dora	ROS2 (rclpy)
Runtime	Rust async (tokio)	Python + C++ middleware
IPC (>4KB)	Zenoh SHM zero-copy	DDS serialization + copy
IPC (<4KB)	TCP with bincode	DDS serialization + copy
Data format	Apache Arrow (zero-serde)	CDR serialization
Threading	Lock-free channels (flume)	GIL-bound callbacks
Fan-out	Arc-wrapped (O(1) per receiver)	Per-receiver copy

Benchmark Suite

Internal benchmarks (`examples/benchmark/`)

Measures Dora’s own latency and throughput across 10 payload sizes (0B to 4MB).

cd examples/benchmark
./compare.sh          # Rust vs Python sender comparison

Metrics reported: avg, p50, p95, p99, p99.9, min, max latency; msg/s throughput.

ROS2 comparison (`examples/ros2-comparison/`)

Apples-to-apples comparison using identical Python workloads on both frameworks.

cd examples/ros2-comparison
./run_comparison.sh   # Requires ROS2 Humble+

Both sides use time.perf_counter_ns() timestamps embedded in payload first 8 bytes. Same message count, sizes, and sleep intervals ensure comparable results.

Criterion micro-benchmarks

Isolated benchmarks for internal hot paths:

# Daemon message routing (fan-out x payload size matrix)
cargo bench -p dora-daemon

# Message serialization/deserialization
cargo bench -p dora-message

CI tracks these via benchmark-action/github-action-benchmark with 120% alert threshold.

Reproducing Results

要求

Linux or macOS (shared memory IPC)
Rust 1.85+ with release profile
Python 3.10+ with numpy, pyarrow
ROS2 Humble+ (for comparison only)

Steps

Build Dora:

cargo install --path binaries/cli --locked

Run internal benchmark:

cd examples/benchmark
BENCH_CSV=results/rust.csv dora run dataflow.yml

Run ROS2 comparison:

cd examples/ros2-comparison
./run_comparison.sh

Environment Notes

Close background applications to reduce variance
Use taskset or cpuset to pin processes for consistent results
Run at least 3 iterations and report median
Shared memory benefits appear at payloads >4KB

Performance Tuning

Queue sizes

Default queue size is 10. For high-throughput outputs, increase it:

inputs:
  data:
    source: producer/output
    queue_size: 1000

Payload size

Dora automatically uses shared memory for messages >4KB, avoiding copies. Structure data to exceed this threshold when low latency matters.

Arrow format

Use Arrow arrays directly instead of converting to/from Python lists:

# Fast: pass Arrow array directly
node.send_output("out", pa.array(data, type=pa.uint8()))

# Slow: convert through Python list
node.send_output("out", pa.array(list(data), type=pa.uint8()))

Operator vs Node

Operators run in-process with the runtime (zero IPC overhead) but share the GIL in Python. Use Rust operators for compute-heavy work, Python operators for glue logic.

Distributed deployment

For cross-machine communication, Dora uses Zenoh pub-sub. Latency depends on network quality. Use local deployment (single-machine) when sub-millisecond latency is required.

CSV Output Format

All benchmarks support BENCH_CSV environment variable for machine-readable output:

latency,<bytes>,<label>,<n>,<avg_ns>,<p50_ns>,<p95_ns>,<p99_ns>,<p999_ns>,<min_ns>,<max_ns>
throughput,<bytes>,<label>,<n>,<msg_per_sec>,<elapsed_ns>,0,0,0,0,0

实时调优

Dora provides optional real-time features for latency-sensitive robotics deployments.

快速开始

# Start daemon with real-time profile (mlockall + SCHED_FIFO)
sudo dora daemon --rt

# Control worker threads
dora daemon --worker-threads 4

# Pin to specific CPU cores
taskset -c 2,3 dora daemon --rt

What `--rt` Does

mlockall(MCL_CURRENT | MCL_FUTURE) — pins all memory, prevents page faults
SCHED_FIFO priority 50 (Linux only) — real-time scheduling for the main thread
Requires CAP_SYS_NICE + CAP_IPC_LOCK capabilities

每节点 CPU 亲和性

Instead of pinning the entire daemon with taskset, you can pin individual nodes to specific CPU cores via the dataflow YAML:

- id: controller
  path: ./controller
  cpu_affinity: [2, 3]

- id: sensor
  path: ./sensor
  cpu_affinity: [4, 5]

The daemon calls sched_setaffinity on the spawned process. This is Linux only; on other platforms the field is silently ignored.

Combine with --rt for best results: the daemon gets real-time scheduling while each node is pinned to dedicated cores, avoiding contention.

Full Guide

See the comprehensive Real-Time Tuning Guide for:

Linux kernel tuning (CPU governor, PREEMPT_RT, boot parameters)
Process-level tuning (CPU affinity, memory locking, thread priority)
Systemd service configuration
Docker and Kubernetes deployment
Zenoh transport tuning
Benchmarking tips

动态拓扑

Add and remove nodes from running dataflows without restarting.

CLI Commands

# Add a node from a YAML definition
dora node add --from-yaml new-node.yml --dataflow my-app

# Remove a node (stops process + cleans up mappings)
dora node remove my-app filter-node

# Connect two nodes (add a live mapping)
dora node connect --dataflow my-app sender/value filter/input

# Disconnect two nodes (remove a mapping)
dora node disconnect --dataflow my-app sender/value filter/input

Node YAML Definition

Dynamic nodes are defined in standalone YAML files with the same format as a single entry in the nodes: list:

# filter-node.yml
id: filter
path: filter.py
outputs:
  - output

After adding, wire inputs explicitly with dora node connect.

Examples

dynamic-add-remove — basic add/remove/connect pipeline
dynamic-agent-tools — AI agent with dynamically-added tools

Current Limitations

Daemon-side node spawning for AddNode is pending (coordinator dispatch works, daemon logs a warning)
Cross-daemon dynamic topology not yet supported
Dynamic nodes are not persisted across dataflow restart

See the Dynamic Topology Plan for the full design.

ROS2 桥接

Dora provides a declarative YAML-based ROS2 bridge that lets any Dora node communicate with ROS2 topics, services, and actions without importing ROS2 libraries. You define the bridge in your dataflow YAML using the ros2: key, and the framework automatically spawns a bridge binary that converts between Apache Arrow (Dora’s native format) and ROS2 CDR/DDS. Your user nodes stay ROS2-free – they send and receive pure Arrow StructArray data.

功能一览

特性	配置	描述
主题订阅	`topic` + `direction: subscribe`	Receive from ROS2, forward as Arrow
主题发布	`topic` + `direction: publish`	Receive Arrow, publish to ROS2
Multi-topic	`topics`	Multiple topics on a single ROS2 node
服务客户端	`service` + `role: client`	Send requests, receive responses
服务服务端	`service` + `role: server`	Receive requests, send responses
动作客户端	`action` + `role: client`	Send goals, receive feedback + result
动作服务端	`action` + `role: server`	Receive goals, send feedback + result
QoS policies	`qos`	Reliability, durability, history, liveliness
Auto-spawn	Automatic	Bridge binary spawned by daemon as a Custom node

架构

When the Dora descriptor resolver encounters a ros2: key on a node, it converts it into a Custom node pointing to the dora-ros2-bridge-node binary. The bridge config is serialized as JSON into the DORA_ROS2_BRIDGE_CONFIG environment variable.

User Node <--(Arrow/SharedMem)--> Bridge Binary <--(CDR/DDS)--> ROS2

The bridge binary:

Reads AMENT_PREFIX_PATH to locate installed ROS2 message packages
Parses message/service/action definitions at startup
Creates a ros2_client node and the appropriate publishers, subscribers, clients, or servers
Converts incoming ROS2 CDR messages to Arrow StructArray (subscribe/response/feedback)
Converts incoming Arrow StructArray to ROS2 CDR messages (publish/request/goal)

Your user nodes never link against ROS2 – all ROS2 communication is isolated in the bridge binary.

前提条件

ROS2 environment sourced: AMENT_PREFIX_PATH must be set and point to a workspace containing the required message packages
Message packages installed: e.g., turtlesim, geometry_msgs, example_interfaces
For service client: A ROS2 service server must be running (or use a companion server dataflow)
For action client: A ROS2 action server must be running before starting the dataflow (no wait_for_action_server mechanism)
For action server: A ROS2 action client sends goals to the bridge (e.g., ros2 action send_goal)

Topic Bridge

Subscribe to a ROS2 topic and forward messages as Arrow data to downstream Dora nodes.

nodes:
  - id: pose_bridge
    ros2:
      topic: /turtle1/pose
      message_type: turtlesim/Pose
      direction: subscribe       # default, can be omitted
    outputs:
      - pose

The bridge creates a ROS2 subscription on /turtle1/pose, deserializes each incoming turtlesim/Pose message into an Arrow StructArray, and sends it on the pose output.

Single Topic (Publish)

Receive Arrow data from Dora nodes and publish to a ROS2 topic.

nodes:
  - id: cmd_bridge
    ros2:
      topic: /turtle1/cmd_vel
      message_type: geometry_msgs/Twist
      direction: publish
    inputs:
      cmd_vel: planner/cmd_vel

The bridge receives Arrow data on the cmd_vel input, serializes it to geometry_msgs/Twist CDR, and publishes to /turtle1/cmd_vel.

Multi-Topic

Bridge multiple topics on a single ROS2 node context, mixing subscribe and publish directions.

nodes:
  - id: turtle_bridge
    ros2:
      topics:
        - topic: /turtle1/pose
          message_type: turtlesim/Pose
          direction: subscribe
          output: pose
        - topic: /turtle1/cmd_vel
          message_type: geometry_msgs/Twist
          direction: publish
          input: velocity
      qos:
        reliable: true
        keep_last: 10
    inputs:
      velocity: planner/cmd_vel
    outputs:
      - pose

Multi-topic mode supports up to 64 topics per bridge node.

Input/Output ID Mapping

By default, topic names are converted to Dora IDs by stripping the leading / and replacing remaining / with _:

ROS2 Topic	Default Dora ID
`/turtle1/pose`	`turtle1_pose`
`/camera/image_raw`	`camera_image_raw`

In multi-topic mode, you can override this with explicit output (for subscribe) or input (for publish) fields. In single-topic mode, the node’s declared outputs or inputs are used directly.

Service Bridge

Service Client

Send requests from Dora to an external ROS2 service and receive responses.

nodes:
  - id: add_client
    ros2:
      service: /add_two_ints
      service_type: example_interfaces/AddTwoInts
      role: 客户端
    inputs:
      request: requester/data
    outputs:
      - response

The bridge waits for the service to become available (up to 10 retries, 2 seconds each), then for each Arrow input it receives:

Serializes the Arrow data as an AddTwoInts_Request CDR message
Sends the request to the ROS2 service
Waits for a response (30-second timeout)
Deserializes the response into Arrow and sends it on the response output

Service Server

Expose an Dora handler node as a ROS2 service that external ROS2 clients can call.

nodes:
  - id: add_server
    ros2:
      service: /dora_add_two_ints
      service_type: example_interfaces/AddTwoInts
      role: 服务端
    inputs:
      response: handler/result
    outputs:
      - request

  - id: handler
    path: path/to/handler-node
    inputs:
      request: add_server/request
    outputs:
      - result

The bridge receives ROS2 service requests, assigns each a unique request_id (UUID v7), forwards the request data as Arrow on the request output with request_id in metadata, and waits for the handler node to send a response back on the response input with the same request_id. The response is then returned to the correct ROS2 client.

See examples/ros2-bridge/yaml-bridge-service/ for a working example.

Request ID Correlation

Each incoming ROS2 request is assigned a request_id metadata parameter. The handler node must include the same request_id in metadata when sending the response. The simplest approach is to pass through metadata.parameters:

#![allow(unused)]
fn main() {
Event::Input { id, metadata, data } => {
    // metadata.parameters contains request_id
    let result = compute(data);
    node.send_service_response("response".into(), metadata.parameters, result)?;
}
}

Responses can arrive in any order – the bridge correlates them by request_id, not by arrival order. Stale pending requests are evicted after 30 seconds. The maximum pending request queue is 64 – additional requests are dropped when full.

Service Wait and Timeouts

Behavior	值
Service client: wait for availability	10 retries, 2s each (20s total)
Service client: response timeout	30 seconds
Service server: pending request limit	64

Action Bridge

Action Client

Send goals from Dora to an external ROS2 action server, receiving feedback and results.

nodes:
  - id: fib_client
    ros2:
      action: /fibonacci
      action_type: example_interfaces/Fibonacci
      role: 客户端
    inputs:
      goal: goal_sender/goal
    outputs:
      - feedback
      - result

For each Arrow goal input:

Serializes the Arrow data as a Fibonacci_Goal CDR message
Sends the goal to the action server (30-second timeout)
If accepted, spawns background threads for feedback and result
Feedback messages arrive on the feedback output as they stream in
The final result arrives on the result output (5-minute timeout)

Feedback and Result Streams

The action bridge sends feedback and results on separate outputs:

feedback: Streamed as each feedback message arrives from the action server. Contains the action’s feedback message as Arrow (e.g., {partial_sequence: int32[]} for Fibonacci)
result: Sent once when the action completes. Contains the action’s result message as Arrow (e.g., {sequence: int32[]} for Fibonacci)

Concurrent Goals

The bridge supports up to 8 concurrent in-flight goals (MAX_CONCURRENT_GOALS). Additional goals are dropped with a warning. Each goal spawns dedicated feedback and result reader threads.

Timeouts

Behavior	值
Goal send timeout	30 seconds
Result retrieval timeout	5 minutes
Feedback	No timeout (streams until action completes)

Action Server

Expose an Dora handler node as a ROS2 action server that external ROS2 clients can call.

nodes:
  - id: fib_server
    ros2:
      action: /fibonacci
      action_type: example_interfaces/Fibonacci
      role: 服务端
    inputs:
      feedback: handler/feedback
      result: handler/result
    outputs:
      - goal

  - id: handler
    path: path/to/handler-node
    inputs:
      goal: fib_server/goal
    outputs:
      - feedback
      - result

The bridge receives goals from ROS2 clients, auto-accepts them, and forwards the goal data on the goal output. The handler computes feedback and results and sends them back on the feedback and result inputs.

See examples/ros2-bridge/yaml-bridge-action-server/ for a working Fibonacci example.

Goal ID Metadata

Each goal is identified by a UUID string passed as a goal_id metadata parameter. The bridge sets goal_id on every goal output. The handler must include the same goal_id in metadata when sending feedback and result so the bridge can correlate them to the correct goal.

The simplest approach is to pass through metadata.parameters from the goal event:

#![allow(unused)]
fn main() {
Event::Input { id, metadata, data } => match id.as_str() {
    "goal" => {
        let params = metadata.parameters; // contains goal_id
        // ... compute ...
        node.send_output("feedback".into(), params.clone(), feedback)?;
        node.send_output("result".into(), params, result)?;
    }
    // ...
}
}

Action Server Lifecycle

ROS2 client sends a goal request
Bridge auto-accepts the goal and starts executing
Bridge sends goal data on goal output with goal_id in metadata
Handler sends feedback (zero or more times) with same goal_id
Handler sends result (once) with same goal_id; bridge returns it to the ROS2 client
Result send times out after 5 minutes if the client never requests it

Goals that contain no data or cannot be forwarded to the handler are automatically aborted – the bridge sends Aborted status back to the ROS2 client so it does not hang indefinitely.

Goal Status

By default, results are returned with Succeeded status. The handler can override this by setting a goal_status metadata parameter on the result output:

`goal_status` value	ROS2 Status	用例
`"succeeded"` (or omitted)	`Succeeded`	Goal completed successfully
`"aborted"`	`Aborted`	Goal failed during execution
`"canceled"`	`Canceled`	Goal was canceled by the handler

Unrecognized goal_status values default to Aborted with a warning logged. Omitting goal_status entirely defaults to Succeeded.

Rust example:

#![allow(unused)]
fn main() {
use dora_node_api::{GOAL_STATUS, GOAL_STATUS_ABORTED, Parameter};

let mut params = metadata.parameters; // contains goal_id
params.insert(GOAL_STATUS.to_string(), Parameter::String(GOAL_STATUS_ABORTED.to_string()));
node.send_output("result".into(), params, error_result)?;
}

Action Server Limits

Behavior	值
Max concurrent goals	8 (additional goals receive `Aborted` status)
Auto-accept	All goals are auto-accepted
Result send timeout	5 minutes

Python Action Server Handler

Python nodes receive goal data as PyArrow arrays with goal_id in the metadata dictionary. Pass it through on feedback/result outputs:

for event in node:
    if event["type"] == "INPUT" and event["id"] == "goal":
        goal_id = event["metadata"]["goal_id"]
        order = event["value"]["order"][0].as_py()

        # Send feedback
        node.send_output("feedback", feedback_array, {"goal_id": goal_id})

        # Send result (with optional status)
        node.send_output("result", result_array, {
            "goal_id": goal_id,
            "goal_status": "succeeded",  # or "aborted", "canceled"
        })

C++ Action Server Handler

C++ nodes access goal_id via type-safe metadata accessors:

auto goal_id = metadata->get_str("goal_id");

// Send feedback with goal_id
auto fb_metadata = new_metadata();
fb_metadata->set_string("goal_id", goal_id);
send_arrow_output_with_metadata("feedback", feedback_data, fb_metadata);

// Send result with goal_id
auto res_metadata = new_metadata();
res_metadata->set_string("goal_id", goal_id);
send_arrow_output_with_metadata("result", result_data, res_metadata);

Quality of Service (QoS)

配置

Set QoS at the bridge level (applies to all topics/channels) or per-topic in multi-topic mode.

nodes:
  - id: my_bridge
    ros2:
      topic: /sensor/data
      message_type: sensor_msgs/LaserScan
      qos:
        reliable: true
        durability: transient_local
        keep_last: 10
        liveliness: automatic
        lease_duration: 5.0
        max_blocking_time: 0.5

Defaults

Field	默认
`reliable`	`false` (best effort)
`durability`	`volatile`
`liveliness`	`automatic`
`lease_duration`	infinity
`max_blocking_time`	100ms (only applies when `reliable: true`)
`keep_last`	`1`
`keep_all`	`false`

Per-Topic QoS Override

In multi-topic mode, each topic can override the bridge-level QoS:

ros2:
  topics:
    - topic: /fast_sensor
      message_type: sensor_msgs/Imu
      direction: subscribe
      qos:
        reliable: false          # override: best effort for this topic
        keep_last: 1
    - topic: /cmd
      message_type: geometry_msgs/Twist
      direction: publish
      # inherits bridge-level QoS (reliable: true)
  qos:
    reliable: true               # default for all topics
    keep_last: 10

Validation Rules

Field	Valid Values
`reliable`	`true`, `false`
`durability`	`"volatile"`, `"transient_local"`
`liveliness`	`"automatic"`, `"manual_by_participant"`, `"manual_by_topic"`
`keep_last`	`1` to `10000`
`keep_all`	`true`, `false` (mutually exclusive intent with `keep_last`)
`lease_duration`	Finite non-negative float (seconds)
`max_blocking_time`	Finite non-negative float (seconds)

Data Format: Arrow Structs

All data exchanged between your nodes and the bridge uses Arrow StructArray with a single row. Each field in the ROS2 message becomes a column in the struct.

How to Build Arrow Messages

Rust example: building an AddTwoInts_Request ({a: i64, b: i64}):

#![allow(unused)]
fn main() {
use std::sync::Arc;
use arrow::array::{Array, Int64Array, StructArray};
use arrow::datatypes::{DataType, Field};

fn make_add_request(a: i64, b: i64) -> StructArray {
    let fields = vec![
        Arc::new(Field::new("a", DataType::Int64, false)),
        Arc::new(Field::new("b", DataType::Int64, false)),
    ];
    let arrays: Vec<Arc<dyn Array>> = vec![
        Arc::new(Int64Array::from(vec![a])),
        Arc::new(Int64Array::from(vec![b])),
    ];
    StructArray::try_new(fields.into(), arrays, None)
        .expect("failed to create struct array")
}
}

Reading a response ({sum: i64}):

#![allow(unused)]
fn main() {
use arrow::array::{Int64Array, StructArray};

fn read_response(data: &dyn arrow::array::Array) -> i64 {
    let struct_array = data
        .as_any()
        .downcast_ref::<StructArray>()
        .expect("expected struct array");
    struct_array
        .column_by_name("sum")
        .expect("missing 'sum' field")
        .as_any()
        .downcast_ref::<Int64Array>()
        .expect("expected Int64Array")
        .value(0)
}
}

Mapping ROS2 Types to Arrow Types

ROS2 Type	Arrow Type	Rust Arrow Array
`bool`	`Boolean`	`BooleanArray`
`int8`	`Int8`	`Int8Array`
`int16`	`Int16`	`Int16Array`
`int32`	`Int32`	`Int32Array`
`int64`	`Int64`	`Int64Array`
`uint8` / `byte` / `char`	`UInt8`	`UInt8Array`
`uint16`	`UInt16`	`UInt16Array`
`uint32`	`UInt32`	`UInt32Array`
`uint64`	`UInt64`	`UInt64Array`
`float32`	`Float32`	`Float32Array`
`float64`	`Float64`	`Float64Array`
`string`	`Utf8`	`StringArray`
`wstring`	`Utf8` (encoded as UTF-16 on CDR side)	`StringArray`
Nested message	`Struct`	`StructArray`

Sequences and Arrays

ROS2 Type	Arrow Type	Rust Arrow Array
Variable-length sequence (`int32[]`)	`List`	`ListArray`
Bounded sequence (`int32[<=10]`)	`List` (length validated)	`ListArray`
Fixed-size array (`int32[3]`)	`FixedSizeList`	`FixedSizeListArray`

Example: reading a ListArray from Fibonacci feedback ({partial_sequence: int32[]}):

#![allow(unused)]
fn main() {
use arrow::array::{Int32Array, ListArray, StructArray};

let struct_array = data.as_any().downcast_ref::<StructArray>().unwrap();
let list = struct_array
    .column_by_name("partial_sequence")
    .unwrap()
    .as_any()
    .downcast_ref::<ListArray>()
    .unwrap();
let values = list
    .value(0)
    .as_any()
    .downcast_ref::<Int32Array>()
    .unwrap()
    .values()
    .to_vec();
}

Complete YAML Reference

nodes:
  - id: my_bridge
    ros2:
      # --- Mode (exactly one required) ---

      # Single topic mode
      topic: /topic_name               # ROS2 topic name
      message_type: package/TypeName    # ROS2 message type
      direction: subscribe             # subscribe (default) | publish

      # Multi-topic mode (mutually exclusive with topic)
      topics:
        - topic: /topic_a
          message_type: package/TypeA
          direction: subscribe
          output: custom_output_id     # override default ID mapping
          qos:                         # per-topic QoS override
            reliable: true
        - topic: /topic_b
          message_type: package/TypeB
          direction: publish
          input: custom_input_id       # override default ID mapping

      # Service mode (mutually exclusive with topic/topics/action)
      service: /service_name           # ROS2 service name
      service_type: package/TypeName   # ROS2 service type
      role: client                     # client | server

      # Action mode (mutually exclusive with topic/topics/service)
      action: /action_name             # ROS2 action name
      action_type: package/TypeName    # ROS2 action type
      role: client                     # client | server

      # --- QoS (optional, applies to all channels) ---
      qos:
        reliable: false                # true | false (default: false = best effort)
        durability: volatile           # volatile (default) | transient_local
        liveliness: automatic          # automatic | manual_by_participant | manual_by_topic
        lease_duration: 5.0            # seconds (default: infinity)
        max_blocking_time: 0.1         # seconds (default: 0.1, reliable only)
        keep_last: 1                   # 1-10000 (default: 1)
        keep_all: false                # true | false (default: false)

      # --- Optional ROS2 node config ---
      namespace: /                     # ROS2 namespace (default: "/")
      node_name: my_ros_node           # ROS2 node name (default: dora node id)

    # --- Standard Dora node fields ---
    inputs:
      input_id: source_node/output_id
    outputs:
      - output_id

Use Case Scenarios

nodes:
  - id: pose_bridge
    ros2:
      topic: /turtle1/pose
      message_type: turtlesim/Pose
    outputs:
      - pose

  - id: my_processor
    path: ./target/debug/my-processor
    inputs:
      pose: pose_bridge/pose

#![allow(unused)]
fn main() {
// In my_processor: receive turtlesim/Pose as Arrow
Event::Input { id, data, .. } if id.as_str() == "pose" => {
    let s = data.as_any().downcast_ref::<StructArray>().unwrap();
    let x = s.column_by_name("x").unwrap()
        .as_any().downcast_ref::<Float32Array>().unwrap().value(0);
    let y = s.column_by_name("y").unwrap()
        .as_any().downcast_ref::<Float32Array>().unwrap().value(0);
    println!("Turtle at ({x}, {y})");
}
}

2. Publish Velocity Commands

nodes:
  - id: planner
    path: ./target/debug/planner
    inputs:
      tick: dora/timer/millis/100
    outputs:
      - cmd_vel

  - id: cmd_bridge
    ros2:
      topic: /turtle1/cmd_vel
      message_type: geometry_msgs/Twist
      direction: publish
    inputs:
      cmd_vel: planner/cmd_vel

#![allow(unused)]
fn main() {
// In planner: send geometry_msgs/Twist as Arrow
// Twist has nested Vector3 fields: linear {x,y,z} and angular {x,y,z}
fn make_twist(linear_x: f64, angular_z: f64) -> StructArray {
    let vec3_fields = vec![
        Arc::new(Field::new("x", DataType::Float64, false)),
        Arc::new(Field::new("y", DataType::Float64, false)),
        Arc::new(Field::new("z", DataType::Float64, false)),
    ];
    let linear = StructArray::try_new(
        vec3_fields.clone().into(),
        vec![
            Arc::new(Float64Array::from(vec![linear_x])) as _,
            Arc::new(Float64Array::from(vec![0.0])) as _,
            Arc::new(Float64Array::from(vec![0.0])) as _,
        ],
        None,
    ).unwrap();
    let angular = StructArray::try_new(
        vec3_fields.into(),
        vec![
            Arc::new(Float64Array::from(vec![0.0])) as _,
            Arc::new(Float64Array::from(vec![0.0])) as _,
            Arc::new(Float64Array::from(vec![angular_z])) as _,
        ],
        None,
    ).unwrap();

    let fields = vec![
        Arc::new(Field::new("linear", linear.data_type().clone(), false)),
        Arc::new(Field::new("angular", angular.data_type().clone(), false)),
    ];
    StructArray::try_new(
        fields.into(),
        vec![Arc::new(linear) as _, Arc::new(angular) as _],
        None,
    ).unwrap()
}
}

3. Multi-Topic Bidirectional Bridge

Subscribe to pose and publish velocity on a single ROS2 node.

nodes:
  - id: turtle_bridge
    ros2:
      topics:
        - topic: /turtle1/pose
          message_type: turtlesim/Pose
          direction: subscribe
          output: pose
        - topic: /turtle1/cmd_vel
          message_type: geometry_msgs/Twist
          direction: publish
          input: velocity
      qos:
        reliable: true
        keep_last: 10
    inputs:
      velocity: planner/cmd_vel
    outputs:
      - pose

  - id: planner
    path: ./target/debug/planner
    inputs:
      pose: turtle_bridge/pose
      tick: dora/timer/millis/100
    outputs:
      - cmd_vel

4. Service Client: Call an External ROS2 Service

nodes:
  - id: requester
    path: ./target/debug/requester
    inputs:
      tick: dora/timer/millis/1000
      response: add_client/response
    outputs:
      - request

  - id: add_client
    ros2:
      service: /add_two_ints
      service_type: example_interfaces/AddTwoInts
      role: 客户端
    inputs:
      request: requester/request
    outputs:
      - response

Prerequisites: run a ROS2 service first:

ros2 run examples_rclcpp_minimal_service service_main

5. Service Server: Expose an Dora Handler as ROS2 Service

nodes:
  - id: add_server
    ros2:
      service: /add_two_ints
      service_type: example_interfaces/AddTwoInts
      role: 服务端
    inputs:
      response: handler/response
    outputs:
      - request

  - id: handler
    path: ./target/debug/handler
    inputs:
      request: add_server/request
    outputs:
      - response

The handler receives {a: i64, b: i64} as Arrow, computes the result, and sends {sum: i64} back. External ROS2 clients can call this service:

ros2 service call /add_two_ints example_interfaces/srv/AddTwoInts "{a: 3, b: 5}"

6. Action Client: Long-Running Fibonacci Goal

nodes:
  - id: goal_sender
    path: ./target/debug/goal-sender
    inputs:
      tick: dora/timer/millis/5000
      feedback: fib_client/feedback
      result: fib_client/result
    outputs:
      - goal

  - id: fib_client
    ros2:
      action: /fibonacci
      action_type: example_interfaces/Fibonacci
      role: 客户端
    inputs:
      goal: goal_sender/goal
    outputs:
      - feedback
      - result

Prerequisites: start the action server before the dataflow:

ros2 run examples_rclcpp_action_server fibonacci_action_server

The goal node sends {order: int32}, receives streamed {partial_sequence: int32[]} feedback, and a final {sequence: int32[]} result.

7. Action Server: Expose an Dora Handler as ROS2 Action

nodes:
  - id: fib_server
    ros2:
      action: /fibonacci
      action_type: example_interfaces/Fibonacci
      role: 服务端
    inputs:
      feedback: handler/feedback
      result: handler/result
    outputs:
      - goal

  - id: handler
    path: ./target/debug/handler
    inputs:
      goal: fib_server/goal
    outputs:
      - feedback
      - result

The handler receives {order: int32} goals with a goal_id in metadata, sends {partial_sequence: int32[]} feedback, and a final {sequence: int32[]} result – all with the same goal_id in metadata. External ROS2 clients can send goals:

ros2 action send_goal /fibonacci example_interfaces/action/Fibonacci "{order: 10}"

Limitations and Known Constraints

Action server auto-accept: All incoming goals are automatically accepted. The handler cannot reject goals before execution starts.
No action cancel support: Neither client nor server handles ROS2 cancel requests.
No wait_for_action_server: The ros2_client library does not provide this API. Start the action server before the dataflow. The first goal will time out (30s) if the server is unavailable.
Single-flight service client: The service client processes requests sequentially – each request blocks until the response arrives (or times out at 30s).
QoS uniform for service/action channels: The qos config applies to all service/action sub-channels (goal, result, cancel, feedback, status). Per-channel QoS is not configurable.
AMENT_PREFIX_PATH required: The bridge fails at startup if no ROS2 message definitions are found.
Max 64 topics: Multi-topic mode supports at most 64 topics per bridge node.
Max 8 concurrent action goals: Additional goals receive Aborted status when the limit is reached.
Max 64 pending service requests (server): Requests are dropped when the queue is full.

最佳实践

Source your ROS2 environment before running. Ensure AMENT_PREFIX_PATH is set and includes all required message packages. The bridge logs an error if no definitions are found.

Start action servers before the dataflow. There is no wait mechanism for action servers. If the server is not ready, the first goal send will time out after 30 seconds.

Use multi-topic mode for related topics. Bridging /turtle1/pose (subscribe) and /turtle1/cmd_vel (publish) on the same bridge node reduces resource usage compared to two separate bridge nodes.

Match Arrow field names exactly. The bridge validates that Arrow struct field names match the ROS2 message definition. Missing fields use default values (zero for numbers, empty string). Extra fields cause an error.

Use explicit output/input in multi-topic mode. Default ID mapping (stripping /, replacing / with _) can be confusing for deep topic names. Explicit IDs make the dataflow YAML self-documenting.

Set QoS to match the ROS2 publisher/subscriber. QoS mismatches (e.g., reliable subscriber with best-effort publisher) cause silent communication failures. Check with ros2 topic info -v /topic_name to see the existing QoS settings.

Pass through request_id in service responses. The bridge correlates responses to requests using the request_id metadata parameter. If the handler does not include request_id in the response metadata, the bridge cannot match the response to the original ROS2 request.

WebSocket 控制面

Dora’s control plane uses WebSocket connections for all communication between the CLI, coordinator, and daemons. A single Axum server exposes three routes on one port, replacing the previous multi-port TCP design. JSON text frames carry a UUID-correlated request-reply protocol with fire-and-forget events for log streaming.

功能一览

特性	详情
路由	`/api/control` (CLI), `/api/daemon` (daemons), `/health`
传输格式	JSON text frames + binary frames for topic data
协议	UUID-correlated request-reply + fire-and-forget events
Message size limit	1 MiB (`MAX_CONTROL_MESSAGE_BYTES`)
Concurrency limit	256 connections (`MAX_WS_CONNECTIONS`)
Server framework	Axum + Tower middleware
Client library	`tokio-tungstenite` (integration tests, daemon), custom `WsSession` (CLI)
安全	Re-register guard, daemon ID verification, machine ID length limit

架构

                        Single Axum server (one port)
                       ┌────────────────────────────┐
                       │  /api/control   (CLI)       │
  CLI ──── WS ────────>│  /api/daemon    (Daemons)   │
                       │  /health        (HTTP GET)  │
  Daemon ── WS ───────>│                             │
                       └──────────┬─────────────────┘
                                  │ mpsc::Sender<Event>
                                  v
                            Coordinator
                          (event loop)

The coordinator binds a single TcpListener and serves an Axum router. Each WebSocket upgrade spawns a handler task that communicates with the coordinator’s main event loop through an mpsc::Sender<Event> channel.

Key source files

File	Role
`binaries/coordinator/src/ws_server.rs`	Router, `serve()`, constants, `ShutdownTrigger`
`binaries/coordinator/src/ws_control.rs`	`/api/control` handler
`binaries/coordinator/src/ws_daemon.rs`	`/api/daemon` handler, security, event translation
`binaries/cli/src/ws_client.rs`	`WsSession` synchronous client wrapper
`libraries/message/src/ws_protocol.rs`	`WsRequest`, `WsResponse`, `WsEvent`, `WsMessage` types

Wire Protocol

All messages are JSON text frames. Three message shapes exist:

WsRequest (client -> server)

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "method": "control",
  "params": { "List": null }
}

Field	类型	描述
`id`	UUID	Unique request identifier for reply correlation
`method`	string	`"control"` for CLI requests, `"daemon_event"` / `"daemon_command"` for daemon
`params`	object	Serialized `ControlRequest` or `Timestamped<CoordinatorRequest>`

WsResponse (server -> client)

Success:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "result": { "DataflowList": [] }
}

Error:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "error": "no running dataflow with id ..."
}

Field	类型	描述
`id`	UUID	Matches the originating request `id`
`result`	object?	Present on success (serialized `ControlRequestReply`)
`error`	string?	Present on failure

WsEvent (either direction)

{
  "event": "log",
  "payload": { "message": "sensor started", "level": "info" }
}

Used for log streaming after a LogSubscribe/BuildLogSubscribe is acknowledged.

Dispatch

Each handler parses incoming frames with its own strategy to preserve u128 fidelity (see u128 serialization):

CLI (ws_client.rs): Uses a flat IncomingFrame struct with serde_json::value::RawValue for the result/payload fields, avoiding serde_json::Value entirely. Discriminates by presence of event (log push) or id (response).
Coordinator control handler (ws_control.rs): Parses as WsRequest (always a request from CLI).
Coordinator daemon handler (ws_daemon.rs): Checks for "method" key to distinguish requests vs responses. Uses DaemonWsRequestRaw helper for requests.
Daemon (coordinator.rs): Uses CoordinatorCommandRaw / RegisterReplyRaw helper structs to parse directly from raw JSON text.

A WsMessage untagged enum is defined in ws_protocol.rs for generic dispatch but is not used by the production handlers:

#![allow(unused)]
fn main() {
#[serde(untagged)]
pub enum WsMessage {
    Request(WsRequest),
    Response(WsResponse),
    Event(WsEvent),
}
}

CLI Control Plane (`/api/control`)

The CLI connects to /api/control to send ControlRequest commands and receive ControlRequestReply responses.

Connection lifecycle

Connect – HTTP upgrade to WebSocket
Request-reply – CLI sends WsRequest, coordinator processes the ControlRequest, sends WsResponse
Log subscribe (optional) – CLI sends LogSubscribe/BuildLogSubscribe, coordinator acks with WsResponse, then pushes WsEvent{event:"log"} frames
Close – CLI sends Close frame or drops connection

Supported ControlRequest variants

Variant	描述
`List`	List all running dataflows
`Build`	Trigger a dataflow build
`WaitForBuild`	Block until build completes
`Start`	Start a dataflow
`WaitForSpawn`	Block until nodes are spawned
`Stop` / `StopByName`	Stop a running dataflow
`Reload`	Hot-reload a node/operator
`Check`	Check dataflow status
`Destroy`	Tear down all daemons
`Logs`	Retrieve historical logs
`Info`	Get dataflow details
`DaemonConnected`	Check if any daemon is connected
`ConnectedMachines`	List connected daemons
`LogSubscribe`	Subscribe to live dataflow logs
`BuildLogSubscribe`	Subscribe to live build logs
`CliAndDefaultDaemonOnSameMachine`	Check co-location
`GetNodeInfo`	Get node metadata
`TopicSubscribe`	Subscribe to live topic data via binary WS frames (details)
`TopicUnsubscribe`	Cancel a topic subscription

Log subscription flow

CLI                         Coordinator
 │                              │
 │─── WsRequest{LogSubscribe} ─>│
 │                              │  (check dataflow exists)
 │<── WsResponse{subscribed} ───│
 │                              │
 │<── WsEvent{event:"log"} ────│  (repeated)
 │<── WsEvent{event:"log"} ────│
 │                              │
 │─── Close ───────────────────>│  (log_subscribers dropped)

If the dataflow is not found, the coordinator returns WsResponse with an error and no events are sent.

WsSession (CLI client)

WsSession is a synchronous wrapper that bridges blocking CLI code to the async WebSocket connection. It creates an internal tokio::runtime::Runtime (current-thread) and spawns an async session_loop task.

CLI thread (sync)                       session_loop (async)
     │                                        │
     │── SessionCommand::Request ────────────>│── WsRequest ──> server
     │                                        │<── WsResponse ──
     │<── oneshot reply ─────────────────────│
     │                                        │
     │── SessionCommand::SubscribeLogs ──────>│── WsRequest ──> server
     │                                        │<── WsResponse (ack)
     │<── oneshot ack ───────────────────────│
     │<── std_mpsc log events ───────────────│<── WsEvent ──

The session loop maintains:

pending_requests: HashMap<Uuid, oneshot::Sender> – for request-reply correlation
pending_subscribes: HashMap<Uuid, (ack_tx, log_tx)> – for subscribe ack routing
log_subscribers: Vec<std_mpsc::Sender> – for broadcasting log events
pending_topic_subscribes: HashMap<Uuid, (ack_tx, data_tx)> – for topic subscribe ack routing
topic_subscribers: HashMap<Uuid, std_mpsc::Sender> – for binary frame dispatch by subscription UUID

Binary WS frames (topic data) are dispatched separately from text frames. See WebSocket Topic Data Channel for details.

On disconnect, all pending requests receive an error via their oneshot channels.

Daemon Plane (`/api/daemon`)

Daemons connect to /api/daemon for registration, event reporting, and receiving coordinator commands.

Registration flow

Daemon                       Coordinator
  │                              │
  │── WsRequest{Register} ─────>│
  │                              │  (validate, assign daemon_id)
  │                              │  (track connection + cmd channel)
  │                              │
  │── WsRequest{Event{...}} ───>│  (subsequent events)

Daemon sends a Register request containing DaemonRegisterRequest (version + machine ID)
Coordinator validates version compatibility and machine ID length
Coordinator assigns a DaemonId and stores the DaemonConnection (includes cmd_tx channel for sending commands back to the daemon)
The connection is tracked via tracked_daemon_id for cleanup on disconnect

Event translation

Daemon events are translated into coordinator-internal Event variants:

DaemonEvent	Coordinator Event
`AllNodesReady`	`Event::Dataflow { ReadyOnDaemon }`
`AllNodesFinished`	`Event::Dataflow { DataflowFinishedOnDaemon }`
`Heartbeat`	`Event::DaemonHeartbeat`
`Log(message)`	`Event::Log(message)`
`Exit`	`Event::DaemonExit`
`NodeMetrics`	`Event::NodeMetrics`
`BuildResult`	`Event::DataflowBuildResult`
`SpawnResult`	`Event::DataflowSpawnResult`

Bidirectional communication

The coordinator can send commands back to daemons via the cmd_tx channel stored in DaemonConnection. The daemon handler maintains a pending_replies: HashMap<Uuid, oneshot::Sender> to correlate daemon responses to coordinator-initiated requests.

Message routing on the daemon handler:

Frame has "method" key -> daemon request (registration or event)
Frame lacks "method" key -> daemon response to a coordinator command

u128 serialization workaround

uhlc::ID contains a NonZeroU128 which exceeds serde_json::Value::Number range (i64/u64/f64 only). Using serde_json::to_value() errors with “number out of range”, and serde_json::from_slice::<Value>() silently loses precision by storing as f64.

All production code bypasses serde_json::Value for data containing uhlc::Timestamp:

Component	Serialization	Deserialization
Daemon (`coordinator.rs`)	`to_string` + `format!`	Helper structs (`RegisterReplyRaw`, `CoordinatorCommandRaw`) + `from_str`
Coordinator control (`ws_control.rs`)	`to_string` + `format!` for replies	N/A (CLI requests don’t contain u128)
Coordinator daemon (`ws_daemon.rs`)	N/A	`DaemonWsRequestRaw` + `from_str`
Coordinator state (`state.rs`)	`str::from_utf8` + `format!` (raw bytes embedding)	N/A
CLI (`ws_client.rs`)	N/A (requests don’t contain u128)	`IncomingFrame` with `serde_json::value::RawValue`

Integration tests similarly construct WsRequest JSON strings manually via format!() + serde_json::to_string() (not to_value()) to match the real wire format.

安全

Re-register guard

Each daemon WebSocket connection allows exactly one Register request. If a connection attempts a second registration, the coordinator logs a warning and closes the connection:

daemon attempted re-register on same connection, rejecting

Daemon ID verification

After registration, every Event message must include a daemon_id matching the one assigned during registration. Mismatched IDs cause connection termination:

daemon sent event with mismatched id: expected `X`, got `Y` -- closing connection

Machine ID length validation

The machine_id field in DaemonRegisterRequest is limited to 256 bytes. Oversized values cause connection termination.

Connection and message limits

Limit	值	Enforced by
Max message size	1 MiB	`WebSocketUpgrade::max_message_size`
Max concurrent connections	256	Tower `ConcurrencyLimitLayer`

Connection Lifecycle & Keepalive

Establishment

Both /api/control and /api/daemon use standard HTTP/1.1 WebSocket upgrade. The Axum WebSocketUpgrade extractor handles the handshake.

Ping/pong

Both handlers respond to Ping frames with Pong frames containing the same payload:

#![allow(unused)]
fn main() {
Ok(Message::Ping(data)) => {
    let _ = ws_tx.send(Message::Pong(data)).await;
    continue;
}
}

Graceful close

When a Close frame is received:

Control handler: breaks the handler loop, dropping log subscriber channels
Daemon handler: breaks the loop, then emits Event::DaemonExit { daemon_id } for immediate cleanup

Cleanup on disconnect

Control connections:

log_tx channel is dropped, stopping log forwarding to that client
No coordinator state to clean up (control connections are stateless)

Daemon connections:

DaemonExit event is emitted if a daemon_id was tracked
cmd_tx and pending_replies are dropped
Coordinator removes the daemon from its connection map

WsSession (CLI client):

All entries in pending_requests receive Err("WS connection closed")
All entries in pending_subscribes receive Err("WS connection closed")

Message Flow Examples

CLI lists dataflows

CLI                          WsSession                    Coordinator
 │                              │                              │
 │── request(&List) ───────────>│                              │
 │                              │── WsRequest ────────────────>│
 │                              │   id: "abc-123"              │
 │                              │   method: "control"          │
 │                              │   params: "List"             │
 │                              │                              │
 │                              │                    ControlEvent::IncomingRequest
 │                              │                    reply via oneshot
 │                              │                              │
 │                              │<── WsResponse ──────────────│
 │                              │   id: "abc-123"              │
 │                              │   result: {DataflowList:[]}  │
 │                              │                              │
 │<── ControlRequestReply ─────│                              │

Daemon registration

Daemon                                    Coordinator
  │                                           │
  │── WsRequest ─────────────────────────────>│
  │   method: "daemon_event"                  │
  │   params: {inner: Register{...},          │
  │            timestamp: ...}                │
  │                                           │  validate version
  │                                           │  validate machine_id
  │                                           │  assign daemon_id
  │                                           │  store DaemonConnection
  │                                           │
  │── WsRequest{Event{Heartbeat}} ──────────>│
  │                                           │  Event::DaemonHeartbeat
  │                                           │
  │                        (on WS close) ────>│  Event::DaemonExit

Log subscription lifecycle

CLI                    WsSession              Coordinator
 │                        │                        │
 │── subscribe_logs() ───>│                        │
 │                        │── WsRequest ──────────>│
 │                        │   params: LogSubscribe │
 │                        │                        │  find dataflow
 │                        │<── WsResponse ────────│  {subscribed: true}
 │<── ack (Ok) ──────────│                        │
 │                        │                        │
 │                        │<── WsEvent{log} ──────│  (node produces log)
 │<── log_rx.recv() ─────│                        │
 │                        │<── WsEvent{log} ──────│
 │<── log_rx.recv() ─────│                        │
 │                        │                        │
 │   (drop session) ─────>│── Close ─────────────>│  (log_subscribers dropped)

Test Coverage

Test tiers

Tier	Location	Tests	What’s covered
Unit (protocol)	`libraries/message/src/ws_protocol.rs`	10	Roundtrip serialization, untagged dispatch, error cases
Unit (client)	`binaries/cli/src/ws_client.rs`	6	Response routing, subscribe ack, topic subscribe ack, orphan handling, disconnect
Integration (control)	`binaries/coordinator/tests/ws_control_tests.rs`	11	Health check, List, invalid JSON/params, Destroy, DaemonConnected, ping/pong, concurrent requests, connection close, log subscribe
Integration (daemon)	`binaries/coordinator/tests/ws_daemon_tests.rs`	4	Register, register-then-status, disconnect cleanup, ping/pong
E2E (WsSession)	`tests/ws-cli-e2e.rs`	4	WsSession + coordinator: list, status, stop, multi-request
Total		35

Key test patterns

Poll-with-timeout: Integration tests poll coordinator state (e.g., DaemonConnected) with a 2-second deadline and 20ms sleep intervals, avoiding flaky timing assumptions.

No nested runtimes: E2E tests run the coordinator on a background std::thread with its own tokio runtime, while WsSession (which creates its own current-thread runtime) runs on the test’s main thread. This avoids the “cannot start a runtime from within a runtime” panic.

u128 workaround in tests: Daemon test helpers construct WsRequest JSON strings manually via format!() + serde_json::to_string() (not serde_json::to_value()) to preserve uhlc::ID u128 values on the wire.

Test coordinator setup: Both integration and E2E tests use dora_coordinator::start_testing() which binds to port 0 (OS-assigned) and accepts an empty external event stream.

Configuration Reference

Constants

Constant	值	File	用途
`MAX_CONTROL_MESSAGE_BYTES`	1 MiB (1,048,576)	`ws_server.rs`	Max WebSocket frame size
`MAX_WS_CONNECTIONS`	256	`ws_server.rs`	Tower concurrency limit

Server setup

#![allow(unused)]
fn main() {
// Production: called by coordinator's main startup
let (port, shutdown, future) = ws_server::serve(bind_addr, event_tx, clock).await?;
tokio::spawn(future);
// ...
shutdown.shutdown(); // graceful stop
}

Test setup

#![allow(unused)]
fn main() {
// Binds to port 0, returns (port, future)
let (port, future) = dora_coordinator::start_testing(
    "127.0.0.1:0".parse().unwrap(),
    futures::stream::empty(),
).await?;
}

Shutdown

ShutdownTrigger wraps a oneshot::Sender<()>. Calling .shutdown() sends the signal, which the Axum server receives via with_graceful_shutdown. In-flight requests complete; new connections are rejected.

WebSocket 主题数据通道

The topic data channel extends the WebSocket control plane to proxy live dataflow messages from the coordinator to CLI clients. Instead of requiring direct Zenoh network access, CLI commands like topic echo, topic hz, and topic info receive message data over the existing WebSocket connection as binary frames.

动机

场景	Before (Zenoh direct)	After (WS proxy)
CLI on same machine as daemon	Works	Works
CLI remote, Zenoh reachable	Works	Works
CLI remote, no Zenoh access	Fails	Works
Browser-based web UI	Impossible	Possible
Embedded target, no local disk	Cannot record locally	`--proxy` streams to CLI

The key insight: CLI and future web UIs connect to the coordinator via WebSocket. By having the coordinator subscribe to Zenoh on their behalf and forward messages as binary frames, topic inspection works anywhere the WebSocket connection reaches.

架构

CLI  ──── WS (binary frames) ────>  Coordinator  ──── Zenoh sub ────>  Daemon
                                    (Zenoh proxy)                      (debug publish)

The coordinator acts as a Zenoh proxy:

CLI sends a TopicSubscribe request over the existing text-frame WS protocol
Coordinator validates the dataflow and opens Zenoh subscribers
Coordinator forwards each Zenoh sample as a binary WS frame back to the CLI
CLI dispatches binary frames by subscription UUID to the appropriate consumer

Key source files

File	Role
`libraries/message/src/cli_to_coordinator.rs`	`TopicSubscribe`, `TopicUnsubscribe` request variants
`libraries/message/src/coordinator_to_cli.rs`	`TopicSubscribed` reply variant
`binaries/coordinator/src/ws_control.rs`	Zenoh proxy: subscribe, forward binary frames
`binaries/coordinator/src/control.rs`	`ControlEvent::TopicSubscribe` for validation
`binaries/cli/src/ws_client.rs`	`WsSession::subscribe_topics()`, binary frame dispatch
`binaries/cli/src/command/topic/echo.rs`	Topic echo via WS
`binaries/cli/src/command/topic/hz.rs`	Topic frequency measurement via WS
`binaries/cli/src/command/topic/info.rs`	Topic metadata/stats via WS
`binaries/cli/src/command/record.rs`	`--proxy` flag for WS-based recording

Wire Protocol

Subscription handshake (JSON text frames)

The subscription uses the existing UUID-correlated request-reply protocol:

Request (CLI -> Coordinator):

{
  "id": "abc-123",
  "method": "control",
  "params": {
    "TopicSubscribe": {
      "dataflow_id": "550e8400-...",
      "topics": [["camera_node", "image"], ["lidar_node", "points"]]
    }
  }
}

Response (Coordinator -> CLI):

{
  "id": "abc-123",
  "result": {
    "TopicSubscribed": {
      "subscription_id": "7f1b3a00-..."
    }
  }
}

Unsubscribe (CLI -> Coordinator):

{
  "id": "def-456",
  "method": "control",
  "params": {
    "TopicUnsubscribe": {
      "subscription_id": "7f1b3a00-..."
    }
  }
}

Binary data frames

After the handshake, the coordinator pushes binary WS frames. Each frame has a fixed-size header:

 0                   16                              N
 ├───────────────────┼──────────────────────────────┤
 │  subscription_id  │  Timestamped<InterDaemonEvent>│
 │  (16 bytes UUID)  │  (bincode serialized)         │
 └───────────────────┴──────────────────────────────┘

Field	Size	描述
`subscription_id`	16 bytes	UUID matching the `TopicSubscribed` ack, for multiplexing
payload	variable	Raw `Timestamped<InterDaemonEvent>` bincode bytes from Zenoh

The 16-byte UUID prefix allows multiplexing multiple subscriptions on a single WS connection without additional framing overhead.

Data Flow

CLI                         WsSession                     Coordinator
 │                              │                              │
 │── subscribe_topics() ───────>│                              │
 │                              │── WsRequest{TopicSubscribe} >│
 │                              │                              │ validate dataflow
 │                              │                              │ open Zenoh session (lazy)
 │                              │                              │ spawn subscriber tasks
 │                              │<── WsResponse{TopicSubscribed}│
 │<── (sub_id, data_rx) ───────│                              │
 │                              │                              │
 │                              │       ┌── Zenoh sample ──────│ Daemon publishes
 │                              │<──────│ Binary frame         │
 │<── data_rx.recv() ──────────│       │ (sub_id + payload)   │
 │                              │       │                      │
 │                              │<──────│ Binary frame         │
 │<── data_rx.recv() ──────────│       │                      │
 │                              │       └                      │
 │                              │                              │
 │   (drop session) ───────────>│── Close ────────────────────>│ abort subscriber tasks

Coordinator internals

Validation: ControlEvent::TopicSubscribe is sent to the coordinator event loop, which checks that the dataflow exists and has enable_debug_inspection: true enabled
Lazy Zenoh: The coordinator’s Zenoh session is opened on the first TopicSubscribe request and reused for subsequent subscriptions on the same WS connection
Per-topic tasks: Each (node_id, data_id) pair spawns a tokio task that subscribes to the corresponding Zenoh topic and forwards samples to the binary frame channel
Backpressure: The binary frame channel has capacity 64. try_send is used – if the channel is full (slow consumer), samples are silently dropped rather than blocking the Zenoh subscriber
Cleanup: When the WS connection closes, all subscriber tasks are aborted

WsSession (CLI side)

The WsSession::subscribe_topics() method:

Serializes a TopicSubscribe request
Sends SessionCommand::SubscribeTopics through the internal command channel
The async session_loop wraps it as a WsRequest and sends it
On receiving the TopicSubscribed ack, registers the data_tx sender in topic_subscribers keyed by subscription_id
Binary frames are dispatched by extracting the first 16 bytes as UUID and sending the remainder to the matching data_tx

State maintained in session_loop:

pending_topic_subscribes: HashMap<Uuid, (ack_tx, data_tx)> – awaiting ack
topic_subscribers: HashMap<Uuid, Sender> – active subscriptions receiving binary data

前提条件

The dataflow descriptor must enable debug message publishing:

_unstable_debug:
  enable_debug_inspection: true

Without this, the coordinator rejects the TopicSubscribe with:

dataflow {id} not found or enable_debug_inspection not enabled

CLI Commands

`dora topic echo`

Stream topic data to the terminal in real-time.

# Echo a single topic
dora topic echo -d my-dataflow camera_node/image

# Echo multiple topics
dora topic echo -d my-dataflow robot1/pose robot2/vel

# JSON output for piping
dora topic echo -d my-dataflow robot1/pose --format json

Internally: calls session.subscribe_topics(), receives Timestamped<InterDaemonEvent> from the data_rx channel, deserializes Arrow data, and renders as table or JSON.

`dora topic hz`

Interactive TUI displaying per-topic publish frequency statistics.

# All topics
dora topic hz -d my-dataflow --window 10

# Specific topics
dora topic hz -d my-dataflow robot1/pose robot2/vel --window 5

Uses ratatui for the TUI. A background std::thread receives events from data_rx and dispatches to per-topic HzStats trackers via a BTreeMap<(node_id, data_id), index> lookup.

`dora topic info`

One-shot topic metadata and statistics.

dora topic info -d my-dataflow camera_node/image --duration 5

Collects messages for --duration seconds, then displays type information, publisher, subscribers (from descriptor), message count, and bandwidth.

`dora record --proxy`

Stream dataflow data through WebSocket for local recording.

# Start dataflow first
dora start dataflow.yml --detach

# Record via proxy (data streams through coordinator to CLI)
dora record dataflow.yml --proxy -o capture.drec

# Record specific topics
dora record dataflow.yml --proxy --topics sensor/image,lidar/points

Use case: the target machine (running the daemon) has no local disk or limited storage. The --proxy flag routes data through the coordinator WebSocket to the CLI machine, where the .drec file is written locally.

Without --proxy (default), a record node is injected into the dataflow and records directly on the daemon’s machine.

Zenoh Topic Format

The coordinator subscribes to Zenoh topics using the format from dora_core::topics::zenoh_output_publish_topic():

dora/{dataflow_id}/{node_id}/{data_id}

Each topic carries Timestamped<InterDaemonEvent> as its payload, serialized with bincode. The coordinator forwards these bytes as-is (prepended with subscription UUID) – no re-serialization.

Backpressure and Performance

参数	值	Rationale
Binary frame channel capacity	64	Balance between latency and memory
Drop policy	Drop on full	Prefer freshness over completeness
Binary format	Raw bincode (no base64)	Avoid 33% overhead for large payloads

For high-throughput topics (camera images, point clouds), the binary frame channel may fill up if the WS connection is slow. Dropped samples are silent – the CLI will show reduced frequency in topic hz but won’t stall.

错误处理

Error	Source	Response
Dataflow not found	Coordinator validation	WsResponse with error message
`enable_debug_inspection` not enabled	Coordinator validation	WsResponse with error message
Zenoh session open failure	Coordinator	WsResponse with error message
Zenoh subscriber failure	Per-topic task	Warning log, task exits
Binary frame too short (<16 bytes)	CLI session_loop	Warning log, frame dropped
Unknown subscription UUID	CLI session_loop	Frame dropped silently
WS connection closed	Either side	All tasks aborted, pending acks get error

Test Coverage

Tier	Location	What’s covered
Unit (client)	`binaries/cli/src/ws_client.rs`	`handle_response_topic_subscribe_ack` – verifies ack routing and subscriber registration
Unit (all existing)	`binaries/cli/src/ws_client.rs`	Updated to pass topic subscribe state through `handle_response`

The TopicSubscribe / binary frame path is primarily validated via integration testing with a running coordinator and Zenoh session. See Testing Guide for smoke test instructions.

Dora 测试指南

本指南介绍如何在 Dora 工作空间中运行、编写和排查测试。

快速开始（5 分钟验证）

运行这三个命令来验证工作空间是否健康：

# 1. Format check (~5s)
cargo fmt --all -- --check

# 2. Lint (~60s first run, cached after)
cargo clippy --all \
  --exclude dora-node-api-python \
  --exclude dora-operator-api-python \
  --exclude dora-ros2-bridge-python \
  -- -D warnings

# 3. Unit + integration tests (~90s first run)
cargo test --all \
  --exclude dora-node-api-python \
  --exclude dora-operator-api-python \
  --exclude dora-ros2-bridge-python

All three must pass before opening a PR. Python packages are excluded because they require maturin.

Test Tiers

Tier	What it covers	命令	Speed
Format	Code style	`cargo fmt --all -- --check`	~5s
Lint	Warnings, correctness	`cargo clippy --all ...`	~60s
Unit	Individual functions	`cargo test --all ...`	~90s
CLI	Command parsing, validation	`cargo test -p dora-cli`	~5s
Integration	Node I/O via env vars	`cargo test --test example-tests`	~30s
Smoke	Full CLI lifecycle	`cargo test --test example-smoke -- --test-threads=1`	~3min
E2E	Multi-dataflow scenarios	`cargo test --test ws-cli-e2e -- --ignored --test-threads=1`	~2min
Fault tolerance	Restart policies, timeouts	`cargo test --test fault-tolerance-e2e`	~45s
Typos	Spelling	Install typos-cli, then `typos`	~2s

Tier Details

单元测试

Unit tests live alongside the code they test using #[cfg(test)] modules. Key crates with tests:

Crate	Test count	What’s tested
dora-arrow-convert	~26	Round-trip Arrow type conversions
dora-cli	~96	Command parsing, value parsers, log grep/filtering, JSON parsing, WebSocket client, cluster config
dora-coordinator	~24	WS control/daemon plane, health check, concurrent requests, artifact store, rate limiter, error sanitization
dora-coordinator-store	~10	In-memory and redb CRUD, schema versioning, persistence
dora-core	~8	Dataflow descriptor validation
dora-daemon	~2	Shlex argument parsing
dora-node-api	~10	Input tracking, service/action helpers (ID generation, send_service_request/response)
dora-log-utils	~11	Log parsing utilities
dora-message	~36	Common types, WS protocol, node/data IDs, metadata, auth tokens
ros2-bridge	~30	ROS2 message/service/action parsing

Run a single crate’s tests:

cargo test -p dora-cli
cargo test -p dora-core
cargo test -p dora-arrow-convert

CLI Tests

CLI tests verify command parsing, argument validation, and value parsers without running any commands. They live in #[cfg(test)] modules inside the CLI crate.

What’s tested:

Clap schema validation (Args::command().debug_assert())
Parsing of every subcommand (run, up, down, start, stop, list, logs, build, graph, new, status, inspect top, topic list/hz/echo, node list)
Rejection of unknown subcommands
--help and --version exit codes
Value parsers: parse_store_spec (coordinator store backend), parse_window (topic hz window)
Utility functions: parse_version_from_pip_show

How to run:

cargo test -p dora-cli

How to add new tests:

When adding a new CLI subcommand or value parser, add a corresponding test in the #[cfg(test)] module of the same file. For subcommand parsing, add a parse_ok call in binaries/cli/src/command/mod.rs. For value parsers, add tests in the file that defines the parser function.

Integration Tests (Node I/O)

File: tests/example-tests.rs

These tests run compiled node executables with pre-recorded inputs and compare outputs against expected baselines. No coordinator or daemon is needed.

cargo test --test example-tests

How it works:

Builds and runs a node crate (e.g., rust-dataflow-example-node)
Sets DORA_TEST_WITH_INPUTS to a JSON file with timed events
Sets DORA_TEST_NO_OUTPUT_TIME_OFFSET=1 for deterministic output
Compares JSONL output against tests/sample-inputs/expected-outputs-*.jsonl

Sample input/output files live in tests/sample-inputs/.

冒烟测试

File: tests/example-smoke.rs

Two execution modes are tested for each applicable example:

Networked (dora up + dora start --detach + poll + dora stop + dora down): exercises the full coordinator/daemon WS control plane.
Local (dora run --stop-after): runs everything in-process, testing the single-process dataflow path.

# Must run single-threaded (shared coordinator port)
cargo test --test example-smoke -- --test-threads=1

# Run only networked or local tests
cargo test --test example-smoke smoke_rust -- --test-threads=1
cargo test --test example-smoke smoke_local -- --test-threads=1

A bash script is also available for quick local validation:

./scripts/smoke-all.sh              # all examples
./scripts/smoke-all.sh --rust-only  # Rust examples only
./scripts/smoke-all.sh --python-only # Python examples only

Networked tests (17):

Test	示例	Timeout
`smoke_rust_dataflow`	rust-dataflow/dataflow.yml	30s
`smoke_rust_dataflow_dynamic`	rust-dataflow/dataflow_dynamic.yml	30s
`smoke_rust_dataflow_url`	rust-dataflow-url/dataflow.yml	30s
`smoke_benchmark`	benchmark/dataflow.yml	30s
`smoke_log_sink_file`	log-sink-file/dataflow.yml	30s
`smoke_log_sink_alert`	log-sink-alert/dataflow.yml	30s
`smoke_log_sink_tcp`	log-sink-tcp/dataflow.yml	30s
`smoke_python_dataflow`	python-dataflow/dataflow.yml	30s
`smoke_python_async`	python-async/dataflow.yaml	15s
`smoke_python_drain`	python-drain/dataflow.yaml	15s
`smoke_python_log`	python-log/dataflow.yaml	15s
`smoke_python_logging`	python-logging/dataflow.yml	15s
`smoke_python_multiple_arrays`	python-multiple-arrays/dataflow.yml	15s
`smoke_python_concurrent_rw`	python-concurrent-rw/dataflow.yml	15s
`smoke_service_example`	service-example/dataflow.yml	30s
`smoke_action_example`	action-example/dataflow.yml	30s

Local tests (9):

Test	示例	stop-after
`smoke_local_python_dataflow`	python-dataflow/dataflow.yml	30s
`smoke_local_python_async`	python-async/dataflow.yaml	10s
`smoke_local_python_drain`	python-drain/dataflow.yaml	10s
`smoke_local_python_log`	python-log/dataflow.yaml	10s
`smoke_local_python_logging`	python-logging/dataflow.yml	10s
`smoke_local_python_multiple_arrays`	python-multiple-arrays/dataflow.yml	10s
`smoke_local_python_concurrent_rw`	python-concurrent-rw/dataflow.yml	10s
`smoke_local_service_example`	service-example/dataflow.yml	10s
`smoke_local_action_example`	action-example/dataflow.yml	10s

Examples requiring special dependencies (webcam, CUDA, ROS2, C/C++ toolchain, multi-machine deploy) are not included in smoke tests.

E2E Tests (WebSocket CLI)

File: tests/ws-cli-e2e.rs

Two groups:

Non-ignored (fast): Start an in-process coordinator and test WsSession directly:

cargo test --test ws-cli-e2e

cli_list_empty – empty dataflow listing
cli_status_no_daemon – daemon connectivity check
cli_stop_nonexistent – error for missing dataflows
cli_multiple_requests_same_session – session reuse

Ignored (full stack): Use dora up with real nodes:

cargo test --test ws-cli-e2e -- --ignored --test-threads=1

e2e_start_list_stop – start, list, stop lifecycle
e2e_sequential_dataflows – two dataflows in sequence

Fault Tolerance Tests

File: tests/fault-tolerance-e2e.rs

These test restart policies and input timeouts using Daemon::run_dataflow directly (no CLI needed).

cargo test --test fault-tolerance-e2e

Tests:

restart_recovers_from_failure – node with restart_policy: on-failure survives panics (15s)
max_restarts_limit_reached – node exhausts max_restarts: 2 budget (15s)
input_timeout_closes_stale_input – input_timeout: 2.0s fires when upstream stops (10s)

Dataflow YAMLs for these tests live in tests/dataflows/.

Coordinator Integration Tests

Files: binaries/coordinator/tests/ws_control_tests.rs, binaries/coordinator/tests/ws_daemon_tests.rs

These start an in-process coordinator and test the WebSocket control/daemon planes.

cargo test -p dora-coordinator

Topics covered: health check, list/stop/destroy requests, invalid JSON/params, concurrent requests, ping/pong, daemon registration, disconnect cleanup, error sanitization (no internal chain leaks), artifact store cleanup on drop.

CI Pipeline

CI runs on push/PR to main. See .github/workflows/ci.yml.

fmt  ──────────────┐
clippy ────────────┤ (all run in parallel)
test ──────────────┤
typos ─────────────┘
                   │
              e2e (depends on test)

Job	Runner	What runs
fmt	ubuntu-latest	`cargo fmt --all -- --check`
clippy	ubuntu-latest	`cargo clippy --all ... -- -D warnings`
test	ubuntu-latest	`cargo test --all ...` (excl. Python + dora-examples)
e2e	ubuntu-latest	example-tests, fault-tolerance, smoke tests, WS E2E
typos	ubuntu-latest	`crate-ci/typos@master`

The e2e job only runs after test passes. All other jobs run in parallel.

Writing New Tests

Unit tests

Add a #[cfg(test)] module in the same file as the code under test:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn parses_valid_input() {
        let result = parse("valid");
        assert_eq!(result, expected);
    }
}
}

Integration tests for nodes

Use the integration testing framework in dora-node-api. Three approaches:

1. setup_integration_testing (recommended)

Call before the node’s main function to inject inputs and capture outputs:

#![allow(unused)]
fn main() {
#[test]
fn test_main_function() -> eyre::Result<()> {
    let events = vec![
        TimedIncomingEvent {
            time_offset_secs: 0.01,
            event: IncomingEvent::Input {
                id: "tick".into(),
                metadata: None,
                data: None,
            },
        },
        TimedIncomingEvent {
            time_offset_secs: 0.055,
            event: IncomingEvent::Stop,
        },
    ];
    let inputs = TestingInput::Input(
        IntegrationTestInput::new("node_id".parse().unwrap(), events),
    );
    let (tx, rx) = flume::unbounded();
    let outputs = TestingOutput::ToChannel(tx);
    let options = TestingOptions { skip_output_time_offsets: true };

    integration_testing::setup_integration_testing(inputs, outputs, options);
    crate::main()?;

    let outputs = rx.try_iter().collect::<Vec<_>>();
    assert_eq!(outputs, expected_outputs);
    Ok(())
}
}

2. Environment variable mode

Test the compiled executable directly, closest to production behavior:

DORA_TEST_WITH_INPUTS=path/to/inputs.json \
DORA_TEST_NO_OUTPUT_TIME_OFFSET=1 \
DORA_TEST_WRITE_OUTPUTS_TO=/tmp/out.jsonl \
cargo run -p my-node

3. DoraNode::init_testing

For testing node logic without going through main:

#![allow(unused)]
fn main() {
let (node, events) = DoraNode::init_testing(inputs, outputs, Default::default())?;
}

Generating test input files

Record real dataflow events by setting DORA_WRITE_EVENTS_TO:

DORA_WRITE_EVENTS_TO=/tmp/recorded-events dora run examples/rust-dataflow/dataflow.yml

This writes inputs-{node_id}.json files that can be used directly with DORA_TEST_WITH_INPUTS.

Workspace-level integration tests

Add new test files in the tests/ directory. For tests that need the full CLI stack, follow the patterns in tests/example-smoke.rs:

Networked pattern (exercises coordinator + daemon):

Build nodes with Once guards (avoid rebuilding per test)
Clean up stale processes with dora down
Start cluster with dora up
Run dataflow with dora start --detach
Poll dora list --json for completion
Clean up with dora stop --all and dora down

Local pattern (single-process, in-process coordinator):

Build CLI with Once guard
Run dora run <yaml> --stop-after <duration>
Assert exit code is success

Conventions

Use assert2::assert! for better error messages (available as dev-dependency)
Use tempfile::NamedTempFile for temporary output files
E2E tests that need exclusive port access should be #[ignore] and run with --test-threads=1
Async tests use #[tokio::test(flavor = "multi_thread")]
Fault tolerance test dataflows go in tests/dataflows/
Sample input/output baselines go in tests/sample-inputs/

故障排除

`cargo test` fails to compile Python packages

Always exclude Python packages:

cargo test --all \
  --exclude dora-node-api-python \
  --exclude dora-operator-api-python \
  --exclude dora-ros2-bridge-python

Smoke/E2E tests fail with “address already in use”

A stale coordinator or daemon is still running. Clean up:

dora down
# or kill processes manually:
pkill -f dora-coordinator
pkill -f dora-daemon

Smoke tests hang or timeout

Increase the timeout in the test if your machine is slow (look for Duration::from_secs(...))

Check that example nodes build successfully:

cargo build -p rust-dataflow-example-node -p rust-dataflow-example-status-node \
  -p rust-dataflow-example-sink -p rust-dataflow-example-sink-dynamic
cargo build -p log-sink-file -p log-sink-alert -p log-sink-tcp
cargo build --release -p benchmark-example-node -p benchmark-example-sink

For Python smoke tests, ensure pyarrow and numpy are installed

E2E tests fail when run in parallel

Smoke and ignored E2E tests must run single-threaded:

cargo test --test example-smoke -- --test-threads=1
cargo test --test ws-cli-e2e -- --ignored --test-threads=1

Integration test output doesn’t match expected

Check that DORA_TEST_NO_OUTPUT_TIME_OFFSET=1 is set (time offsets vary per machine)

Regenerate baselines if the node’s behavior intentionally changed:

DORA_TEST_WITH_INPUTS=tests/sample-inputs/inputs-rust-node.json \
DORA_TEST_NO_OUTPUT_TIME_OFFSET=1 \
DORA_TEST_WRITE_OUTPUTS_TO=tests/sample-inputs/expected-outputs-rust-node.jsonl \
cargo run -p rust-dataflow-example-node

Typos check fails

The typos config is in _typos.toml. To add a false-positive exclusion:

[default.extend-identifiers]
MyCustomIdent = "MyCustomIdent"

Tests pass locally but fail in CI

CI runs on Ubuntu; check for platform-specific assumptions (paths, process signals)
CI uses rust-cache so dependency versions may differ from your local lockfile
Ensure cargo fmt --all -- --check passes (CI enforces this)

Keyboard shortcuts

Dora User Guide