跳过主要内容

数据流规范

Dora dataflows are specified through a YAML file. This dataflow configuration file specifies the nodes of the dataflow and their inputs and outputs. It also allows configuring communication parameters and enabling debug options.

This article provides an introduction to the dataflow file format and its most important fields. For a complete reference of all available fields their behavior check ouf the documentation of the Descriptor and Node structs.

Defining Nodes

The most important field in a dataflow configuration file is the nodes field, which lists the nodes of the dataflow. Each node is identified by a unique id:

nodes:
- id: foo
path: path/to/the/executable
# ... (see below)
- id: bar
path: path/to/another/executable
# ... (see below)

For each node, you need to specify the path of the executable or script that Dora should run when starting the node. Most of the other node fields are optional, but you typically want to specify at least some inputs and/or outputs.

输入和输出

Nodes can send output messages that can be received by other nodes as input. All inputs and outputs need to be specified in the dataflow configuration file.

For each node, list all output IDs that it sends under the outputs key. Only the specified output IDs are valid to be used in output sending functions such as send_output.

Receiving nodes can subscribe to outputs by listing them in their inputs field. The inputs field should be a key-value map of the following format: input_id: source_node_id/source_node_output_id

The components are defined as follows:

  • input_id is the local identifier that should be used for this input. This will map to the id field of Event::Input events sent to the node event loop.
  • source_node_id should be the id field of the node that sends the output that we want to subscribe to
  • source_node_output_id should be the identifier of the output that that we want to subscribe to

Input/Output Example

nodes:
- id: example-node
outputs:
- one
- two
- id: receiver
inputs:
my_input: example-node/two

Fields Controlling Node Execution

Use the following fields to define how a node is executed, including command-line arguments and environment variables.

path (required)

Specifies the path of the executable or script that Dora should run when starting the dataflow. This can point to a normal executable (e.g. when using a compiled language such as Rust) or a Python script.

nodes:
- id: rust-example
path: target/release/rust-node
- id: python-example
path: ./receive_data.py

See the path field documentation for details.

args and env

Use the args field to specify command-line arguments that should be passed to the executable/script specified in path. Use the env field for setting environment variables.

nodes:
- id: example
path: example-node
args: -v --some-flag foo
env:
IMAGE_WIDTH: 640
IMAGE_HEIGHT: 480

Fields Controlling Node Build

Use build fields define how a node is set up and built on dora build. All build fields are optional.

build

The build field specifies the command that should be invoked for building the node.

- id: build-example
build: cargo build -p receive_data --release
path: target/release/receive_data
- id: multi-line-example
build: |
pip install flash-attn
pip install -e ../../node-hub/dora-phi4
path: dora-phi4

Special treatment of pip: Build lines that start with pip or pip3 are treated in a special way: If the --uv argument is passed to the dora build command, all pip/pip3 commands are run through the uv package manager.

git

The git field allows downloading nodes from git repositories. This can be especially useful for distributed dataflows.

When a git key is specified, dora build automatically clones the specified repository (or reuse an existing clone). Then it checks out the specified branch, tag, or rev, or the default branch if none of them are specified. Afterwards it runs the build command if specified.

nodes:
- id: rust-node
git: https://github.com/dora-rs/dora.git
branch: main
build: cargo build -p rust-dataflow-example-node
path: target/debug/rust-dataflow-example-node

Operators

Operators are an experimental, lightweight alternative to nodes. Instead of running as a separate process, operators are linked into a runtime process. This allows running multiple operators to share a single address space (not supported for Python currently).

Operators are defined as part of the node list, as children of a runtime node. A runtime node is a special node that specifies no path field, but contains an operators field instead.

Other Dataflow Fields

See the Descriptor struct for a full list of supported fields.