数据流规范
Dora dataflows are specified through a YAML file. This dataflow configuration file specifies the nodes of the dataflow and their inputs and outputs. It also allows configuring communication parameters and enabling debug options.
This article provides an introduction to the dataflow file format and its most important fields. For a complete reference of all available fields their behavior check ouf the documentation of the Descriptor
and Node
structs.
Defining Nodes
The most important field in a dataflow configuration file is the nodes
field, which lists the nodes of the dataflow. Each node is identified by a unique id
:
nodes:
- id: foo
path: path/to/the/executable
# ... (see below)
- id: bar
path: path/to/another/executable
# ... (see below)
For each node, you need to specify the path
of the executable or script that Dora should run when starting the node. Most of the other node fields are optional, but you typically want to specify at least some inputs
and/or outputs
.
输入和输出
Nodes can send output messages that can be received by other nodes as input. All inputs and outputs need to be specified in the dataflow configuration file.
For each node, list all output IDs that it sends under the outputs
key. Only the specified output IDs are valid to be used in output sending functions such as send_output
.
Receiving nodes can subscribe to outputs by listing them in their inputs
field. The inputs
field should be a key-value map of the following format: input_id: source_node_id/source_node_output_id
The components are defined as follows:
input_id
is the local identifier that should be used for this input. This will map to theid
field ofEvent::Input
events sent to the node event loop.source_node_id
should be theid
field of the node that sends the output that we want to subscribe tosource_node_output_id
should be the identifier of the output that that we want to subscribe to
Input/Output Example
nodes:
- id: example-node
outputs:
- one
- two
- id: receiver
inputs:
my_input: example-node/two
Fields Controlling Node Execution
Use the following fields to define how a node is executed, including command-line arguments and environment variables.
path
(required)
Specifies the path of the executable or script that Dora should run when starting the dataflow. This can point to a normal executable (e.g. when using a compiled language such as Rust) or a Python script.
nodes:
- id: rust-example
path: target/release/rust-node
- id: python-example
path: ./receive_data.py
See the path
field documentation for details.
args
and env
Use the args
field to specify command-line arguments that should be passed to the executable/script specified in path
. Use the env
field for setting environment variables.
nodes:
- id: example
path: example-node
args: -v --some-flag foo
env:
IMAGE_WIDTH: 640
IMAGE_HEIGHT: 480
Fields Controlling Node Build
Use build fields define how a node is set up and built on dora build
. All build fields are optional.
build
The build
field specifies the command that should be invoked for building the node.
- id: build-example
build: cargo build -p receive_data --release
path: target/release/receive_data
- id: multi-line-example
build: |
pip install flash-attn
pip install -e ../../node-hub/dora-phi4
path: dora-phi4
Special treatment of pip
: Build lines that start with pip
or pip3
are treated in a special way: If the --uv
argument is passed to the dora build
command, all pip
/pip3
commands are run through the uv
package manager.
git
The git
field allows downloading nodes from git repositories. This can be especially useful for distributed dataflows.
When a git
key is specified, dora build
automatically clones the specified repository (or reuse an existing clone). Then it checks out the specified branch
, tag
, or rev
, or the default branch if none of them are specified. Afterwards it runs the build
command if specified.
nodes:
- id: rust-node
git: https://github.com/dora-rs/dora.git
branch: main
build: cargo build -p rust-dataflow-example-node
path: target/debug/rust-dataflow-example-node
Operators
Operators are an experimental, lightweight alternative to nodes. Instead of running as a separate process, operators are linked into a runtime process. This allows running multiple operators to share a single address space (not supported for Python currently).
Operators are defined as part of the node list, as children of a runtime node. A runtime node is a special node that specifies no path
field, but contains an operators
field instead.
Other Dataflow Fields
See the Descriptor
struct for a full list of supported fields.