Tydi logo

Tydi: an open specification for complex data structures over hardware streams

1. Introduction

Tydi is an open-source project that aims to standardize interfaces for hardware components used in modern streaming dataflow designs.

The project consists of two parts: the Tydi specification and a set of reference implementations of useful tools.

Specification

The Tydi (Typed dataflow interface) specification is at the core of the project: it defines what an interface should look like and constrains its behavior, based on a constructively defined, low-level data type. Start reading here.

Tools

Using the Tydi specifications, reference implementations of tools are implemented and open-sourced.

Generators (under construction)
Compositors (under construction)

Generators create HDL templates, that users may fill in with their desired behavior. Tydi aims to impose no restrictions on what hardware-description languages are used to work with the specification. Tydi aims to provide generators for old HDLs, such as VHDL and Verilog, but also modern HDL

Compositors are tools that combine streamlets into larger designs.

Tydi specification

The Tydi specification, short for Typed dataflow interface specification, aims to provide a standardized way in which complex data types can be transferred from one circuit to another, within the context of FPGAs or ASICs.

How to read this document

The specification is comprised of the following sections.

Introduction: defines the motivation and scope of this specification.
Notation: defines the mathematical notation used by the remainder of the specification.
Physical stream specification. Physical streams are hardware streams with their own valid/ready handshaking interface, transporting elementary data and dimensionality information. Their exact bit-level representation and transfer behavior is defined through five parameters. These parameters are normally derived from logical streams.
Logical stream specification. Logical streams are bundles of one or multiple physical streams of some type from the Tydi type system. Types expressed in this type system determine which physical streams with which parameters make up the logical stream. This section also introduces streamlets, the Tydi name for components that have logical streams as inputs and/or outputs.

The two specification sections are written such that the first paragraph of each section defines a new concept, and the subsequent paragraphs constrain it.

Additional information, such as examples, a more intuitive description, or motivation, is written in these kinds of blocks.

Introduction

This section describes the motivation and scope of the Tydi specification.

Background and motivation

As FPGAs become faster and larger, they are increasingly used within a data center context to accelerate computational workloads. FPGAs can already achieve greater compute density and power efficiency than CPUs and GPUs for certain types of workloads, particularly those that rely on high streaming throughput and branch-heavy computation. Examples of this are decompression, SQL query acceleration, and pattern matching. The major disadvantage of FPGAs is that they are more difficult to program than CPUs and GPUs, and that algorithms expressed imperatively in the CPU/GPU domain often need to be redesigned from the ground up to achieve competitive performance. Both the advantages and disadvantages are due to the spatial nature of an FPGA: instead of programming a number of processors, an FPGA designer "programs" millions of basic computational elements not much more complex than a single logic gate that all work parallel to each other, and describes how these elements are interconnected. This extreme programmability comes at a cost of roughly an order of magnitude in area and an order of magnitude in performance compared to the custom silicon used to make CPUs and GPUs. Therefore, while imperative algorithms can indeed be mapped to FPGAs more or less directly through high-level synthesis (HLS) techniques or the use of softcores, typically an algorithm needs at least two orders of magnitude of acceleration through clever use of the spatial nature of the FPGA to be competitive.

Unfortunately, the industry-standard toolchains needed to program FPGAs only take VHDL, Verilog, SystemC, (more recently through HLS) a subset of C++, and/or visually designed data flow graphs as their input. The first three provide few abstractions above the level of a single wire: while they do allow the use of data types such as integers and structures to represent bundles of wires, all control of what the voltages on those bundles of wires represent at a certain point in time (if anything) remains up to the programmer. The latter two techniques raise the bar slightly by presenting the designer with streams and memory. However, they are vendor-specific, often require additional licensing fees over the more traditional design entry methods, and in some cases are even specific to a certain version of the vendor tool and/or a certain FPGA device family.

This situation has given rise to a number of open-source projects that take higher-level languages and transform them to vendor-agnostic VHDL or Verilog. Examples of this are Chisel/FIRRTL and Clash, using generative Scala code and a Haskell-like functional programming language as their respective inputs. Both tools come with their own standard libraries of hardware components that can be used to compose accelerators out of smaller primitives, similar to the data flow design method described earlier, but with textual input and the advantage of being vendor and device agnostic.

With the advent of these data flow composition tools, it is increasingly important to have a common interface standard to allow the primitive blocks to connect to each other. The de-facto standard for this has become the AMBA AXI4 interface, designed by ARM for their microcontrollers and processors. Roughly speaking, AXI4 specifies an interface for device-to-memory connections (AXI4 and AXI4-lite), and a streaming interface (AXI4-stream) for intra-device connections.

While AXI4 and AXI4-lite are of primary importance to processors due to their memory-oriented nature, AXI4-stream is much more important for FPGAs due to their spatial nature. However, because AXI4-stream is not originally designed for FPGAs, parts of the specifications are awkward for this purpose. For instance, AXI4-stream is byte oriented: it requires its data signal to be divided into one or more byte lanes, and specifies (optional) control signals that indicate the significance of each lane. Since FPGA designs are not at all restricted to operating on byte elements, this part of the specification is often ignored, and as such, any stream with a valid, ready, and one or more data signals of any width has become the de-facto streaming standard. This is reflected for instance by Chisel's built-in Decoupled interface type.

Within a single design this is of course not an issue — as long as both the stream source and sink blocks agree on the same noncompliant interface, the design will work. However, bearing in mind that there is an increasing number of independently developed data flow oriented tools, each with their own standard library, interoperability becomes an issue: whenever a designer needs to use components from different vendors, they must first ensure that the interfaces match, and if not, insert the requisite glue logic in between.

A similar issue exists in the software domain, where different programming languages use different runtime environments and calling conventions. For instance, efficiently connecting a component written in Java to a component written in Python requires considerable effort. The keyword here is "efficiently:" because Java and Python have very different ways of representing abstract data in memory, one fundamentally has to convert from one representation to another for any communication between the two components. This serialization and deserialization overhead can and often does cost more CPU time than the execution of the algorithms themselves.

The Apache Arrow project attempts to solve this problem by standardizing a way to represent this abstract data in memory, and providing libraries for popular programming languages to interact with this data format. The goal is to make transferring data between two processes as efficient as sharing a pool of Arrow-formatted memory between them. Arrow also specifies efficient ways of serializing Arrow data structures for (temporary) storage in files or streaming structures over a network, and can also be used by GPUs through CUDA. However, FPGA-based acceleration is at the time of writing missing from the core project. The Fletcher project attempts to bridge this gap, by providing an interface layer between the Arrow in-memory format and FPGA accelerators, presenting the memory to the accelerator in an abstract, tabular form.

In order to represent the complex, nested data types supported by Arrow, the Fletcher project had to devise its own data streaming format on top of the de-facto subset of AXI4-stream. Originally, this format was simply a means to an end, and therefore, not much thought was put into it. Particularly, as only Arrow-to-device interfaces (and back) were needed for the project, an interface designed specifically for intra-device streaming is lacking; in fact, depending on configuration, the reader and writer interfaces do not even match completely. Clearly, a standard for streaming complex data types between components is needed, both within the context of the Fletcher project, and outside of it.

As far as the writers are aware, no such standard exists as of yet. Defining such a standard in an open, royalty-free way is the primary purpose of this document.

Goals

Defining a streaming format for complex data types in the context of FPGAs and, potentially, ASICs, where "complex data types" include:
- multi-dimensional sequences of undefined length;
- unions (a.k.a. variants);
- structures such as tuples or records.
Doing the above in as broad of a way as possible, without imposing unnecessary burdens on the development of simple components.
Allowing for minimization of area and complexity through well-defined contracts between source and sink on top of the signal specification itself.
Extensibility: this specification should be as usable as possible, even to those with use cases not foreseen by this specification.

Non-goals

In this specification, a "streaming format" refers to the way in which the voltages on a bundle of wires are used to communicate data. We expressly do NOT mean streaming over existing communication formats such as Ethernet, and certainly not over abstract streams such as POSIX pipes or other inter-process communication paradigms. If you're looking for the former, have a look at Arrow Flight. The latter is part of Apache Arrow itself through their IPC format specification.
We do not intend to compete with the AXI4(-stream) specification. AXI4-stream is designed for streaming unstructured byte-oriented data; Tydi streams are for streaming structured, complex data types.
Tydi streams have no notion of multi-endpoint network-on-chip-like structures. Adding source and destination addressing or other routing information can be done through the user signal.
The primitive data type in Tydi is a group of bits. We leave the mapping from these bits to higher-order types such as numbers, characters, or timestamps to existing specifications.

Notation

The Tydi specification uses a flavor of mathematical notation to formally represent the defined concepts and operations thereupon. The nonstandard part of this notation is established in this section.

Nodes

The structures needed by the specification are complex to the point that simply using tuples becomes overly confusing. Therefore, we introduce a form of named tuple, which we will call a node (the term "type" would be more appropriate, but would be confusing within the context of this specification).

The notation for a node is as follows:

\[\textrm{NodeName}(...)\]

where \(\textrm{NodeName}\) is a title case identifier unique within the specification, and \(...\) is a number of field names in some pattern. For example:

\[\textrm{Fields}(N_1 : b_1, N_2 : b_2, ..., N_n : b_n)\]

A node carries with it implicit requirements on what the fields may be. Whenever an operation constructs a node, the operation is written in a way that these requirements are guaranteed by the preconditions of the operation.

The field names of nodes are significant. When an operation depends on one of the fields of the node, the fields may be "accessed" by subscripting the field name. For instance, if \(F\) is a \(\textrm{Fields}\) node, \(F_n\) represents the value for \(n\), and \(F_N\) represents the tuple consisting of \(N_1, N_2, ..., N_n\) of the node.

When the field being indexed is clear from context, the field itself is usually omitted; for instance, using just \(N\) implies the same thing as \(F_N\) if there is only one \(\textrm{Fields}\) node being discussed, and nothing else is using \(N\) for anything.

Nodes may also be unpacked into differently named values using the following notation:

Unpack \(F'\) into \(\textrm{Fields}(N'_1 : b'_1, N'_2 : b'_2, ..., N'_n : b'_n)\).

The following discussion may then use for instance \(N'\) to refer to \(F'_N\).

In some cases, only some of the fields of a node need to have a certain value for the following discussion to apply. We will then write only those fields within the parentheses, prefixing the values with the name and an equals sign, followed by an ellipsis to signify the unconstrained fields. For example, \(\textrm{Fields}(n=2, ...)\) signifies a \(\textrm{Fields}\) node with exactly two \(N\) and two \(b\), but the values thereof are unconstrained.

Functions

Due to the highly generic nature of the specification, parts of it are described algorithmically. The algorithms are referred to as functions. Like functions in a typical programming language, they have a signature, written as follows:

\[\textrm{functionName}(...) \rightarrow \textrm{ReturnType}\]

where \(\textrm{functionName}\) is a camelcase identifier unique within the specification, \(...\) is a number of argument names in some pattern, and \(\textrm{ReturnType}\) specifies what is returned, usually in the form of a node name.

A function carries with it implicit requirements on what its parameters may be. Whenever an operation uses a function, the operation is written in a way that these requirements are guaranteed by the preconditions of the operation.

Physical streams

A Tydi physical stream carries a stream of elements, dimensionality information for said elements, and (optionally) user-defined transfer information from a source to a sink. The contents of these three groups and to some extent the way in which they are transferred are defined in a parameterized way, to provide the necessary flexibility for the higher levels of this specification, and to give hardware developers the freedom to optimize their designs. Based on these parameters, this specification defines which signals must exist on the interface, and the significance of these signals.

Parameters

A physical stream is parameterized as \(\textrm{PhysicalStream}(E, N, D, C, U)\). The significance of these parameters is defined in the following subsections.

The parameters are defined on the interfaces of the source and the sink rather than the stream between them, because the complexity parameter need not be the same on the source and sink. The others do.

Element content (E) and user/transfer content (U)

\(E\) and \(U\) are of the form \(\textrm{Fields}(N_1 : b_1, N_2 : b_2, ..., N_n : b_n)\), where \(N\) are names, \(b\) are positive integers representing bit counts, and \(n\) is a nonnegative integer, describing a list of \(n\) named bit vector signals.

The element type can be given zero fields to make a "null" stream. Such streams can still be useful, as physical streams also carry metadata. The bit counts are however not allowed to be zero, because a lot of synthesis tools out there do not handle null ranges very well.

The difference between the element content and the user/transfer content of a stream, is that a transfer may encode zero or more elements, but always encodes one instance of the user/transfer content.

The user/transfer signals are intended for adding lower-level user-defined protocol signals to the stream, similar to the use of the *USER signals in AXI4. For instance, physical streams could be transferred over a network-on-chip-like structure by attaching routing information such as source and destination addresses through this method.

The name of each field is a string consisting of letters, numbers, and/or underscores. It may be empty, represented in this specification as \(\varnothing\).

The name cannot start or end with an underscore.

The name cannot start with a digit.

It is illegal to start or end with an underscore or start with a number to prevent confusion when the name is prefixed to form the signal name, and for compatibility with VHDL.

The name must be case-insensitively unique within the set of named fields.

The identifier is case-insensitive because compatibility with VHDL is desired.

\(|\textrm{Fields}(N_1 : b_1, N_2 : b_2, ..., N_n : b_n)|\) is a shorthand defined to equal \(\sum_{i=1}^{n} b_i\); that is, the sum of the field bit count over all fields in the element.

Number of element lanes (N)

\(N\) must be an integer greater than or equal to one. The signals used for describing a data element are replicated this many times, allowing multiple elements to be transferred per stream handshake. Each replication is called a lane.

Dimensionality (D)

\(D\) must be an integer greater than or equal to zero. It specifies the number of last bits needed to represent the data.

Intuitively, each sequence nesting level adds one to this number. For instance, to stream two-dimensional sequences, two last bits are needed: one to mark the boundaries of the inner sequence, and one to mark the boundary of each two-dimensional sequence.

Complexity (C)

\(C\) must be a nonempty list of nonnegative integers. It encodes the guarantees a source makes about how elements are transferred. Equivalently, it encodes the assumptions a sink can safely make.

Higher C Lower C

Fewer guarantees made by source More guarantees made by source

Source is easier to implement Source is harder to implement

Sink can make fewer assumptions Sink can make more assumptions

Sink is harder to implement Sink is easier to implement

C is usually specified as a period-separated list, like a version number or paragraph/secion number, because like those things, it is orderable.

The complexity number carries the following intuitive significance.

\(C\) Description

< 8 Only one sequence can be terminated per transfer.

< 7 The indices of the active data lanes can be described with a simple range.

< 6 The range of active data lanes must always start with lane zero.

< 5 All lanes must be active for all but the last transfer of the innermost sequence.

< 4 The last flag cannot be postponed until after the transfer of the last element.

< 3 Innermost sequences must be transferred in consecutive cycles.

< 2 Whole outermost instances must be transferred in consecutive cycles.

The exact requirements imposed for each \(C\) can be found along with the unconditional signal requirements in later sections.

Higher C	Lower C
Fewer guarantees made by source	More guarantees made by source
Source is easier to implement	Source is harder to implement
Sink can make fewer assumptions	Sink can make more assumptions
Sink is harder to implement	Sink is easier to implement

\(C\)	Description
< 8	Only one sequence can be terminated per transfer.
< 7	The indices of the active data lanes can be described with a simple range.
< 6	The range of active data lanes must always start with lane zero.
< 5	All lanes must be active for all but the last transfer of the innermost sequence.
< 4	The `last` flag cannot be postponed until after the transfer of the last element.
< 3	Innermost sequences must be transferred in consecutive cycles.
< 2	Whole outermost instances must be transferred in consecutive cycles.

The complexity levels and signals are defined such that a source with complexity \(C_a\) can be connected to a sink with complexity \(C_b \ge C_a\). A complexity number is higher than another when the leftmost integer of either is greater, and lower when the leftmost integer is lower. If the leftmost integer is equal, the next integer is checked. If one complexity number has more entries than another, the shorter number is padded with zeros on the right.

That is, it works a bit like a version number. The idea behind this is to allow future version of this specification (or any other kind of addition later) to insert new complexity levels between the ones defined here. For instance, 3 < 3.1 < 3.1.1 < 3.2 < 4.

A stream is considered to be normalized when \(C < 4\). At this complexity level, there is a one-to-one mapping between the transferred data and the actual transfers for any given \(N\). This is called the canonical representation of the data for a given \(N\).

Signals

A physical stream is comprised of the following signals.

Name	Origin	Purpose
`valid`	Source	Stalling the data stream due to the source not being ready.
`ready`	Sink	Stalling the data stream due to the sink not being ready.
`data`	Source	Data transfer of \(N\) \(\|E\|\)-bit elements.
`last`	Source	Indicating the last transfer for \(D\) levels of nested sequences.
`stai`	Source	Start index; encodes the index of the first valid lane.
`endi`	Source	End index; encodes the index of the last valid lane.
`strb`	Source	Strobe; encodes individual lane validity for \(C \ge 8\), empty sequences otherwise.
`user`	Source	Additional control information carried along with the stream.

The valid and ready signals are scalars, while all the other signals are bit vectors with the following widths.

Name	Width
`valid`	scalar
`ready`	scalar
`data`	\(N \times \|E\|\)
`last`	\(N \times D\)
`stai`	\(\lceil \log_2{N} \rceil\)
`endi`	\(\lceil \log_2{N} \rceil\)
`strb`	\(N\)
`user`	\(\|U\|\)

Clock

All signals are synchronous to a single clock signal, shared between the source and sink.

It is up to the user to manage clock domains and clock domain crossing where needed, and specify the edge sensitivity of the clock(s).

Reset

This specification places no constraints on the reset signal. The requirements on the valid and ready signals ensure that no transfers occur when either the source or the sink is held under reset, so the source, synchronicity, and sensitivity of the reset signal(s) can be chosen by the user.

Detailed signal description

`valid` signal description

The valid signal has the same semantics as the TVALID AXI4-stream signal. These semantics are repeated here.

The valid signal indicates the presence of a valid transfer payload of the other downstream signals. It is active-high.

The state of the valid signal is always significant.

That is, if the source is in some undefined or "uninitialized" state, valid must nonetheless be low.

The downstream signals, including valid, must be stable (maintain their state) from the first active clock edge in which valid is asserted until acknowledgement of the transfer. Note that the validating and acknowledging clock edge can be one and the same if ready is already asserted during the first clock cycle; in this case there are no stability requirements.

In other words, a transfer may not be cancelled or modified after validation. This allows a sink to start doing some sort of multicycle processing as soon as valid is asserted, only asserting ready when it is done, without having to worry about cancellation or changes in the transfer content. This prevents the need for extra holding registers or latches otherwise needed to keep the original transfer.

When a transfer is acknowledged, either the next transfer is presented on the downstream signals and valid remains asserted, or valid is released.

[\(C < 3\)] valid may only be released when lane \(N - 1\) of the last signal in the acknowledged transfer is nonzero.

[\(C < 2\)] valid may only be released when lane \(N - 1\) of the last signal in the acknowledged transfer is all ones.

valid must be released while the source is being reset.

This prevents spurious transfers when the sink has a different reset source.

`ready` signal description

The ready signal has the same semantics as the TREADY AXI4-stream signal. These semantics are repeated here.

The ready signal indicates acknowledgement of a transfer initiated through valid. It is active-high.

The state of the ready signal is significant only while valid is asserted.

That is, a source must never wait for ready to assert before it asserts valid. Doing so can lead to deadlock. However, a sink may wait for valid to assert before asserting ready.

A transfer is considered "handshaked" when both valid and ready are asserted during the active clock edge of the clock domain common to the source and the sink.

ready must be released while the sink is being reset.

This prevents transfers from being lost when the source has a different reset source.

`data` signal description

The data signal consists of \(N\) concatenated lanes, each carrying a number of data element fields, totaling to \(|E|\) bits per lane.

The state of lane \(i\) in the data signal is significant only while valid is asserted and data lane \(i\) is active. Data lane \(i\) is considered active if and only if

bit \(i\) of strb is asserted,
the unsigned integer interpretation of endi is greater than or equal to \(i\), and
the unsigned integer interpretation of stai is less than or equal to \(i\).

The redundancy of the endi and stai signals given the presence of the strb signal (only applicable for \(C \ge 7\)) has to do with the rule that it must always be possible to connect a source with low complexity to a sink with high complexity. That is, while a source that desires individual control over the lanes and thus has \(C \ge 7\) would probably always drive endi to \(N-1\) and stai to 0, a sink with complexity \(C \ge 7\) still needs to accept input from sources that do use endi and stai.

The overhead incurred by this should be negligible in practice. First of all, because \(C \ge 7\) with significant \(N\) will be quite large to begin with, so managing two additional signals of size \(\lceil \log_2{N} \rceil\) probably doesn't matter much. Secondly, in most cases, constant propagation will remove the unneeded endi and stai signals whenever both source and sink have \(C \ge 7\). Finally, even if the above fails, going from stai + endi + strb to the internal lane enable signals in a sink is trivial for modern FPGAs and realistic \(N\), as the less-than-equal and greater-than-equal comparisons fit in a single 6-LUT up to and including \(N = 64\), resulting in just three decoding LUTs per lane (one for stai, one for endi, and one 3-input AND gate to merge the outputs of the first two and the strb bit).

The significant lanes are interpreted in order of increasing index for each transfer.

`last` signal description

The last signal consists of N concatenated lanes, each carrying a termination bit for each nested sequence represented by the physical stream. The last bit for lane \(i\) and dimension \(j\) (i.e., bit \(i \cdot D + j\)) being asserted indicates completion of a sequence with nesting level \(D - j - 1\), containing all elements streamed between the previous lane/transfer this occured for dimension \(j' \ge j\) (or system reset, if never) exclusively and this point inclusively, where nesting level is defined to be 0 for the outermost sequence, 1 for its inner sequence, up to \(D - 1\) for the innermost sequence. It is active-high.

For example, one way to represent the value ["Hello", "World"], ["Tydi", "is", "nice"], [""], [] with \(N = 6, C \ge 8\) is as follows. Note that \(D = 2\) follows from the data type, \(|E|\) depends on the character encoding, and the example does not depend on \(U\).
reset released here
        |                      transfer -->
        v      A              B              C              D

last    "000100000000" "000011000000" "000001000100" "000010010000"
data       "WolleH"       "yTdlro"       "insiid"       "----ec"
strb       "111111"       "111111"       "111111"       "000011"

        .------------. .------------. .------------. .------------.
      0 | data:  'H' | | data:  'o' | | data:  'd' | | data:  'c' |
        | last:  0 0 | | last:  0 0 | | last:  0 0 | | last:  0 0 |
        |------------| |------------| |------------| |------------|
      1 | data:  'e' | | data:  'r' | | data:  'i' | | data:  'e' |
        | last:  0 0 | | last:  0 0 | | last:  0 1 | | last:  0 0 |
        |------------| |------------| |------------| |------------|
lane  2 | data:  'l' | | data:  'l' | | data:  'i' | | data:   -  |
index   | last:  0 0 | | last:  0 0 | | last:  0 0 | | last:  0 1 |
  |     |------------| |------------| |------------| |------------|
  v   3 | data:  'l' | | data:  'd' | | data:  's' | | data:   -  |
        | last:  0 0 | | last:  1 1 | | last:  0 1 | | last:  1 0 |
        |------------| |------------| |------------| |------------|
      4 | data:  'o' | | data:  'T' | | data:  'n' | | data:   -  |
        | last:  0 1 | | last:  0 0 | | last:  0 0 | | last:  1 1 |
        |------------| |------------| |------------| |------------|
      5 | data:  'W' | | data:  'y' | | data:  'i' | | data:   -  |
        | last:  0 0 | | last:  0 0 | | last:  0 0 | | last:  1 0 |
        '------------' '------------' '------------' '------------'
"Hello" is delimited by the reset condition and the asserted last bit in transfer A, lane 4, dimension 0 (innermost). "World" is delimited by the aforementioned and the last bit in transfer B, lane 3, also for dimension 0. The last bit for B3/dimension 1 is also asserted, which delimits ["Hello", "World"] along with the reset condition; both the "Hello" and "World" sequence are fully contained by the delimited portion of the stream, so the outer sequence has those as its two entries. "Tydi" is delimited by B3 and C1, "is" by C1 and C3, and "nice" by C3 and D2 for dimension 0. Note that the D2 is postponed beyond the last element (D1) by means of deactivating data lane D2. The last flag for the surrounding sequence is similarly postponed to D3; note though that data lane D3 must be inactive, or the last bit for D3/dimension 0 must be asserted for the data stream to be legal. The next "" is delimited by D3/dimension 1 and D4/dimension 0, and [""] is delimited by D3 and D4 for dimension 1. The final [] is delimited by D4 and D5 for dimension 1. It does not contain an inner sequence item because D5/dimension 0 is not asserted, so the outer sequence is empty.

The state of lane \(i\) in the data signal is significant only while valid is asserted.

Note that the validity of the last lanes is thus not controlled by which data lanes are active. This allows sequences to be terminated without sending an element; this is in fact required in order to send empty sequences, but also allows postponing the end-of-sequence marker until after the last element (if complexity allows) as the previous example shows.

It is illegal to assert a last bit for dimension \(j\) without first terminating all dimensions \(j' < j\), if any data lanes were active since the previous assertion of a last bit for dimension 0.

For example, the following is illegal because of the above:
        .------------.
      0 | data:  '1' |
        | last:  0 0 |
        |------------|
      1 | data:  '2' |
        | last:  0 1 |
        |------------|
lane  2 | data:  '3' |
index   | last:  0 0 |
  |     |------------|
  v   3 | data:  '4' |
        | last:  1 0 |
        |------------|
      4 | data:  '5' |
        | last:  0 0 |
        |------------|
      5 | data:  '6' |
        | last:  1 1 |
        '------------'
An argument could be made that this encodes [[1, 2], 3, 4], [[5, 6]], but allowing this interpretation would make the stream significantly less type-safe, and there would be no way to encode non-sequence elements before the last inner sequence in the outer sequence without introducing a start signal as well.

An argument could also be made that this encodes [[1, 2]], [[3, 4, 5, 6]], because the last flag delimiting [1, 2] falls within the first outer sequence and the last flag delimiting [3, 4, 5, 6] falls within the second. However, this would unnecessarily complicate sink logic, make manual interpretation of the data structure more context-sensitive, and so on.

One could also argue that it encodes [[1, 2]], [[5, 6]], because 3 and 4 are not enclosed by any inner sequence, but then it makes more sense to just deactivate the lanes.

Ultimately, this representation is ambiguous or at least leads to confusion, and is therefore illegal.

[\(C < 8\)] All last bits for lanes 0 to \(N - 2\) inclusive must be driven low by the source, and may be ignored by the sink.

The above rule ultimately means that the last information is transferred on a transfer basis instead of on element basis, similar to AXI4-stream. This can significantly decrease decoding complexity, but only allows one innermost sequence to be transferred per cycle regardless of \(N\) and the length of the sequence.

[\(C < 4\)] It is illegal to assert a last bit for dimension \(j\) without also asserting the last bits for dimensions \(j' < j\) in the same lane.

[\(C < 4\)] It is illegal to assert the last bit for dimension 0 when the respective data lane is inactive, except for empty sequences.

The above two rules mean that the last flags cannot be postponed.

`stai` signal description

The stai signal (start index) is used to deactivate data lanes with an index lower than the encoded binary number. It is a bit vector of size \(\lceil \log_2{N} \rceil\) to encode the full index range.

The state of the stai signal is significant only while valid is asserted.

It is illegal to drive stai to the value \(N\) or greater.

This implies that stai cannot be used to disable the last lane.

[\(C < 6\)] stai must always be driven to 0 by the source, and may be ignored by the sink.

`endi` signal description

The endi signal (end index) is used to deactivate data lanes with an index greater than the encoded binary number. It is a bit vector of size \(\lceil \log_2{N} \rceil\) to encode the full index range.

Note that endi cannot be used to disable the first lane.

The state of the endi signal is significant only while valid is asserted.

It is illegal to drive endi to the value \(N\) or greater.

This would semantically not be different from driving \(N - 1\), so there is no reason to burden the sink logic by allowing this case.

It is illegal to drive endi to a value less than stai.

This would essentially encode a VHDL null range, which, if allowed, could be used to encode empty sequences. However, detecting this condition is relatively hard on FPGA resources (timing in particular), typically requiring a carry chain for \(N > 8\). Instead, use strb for this purpose; at lower complexities all strb bits must be equal, so only a single bit has to be checked to determine whether a transfer carries any data.

[\(C < 5\)] endi must be driven to \(N - 1\) by the source when last is zero, and may be ignored by the sink in this case.

Together with the other complexity rules up to this level, this means that all lanes must be used for all but the last transfer containing elements for the innermost sequence. This furthermore implies that the element with innermost index \(i\) is always transferred on data lane \(i \mod N\).

`strb` signal description

The strb signal (strobe) is used to deactivate individual data lanes. It is a bit vector of size \(N\). When a strb bit is low, the associated data lane is deactivated.

Note that the opposite (strb high -> activated) is not necessarily true due to endi and stai.

The state of the strb signal is significant only while valid is asserted.

[\(C < 8\)] All strb bits must be driven to the same value by the source. The sink only needs to interpret one of the bits.

This effectively reduces the strb signal to a single bit indicating whether the transfer is empty (low) or not (high), as it is illegal to deactivate all data lanes by driving endi < stai.

`user` signal description

The user signal is the concatenation of a number of user-specified fields, carrying additional transfer-oriented information. The significance of this is user-defined.

The state of the user signal is not significant when valid is not asserted.

The opposite is not necessarily true; it is up to the user's specifications if (part of) the user signal can be insignificant despite valid being asserted.

Signal omission

Not all signals are always needed. The following table shows the condition for a signal to be relevant. If the condition is false, the signal may be omitted on the interface. When two interfaces with differing but compatible complexities are connected together, the default value specified in the table below must be driven for the omitted signals.

Name	Condition	Default
`valid`	see below	`'1'`
`ready`	see below	`'1'`
`data`	\(\|E\| > 0\)	all `'0'`
`last`	\(D \ge 1\)	all `'1'`
`stai`	\(C \ge 6 \wedge N > 1\)	0
`endi`	\((C \ge 5 \vee D \ge 1) \wedge N > 1\)	\(N-1\)
`strb`	\(C \ge 7 \vee D \ge 1\)	all `'1'`
`user`	\(\|U\| > 0\)	all `'0'`

valid may be omitted for sources that are always valid.

For example, a constant or status signal source (the latter must be valid during reset as well for this to apply).

ready may be omitted for sinks that never block.

For example, a sink that voids all incoming data in order to measure performance of the source.

Interface conventions

Streams may be named, in order to prevent name conflicts due to multiple streams existing within the same namespace. Such a name is to be prefixed to the signal names using a double underscore.

Double underscores are used as a form of unambiguous hierarchy separation to allow user-specified field names in the logical stream types (defined later) to contain (non-consecutive) underscores without risk of name conflicts.

The canonical representation for the data and user signals is the LSB-first concatenation of the contained fields. For \(N > 1\), the lanes are concatenated LSB first, such that the lane index is major and the field index is minor.

Where applicable, the signals must be listed in the following order: dn, up, valid, ready, data, last, stai, endi, strb, user (dn and up are part of the alternative representation defined below).

This is mostly just a consistency thing, primarily because it helps to get used to a single ordering when interpreting simulation waveforms. It may also simplify tooling.

The signal names must be lowercase.

This is for interoperability between languages that differ in case sensitivity.

Alternative representation

To improve code readability in hardware definition languages supporting array and aggregate constructs (record, struct, ...), the following changes are permissible. However, it is recommended to fall back to the conventions above for interoperability with other streamlets on the "outer" interfaces of an IP block.

valid, data, last, stai, endi, strb, and user may be bundled in an aggregate type named <stream-name>__dn__type with signal name <stream-name>__dn (dn is short for "downstream").
ready may be "bundled" in an aggregate type named <stream-name>__up__type with signal name <stream-name>__up for symmetry (up is short for "upstream").
If the language allows for signal direction reversal within a bundle, all stream signals may also be bundled into a single type named <stream-name>__type, with ready in the reverse direction.
The data and user fields may be bundled in aggregate types named <stream-name>__data__type and <stream-name>__user__type respectively. The data signal becomes an array of <stream-name>__data__types from 0 to \(N - 1\) if \(N > 1\) or when otherwise desirable.
Data and user fields consisting of a single bit may be interpreted as either a bit vector of size one or a scalar bit depending on context.
Fields with a common double-underscore-delimited prefix may be aggregated recursively using <stream-name>__data__<common-prefix>__type, in such a way that the double underscores in the canonical signal name are essentially replaced with the hierarchy separator of the hardware definition language.

Arrays, vectors, and concatenations

Where applicable, the bitrange for bit vector signals is always n-1 downto 0, where n is the number of bits in the signal.

Concatenations of bit vector signals are done LSB-first.

That is, lower indexed entries use lower bit indices and thus occur on the right-hand side when the bit vector is written as a binary number. Note that this results in the inverse order when using the concatenation operator in VHDL (and possibly other languages). LSB-first order is chosen because it is more intuitive when indexing the vector. Ultimately this is just a convention.

Where applicable, the range for arrays is always 0 to n-1, where n is the number of entries in the array.

This is the more natural order for array-like structures, actually putting the first entry on the left-hand side.

Logical streams

A Tydi logical stream consists of a bundle of physical streams and a group of user-defined signals, parameterized by way of a logical stream type. The logical stream type, in turn, represents some representation of an abstract data type (in the same way that for instance std::vector and std::list both represent a sequence in C++). Conceptually, the resulting logical stream can then be used to stream instances of its corresponding abstract data type from the source to the sink. The sink can only receive or operate on one instance at a time and only in sequence (hence the name "stream"), but the logical stream type can be designed such that the sink as random-access capability within the current instance.

Because of the degree of flexibility in specifying Tydi logical streams, much of this specification consists of the formal definitions of the structures needed to describe a Tydi logical stream, and operations thereupon. The resulting constraints on the signal lists and behavior of the signals are then described based on these.

Logical stream type

This structure is at the heart of the logical stream specification. It is used both to specify the type of a logical stream and internally for the process of lowering the recursive structure down to physical streams and signals.

Structure

The logical stream type is defined recursively by means of a number of node types. Two classes of nodes are defined: stream-manipulating nodes, and element-manipulating nodes. Only one stream-manipulating node exists (\(\textrm{Stream}\)), which is defined in the first subsection. The remainder of the subsections describe the element-manipulating nodes.

Stream

The \(\textrm{Stream}\) node is used to define a new physical stream. It is defined as \(\textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\), where:

\(T_e\) is any logical stream type, representing the type of element carried by the logical stream;

This may include additional \(\textrm{Stream}\) nodes to handle cases such as structures containing sequences, handled in Tydi by transferring the two different dimensions over two different physical streams, as well as to encode request-response patterns.
\(t\) is a positive real number, representing the minimum number of elements that should be transferrable on the child stream per element in the parent stream, or if there is no parent stream, the minimum number of elements that should be transferrable per clock cycle;

\(t\) is short for throughput ratio. As we'll see later, the \(N\) parameter for the resulting physical stream equals \(\left\lceil\prod t\right\rceil\) for all ancestral \(\textrm{Stream}\) nodes, including this node.

This significance is as follows. Let's say that you have a structure like \(\textrm{Stream}(T_e = \textrm{Group}(\textrm{a}: \textrm{Bits}(16), \textrm{b}: \textrm{Stream}(d = 1, T_e = \textrm{Bits}(8), ...)), ...)\), you're expecting that you'll be transferring about one of those groups every three cycles, and you're expecting that the list size of b will be about 8 on average. In this case, \(t = 1/3\) for the outer stream node, and \(t = 8\) for the inner stream node. That'll give you \(\lceil 1/3\rceil = 1\) element lane on the outer stream (used at ~33% efficiency) and \(\lceil 1/3 \cdot 8\rceil = 3\) element lanes for the inner b stream (used at ~88% efficiency), such that neither stream will on average be slowing down the design.
\(d\) is a nonnegative integer, representing the dimensionality of the child stream with respect to the parent stream;

As we'll see later, the \(D\) parameter for the resulting physical stream equals \(\sum d\) for all ancestral \(\textrm{Stream}\) nodes, including this node, up to and including the closest ancestor (or self) for which \(s = \textrm{Flatten}\).
\(s\) must be one of \(\textrm{Sync}\), \(\textrm{Flatten}\), \(\textrm{Desync}\), or \(\textrm{FlatDesync}\), representing the relation between the dimensionality information of the parent stream and the child stream;

The most common relation is \(\textrm{Sync}\), which enforces that for each element in the parent stream, there must be one \(d\)-dimensional sequence in the child stream (if \(d\) is 0, that just means one element). Furthermore, it means that the last flags from the parent stream are repeated in the child stream. This repetition is technically redundant; since the sink will have access to both streams, it only needs this information once to decode the data.

If this redundant repetition is not necessary for the sink or hard to generate for a source, \(\textrm{Flatten}\) may be used instead. It is defined in the same way, except the redundant information is not repeated for the child stream.

Consider for example the data [(1, [2, 3]), (4, [5])], [(6, [7])]. If encoded as \(\textrm{Stream}(d=1, T_e=\textrm{Group}(\textrm{Bits}(8), \textrm{Stream}(d=1, s=\textrm{Sync}, ...)), ...)\), the first physical stream will transfer [1, 4], [6], and the second physical stream will transfer [[2, 3], [5]], [[7]]. If \(\textrm{Flatten}\) is used for the inner \(\textrm{Stream}\) instead, the data transferred by the second physical stream reduces to [2, 3], [5], [7]; the structure of the outer sequence is implied by the structure transferred by the first physical stream.

\(\textrm{Desync}\) may be used if the relation between the elements in the child and parent stream is dependent on context rather than the last flags in either stream. For example, for the type \(\textrm{Stream}(d=1, T_e=\textrm{Group}(\textrm{Bits}(8), \textrm{Stream}(s=\textrm{Desync}, ...)), ...)\), if the first stream carries [1, 4], [6] as before, the child stream may carry anything of the form [...], [...]. That is, the dimensionality information (if any) must still match, but the number of elements in the innermost sequence is unrelated or only related by some user-defined context.

\(\textrm{FlatDesync}\), finally, does the same thing as \(\textrm{Desync}\), but also strips the dimensionality information from the parent. This means there the relation between the two streams, if any, is fully user-defined.
\(c\) is a complexity number, as defined in the physical stream specification, representing the complexity for the corresponding physical stream interface;
\(r\) must be one of \(\textrm{Forward}\) or \(\textrm{Reverse}\), representing the direction of the child stream with respect to the parent stream, or if there is no parent stream, with respect to the direction of the logical stream (source to sink);

\(\textrm{Forward}\) indicates that the child stream flows in the same direction as its parent, complementing the data of its parent in some way. Conversely, \(\textrm{Reverse}\) indicates that the child stream acts as a response channel for the parent stream. If there is no parent stream, \(\textrm{Forward}\) indicates that the stream flows in the natural source to sink direction of the logical stream, while \(\textrm{Reverse}\) indicates a control channel in the opposite direction. The latter may occur for instance when doing random read access to a memory; the first stream carrying the read commands then flows in the sink to source direction.
\(T_u\) is a logical stream type consisting of only element-manipulating nodes, representing the user signals of the physical stream; that is, transfer-oriented data (versus element-oriented data); and

Note that while \(T_u\) is called a "logical stream type" for consistency, the \(\textrm{Stream}\) node being illegal implies that no additional streams can be specified within this type.
\(x\) is a boolean, representing whether the stream carries "extra" information beyond the data and user signal payloads.

\(x\) is normally false, which implies that the \(\textrm{Stream}\) node will not result in a physical stream if both its data and user signals would be empty according to the rest of this specification; it is effectively optimized away. Setting \(x\) to true simply overrides this behavior.

This optimization handles a fairly common case when the logical stream type is recursively generated by tooling outside the scope of this layer of the specification. It arises for instance for nested lists, which may result in something of the form \(\textrm{Stream}(d=1, T_e=\textrm{Stream}(d=1, s=\textrm{Sync}, T_e=...))\). The outer stream is not necessary here; all the dimensionality information is carried by the inner stream.

It could be argued that the above example should be reduced to \(\textrm{Stream}(d=2, s=\textrm{Sync}, T_e=...)\) by said external tool. Indeed, this description is equivalent. This reduction is however very difficult to define or specify in a way that it handles all cases intuitively, hence this more pragmatic solution was chosen.

As the \(\textrm{Stream}\) node carries many parameters and can therefore be hard to read, some abbreviations are in order:

\(\textrm{Dim}(T_e, t, c, T_u) \ \rightarrow\ \textrm{Stream}(T_e, t, 1, \textrm{Sync}, c, \textrm{Forward}, T_u, \textrm{false})\)

\(\textrm{New}(T_e, t, c, T_u) \ \rightarrow\ \textrm{Stream}(T_e, t, 0, \textrm{Sync}, c, \textrm{Forward}, T_u, \textrm{false})\)

\(\textrm{Des}(T_e, t, c, T_u) \ \rightarrow\ \textrm{Stream}(T_e, t, 0, \textrm{Desync}, c, \textrm{Forward}, T_u, \textrm{false})\)

\(\textrm{Flat}(T_e, t, c, T_u)\ \rightarrow\ \textrm{Stream}(T_e, t, 0, \textrm{Flatten}, c, \textrm{Forward}, T_u, \textrm{false})\)

\(\textrm{Rev}(T_e, t, c, T_u) \ \rightarrow\ \textrm{Stream}(T_e, t, 0, \textrm{Sync}, c, \textrm{Reverse}, T_u, \textrm{false})\)

For the above abbreviations, the following defaults apply in addition:

\(T_u = \textrm{Null}\);

\(c =\) complexity of parent stream (cannot be omitted if there is no parent);

\(t = 1\).

Null

The \(\textrm{Null}\) leaf node indicates the transferrence of one-valued data: the only valid value is \(\varnothing\) (null).

This is useful primarily when used within a \(\textrm{Union}\), where the data type itself carries information. Furthermore, \(\textrm{Stream}\) nodes carrying a \(\textrm{Null}\) will always result in a physical stream (whereas \(\textrm{Stream}\) nodes carrying only other streams will be "optimized out"), which may still carry dimensionality information, user data, or even just transfer timing information.

Bits

The \(\textrm{Bits}\) leaf node, defined as \(\textrm{Bits}(b)\), indicates the transferrence of \(2^b\)-valued data carried by means of a group of \(b\) bits, where \(b\) is a positive integer.

Tydi does not specify how the bits should be interpreted: such interpretation is explicitly out of scope.

Group

The \(\textrm{Group}\) node acts as a product type (composition). It is defined as \(\textrm{Group}(N_1: T_1, N_2: T_2, ..., N_n: T_n)\), where:

\(N_i\) is a string consisting of letters, numbers, and/or underscores, representing the name of the \(i\)th field;
\(T_i\) is the logical stream type for the \(i\)th field;
\(n\) is a nonnegative integer, representing the number of fields.

Intuitively, each instance of type \(\textrm{Group}(...)\) consists of instances of all the contained types. This corresponds to a struct or record in most programming languages.

The names cannot contain two or more consecutive underscores.

Double underscores are used by Tydi as a hierarchy separator when flattening structures for the canonical representation. If not for the above rule, something of the form \(\textrm{Group}(\textrm{'a'}:\textrm{Group}(\textrm{'b__c'}:...),\textrm{'a__b'}:\textrm{Group}(\textrm{'c'}:...))\) would result in a name conflict: both signals would be named a__b__c.

The names cannot start or end with an underscore.

The names cannot start with a digit.

The names cannot be empty.

The names must be case-insensitively unique within the group.

The above requirements on the name mirror the requirements on the field names for the physical streams.

Union

The \(\textrm{Union}\) node acts as a sum type (exclusive disjunction). It is defined as \(\textrm{Union}(N_1: T_1, N_2: T_2, ..., N_n: T_n)\), where:

\(N_i\) is a string consisting of letters, numbers, and/or underscores, representing the name of the \(i\)th field;
\(T_i\) is the logical stream type for the \(i\)th field;
\(n\) is a positive integer, representing the number of fields.

Intuitively, each instance of type \(\textrm{Union}\) consists of an instance of one the contained types (also known as variants), as well as a tag indicating which field that instance belongs to. This corresponds to a union or enum in most programming languages. Mathematically, Tydi unions represent tagged unions.

One might argue that unions could also be specified without the \(\textrm{Union}\) node, by simply grouping the variants together along with a manually specified \(\textrm{Bits}\) field for the union tag, and using don't-cares for the variants not selected through the tag field. However, there are a number of downsides to such a sparse union approach.

For unions with many variants, the data signal of the corresponding stream becomes very wide.

Any streams nested inside the union variants would need to be stuffed with don't-care transfers as well, which may result in additional control logic, delay, area, etc..

While such sparse unions have their place, dense unions can be more suitable for a given application. These could also be represented without a \(\textrm{Union}\) node, but not in a nicely recursive manner without loss of information about the variant types. Therefore, (dense) unions were made a first-class type through \(\textrm{Union}\).

\(\textrm{Union}\)s can also be used to elegantly describe nullable types. The pattern for that is simply \(\textrm{Union}(\textrm{Null}, T)\). In this case, the union's tag field acts like a validity bit.

The names cannot contain two or more consecutive underscores.

Double underscores are used by Tydi as a hierarchy separator when flattening structures for the canonical representation. If not for the above rule, something of the form \(\textrm{Union}(\textrm{'a'}:\textrm{Union}(\textrm{'b__c'}:\textrm{Stream}...),\textrm{'a__b'}:\textrm{Union}(\textrm{'c'}:\textrm{Stream}...))\) would result in a name conflict: both physical streams would be named a__b__c.

The names cannot start or end with an underscore.

The names cannot start with a digit.

The names cannot be empty.

The names must be case-insensitively unique within the union.

The above requirements on the name mirror the requirements on the field names for the physical streams.

Operations on logical stream

This section defines some operations that may be performed on logical stream types, most importantly the \(\textrm{synthesize}()\) function, which returns the signals and physical streams for a given type.

Null detection function

This section defines the function \(\textrm{isNull}(T_{in}) \rightarrow \textrm{Bool}\), which returns true if and only if \(T_{in}\) does not result in any signals.

The function returns true if and only if one or more of the following rules match:

\(T_{in} = \textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\), \(\textrm{isNull}(T_e)\) is true, \(\textrm{isNull}(T_u)\) is true, and \(x\) is false;
\(T_{in} = \textrm{Null}\);
\(T_{source} = \textrm{Group}(N_1: T_1, N_2: T_2, ..., N_n: T_n)\) and \(\textrm{isNull}(T_i)\) is true for all \(i \in (1, 2, ..., n)\); or
\(T_{source} = \textrm{Union}(N_1: T_1)\) and \(\textrm{isNull}(T_1)\) is true.

Note that any union with more than one field is not null regardless of its types, because the union tag also carries information in that case.

Split function

This section defines the function \(\textrm{split}(T_{in}) \rightarrow \textrm{SplitStreams}\), where \(T_{in}\) is any logical stream type. The \(\textrm{SplitStreams}\) node is defined as \(\textrm{SplitStreams}(T_{signals}, N_1 : T_1, N_2 : T_2, ..., N_n : T_n)\), where:

\(T_{signals}\) is a logical stream type consisting of only element-manipulating nodes;
all \(T_{1..n}\) are \(\textrm{Stream}\) nodes with the data and user subtypes consisting of only element-manipulating nodes;
all \(N\) are case-insensitively unique, emptyable strings consisting of letters, numbers, and/or underscores, not starting or ending in an underscore, and not starting with a digit; and
\(n\) is a nonnegative integer.

Intuitively, it splits a logical stream type up into simplified stream types, where \(T_{signals}\) contains only the information for the user-defined signals, and \(T_i\) contains only the information for the physical stream with index \(i\) and name \(N_i\).

The names can be empty, because if there are no \(Group\)s or \(Union\)s, the one existing stream or signal will not have an intrinsic name given by the type. Empty strings are represented here with \(\varnothing\). Note that the names here can include double underscores.

\(\textrm{split}(T_{in})\) is evaluated as follows.

If \(T_{in} = \textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\), apply the following algorithm.
- Initialize \(N\) and \(T\) to empty lists.
- \(T_{element} := \textrm{split}(T_e)_{signals}\) (i.e., the \(T_{signals}\) item of the tuple returned by the \(\textrm{split}\) function)
- If \(\textrm{isNull}(T_{element})\) is false, \(\textrm{isNull}(T_u)\) is false, or \(x\) is true:
  - Append \(\varnothing\) (an empty name) to the end of \(N\).
  - Append \(\textrm{Stream}(T_{element}, t, d, s, c, r, T_u, x)\) to the end of \(T\).
- Extend \(N\) with \(\textrm{split}(T_e)_{names}\) (i.e., all stream names returned by the \(\textrm{split}\) function).
- For all \(T' \in \textrm{split}(T_e)_{streams}\) (i.e., for all named streams returned by the \(\textrm{split}\) function):
  - Unpack \(T'\) into \(\textrm{Stream}(T'_d, t', d', s', c', r', T'_u, x')\). This is always possible due to the postconditions of \(\textrm{split}\).
  - If \(r = \textrm{Reverse}\), reverse the direction of \(r'\).
  - If \(s = \textrm{Flatten}\) or \(s = \textrm{FlatDesync}\), assign \(s' := \textrm{FlatDesync}\).
  - If \(s' \ne \textrm{Flatten}\) and \(s \ne \textrm{FlatDesync}\), assign \(d' := d' + d\).
  - \(t' := t' \cdot t\).
  - Append \(\textrm{Stream}(T'_d, t', d', s', c', r', T'_u, x')\) to \(T\).
- Return \(\textrm{SplitStreams}(\textrm{Null}, N_1 : T_1, N_2 : T_2, ..., N_m : T_m)\).
If \(T_{in} = \textrm{Null}\) or \(T_{in} = \textrm{Bits}(b)\), return \(\textrm{SplitStreams}(T_{in})\).
If \(T_{in} = \textrm{Group}(N_{g,1} : T_{g,1}, N_{g,2} : T_{g,2}, ..., N_{g,n} : T_{g,n})\), apply the following algorithm.
- \(T_{signals} := \textrm{Group}(N_{g,1} : \textrm{split}(T_{g,1})_{signals}, ..., N_{g,n} : \textrm{split}(T_{g,n})_{signals})\) (i.e., replace the types contained by the group with the \(T_{signals}\) item of the tuple returned by the \(\textrm{split}\) function applied to the original types).
- Initialize \(N\) and \(T\) to empty lists.
- For all \(i \in 1..n\):
  - For all \(\textrm{name} \in \textrm{split}(T_{g,i})_{names}\) (i.e., for all stream names returned by the \(\textrm{split}\) function),
    - If \(\textrm{name} = \varnothing\), append \(N_{g,i}\) to the end of \(N\).
    - Otherwise, append the concatenation \(N_{g,i}\) and \(\textrm{name}\) to the end of \(N\), separated by a double underscore.
  - Extend \(T\) with \(\textrm{split}(T_{g,i})_{streams}\) (i.e., all named streams returned by the \(\textrm{split}\) function).
- Return \(\textrm{SplitStreams}(T_{signals}, N_1 : T_1, N_2 : T_2, ..., N_m : T_m)\), where \(m = |N| = |T|\).
If \(T_{in} = \textrm{Union}(N_{u,1} : T_{u,1}, N_{u,2} : T_{u,2}, ..., N_{u,n} : T_{u,n})\), apply the following algorithm.
- \(T_{signals} := \textrm{Union}(N_{u,1} : \textrm{split}(T_{u,1})_{signals}, ..., N_{u,n} : \textrm{split}(T_{u,n})_{signals})\) (i.e., replace the types contained by the group with the \(T_{signals}\) item of the tuple returned by the \(\textrm{split}\) function applied to the original types).
- Initialize \(N\) and \(T\) to empty lists.
- For all \(i \in 1..n\):
  - For all \(\textrm{name} \in \textrm{split}(T_{u,i})_{names}\) (i.e., for all stream names returned by the \(\textrm{split}\) function),
    - If \(\textrm{name} = \varnothing\), append \(N_{u,i}\) to the end of \(N\).
    - Otherwise, append the concatenation \(N_{u,i}\) and \(\textrm{name}\) to the end of \(N\), separated by a double underscore.
  - Extend \(T\) with \(\textrm{split}(T_{u,i})_{streams}\) (i.e., all named streams returned by the \(\textrm{split}\) function).
- Return \(\textrm{SplitStreams}(T_{signals}, N_1 : T_1, N_2 : T_2, ..., N_m : T_m)\), where \(m = |N| = |T|\).

Note that the algorithm for \(\textrm{Group}\) and \(\textrm{Union}\) is the same, aside from returning \(\textrm{Group}\) vs \(\textrm{Union}\) nodes for \(T_{signals}\).

Field conversion function

This section defines the function \(\textrm{fields}(T_{in}) \rightarrow \textrm{Fields}\), where \(T_{in}\) is any logical stream type, and the \(\textrm{Fields}\) node is as defined in the physical stream specification.

Intuitively, this function flattens a logical stream type consisting of \(\textrm{Null}\), \(\textrm{Bits}\), \(\textrm{Group}\) and \(\textrm{Union}\) nodes into a list of named bitfields. It is used for constructing the data and user field lists of the physical streams and the user-defined signal list from the logical stream types preprocessed by \(\textrm{split}()\). This function is normally only applied to logical stream types that only carry element-manipulating nodes.

The names can be empty, because if there are no \(Group\)s or \(Union\)s, the one existing stream or signal will not have an intrinsic name given by the type. Empty strings are represented here with \(\varnothing\). Note that the names here can include double underscores.

\(\textrm{fields}(T_{in})\) is evaluated as follows.

If \(T_{in} = \textrm{Null}\) or \(T_{in} = \textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\), return \(\textrm{Fields}()\).
If \(T_{in} = \textrm{Bits}(b)\), return \(\textrm{Fields}(\varnothing : b)\).
If \(T_{in} = \textrm{Group}(N_{g,1} : T_{g,1}, N_{g,2} : T_{g,2}, ..., N_{g,n} : T_{g,n})\), apply the following algorithm.
- Initialize \(N_o\) and \(b_o\) to empty lists.
- For all \(i \in 1..n\):
  - Unpack \(\textrm{fields}(T_{g,i})\) into \(\textrm{Fields}(N_{e,1} : b_{e,1}, N_{e,2} : b_{e,2}, ..., N_{e,m} : b_{e,m})\).
  - For all \(j \in 1..m\):
    - If \(N_{e,j} = \varnothing\), append \(N_{g,i}\) to the end of \(N_o\). Otherwise, append the concatenation of \(N_{g,i}\) and \(N_{e,j}\), separated by a double underscore, to the end of \(N_o\).
    - Append \(b_{e,j}\) to the end of \(b_o\).
- Return \(\textrm{Fields}(N_{o,1} : b_{o,1}, N_{o,2} : b_{o,2}, ..., N_{o,p} : b_{o,p})\), where \(p = |N_o| = |b_o|\).
If \(T_{in} = \textrm{Union}(N_{u,1} : T_{u,1}, N_{u,2} : T_{u,2}, ..., N_{u,n} : T_{u,n})\), apply the following algorithm.
- Initialize \(N_o\) and \(b_o\) to empty lists.
- If \(n > 1\):
  - Append "tag" to the end of \(N_o\).
  - Append \(\left\lceil\log_2 n\right\rceil\) to the end of \(b_o\).
- \(b_d := 0\)
- For all \(i \in 1..n\):
  - Unpack \(\textrm{fields}(T_{g,i})\) into \(\textrm{Fields}(N_{e,1} : b_{e,1}, N_{e,2} : b_{e,2}, ..., N_{e,m} : b_{e,m})\).
  - \(b_d := \max\left(b_d, \sum_{j=1}^m b_{e,j}\right)\)
- If \(b_d > 0\):
  - Append "union" to the end of \(N_o\).
  - Append \(b_d\) to the end of \(b_o\).
- Return \(\textrm{Fields}(N_{o,1} : b_{o,1}, N_{o,2} : b_{o,2}, ..., N_{o,p} : b_{o,p})\), where \(p = |N_o| = |b_o|\).

Synthesis function

This section defines the function \(\textrm{synthesize}(T_{in}) \rightarrow \textrm{LogicalStream}\), where \(T_{in}\) is any logical stream type. The \(\textrm{LogicalStream}\) node is defined as \(\textrm{LogicalStream}(F_{signals}, N_1 : P_1, N_2 : P_2, ..., N_n : P_n)\), where:

\(F_{signals}\) is of the form \(\textrm{Fields}(N_{s,1} : b_{s,1}, N_{s,2} : b_{s,2}, ..., N_{s,m} : b_{s,m})\), as defined in the physical stream specification;

This represents the list of user-defined signals flowing in parallel to the physical streams.
all \(P\) are of the form \(\textrm{PhysicalStream}(E, N, D, C, U)\), as defined in the physical stream specification;
all \(N\) are case-insensitively unique, emptyable strings consisting of letters, numbers, and/or underscores, not starting or ending in an underscore, and not starting with a digit; and
\(n\) is a nonnegative integer.

Intuitively, this function synthesizes a logical stream type to its physical representation.

\(\textrm{synthesize}(T_{in})\) is evaluated as follows.

Unpack \(\textrm{split}(T_{in})\) into \(\textrm{SplitStreams}(T_{signals}, N_1 : T_1, N_2 : T_2, ..., N_n : T_n)\).
\(F_{signals} := \textrm{fields}(T_{signals})\)
For all \(i \in (1, 2, ..., n)\):
- Unpack \(T_i\) into \(\textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\).
- \(P_i := \textrm{PhysicalStream}(\textrm{fields}(T_e), \lceil t\rceil, d, c, \textrm{fields}(T_u))\)
Return \(\textrm{LogicalStream}(F_{signals}, N_1 : P_1, N_2 : P_2, ..., N_n : P_n)\).

Type compatibility function

This section defines the function \(\textrm{compatible}(T_{source}, T_{sink}) \rightarrow \textrm{Bool}\), where \(T_{source}\) and \(T_{sink}\) are both logical stream types. The function returns true if and only if a source of type \(T_{source}\) can be connected to a sink of type \(T_{sink}\) without insertion of conversion logic. The function returns true if and only if one or more of the following rules match:

\(T_{source} = T_{sink}\);
\(T_{source} = \textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\), \(T_{sink} = \textrm{Stream}(T'_d, t, d, s, c', r, T_u, x)\), \(\textrm{compatible}(T_e, T'_d) = \textrm{true}\), and \(c < c'\);
\(T_{source} = \textrm{Group}(N_1: T_1, N_2: T_2, ..., N_n: T_n)\), \(T_{sink} = \textrm{Group}(N_1: T'_1, N_2: T'_2, ..., N_n: T'_n)\), and \(\textrm{compatible}(T_i, T'_i)\) is true for all \(i \in (1, 2, ..., n)\); or
\(T_{source} = \textrm{Union}(N_1: T_1, N_2: T_2, ..., N_n: T_n)\), \(T_{sink} = \textrm{Union}(N_1: T'_1, N_2: T'_2, ..., N_n: T'_n)\), and \(\textrm{compatible}(T_i, T'_i)\) is true for all \(i \in (1, 2, ..., n)\).

The name matching is to be done case sensitively.

While two streams may be compatible in a case-insensitive language when only the case of one or more of the names differ, this would still not hold for case-sensitive languages. Therefore, for two streams to be compatible in all cases, the matching must be case sensitive.

Union semantics

When flattened into fields, a \(\textrm{Union}\) node consists of:

if there are two or more variants, a "tag" field of size \(\left\lceil\log_2{n}\right\rceil\), where \(n\) is the number of variants; and
if any variant carries data in the same stream as the union itself, a "union" field of size
\(\max\limits_{\textrm{variants}} \sum\limits_{\textrm{subfields}} b_{\textrm{subfield}}\).

The "tag" field is used to represent which variant is being represented, by means of a zero-based index encoded as an unsigned binary number.

It is illegal for a source to drive the "tag" field of a \(\textrm{Union}\) with \(n\) variants to \(n\) or higher.

The above would otherwise be possible if \(n\) is not a power of two.

The "union" field of a \(\textrm{Union}\) is used to represent variant data existing in the same stream as the union itself. It consists of the LSB-first, LSB-aligned concatenation of the fields of the variant.

No constraint is placed on the value that the source drives on the pad bits.

The physical streams represented by \(\textrm{Stream}\) nodes nested inside \(\textrm{Union}\) variant types behave as if any other variants carried by the parent stream do not exist, but are otherwise synchronized to the parent stream as defined by the \(s\) parameter of the child stream.

Consider the following example, keeping \(x\) generic for now:

\[B = \textrm{Group}(x: \textrm{Bits}(2), y: \textrm{Bits}(2))\\ C = \textrm{Stream}(d=1, s=x, T_e=\textrm{Bits}(4))\\ U = \textrm{Union}(\textrm{a} : \textrm{Bits}(3), \textrm{b} : \textrm{B}, \textrm{c} : C)\\ \textrm{Stream}(d=1, T_e=U)\]

This results in two physical streams:

the unnamed stream carrying the union with \(D = 1\), consisting of a two-bit field named tag and a four-bit field named union; and

a stream named c, consisting of a single unnamed four-bit field.

Let's say that we need to represent the data [a:0, b:(x:1, y:2)], [c:[3, 4, 5], a:6]. The unnamed stream then carries the following four transfers, regardless of the value for \(x\):
[ (tag: "00", union: "-000"),     (1)
  (tag: "01", union: "1001") ],   (2)
[ (tag: "10", union: "----"),     (3)
  (tag: "00", union: "-110") ]    (4)
In transfer 1 and 4, the MSB of the union is don't-care, because variant a uses only three bits. In transfer 2, x is represented by the two LSB of the union field, while y is represented by the MSBs. Finally, in transfer 3, all bits of union are don't-care, as stream c is used to carry the data. Note that a tag value of "11" is illegal in this example.

When \(x = \textrm{Sync}\), stream c has \(D = 2\) and carries the following transfers:
[              ],   (1)
[ [ ("0011"),       (2)
    ("0100"),       (3)
    ("0101") ] ]    (4)
Transfer 1 carries an empty outer sequence, corresponding to [a:0, b:(x:1, y:2)] of the represented data. There are no inner sequences because variant c never occurs in the first sequence, but the outer sequence must still be represented to correctly copy the dimensionality information of the parent stream, as mandated by the \(\textrm{Sync}\) tag. The second outer sequence represents [c:[3, 4, 5], a:6]; it carries one inner sequence, because variant c occurs once. The inner sequence directly corresponds to [3, 4, 5].

When \(x = \textrm{Flatten}\), the outer sequence is flattened away. Therefore, stream c has \(D = 1\) and carries only the following transfers:
[ ("0011"),     (1)
  ("0100"),     (2)
  ("0101") ]    (3)
It is then up to the sink to recover the outer dimension if it needs it.

When \(x = \textrm{Desync}\), \(D = 2\) as before. For the example data, the transfers would be exactly the same as for \(x = \textrm{Sync}\), but the zero-or-more relationship defined by the \(\textrm{Desync}\) means that a c variant in the parent stream may correspond with any number of inner sequences on the c stream. That does however still mean that the first outer sequence must be empty, as follows:
[         ],
[ [ ... ]
    ...
  [ ... ] ]
After all, there are no c variants in the first sequence to apply the zero-or-more relationship to.

Finally, \(x = \textrm{FlatDesync}\) flattens the outer sequence away compared to \(x = \textrm{Desync}\), allowing for any number of transfers to occur on the c stream, all corresponding to the c variant in transfer 3 of the parent stream.

If you're confused as to how a sink is to determine which transfers on the c stream correspond to which c variant in the parent stream, recall that the purpose of \(\textrm{Desync}\) and \(\textrm{FlatDesync}\) is to allow the user of this layer of the specification to define this relation as they see fit. Usually, a pattern of the form \(\textrm{Group}(\textrm{len}: \textrm{Bits}(...), \textrm{data}: \textrm{Stream}(s=\textrm{Desync}, ...))\) would be used, in which case the length of the encoded inner sequence would be transferred using the union field of the parent stream.

Natural ordering and inter-stream dependencies

The natural ordering of a datum transferred over a logical stream is defined as follows:

for \(\textrm{Group}\) nodes, the elements are ordered by left-to-right group order; and
for \(\textrm{Stream}\) nodes with a parent stream, for every element transferred on the parent stream, transfer the parent element first, then transfer the corresponding data on the child stream. This data depends on the \(s\) and \(d\) parameters of the \(\textrm{Stream}\) node:
- \(s \in (\textrm{Sync}, \textrm{Flatten})\), \(d = 0\): one element on the parent stream corresponds to one element on the child stream;
- \(s = \textrm{Sync}\), \(d > 0\): one element on the parent stream corresponds to one \(d\)-dimensional sequence on the child stream, delimited by the most significant last bit added on top of the last bits replicated from the parent stream;
- \(s = \textrm{Flatten}\), \(d > 0\): one element on the parent stream corresponds to one \(d\)-dimensional sequence on the child stream, delimited by the most significant last bit;
- \(s \in (\textrm{Desync}, \textrm{FlatDesync})\), \(d = 0\): one element on the parent stream corresponds to a user/context-defined number of elements on the child stream (zero or more); and
- \(s \in (\textrm{Desync}, \textrm{FlatDesync})\), \(d > 0\): one element on the parent stream corresponds to a user/context-defined number of \(d\)-dimensional sequences on the child stream (zero or more).

This corresponds to depth-first preorder traversal of the logical stream type tree. It also corresponds to the order in which you would naturally write the data down.

Sources may not assume that sinks will be able to handshake data in an order differing from the natural order defined above.

Sinks may not assume that sources will be able to handshake data in an order differing from the natural order defined above.

The above two rules, if adhered to by both source and sink, prevent inter-physical-stream deadlocks.

Consider the following simple example:

\[C = \textrm{Stream}(s = \textrm{Sync}, d = 0, ...)\\ \textrm{Group}(\textrm{a}: C, \textrm{b}: C)\]

[(a: 1, b: 2), (a: 3, b: 4)]

Let's say the source follows the natural ordering rule and can't buffer anything. That is, it drives 1 on physical stream a, waits for the transfer to be acknowledged before driving 2 on physical stream b, then waits for that to be acknowledged before driving 3, and so on. If the natural-ordering rule for sinks did not exist, the sink may try to wait for stream b to be validated before it will acknowledge a, which would in this case cause a deadlock.

The reasoning also works the other way around. That is, if a sink will only acknowledge in the natural order, but the source is allowed to drive 2 on b and wait for its acknowledgement before driving 1 on a, a similar deadlock will occur.

The above does not mean that a source isn't allowed to drive data on the two streams independently as fast as it can. The same thing goes for acknowledgement by a sink. If both source and sink support full out-of-order transferral of both streams, the elements can be transferred in any order. It just means that sources and sinks must be compatible with an opposite end that does not support out-of-order transferral.

The above reasoning is particularly important for streams flowing in opposing directions. In this case, for a physical stream x naturally-ordered before another stream y flowing in the opposing direction, x acts as a request stream and y acts as its response. That is, the source of x may not assume that the response will come before it sends the request.

User-defined signal constraints

Tydi places the following constraints on the user-defined signals.

The user-defined signals flow exclusively in the source to sink direction.

While physical streams can be reversed in order to transfer metadata in the sink to source direction, logical streams fundamentally describe dataflow in the source to sink direction. Therefore, if data needs to flow in the opposite direction as well, you should use a second logical stream.

The reasoning behind this design choice is mostly practical in nature. Firstly, with the way the logical stream type is currently defined, unidirectional signals emerge naturally before the first stream node is opened, but reverse-direction signals do not. Therefore, additional nodes would be needed to describe this. Secondly, allowing signals to flow in both directions might be interpreted by users as a way for them to implement their own handshake method to use instead of or in addition to the handshakes of the Tydi physical streams. Most handshake methods break however when registers or clock domain crossings not specifically tailored for that particular handshake are inserted in between. With a purely unidirectional approach, latency bounds are not necessary for correctness, allowing tooling to automatically generate interconnect IP. Tydi does pose some constraints on such IP, however.

Interconnect components such as register slices and clock domain crossings may interpret the signals as being asynchronous. Thus, a sink may not rely upon bit flips driven in the same cycle by the source from occurring in the same cycle at the sink.

This allows clock domain crossings for the user-defined signals to reduce to simple synchronization registers.

The latency of any signal travelling from source to sink must be less than or equal to the latency of the streams.

This allows the user-defined signals to be used as a way to transfer quasi-constant values, such as the control register inputs for some kernel. As long as the constants are only mutated before the source sends the first stream transfer, the signals received by the sink are guaranteed to be stable by the time it receives the first transfer.

Handshaking in the reverse direction may be done using a reversed stream. When the sink is done processing, it sends a transfer on this stream. This will necessarily be received by the source at a later point in time, which is then free to modify the user signals again.

Note that slow-changing counter values (those that only increment or decrement slowly with respect to the involved clock domains) can be safely transferred using Gray code. Pulse/strobe signals can be safely transferred by the source toggling a user-defined signal whenever the strobe occurs, and the sink detecting the transitions in this signal.

Tools

(under construction)

Generators

(under construction)

Compositors

(under construction)

Frequently Asked Questions

References

Publications

J. Peltenburg, J. Van Straten, M. Brobbel, Z. Al-Ars and H. P. Hofstee
Tydi: An Open Specification for Complex Data Structures Over Hardware Streams
IEEE Micro, vol. 40, no. 4, pp. 120-130, 1 July-Aug. 2020, doi: 10.1109/MM.2020.2996373.

Other

Tydi playground: an interactive playground to visualize Tydi types

An early implementation from which the more generalized Tydi was spawned is Fletcher. Resources on Fletcher can be found here:

Fletcher general paper: published in 2019 29th International Conference on Field Programmable Logic and Applications (FPL)
Fletcher hardware design: published in ARC 2019: Applied Reconfigurable Computing.
Fletcher: the Fletcher project repository