Tydi: an open specification for complex data structures over hardware streams
1. Introduction
Tydi is an open-source project that aims to standardize interfaces for hardware components used in modern streaming dataflow designs.
The project consists of two parts: the Tydi specification and a set of reference implementations of useful tools.
Specification
The Tydi (Typed dataflow interface) specification is at the core of the project: it defines what an interface should look like and constrains its behavior, based on a constructively defined, low-level data type. Start reading here.
Tools
Using the Tydi specifications, reference implementations of tools are implemented and open-sourced.
- Generators (under construction)
- Compositors (under construction)
Generators create HDL templates, that users may fill in with their desired behavior. Tydi aims to impose no restrictions on what hardware-description languages are used to work with the specification. Tydi aims to provide generators for old HDLs, such as VHDL and Verilog, but also modern HDL
Compositors are tools that combine streamlets into larger designs.
Tydi specification
The Tydi specification, short for Typed dataflow interface specification, aims to provide a standardized way in which complex data types can be transferred from one circuit to another, within the context of FPGAs or ASICs.
How to read this document
The specification is comprised of the following sections.
-
Introduction: defines the motivation and scope of this specification.
-
Notation: defines the mathematical notation used by the remainder of the specification.
-
Physical stream specification. Physical streams are hardware streams with their own valid/ready handshaking interface, transporting elementary data and dimensionality information. Their exact bit-level representation and transfer behavior is defined through five parameters. These parameters are normally derived from logical streams.
-
Logical stream specification. Logical streams are bundles of one or multiple physical streams of some type from the Tydi type system. Types expressed in this type system determine which physical streams with which parameters make up the logical stream. This section also introduces streamlets, the Tydi name for components that have logical streams as inputs and/or outputs.
The two specification sections are written such that the first paragraph of each section defines a new concept, and the subsequent paragraphs constrain it.
Additional information, such as examples, a more intuitive description, or motivation, is written in these kinds of blocks.
Introduction
This section describes the motivation and scope of the Tydi specification.
Background and motivation
As FPGAs become faster and larger, they are increasingly used within a data center context to accelerate computational workloads. FPGAs can already achieve greater compute density and power efficiency than CPUs and GPUs for certain types of workloads, particularly those that rely on high streaming throughput and branch-heavy computation. Examples of this are decompression, SQL query acceleration, and pattern matching. The major disadvantage of FPGAs is that they are more difficult to program than CPUs and GPUs, and that algorithms expressed imperatively in the CPU/GPU domain often need to be redesigned from the ground up to achieve competitive performance. Both the advantages and disadvantages are due to the spatial nature of an FPGA: instead of programming a number of processors, an FPGA designer "programs" millions of basic computational elements not much more complex than a single logic gate that all work parallel to each other, and describes how these elements are interconnected. This extreme programmability comes at a cost of roughly an order of magnitude in area and an order of magnitude in performance compared to the custom silicon used to make CPUs and GPUs. Therefore, while imperative algorithms can indeed be mapped to FPGAs more or less directly through high-level synthesis (HLS) techniques or the use of softcores, typically an algorithm needs at least two orders of magnitude of acceleration through clever use of the spatial nature of the FPGA to be competitive.
Unfortunately, the industry-standard toolchains needed to program FPGAs only take VHDL, Verilog, SystemC, (more recently through HLS) a subset of C++, and/or visually designed data flow graphs as their input. The first three provide few abstractions above the level of a single wire: while they do allow the use of data types such as integers and structures to represent bundles of wires, all control of what the voltages on those bundles of wires represent at a certain point in time (if anything) remains up to the programmer. The latter two techniques raise the bar slightly by presenting the designer with streams and memory. However, they are vendor-specific, often require additional licensing fees over the more traditional design entry methods, and in some cases are even specific to a certain version of the vendor tool and/or a certain FPGA device family.
This situation has given rise to a number of open-source projects that take higher-level languages and transform them to vendor-agnostic VHDL or Verilog. Examples of this are Chisel/FIRRTL and Clash, using generative Scala code and a Haskell-like functional programming language as their respective inputs. Both tools come with their own standard libraries of hardware components that can be used to compose accelerators out of smaller primitives, similar to the data flow design method described earlier, but with textual input and the advantage of being vendor and device agnostic.
With the advent of these data flow composition tools, it is increasingly important to have a common interface standard to allow the primitive blocks to connect to each other. The de-facto standard for this has become the AMBA AXI4 interface, designed by ARM for their microcontrollers and processors. Roughly speaking, AXI4 specifies an interface for device-to-memory connections (AXI4 and AXI4-lite), and a streaming interface (AXI4-stream) for intra-device connections.
While AXI4 and AXI4-lite are of primary importance to processors due to their
memory-oriented nature, AXI4-stream is much more important for FPGAs due to
their spatial nature. However, because AXI4-stream is not originally designed
for FPGAs, parts of the specifications are awkward for this purpose. For
instance, AXI4-stream is byte oriented: it requires its data signal to be
divided into one or more byte lanes, and specifies (optional) control signals
that indicate the significance of each lane. Since FPGA designs are not at all
restricted to operating on byte elements, this part of the specification is
often ignored, and as such, any stream with a valid
, ready
, and one or more
data
signals of any width has become the de-facto streaming standard. This is
reflected for instance by Chisel's built-in Decoupled
interface type.
Within a single design this is of course not an issue — as long as both the stream source and sink blocks agree on the same noncompliant interface, the design will work. However, bearing in mind that there is an increasing number of independently developed data flow oriented tools, each with their own standard library, interoperability becomes an issue: whenever a designer needs to use components from different vendors, they must first ensure that the interfaces match, and if not, insert the requisite glue logic in between.
A similar issue exists in the software domain, where different programming languages use different runtime environments and calling conventions. For instance, efficiently connecting a component written in Java to a component written in Python requires considerable effort. The keyword here is "efficiently:" because Java and Python have very different ways of representing abstract data in memory, one fundamentally has to convert from one representation to another for any communication between the two components. This serialization and deserialization overhead can and often does cost more CPU time than the execution of the algorithms themselves.
The Apache Arrow project attempts to solve this problem by standardizing a way to represent this abstract data in memory, and providing libraries for popular programming languages to interact with this data format. The goal is to make transferring data between two processes as efficient as sharing a pool of Arrow-formatted memory between them. Arrow also specifies efficient ways of serializing Arrow data structures for (temporary) storage in files or streaming structures over a network, and can also be used by GPUs through CUDA. However, FPGA-based acceleration is at the time of writing missing from the core project. The Fletcher project attempts to bridge this gap, by providing an interface layer between the Arrow in-memory format and FPGA accelerators, presenting the memory to the accelerator in an abstract, tabular form.
In order to represent the complex, nested data types supported by Arrow, the Fletcher project had to devise its own data streaming format on top of the de-facto subset of AXI4-stream. Originally, this format was simply a means to an end, and therefore, not much thought was put into it. Particularly, as only Arrow-to-device interfaces (and back) were needed for the project, an interface designed specifically for intra-device streaming is lacking; in fact, depending on configuration, the reader and writer interfaces do not even match completely. Clearly, a standard for streaming complex data types between components is needed, both within the context of the Fletcher project, and outside of it.
As far as the writers are aware, no such standard exists as of yet. Defining such a standard in an open, royalty-free way is the primary purpose of this document.
Goals
-
Defining a streaming format for complex data types in the context of FPGAs and, potentially, ASICs, where "complex data types" include:
- multi-dimensional sequences of undefined length;
- unions (a.k.a. variants);
- structures such as tuples or records.
-
Doing the above in as broad of a way as possible, without imposing unnecessary burdens on the development of simple components.
-
Allowing for minimization of area and complexity through well-defined contracts between source and sink on top of the signal specification itself.
-
Extensibility: this specification should be as usable as possible, even to those with use cases not foreseen by this specification.
Non-goals
-
In this specification, a "streaming format" refers to the way in which the voltages on a bundle of wires are used to communicate data. We expressly do NOT mean streaming over existing communication formats such as Ethernet, and certainly not over abstract streams such as POSIX pipes or other inter-process communication paradigms. If you're looking for the former, have a look at Arrow Flight. The latter is part of Apache Arrow itself through their IPC format specification.
-
We do not intend to compete with the AXI4(-stream) specification. AXI4-stream is designed for streaming unstructured byte-oriented data; Tydi streams are for streaming structured, complex data types.
-
Tydi streams have no notion of multi-endpoint network-on-chip-like structures. Adding source and destination addressing or other routing information can be done through the
user
signal. -
The primitive data type in Tydi is a group of bits. We leave the mapping from these bits to higher-order types such as numbers, characters, or timestamps to existing specifications.
Notation
The Tydi specification uses a flavor of mathematical notation to formally represent the defined concepts and operations thereupon. The nonstandard part of this notation is established in this section.
Nodes
The structures needed by the specification are complex to the point that simply using tuples becomes overly confusing. Therefore, we introduce a form of named tuple, which we will call a node (the term "type" would be more appropriate, but would be confusing within the context of this specification).
The notation for a node is as follows:
\[\textrm{NodeName}(...)\]
where \(\textrm{NodeName}\) is a title case identifier unique within the specification, and \(...\) is a number of field names in some pattern. For example:
\[\textrm{Fields}(N_1 : b_1, N_2 : b_2, ..., N_n : b_n)\]
A node carries with it implicit requirements on what the fields may be. Whenever an operation constructs a node, the operation is written in a way that these requirements are guaranteed by the preconditions of the operation.
The field names of nodes are significant. When an operation depends on one of the fields of the node, the fields may be "accessed" by subscripting the field name. For instance, if \(F\) is a \(\textrm{Fields}\) node, \(F_n\) represents the value for \(n\), and \(F_N\) represents the tuple consisting of \(N_1, N_2, ..., N_n\) of the node.
When the field being indexed is clear from context, the field itself is usually omitted; for instance, using just \(N\) implies the same thing as \(F_N\) if there is only one \(\textrm{Fields}\) node being discussed, and nothing else is using \(N\) for anything.
Nodes may also be unpacked into differently named values using the following notation:
Unpack \(F'\) into \(\textrm{Fields}(N'_1 : b'_1, N'_2 : b'_2, ..., N'_n : b'_n)\).
The following discussion may then use for instance \(N'\) to refer to \(F'_N\).
In some cases, only some of the fields of a node need to have a certain value for the following discussion to apply. We will then write only those fields within the parentheses, prefixing the values with the name and an equals sign, followed by an ellipsis to signify the unconstrained fields. For example, \(\textrm{Fields}(n=2, ...)\) signifies a \(\textrm{Fields}\) node with exactly two \(N\) and two \(b\), but the values thereof are unconstrained.
Functions
Due to the highly generic nature of the specification, parts of it are described algorithmically. The algorithms are referred to as functions. Like functions in a typical programming language, they have a signature, written as follows:
\[\textrm{functionName}(...) \rightarrow \textrm{ReturnType}\]
where \(\textrm{functionName}\) is a camelcase identifier unique within the specification, \(...\) is a number of argument names in some pattern, and \(\textrm{ReturnType}\) specifies what is returned, usually in the form of a node name.
A function carries with it implicit requirements on what its parameters may be. Whenever an operation uses a function, the operation is written in a way that these requirements are guaranteed by the preconditions of the operation.
Physical streams
A Tydi physical stream carries a stream of elements, dimensionality information for said elements, and (optionally) user-defined transfer information from a source to a sink. The contents of these three groups and to some extent the way in which they are transferred are defined in a parameterized way, to provide the necessary flexibility for the higher levels of this specification, and to give hardware developers the freedom to optimize their designs. Based on these parameters, this specification defines which signals must exist on the interface, and the significance of these signals.
Parameters
A physical stream is parameterized as \(\textrm{PhysicalStream}(E, N, D, C, U)\). The significance of these parameters is defined in the following subsections.
The parameters are defined on the interfaces of the source and the sink rather than the stream between them, because the complexity parameter need not be the same on the source and sink. The others do.
Element content (E) and user/transfer content (U)
\(E\) and \(U\) are of the form \(\textrm{Fields}(N_1 : b_1, N_2 : b_2, ..., N_n : b_n)\), where \(N\) are names, \(b\) are positive integers representing bit counts, and \(n\) is a nonnegative integer, describing a list of \(n\) named bit vector signals.
The element type can be given zero fields to make a "null" stream. Such streams can still be useful, as physical streams also carry metadata. The bit counts are however not allowed to be zero, because a lot of synthesis tools out there do not handle null ranges very well.
The difference between the element content and the user/transfer content of a stream, is that a transfer may encode zero or more elements, but always encodes one instance of the user/transfer content.
The user/transfer signals are intended for adding lower-level user-defined protocol signals to the stream, similar to the use of the
*USER
signals in AXI4. For instance, physical streams could be transferred over a network-on-chip-like structure by attaching routing information such as source and destination addresses through this method.
The name of each field is a string consisting of letters, numbers, and/or underscores. It may be empty, represented in this specification as \(\varnothing\).
The name cannot start or end with an underscore.
The name cannot start with a digit.
It is illegal to start or end with an underscore or start with a number to prevent confusion when the name is prefixed to form the signal name, and for compatibility with VHDL.
The name must be case-insensitively unique within the set of named fields.
The identifier is case-insensitive because compatibility with VHDL is desired.
\(|\textrm{Fields}(N_1 : b_1, N_2 : b_2, ..., N_n : b_n)|\) is a shorthand defined to equal \(\sum_{i=1}^{n} b_i\); that is, the sum of the field bit count over all fields in the element.
Number of element lanes (N)
\(N\) must be an integer greater than or equal to one. The signals used for describing a data element are replicated this many times, allowing multiple elements to be transferred per stream handshake. Each replication is called a lane.
Dimensionality (D)
\(D\) must be an integer greater than or equal to zero. It specifies the
number of last
bits needed to represent the data.
Intuitively, each sequence nesting level adds one to this number. For instance, to stream two-dimensional sequences, two
last
bits are needed: one to mark the boundaries of the inner sequence, and one to mark the boundary of each two-dimensional sequence.
Complexity (C)
\(C\) must be a nonempty list of nonnegative integers. It encodes the guarantees a source makes about how elements are transferred. Equivalently, it encodes the assumptions a sink can safely make.
Higher C Lower C Fewer guarantees made by source More guarantees made by source Source is easier to implement Source is harder to implement Sink can make fewer assumptions Sink can make more assumptions Sink is harder to implement Sink is easier to implement C is usually specified as a period-separated list, like a version number or paragraph/secion number, because like those things, it is orderable.
The complexity number carries the following intuitive significance.
\(C\) Description < 8 Only one sequence can be terminated per transfer. < 7 The indices of the active data lanes can be described with a simple range. < 6 The range of active data lanes must always start with lane zero. < 5 All lanes must be active for all but the last transfer of the innermost sequence. < 4 The last
flag cannot be postponed until after the transfer of the last element.< 3 Innermost sequences must be transferred in consecutive cycles. < 2 Whole outermost instances must be transferred in consecutive cycles. The exact requirements imposed for each \(C\) can be found along with the unconditional signal requirements in later sections.
The complexity levels and signals are defined such that a source with complexity \(C_a\) can be connected to a sink with complexity \(C_b \ge C_a\). A complexity number is higher than another when the leftmost integer of either is greater, and lower when the leftmost integer is lower. If the leftmost integer is equal, the next integer is checked. If one complexity number has more entries than another, the shorter number is padded with zeros on the right.
That is, it works a bit like a version number. The idea behind this is to allow future version of this specification (or any other kind of addition later) to insert new complexity levels between the ones defined here. For instance, 3 < 3.1 < 3.1.1 < 3.2 < 4.
A stream is considered to be normalized when \(C < 4\). At this complexity level, there is a one-to-one mapping between the transferred data and the actual transfers for any given \(N\). This is called the canonical representation of the data for a given \(N\).
Signals
A physical stream is comprised of the following signals.
Name | Origin | Purpose |
---|---|---|
valid | Source | Stalling the data stream due to the source not being ready. |
ready | Sink | Stalling the data stream due to the sink not being ready. |
data | Source | Data transfer of \(N\) \(|E|\)-bit elements. |
last | Source | Indicating the last transfer for \(D\) levels of nested sequences. |
stai | Source | Start index; encodes the index of the first valid lane. |
endi | Source | End index; encodes the index of the last valid lane. |
strb | Source | Strobe; encodes individual lane validity for \(C \ge 8\), empty sequences otherwise. |
user | Source | Additional control information carried along with the stream. |
The valid
and ready
signals are scalars, while all the other signals are
bit vectors with the following widths.
Name | Width |
---|---|
valid | scalar |
ready | scalar |
data | \(N \times |E|\) |
last | \(N \times D\) |
stai | \(\lceil \log_2{N} \rceil\) |
endi | \(\lceil \log_2{N} \rceil\) |
strb | \(N\) |
user | \(|U|\) |
Clock
All signals are synchronous to a single clock signal, shared between the source and sink.
It is up to the user to manage clock domains and clock domain crossing where needed, and specify the edge sensitivity of the clock(s).
Reset
This specification places no constraints on the reset signal. The requirements
on the valid
and ready
signals ensure that no transfers occur when either
the source or the sink is held under reset, so the source, synchronicity, and
sensitivity of the reset signal(s) can be chosen by the user.
Detailed signal description
valid
signal description
The
valid
signal has the same semantics as theTVALID
AXI4-stream signal. These semantics are repeated here.
The valid
signal indicates the presence of a valid transfer payload of the
other downstream signals. It is active-high.
The state of the valid
signal is always significant.
That is, if the source is in some undefined or "uninitialized" state,
valid
must nonetheless be low.
The downstream signals, including valid
, must be stable (maintain their
state) from the first active clock edge in which valid
is asserted until
acknowledgement of the transfer. Note that the validating and acknowledging
clock edge can be one and the same if ready
is already asserted during the
first clock cycle; in this case there are no stability requirements.
In other words, a transfer may not be cancelled or modified after validation. This allows a sink to start doing some sort of multicycle processing as soon as
valid
is asserted, only assertingready
when it is done, without having to worry about cancellation or changes in the transfer content. This prevents the need for extra holding registers or latches otherwise needed to keep the original transfer.
When a transfer is acknowledged, either the next transfer is presented on the
downstream signals and valid
remains asserted, or valid
is released.
[\(C < 3\)] valid
may only be released when lane \(N - 1\) of the
last
signal in the acknowledged transfer is nonzero.
[\(C < 2\)] valid
may only be released when lane \(N - 1\) of the
last
signal in the acknowledged transfer is all ones.
valid
must be released while the source is being reset.
This prevents spurious transfers when the sink has a different reset source.
ready
signal description
The
ready
signal has the same semantics as theTREADY
AXI4-stream signal. These semantics are repeated here.
The ready
signal indicates acknowledgement of a transfer initiated through
valid
. It is active-high.
The state of the ready
signal is significant only while valid
is asserted.
That is, a source must never wait for
ready
to assert before it assertsvalid
. Doing so can lead to deadlock. However, a sink may wait forvalid
to assert before assertingready
.
A transfer is considered "handshaked" when both valid
and ready
are
asserted during the active clock edge of the clock domain common to the source
and the sink.
ready
must be released while the sink is being reset.
This prevents transfers from being lost when the source has a different reset source.
data
signal description
The data
signal consists of \(N\) concatenated lanes, each carrying a
number of data element fields, totaling to \(|E|\) bits per lane.
The state of lane \(i\) in the data
signal is significant only while
valid
is asserted and data lane \(i\) is active. Data lane \(i\) is
considered active if and only if
- bit \(i\) of
strb
is asserted, - the unsigned integer interpretation of
endi
is greater than or equal to \(i\), and - the unsigned integer interpretation of
stai
is less than or equal to \(i\).
The redundancy of the
endi
andstai
signals given the presence of thestrb
signal (only applicable for \(C \ge 7\)) has to do with the rule that it must always be possible to connect a source with low complexity to a sink with high complexity. That is, while a source that desires individual control over the lanes and thus has \(C \ge 7\) would probably always driveendi
to \(N-1\) andstai
to 0, a sink with complexity \(C \ge 7\) still needs to accept input from sources that do useendi
andstai
.The overhead incurred by this should be negligible in practice. First of all, because \(C \ge 7\) with significant \(N\) will be quite large to begin with, so managing two additional signals of size \(\lceil \log_2{N} \rceil\) probably doesn't matter much. Secondly, in most cases, constant propagation will remove the unneeded
endi
andstai
signals whenever both source and sink have \(C \ge 7\). Finally, even if the above fails, going fromstai
+endi
+strb
to the internal lane enable signals in a sink is trivial for modern FPGAs and realistic \(N\), as the less-than-equal and greater-than-equal comparisons fit in a single 6-LUT up to and including \(N = 64\), resulting in just three decoding LUTs per lane (one forstai
, one forendi
, and one 3-input AND gate to merge the outputs of the first two and thestrb
bit).
The significant lanes are interpreted in order of increasing index for each transfer.
last
signal description
The last
signal consists of N
concatenated lanes, each carrying a
termination bit for each nested sequence represented by the physical stream.
The last
bit for lane \(i\) and dimension \(j\) (i.e., bit
\(i \cdot D + j\)) being asserted indicates completion of a sequence with
nesting level \(D - j - 1\), containing all elements streamed between the
previous lane/transfer this occured for dimension \(j' \ge j\) (or system
reset, if never) exclusively and this point inclusively, where nesting level
is defined to be 0 for the outermost sequence, 1 for its inner sequence, up to
\(D - 1\) for the innermost sequence. It is active-high.
For example, one way to represent the value
["Hello", "World"], ["Tydi", "is", "nice"], [""], []
with \(N = 6, C \ge 8\) is as follows. Note that \(D = 2\) follows from the data type, \(|E|\) depends on the character encoding, and the example does not depend on \(U\).reset released here | transfer --> v A B C D last "000100000000" "000011000000" "000001000100" "000010010000" data "WolleH" "yTdlro" "insiid" "----ec" strb "111111" "111111" "111111" "000011" .------------. .------------. .------------. .------------. 0 | data: 'H' | | data: 'o' | | data: 'd' | | data: 'c' | | last: 0 0 | | last: 0 0 | | last: 0 0 | | last: 0 0 | |------------| |------------| |------------| |------------| 1 | data: 'e' | | data: 'r' | | data: 'i' | | data: 'e' | | last: 0 0 | | last: 0 0 | | last: 0 1 | | last: 0 0 | |------------| |------------| |------------| |------------| lane 2 | data: 'l' | | data: 'l' | | data: 'i' | | data: - | index | last: 0 0 | | last: 0 0 | | last: 0 0 | | last: 0 1 | | |------------| |------------| |------------| |------------| v 3 | data: 'l' | | data: 'd' | | data: 's' | | data: - | | last: 0 0 | | last: 1 1 | | last: 0 1 | | last: 1 0 | |------------| |------------| |------------| |------------| 4 | data: 'o' | | data: 'T' | | data: 'n' | | data: - | | last: 0 1 | | last: 0 0 | | last: 0 0 | | last: 1 1 | |------------| |------------| |------------| |------------| 5 | data: 'W' | | data: 'y' | | data: 'i' | | data: - | | last: 0 0 | | last: 0 0 | | last: 0 0 | | last: 1 0 | '------------' '------------' '------------' '------------'
"Hello"
is delimited by the reset condition and the assertedlast
bit in transfer A, lane 4, dimension 0 (innermost)."World"
is delimited by the aforementioned and thelast
bit in transfer B, lane 3, also for dimension 0. Thelast
bit for B3/dimension 1 is also asserted, which delimits["Hello", "World"]
along with the reset condition; both the"Hello"
and"World"
sequence are fully contained by the delimited portion of the stream, so the outer sequence has those as its two entries."Tydi"
is delimited by B3 and C1,"is"
by C1 and C3, and"nice"
by C3 and D2 for dimension 0. Note that the D2 is postponed beyond the last element (D1) by means of deactivating data lane D2. Thelast
flag for the surrounding sequence is similarly postponed to D3; note though that data lane D3 must be inactive, or the last bit for D3/dimension 0 must be asserted for the data stream to be legal. The next""
is delimited by D3/dimension 1 and D4/dimension 0, and[""]
is delimited by D3 and D4 for dimension 1. The final[]
is delimited by D4 and D5 for dimension 1. It does not contain an inner sequence item because D5/dimension 0 is not asserted, so the outer sequence is empty.
The state of lane \(i\) in the data
signal is significant only while
valid
is asserted.
Note that the validity of the
last
lanes is thus not controlled by which data lanes are active. This allows sequences to be terminated without sending an element; this is in fact required in order to send empty sequences, but also allows postponing the end-of-sequence marker until after the last element (if complexity allows) as the previous example shows.
It is illegal to assert a last
bit for dimension \(j\) without first
terminating all dimensions \(j' < j\), if any data lanes were active since
the previous assertion of a last
bit for dimension 0.
For example, the following is illegal because of the above:
.------------. 0 | data: '1' | | last: 0 0 | |------------| 1 | data: '2' | | last: 0 1 | |------------| lane 2 | data: '3' | index | last: 0 0 | | |------------| v 3 | data: '4' | | last: 1 0 | |------------| 4 | data: '5' | | last: 0 0 | |------------| 5 | data: '6' | | last: 1 1 | '------------'
An argument could be made that this encodes
[[1, 2], 3, 4], [[5, 6]]
, but allowing this interpretation would make the stream significantly less type-safe, and there would be no way to encode non-sequence elements before the last inner sequence in the outer sequence without introducing astart
signal as well.An argument could also be made that this encodes
[[1, 2]], [[3, 4, 5, 6]]
, because thelast
flag delimiting[1, 2]
falls within the first outer sequence and thelast
flag delimiting[3, 4, 5, 6]
falls within the second. However, this would unnecessarily complicate sink logic, make manual interpretation of the data structure more context-sensitive, and so on.One could also argue that it encodes
[[1, 2]], [[5, 6]]
, because3
and4
are not enclosed by any inner sequence, but then it makes more sense to just deactivate the lanes.Ultimately, this representation is ambiguous or at least leads to confusion, and is therefore illegal.
[\(C < 8\)] All last bits for lanes 0 to \(N - 2\) inclusive must be driven low by the source, and may be ignored by the sink.
The above rule ultimately means that the
last
information is transferred on a transfer basis instead of on element basis, similar to AXI4-stream. This can significantly decrease decoding complexity, but only allows one innermost sequence to be transferred per cycle regardless of \(N\) and the length of the sequence.
[\(C < 4\)] It is illegal to assert a last
bit for dimension \(j\)
without also asserting the last
bits for dimensions \(j' < j\) in the same
lane.
[\(C < 4\)] It is illegal to assert the last
bit for dimension 0 when
the respective data lane is inactive, except for empty sequences.
The above two rules mean that the
last
flags cannot be postponed.
stai
signal description
The stai
signal (start index) is used to deactivate data lanes with an index
lower than the encoded binary number. It is a bit vector of size
\(\lceil \log_2{N} \rceil\) to encode the full index range.
The state of the stai
signal is significant only while valid
is asserted.
It is illegal to drive stai
to the value \(N\) or greater.
This implies that
stai
cannot be used to disable the last lane.
[\(C < 6\)] stai
must always be driven to 0 by the source, and may be
ignored by the sink.
endi
signal description
The endi
signal (end index) is used to deactivate data lanes with an index
greater than the encoded binary number. It is a bit vector of size
\(\lceil \log_2{N} \rceil\) to encode the full index range.
Note that
endi
cannot be used to disable the first lane.
The state of the endi
signal is significant only while valid
is asserted.
It is illegal to drive endi
to the value \(N\) or greater.
This would semantically not be different from driving \(N - 1\), so there is no reason to burden the sink logic by allowing this case.
It is illegal to drive endi
to a value less than stai
.
This would essentially encode a VHDL null range, which, if allowed, could be used to encode empty sequences. However, detecting this condition is relatively hard on FPGA resources (timing in particular), typically requiring a carry chain for \(N > 8\). Instead, use
strb
for this purpose; at lower complexities allstrb
bits must be equal, so only a single bit has to be checked to determine whether a transfer carries any data.
[\(C < 5\)] endi
must be driven to \(N - 1\) by the source when last
is zero, and may be ignored by the sink in this case.
Together with the other complexity rules up to this level, this means that all lanes must be used for all but the last transfer containing elements for the innermost sequence. This furthermore implies that the element with innermost index \(i\) is always transferred on data lane \(i \mod N\).
strb
signal description
The strb
signal (strobe) is used to deactivate individual data lanes. It is
a bit vector of size \(N\). When a strb
bit is low, the associated data
lane is deactivated.
Note that the opposite (
strb
high -> activated) is not necessarily true due toendi
andstai
.
The state of the strb
signal is significant only while valid
is asserted.
[\(C < 8\)] All strb
bits must be driven to the same value by the source.
The sink only needs to interpret one of the bits.
This effectively reduces the
strb
signal to a single bit indicating whether the transfer is empty (low) or not (high), as it is illegal to deactivate all data lanes by drivingendi
<stai
.
user
signal description
The user
signal is the concatenation of a number of user-specified fields,
carrying additional transfer-oriented information. The significance of this is
user-defined.
The state of the user
signal is not significant when valid
is not asserted.
The opposite is not necessarily true; it is up to the user's specifications if (part of) the user signal can be insignificant despite
valid
being asserted.
Signal omission
Not all signals are always needed. The following table shows the condition for a signal to be relevant. If the condition is false, the signal may be omitted on the interface. When two interfaces with differing but compatible complexities are connected together, the default value specified in the table below must be driven for the omitted signals.
Name | Condition | Default |
---|---|---|
valid | see below | '1' |
ready | see below | '1' |
data | \(|E| > 0\) | all '0' |
last | \(D \ge 1\) | all '1' |
stai | \(C \ge 6 \wedge N > 1\) | 0 |
endi | \((C \ge 5 \vee D \ge 1) \wedge N > 1\) | \(N-1\) |
strb | \(C \ge 7 \vee D \ge 1\) | all '1' |
user | \(|U| > 0\) | all '0' |
valid
may be omitted for sources that are always valid.
For example, a constant or status signal source (the latter must be valid during reset as well for this to apply).
ready
may be omitted for sinks that never block.
For example, a sink that voids all incoming data in order to measure performance of the source.
Interface conventions
Streams may be named, in order to prevent name conflicts due to multiple streams existing within the same namespace. Such a name is to be prefixed to the signal names using a double underscore.
Double underscores are used as a form of unambiguous hierarchy separation to allow user-specified field names in the logical stream types (defined later) to contain (non-consecutive) underscores without risk of name conflicts.
The canonical representation for the data
and user
signals is the LSB-first
concatenation of the contained fields. For \(N > 1\), the lanes are
concatenated LSB first, such that the lane index is major and the field index
is minor.
Where applicable, the signals must be listed in the following order: dn
,
up
, valid
, ready
, data
, last
, stai
, endi
, strb
, user
(dn
and up
are part of the alternative representation defined below).
This is mostly just a consistency thing, primarily because it helps to get used to a single ordering when interpreting simulation waveforms. It may also simplify tooling.
The signal names must be lowercase.
This is for interoperability between languages that differ in case sensitivity.
Alternative representation
To improve code readability in hardware definition languages supporting array and aggregate constructs (record, struct, ...), the following changes are permissible. However, it is recommended to fall back to the conventions above for interoperability with other streamlets on the "outer" interfaces of an IP block.
-
valid
,data
,last
,stai
,endi
,strb
, anduser
may be bundled in an aggregate type named<stream-name>__dn__type
with signal name<stream-name>__dn
(dn
is short for "downstream"). -
ready
may be "bundled" in an aggregate type named<stream-name>__up__type
with signal name<stream-name>__up
for symmetry (up
is short for "upstream"). -
If the language allows for signal direction reversal within a bundle, all stream signals may also be bundled into a single type named
<stream-name>__type
, withready
in the reverse direction. -
The data and user fields may be bundled in aggregate types named
<stream-name>__data__type
and<stream-name>__user__type
respectively. Thedata
signal becomes an array of<stream-name>__data__type
s from 0 to \(N - 1\) if \(N > 1\) or when otherwise desirable. -
Data and user fields consisting of a single bit may be interpreted as either a bit vector of size one or a scalar bit depending on context.
-
Fields with a common double-underscore-delimited prefix may be aggregated recursively using
<stream-name>__data__<common-prefix>__type
, in such a way that the double underscores in the canonical signal name are essentially replaced with the hierarchy separator of the hardware definition language.
Arrays, vectors, and concatenations
Where applicable, the bitrange for bit vector signals is always n-1 downto 0
,
where n
is the number of bits in the signal.
Concatenations of bit vector signals are done LSB-first.
That is, lower indexed entries use lower bit indices and thus occur on the right-hand side when the bit vector is written as a binary number. Note that this results in the inverse order when using the concatenation operator in VHDL (and possibly other languages). LSB-first order is chosen because it is more intuitive when indexing the vector. Ultimately this is just a convention.
Where applicable, the range for arrays is always 0 to n-1
, where n
is the
number of entries in the array.
This is the more natural order for array-like structures, actually putting the first entry on the left-hand side.
Logical streams
A Tydi logical stream consists of a bundle of physical streams and a group of
user-defined signals, parameterized by way of a logical stream type. The
logical stream type, in turn, represents some representation of an abstract
data type (in the same way that for instance std::vector
and std::list
both
represent a sequence in C++). Conceptually, the resulting logical stream can
then be used to stream instances of its corresponding abstract data type from
the source to the sink. The sink can only receive or operate on one instance at
a time and only in sequence (hence the name "stream"), but the logical stream
type can be designed such that the sink as random-access capability within the
current instance.
Because of the degree of flexibility in specifying Tydi logical streams, much of this specification consists of the formal definitions of the structures needed to describe a Tydi logical stream, and operations thereupon. The resulting constraints on the signal lists and behavior of the signals are then described based on these.
Logical stream type
This structure is at the heart of the logical stream specification. It is used both to specify the type of a logical stream and internally for the process of lowering the recursive structure down to physical streams and signals.
Structure
The logical stream type is defined recursively by means of a number of node types. Two classes of nodes are defined: stream-manipulating nodes, and element-manipulating nodes. Only one stream-manipulating node exists (\(\textrm{Stream}\)), which is defined in the first subsection. The remainder of the subsections describe the element-manipulating nodes.
Stream
The \(\textrm{Stream}\) node is used to define a new physical stream. It is defined as \(\textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\), where:
-
\(T_e\) is any logical stream type, representing the type of element carried by the logical stream;
This may include additional \(\textrm{Stream}\) nodes to handle cases such as structures containing sequences, handled in Tydi by transferring the two different dimensions over two different physical streams, as well as to encode request-response patterns.
-
\(t\) is a positive real number, representing the minimum number of elements that should be transferrable on the child stream per element in the parent stream, or if there is no parent stream, the minimum number of elements that should be transferrable per clock cycle;
\(t\) is short for throughput ratio. As we'll see later, the \(N\) parameter for the resulting physical stream equals \(\left\lceil\prod t\right\rceil\) for all ancestral \(\textrm{Stream}\) nodes, including this node.
This significance is as follows. Let's say that you have a structure like \(\textrm{Stream}(T_e = \textrm{Group}(\textrm{a}: \textrm{Bits}(16), \textrm{b}: \textrm{Stream}(d = 1, T_e = \textrm{Bits}(8), ...)), ...)\), you're expecting that you'll be transferring about one of those groups every three cycles, and you're expecting that the list size of
b
will be about 8 on average. In this case, \(t = 1/3\) for the outer stream node, and \(t = 8\) for the inner stream node. That'll give you \(\lceil 1/3\rceil = 1\) element lane on the outer stream (used at ~33% efficiency) and \(\lceil 1/3 \cdot 8\rceil = 3\) element lanes for the innerb
stream (used at ~88% efficiency), such that neither stream will on average be slowing down the design. -
\(d\) is a nonnegative integer, representing the dimensionality of the child stream with respect to the parent stream;
As we'll see later, the \(D\) parameter for the resulting physical stream equals \(\sum d\) for all ancestral \(\textrm{Stream}\) nodes, including this node, up to and including the closest ancestor (or self) for which \(s = \textrm{Flatten}\).
-
\(s\) must be one of \(\textrm{Sync}\), \(\textrm{Flatten}\), \(\textrm{Desync}\), or \(\textrm{FlatDesync}\), representing the relation between the dimensionality information of the parent stream and the child stream;
The most common relation is \(\textrm{Sync}\), which enforces that for each element in the parent stream, there must be one \(d\)-dimensional sequence in the child stream (if \(d\) is 0, that just means one element). Furthermore, it means that the
last
flags from the parent stream are repeated in the child stream. This repetition is technically redundant; since the sink will have access to both streams, it only needs this information once to decode the data.If this redundant repetition is not necessary for the sink or hard to generate for a source, \(\textrm{Flatten}\) may be used instead. It is defined in the same way, except the redundant information is not repeated for the child stream.
Consider for example the data
[(1, [2, 3]), (4, [5])], [(6, [7])]
. If encoded as \(\textrm{Stream}(d=1, T_e=\textrm{Group}(\textrm{Bits}(8), \textrm{Stream}(d=1, s=\textrm{Sync}, ...)), ...)\), the first physical stream will transfer[1, 4], [6]
, and the second physical stream will transfer[[2, 3], [5]], [[7]]
. If \(\textrm{Flatten}\) is used for the inner \(\textrm{Stream}\) instead, the data transferred by the second physical stream reduces to[2, 3], [5], [7]
; the structure of the outer sequence is implied by the structure transferred by the first physical stream.\(\textrm{Desync}\) may be used if the relation between the elements in the child and parent stream is dependent on context rather than the
last
flags in either stream. For example, for the type \(\textrm{Stream}(d=1, T_e=\textrm{Group}(\textrm{Bits}(8), \textrm{Stream}(s=\textrm{Desync}, ...)), ...)\), if the first stream carries[1, 4], [6]
as before, the child stream may carry anything of the form[...], [...]
. That is, the dimensionality information (if any) must still match, but the number of elements in the innermost sequence is unrelated or only related by some user-defined context.\(\textrm{FlatDesync}\), finally, does the same thing as \(\textrm{Desync}\), but also strips the dimensionality information from the parent. This means there the relation between the two streams, if any, is fully user-defined.
-
\(c\) is a complexity number, as defined in the physical stream specification, representing the complexity for the corresponding physical stream interface;
-
\(r\) must be one of \(\textrm{Forward}\) or \(\textrm{Reverse}\), representing the direction of the child stream with respect to the parent stream, or if there is no parent stream, with respect to the direction of the logical stream (source to sink);
\(\textrm{Forward}\) indicates that the child stream flows in the same direction as its parent, complementing the data of its parent in some way. Conversely, \(\textrm{Reverse}\) indicates that the child stream acts as a response channel for the parent stream. If there is no parent stream, \(\textrm{Forward}\) indicates that the stream flows in the natural source to sink direction of the logical stream, while \(\textrm{Reverse}\) indicates a control channel in the opposite direction. The latter may occur for instance when doing random read access to a memory; the first stream carrying the read commands then flows in the sink to source direction.
-
\(T_u\) is a logical stream type consisting of only element-manipulating nodes, representing the
user
signals of the physical stream; that is, transfer-oriented data (versus element-oriented data); andNote that while \(T_u\) is called a "logical stream type" for consistency, the \(\textrm{Stream}\) node being illegal implies that no additional streams can be specified within this type.
-
\(x\) is a boolean, representing whether the stream carries "extra" information beyond the
data
anduser
signal payloads.\(x\) is normally false, which implies that the \(\textrm{Stream}\) node will not result in a physical stream if both its
data
anduser
signals would be empty according to the rest of this specification; it is effectively optimized away. Setting \(x\) to true simply overrides this behavior.This optimization handles a fairly common case when the logical stream type is recursively generated by tooling outside the scope of this layer of the specification. It arises for instance for nested lists, which may result in something of the form \(\textrm{Stream}(d=1, T_e=\textrm{Stream}(d=1, s=\textrm{Sync}, T_e=...))\). The outer stream is not necessary here; all the dimensionality information is carried by the inner stream.
It could be argued that the above example should be reduced to \(\textrm{Stream}(d=2, s=\textrm{Sync}, T_e=...)\) by said external tool. Indeed, this description is equivalent. This reduction is however very difficult to define or specify in a way that it handles all cases intuitively, hence this more pragmatic solution was chosen.
As the \(\textrm{Stream}\) node carries many parameters and can therefore be hard to read, some abbreviations are in order:
- \(\textrm{Dim}(T_e, t, c, T_u) \ \rightarrow\ \textrm{Stream}(T_e, t, 1, \textrm{Sync}, c, \textrm{Forward}, T_u, \textrm{false})\)
- \(\textrm{New}(T_e, t, c, T_u) \ \rightarrow\ \textrm{Stream}(T_e, t, 0, \textrm{Sync}, c, \textrm{Forward}, T_u, \textrm{false})\)
- \(\textrm{Des}(T_e, t, c, T_u) \ \rightarrow\ \textrm{Stream}(T_e, t, 0, \textrm{Desync}, c, \textrm{Forward}, T_u, \textrm{false})\)
- \(\textrm{Flat}(T_e, t, c, T_u)\ \rightarrow\ \textrm{Stream}(T_e, t, 0, \textrm{Flatten}, c, \textrm{Forward}, T_u, \textrm{false})\)
- \(\textrm{Rev}(T_e, t, c, T_u) \ \rightarrow\ \textrm{Stream}(T_e, t, 0, \textrm{Sync}, c, \textrm{Reverse}, T_u, \textrm{false})\)
For the above abbreviations, the following defaults apply in addition:
- \(T_u = \textrm{Null}\);
- \(c =\) complexity of parent stream (cannot be omitted if there is no parent);
- \(t = 1\).
Null
The \(\textrm{Null}\) leaf node indicates the transferrence of one-valued data: the only valid value is \(\varnothing\) (null).
This is useful primarily when used within a \(\textrm{Union}\), where the data type itself carries information. Furthermore, \(\textrm{Stream}\) nodes carrying a \(\textrm{Null}\) will always result in a physical stream (whereas \(\textrm{Stream}\) nodes carrying only other streams will be "optimized out"), which may still carry dimensionality information, user data, or even just transfer timing information.
Bits
The \(\textrm{Bits}\) leaf node, defined as \(\textrm{Bits}(b)\), indicates the transferrence of \(2^b\)-valued data carried by means of a group of \(b\) bits, where \(b\) is a positive integer.
Tydi does not specify how the bits should be interpreted: such interpretation is explicitly out of scope.
Group
The \(\textrm{Group}\) node acts as a product type (composition). It is defined as \(\textrm{Group}(N_1: T_1, N_2: T_2, ..., N_n: T_n)\), where:
- \(N_i\) is a string consisting of letters, numbers, and/or underscores, representing the name of the \(i\)th field;
- \(T_i\) is the logical stream type for the \(i\)th field;
- \(n\) is a nonnegative integer, representing the number of fields.
Intuitively, each instance of type \(\textrm{Group}(...)\) consists of instances of all the contained types. This corresponds to a
struct
orrecord
in most programming languages.
The names cannot contain two or more consecutive underscores.
Double underscores are used by Tydi as a hierarchy separator when flattening structures for the canonical representation. If not for the above rule, something of the form \(\textrm{Group}(\textrm{'a'}:\textrm{Group}(\textrm{'b__c'}:...),\textrm{'a__b'}:\textrm{Group}(\textrm{'c'}:...))\) would result in a name conflict: both signals would be named
a__b__c
.
The names cannot start or end with an underscore.
The names cannot start with a digit.
The names cannot be empty.
The names must be case-insensitively unique within the group.
The above requirements on the name mirror the requirements on the field names for the physical streams.
Union
The \(\textrm{Union}\) node acts as a sum type (exclusive disjunction). It is defined as \(\textrm{Union}(N_1: T_1, N_2: T_2, ..., N_n: T_n)\), where:
- \(N_i\) is a string consisting of letters, numbers, and/or underscores, representing the name of the \(i\)th field;
- \(T_i\) is the logical stream type for the \(i\)th field;
- \(n\) is a positive integer, representing the number of fields.
Intuitively, each instance of type \(\textrm{Union}\) consists of an instance of one the contained types (also known as variants), as well as a tag indicating which field that instance belongs to. This corresponds to a
union
orenum
in most programming languages. Mathematically, Tydi unions represent tagged unions.One might argue that unions could also be specified without the \(\textrm{Union}\) node, by simply grouping the variants together along with a manually specified \(\textrm{Bits}\) field for the union tag, and using don't-cares for the variants not selected through the tag field. However, there are a number of downsides to such a sparse union approach.
- For unions with many variants, the
data
signal of the corresponding stream becomes very wide.- Any streams nested inside the union variants would need to be stuffed with don't-care transfers as well, which may result in additional control logic, delay, area, etc..
While such sparse unions have their place, dense unions can be more suitable for a given application. These could also be represented without a \(\textrm{Union}\) node, but not in a nicely recursive manner without loss of information about the variant types. Therefore, (dense) unions were made a first-class type through \(\textrm{Union}\).
\(\textrm{Union}\)s can also be used to elegantly describe nullable types. The pattern for that is simply \(\textrm{Union}(\textrm{Null}, T)\). In this case, the union's tag field acts like a validity bit.
The names cannot contain two or more consecutive underscores.
Double underscores are used by Tydi as a hierarchy separator when flattening structures for the canonical representation. If not for the above rule, something of the form \(\textrm{Union}(\textrm{'a'}:\textrm{Union}(\textrm{'b__c'}:\textrm{Stream}...),\textrm{'a__b'}:\textrm{Union}(\textrm{'c'}:\textrm{Stream}...))\) would result in a name conflict: both physical streams would be named
a__b__c
.
The names cannot start or end with an underscore.
The names cannot start with a digit.
The names cannot be empty.
The names must be case-insensitively unique within the union.
The above requirements on the name mirror the requirements on the field names for the physical streams.
Operations on logical stream
This section defines some operations that may be performed on logical stream types, most importantly the \(\textrm{synthesize}()\) function, which returns the signals and physical streams for a given type.
Null detection function
This section defines the function \(\textrm{isNull}(T_{in}) \rightarrow \textrm{Bool}\), which returns true if and only if \(T_{in}\) does not result in any signals.
The function returns true if and only if one or more of the following rules match:
-
\(T_{in} = \textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\), \(\textrm{isNull}(T_e)\) is true, \(\textrm{isNull}(T_u)\) is true, and \(x\) is false;
-
\(T_{in} = \textrm{Null}\);
-
\(T_{source} = \textrm{Group}(N_1: T_1, N_2: T_2, ..., N_n: T_n)\) and \(\textrm{isNull}(T_i)\) is true for all \(i \in (1, 2, ..., n)\); or
-
\(T_{source} = \textrm{Union}(N_1: T_1)\) and \(\textrm{isNull}(T_1)\) is true.
Note that any union with more than one field is not null regardless of its types, because the union tag also carries information in that case.
Split function
This section defines the function \(\textrm{split}(T_{in}) \rightarrow \textrm{SplitStreams}\), where \(T_{in}\) is any logical stream type. The \(\textrm{SplitStreams}\) node is defined as \(\textrm{SplitStreams}(T_{signals}, N_1 : T_1, N_2 : T_2, ..., N_n : T_n)\), where:
- \(T_{signals}\) is a logical stream type consisting of only element-manipulating nodes;
- all \(T_{1..n}\) are \(\textrm{Stream}\) nodes with the data and user subtypes consisting of only element-manipulating nodes;
- all \(N\) are case-insensitively unique, emptyable strings consisting of letters, numbers, and/or underscores, not starting or ending in an underscore, and not starting with a digit; and
- \(n\) is a nonnegative integer.
Intuitively, it splits a logical stream type up into simplified stream types, where \(T_{signals}\) contains only the information for the user-defined signals, and \(T_i\) contains only the information for the physical stream with index \(i\) and name \(N_i\).
The names can be empty, because if there are no \(Group\)s or \(Union\)s, the one existing stream or signal will not have an intrinsic name given by the type. Empty strings are represented here with \(\varnothing\). Note that the names here can include double underscores.
\(\textrm{split}(T_{in})\) is evaluated as follows.
-
If \(T_{in} = \textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\), apply the following algorithm.
- Initialize \(N\) and \(T\) to empty lists.
- \(T_{element} := \textrm{split}(T_e)_{signals}\) (i.e., the \(T_{signals}\) item of the tuple returned by the \(\textrm{split}\) function)
- If \(\textrm{isNull}(T_{element})\) is false, \(\textrm{isNull}(T_u)\)
is false, or \(x\) is true:
- Append \(\varnothing\) (an empty name) to the end of \(N\).
- Append \(\textrm{Stream}(T_{element}, t, d, s, c, r, T_u, x)\) to the end of \(T\).
- Extend \(N\) with \(\textrm{split}(T_e)_{names}\) (i.e., all stream names returned by the \(\textrm{split}\) function).
- For all \(T' \in \textrm{split}(T_e)_{streams}\) (i.e.,
for all named streams returned by the \(\textrm{split}\) function):
- Unpack \(T'\) into \(\textrm{Stream}(T'_d, t', d', s', c', r', T'_u, x')\). This is always possible due to the postconditions of \(\textrm{split}\).
- If \(r = \textrm{Reverse}\), reverse the direction of \(r'\).
- If \(s = \textrm{Flatten}\) or \(s = \textrm{FlatDesync}\), assign \(s' := \textrm{FlatDesync}\).
- If \(s' \ne \textrm{Flatten}\) and \(s \ne \textrm{FlatDesync}\), assign \(d' := d' + d\).
- \(t' := t' \cdot t\).
- Append \(\textrm{Stream}(T'_d, t', d', s', c', r', T'_u, x')\) to \(T\).
- Return \(\textrm{SplitStreams}(\textrm{Null}, N_1 : T_1, N_2 : T_2, ..., N_m : T_m)\).
-
If \(T_{in} = \textrm{Null}\) or \(T_{in} = \textrm{Bits}(b)\), return \(\textrm{SplitStreams}(T_{in})\).
-
If \(T_{in} = \textrm{Group}(N_{g,1} : T_{g,1}, N_{g,2} : T_{g,2}, ..., N_{g,n} : T_{g,n})\), apply the following algorithm.
- \(T_{signals} := \textrm{Group}(N_{g,1} : \textrm{split}(T_{g,1})_{signals}, ..., N_{g,n} : \textrm{split}(T_{g,n})_{signals})\) (i.e., replace the types contained by the group with the \(T_{signals}\) item of the tuple returned by the \(\textrm{split}\) function applied to the original types).
- Initialize \(N\) and \(T\) to empty lists.
- For all \(i \in 1..n\):
- For all \(\textrm{name} \in \textrm{split}(T_{g,i})_{names}\) (i.e.,
for all stream names returned by the \(\textrm{split}\) function),
- If \(\textrm{name} = \varnothing\), append \(N_{g,i}\) to the end of \(N\).
- Otherwise, append the concatenation \(N_{g,i}\) and \(\textrm{name}\) to the end of \(N\), separated by a double underscore.
- Extend \(T\) with \(\textrm{split}(T_{g,i})_{streams}\) (i.e., all named streams returned by the \(\textrm{split}\) function).
- For all \(\textrm{name} \in \textrm{split}(T_{g,i})_{names}\) (i.e.,
for all stream names returned by the \(\textrm{split}\) function),
- Return \(\textrm{SplitStreams}(T_{signals}, N_1 : T_1, N_2 : T_2, ..., N_m : T_m)\), where \(m = |N| = |T|\).
-
If \(T_{in} = \textrm{Union}(N_{u,1} : T_{u,1}, N_{u,2} : T_{u,2}, ..., N_{u,n} : T_{u,n})\), apply the following algorithm.
- \(T_{signals} := \textrm{Union}(N_{u,1} : \textrm{split}(T_{u,1})_{signals}, ..., N_{u,n} : \textrm{split}(T_{u,n})_{signals})\) (i.e., replace the types contained by the group with the \(T_{signals}\) item of the tuple returned by the \(\textrm{split}\) function applied to the original types).
- Initialize \(N\) and \(T\) to empty lists.
- For all \(i \in 1..n\):
- For all \(\textrm{name} \in \textrm{split}(T_{u,i})_{names}\) (i.e.,
for all stream names returned by the \(\textrm{split}\) function),
- If \(\textrm{name} = \varnothing\), append \(N_{u,i}\) to the end of \(N\).
- Otherwise, append the concatenation \(N_{u,i}\) and \(\textrm{name}\) to the end of \(N\), separated by a double underscore.
- Extend \(T\) with \(\textrm{split}(T_{u,i})_{streams}\) (i.e., all named streams returned by the \(\textrm{split}\) function).
- For all \(\textrm{name} \in \textrm{split}(T_{u,i})_{names}\) (i.e.,
for all stream names returned by the \(\textrm{split}\) function),
- Return \(\textrm{SplitStreams}(T_{signals}, N_1 : T_1, N_2 : T_2, ..., N_m : T_m)\), where \(m = |N| = |T|\).
Note that the algorithm for \(\textrm{Group}\) and \(\textrm{Union}\) is the same, aside from returning \(\textrm{Group}\) vs \(\textrm{Union}\) nodes for \(T_{signals}\).
Field conversion function
This section defines the function \(\textrm{fields}(T_{in}) \rightarrow \textrm{Fields}\), where \(T_{in}\) is any logical stream type, and the \(\textrm{Fields}\) node is as defined in the physical stream specification.
Intuitively, this function flattens a logical stream type consisting of \(\textrm{Null}\), \(\textrm{Bits}\), \(\textrm{Group}\) and \(\textrm{Union}\) nodes into a list of named bitfields. It is used for constructing the data and user field lists of the physical streams and the user-defined signal list from the logical stream types preprocessed by \(\textrm{split}()\). This function is normally only applied to logical stream types that only carry element-manipulating nodes.
The names can be empty, because if there are no \(Group\)s or \(Union\)s, the one existing stream or signal will not have an intrinsic name given by the type. Empty strings are represented here with \(\varnothing\). Note that the names here can include double underscores.
\(\textrm{fields}(T_{in})\) is evaluated as follows.
-
If \(T_{in} = \textrm{Null}\) or \(T_{in} = \textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\), return \(\textrm{Fields}()\).
-
If \(T_{in} = \textrm{Bits}(b)\), return \(\textrm{Fields}(\varnothing : b)\).
-
If \(T_{in} = \textrm{Group}(N_{g,1} : T_{g,1}, N_{g,2} : T_{g,2}, ..., N_{g,n} : T_{g,n})\), apply the following algorithm.
- Initialize \(N_o\) and \(b_o\) to empty lists.
- For all \(i \in 1..n\):
- Unpack \(\textrm{fields}(T_{g,i})\) into \(\textrm{Fields}(N_{e,1} : b_{e,1}, N_{e,2} : b_{e,2}, ..., N_{e,m} : b_{e,m})\).
- For all \(j \in 1..m\):
- If \(N_{e,j} = \varnothing\), append \(N_{g,i}\) to the end of \(N_o\). Otherwise, append the concatenation of \(N_{g,i}\) and \(N_{e,j}\), separated by a double underscore, to the end of \(N_o\).
- Append \(b_{e,j}\) to the end of \(b_o\).
- Return \(\textrm{Fields}(N_{o,1} : b_{o,1}, N_{o,2} : b_{o,2}, ..., N_{o,p} : b_{o,p})\), where \(p = |N_o| = |b_o|\).
-
If \(T_{in} = \textrm{Union}(N_{u,1} : T_{u,1}, N_{u,2} : T_{u,2}, ..., N_{u,n} : T_{u,n})\), apply the following algorithm.
- Initialize \(N_o\) and \(b_o\) to empty lists.
- If \(n > 1\):
- Append
"tag"
to the end of \(N_o\). - Append \(\left\lceil\log_2 n\right\rceil\) to the end of \(b_o\).
- Append
- \(b_d := 0\)
- For all \(i \in 1..n\):
- Unpack \(\textrm{fields}(T_{g,i})\) into \(\textrm{Fields}(N_{e,1} : b_{e,1}, N_{e,2} : b_{e,2}, ..., N_{e,m} : b_{e,m})\).
- \(b_d := \max\left(b_d, \sum_{j=1}^m b_{e,j}\right)\)
- If \(b_d > 0\):
- Append
"union"
to the end of \(N_o\). - Append \(b_d\) to the end of \(b_o\).
- Append
- Return \(\textrm{Fields}(N_{o,1} : b_{o,1}, N_{o,2} : b_{o,2}, ..., N_{o,p} : b_{o,p})\), where \(p = |N_o| = |b_o|\).
Synthesis function
This section defines the function \(\textrm{synthesize}(T_{in}) \rightarrow \textrm{LogicalStream}\), where \(T_{in}\) is any logical stream type. The \(\textrm{LogicalStream}\) node is defined as \(\textrm{LogicalStream}(F_{signals}, N_1 : P_1, N_2 : P_2, ..., N_n : P_n)\), where:
-
\(F_{signals}\) is of the form \(\textrm{Fields}(N_{s,1} : b_{s,1}, N_{s,2} : b_{s,2}, ..., N_{s,m} : b_{s,m})\), as defined in the physical stream specification;
This represents the list of user-defined signals flowing in parallel to the physical streams.
-
all \(P\) are of the form \(\textrm{PhysicalStream}(E, N, D, C, U)\), as defined in the physical stream specification;
-
all \(N\) are case-insensitively unique, emptyable strings consisting of letters, numbers, and/or underscores, not starting or ending in an underscore, and not starting with a digit; and
-
\(n\) is a nonnegative integer.
Intuitively, this function synthesizes a logical stream type to its physical representation.
\(\textrm{synthesize}(T_{in})\) is evaluated as follows.
- Unpack \(\textrm{split}(T_{in})\) into \(\textrm{SplitStreams}(T_{signals}, N_1 : T_1, N_2 : T_2, ..., N_n : T_n)\).
- \(F_{signals} := \textrm{fields}(T_{signals})\)
- For all \(i \in (1, 2, ..., n)\):
- Unpack \(T_i\) into \(\textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\).
- \(P_i := \textrm{PhysicalStream}(\textrm{fields}(T_e), \lceil t\rceil, d, c, \textrm{fields}(T_u))\)
- Return \(\textrm{LogicalStream}(F_{signals}, N_1 : P_1, N_2 : P_2, ..., N_n : P_n)\).
Type compatibility function
This section defines the function \(\textrm{compatible}(T_{source}, T_{sink}) \rightarrow \textrm{Bool}\), where \(T_{source}\) and \(T_{sink}\) are both logical stream types. The function returns true if and only if a source of type \(T_{source}\) can be connected to a sink of type \(T_{sink}\) without insertion of conversion logic. The function returns true if and only if one or more of the following rules match:
-
\(T_{source} = T_{sink}\);
-
\(T_{source} = \textrm{Stream}(T_e, t, d, s, c, r, T_u, x)\), \(T_{sink} = \textrm{Stream}(T'_d, t, d, s, c', r, T_u, x)\), \(\textrm{compatible}(T_e, T'_d) = \textrm{true}\), and \(c < c'\);
-
\(T_{source} = \textrm{Group}(N_1: T_1, N_2: T_2, ..., N_n: T_n)\), \(T_{sink} = \textrm{Group}(N_1: T'_1, N_2: T'_2, ..., N_n: T'_n)\), and \(\textrm{compatible}(T_i, T'_i)\) is true for all \(i \in (1, 2, ..., n)\); or
-
\(T_{source} = \textrm{Union}(N_1: T_1, N_2: T_2, ..., N_n: T_n)\), \(T_{sink} = \textrm{Union}(N_1: T'_1, N_2: T'_2, ..., N_n: T'_n)\), and \(\textrm{compatible}(T_i, T'_i)\) is true for all \(i \in (1, 2, ..., n)\).
The name matching is to be done case sensitively.
While two streams may be compatible in a case-insensitive language when only the case of one or more of the names differ, this would still not hold for case-sensitive languages. Therefore, for two streams to be compatible in all cases, the matching must be case sensitive.
Union semantics
When flattened into fields, a \(\textrm{Union}\) node consists of:
- if there are two or more variants, a
"tag"
field of size \(\left\lceil\log_2{n}\right\rceil\), where \(n\) is the number of variants; and - if any variant carries data in the same stream as the union itself, a
"union"
field of size
\(\max\limits_{\textrm{variants}} \sum\limits_{\textrm{subfields}} b_{\textrm{subfield}}\).
The "tag"
field is used to represent which variant is being represented, by
means of a zero-based index encoded as an unsigned binary number.
It is illegal for a source to drive the "tag"
field of a \(\textrm{Union}\)
with \(n\) variants to \(n\) or higher.
The above would otherwise be possible if \(n\) is not a power of two.
The "union"
field of a \(\textrm{Union}\) is used to represent variant
data existing in the same stream as the union itself. It consists of the
LSB-first, LSB-aligned concatenation of the fields of the variant.
No constraint is placed on the value that the source drives on the pad bits.
The physical streams represented by \(\textrm{Stream}\) nodes nested inside \(\textrm{Union}\) variant types behave as if any other variants carried by the parent stream do not exist, but are otherwise synchronized to the parent stream as defined by the \(s\) parameter of the child stream.
Consider the following example, keeping \(x\) generic for now:
\[B = \textrm{Group}(x: \textrm{Bits}(2), y: \textrm{Bits}(2))\\ C = \textrm{Stream}(d=1, s=x, T_e=\textrm{Bits}(4))\\ U = \textrm{Union}(\textrm{a} : \textrm{Bits}(3), \textrm{b} : \textrm{B}, \textrm{c} : C)\\ \textrm{Stream}(d=1, T_e=U)\]
This results in two physical streams:
- the unnamed stream carrying the union with \(D = 1\), consisting of a two-bit field named
tag
and a four-bit field namedunion
; and- a stream named
c
, consisting of a single unnamed four-bit field.Let's say that we need to represent the data
[a:0, b:(x:1, y:2)], [c:[3, 4, 5], a:6]
. The unnamed stream then carries the following four transfers, regardless of the value for \(x\):[ (tag: "00", union: "-000"), (1) (tag: "01", union: "1001") ], (2) [ (tag: "10", union: "----"), (3) (tag: "00", union: "-110") ] (4)
In transfer 1 and 4, the MSB of the
union
is don't-care, because varianta
uses only three bits. In transfer 2,x
is represented by the two LSB of theunion
field, whiley
is represented by the MSBs. Finally, in transfer 3, all bits ofunion
are don't-care, as streamc
is used to carry the data. Note that atag
value of"11"
is illegal in this example.When \(x = \textrm{Sync}\), stream
c
has \(D = 2\) and carries the following transfers:[ ], (1) [ [ ("0011"), (2) ("0100"), (3) ("0101") ] ] (4)
Transfer 1 carries an empty outer sequence, corresponding to
[a:0, b:(x:1, y:2)]
of the represented data. There are no inner sequences because variantc
never occurs in the first sequence, but the outer sequence must still be represented to correctly copy the dimensionality information of the parent stream, as mandated by the \(\textrm{Sync}\) tag. The second outer sequence represents[c:[3, 4, 5], a:6]
; it carries one inner sequence, because variantc
occurs once. The inner sequence directly corresponds to[3, 4, 5]
.When \(x = \textrm{Flatten}\), the outer sequence is flattened away. Therefore, stream
c
has \(D = 1\) and carries only the following transfers:[ ("0011"), (1) ("0100"), (2) ("0101") ] (3)
It is then up to the sink to recover the outer dimension if it needs it.
When \(x = \textrm{Desync}\), \(D = 2\) as before. For the example data, the transfers would be exactly the same as for \(x = \textrm{Sync}\), but the zero-or-more relationship defined by the \(\textrm{Desync}\) means that a
c
variant in the parent stream may correspond with any number of inner sequences on thec
stream. That does however still mean that the first outer sequence must be empty, as follows:[ ], [ [ ... ] ... [ ... ] ]
After all, there are no
c
variants in the first sequence to apply the zero-or-more relationship to.Finally, \(x = \textrm{FlatDesync}\) flattens the outer sequence away compared to \(x = \textrm{Desync}\), allowing for any number of transfers to occur on the
c
stream, all corresponding to thec
variant in transfer 3 of the parent stream.If you're confused as to how a sink is to determine which transfers on the
c
stream correspond to whichc
variant in the parent stream, recall that the purpose of \(\textrm{Desync}\) and \(\textrm{FlatDesync}\) is to allow the user of this layer of the specification to define this relation as they see fit. Usually, a pattern of the form \(\textrm{Group}(\textrm{len}: \textrm{Bits}(...), \textrm{data}: \textrm{Stream}(s=\textrm{Desync}, ...))\) would be used, in which case the length of the encoded inner sequence would be transferred using theunion
field of the parent stream.
Natural ordering and inter-stream dependencies
The natural ordering of a datum transferred over a logical stream is defined as follows:
-
for \(\textrm{Group}\) nodes, the elements are ordered by left-to-right group order; and
-
for \(\textrm{Stream}\) nodes with a parent stream, for every element transferred on the parent stream, transfer the parent element first, then transfer the corresponding data on the child stream. This data depends on the \(s\) and \(d\) parameters of the \(\textrm{Stream}\) node:
- \(s \in (\textrm{Sync}, \textrm{Flatten})\), \(d = 0\): one element on the parent stream corresponds to one element on the child stream;
- \(s = \textrm{Sync}\), \(d > 0\): one element on the parent stream
corresponds to one \(d\)-dimensional sequence on the child stream,
delimited by the most significant
last
bit added on top of thelast
bits replicated from the parent stream; - \(s = \textrm{Flatten}\), \(d > 0\): one element on the parent stream
corresponds to one \(d\)-dimensional sequence on the child stream,
delimited by the most significant
last
bit; - \(s \in (\textrm{Desync}, \textrm{FlatDesync})\), \(d = 0\): one element on the parent stream corresponds to a user/context-defined number of elements on the child stream (zero or more); and
- \(s \in (\textrm{Desync}, \textrm{FlatDesync})\), \(d > 0\): one element on the parent stream corresponds to a user/context-defined number of \(d\)-dimensional sequences on the child stream (zero or more).
This corresponds to depth-first preorder traversal of the logical stream type tree. It also corresponds to the order in which you would naturally write the data down.
Sources may not assume that sinks will be able to handshake data in an order differing from the natural order defined above.
Sinks may not assume that sources will be able to handshake data in an order differing from the natural order defined above.
The above two rules, if adhered to by both source and sink, prevent inter-physical-stream deadlocks.
Consider the following simple example:
\[C = \textrm{Stream}(s = \textrm{Sync}, d = 0, ...)\\ \textrm{Group}(\textrm{a}: C, \textrm{b}: C)\]
[(a: 1, b: 2), (a: 3, b: 4)]
Let's say the source follows the natural ordering rule and can't buffer anything. That is, it drives
1
on physical streama
, waits for the transfer to be acknowledged before driving2
on physical streamb
, then waits for that to be acknowledged before driving3
, and so on. If the natural-ordering rule for sinks did not exist, the sink may try to wait for streamb
to be validated before it will acknowledgea
, which would in this case cause a deadlock.The reasoning also works the other way around. That is, if a sink will only acknowledge in the natural order, but the source is allowed to drive
2
onb
and wait for its acknowledgement before driving1
ona
, a similar deadlock will occur.The above does not mean that a source isn't allowed to drive data on the two streams independently as fast as it can. The same thing goes for acknowledgement by a sink. If both source and sink support full out-of-order transferral of both streams, the elements can be transferred in any order. It just means that sources and sinks must be compatible with an opposite end that does not support out-of-order transferral.
The above reasoning is particularly important for streams flowing in opposing directions. In this case, for a physical stream
x
naturally-ordered before another streamy
flowing in the opposing direction,x
acts as a request stream andy
acts as its response. That is, the source ofx
may not assume that the response will come before it sends the request.
User-defined signal constraints
Tydi places the following constraints on the user-defined signals.
The user-defined signals flow exclusively in the source to sink direction.
While physical streams can be reversed in order to transfer metadata in the sink to source direction, logical streams fundamentally describe dataflow in the source to sink direction. Therefore, if data needs to flow in the opposite direction as well, you should use a second logical stream.
The reasoning behind this design choice is mostly practical in nature. Firstly, with the way the logical stream type is currently defined, unidirectional signals emerge naturally before the first stream node is opened, but reverse-direction signals do not. Therefore, additional nodes would be needed to describe this. Secondly, allowing signals to flow in both directions might be interpreted by users as a way for them to implement their own handshake method to use instead of or in addition to the handshakes of the Tydi physical streams. Most handshake methods break however when registers or clock domain crossings not specifically tailored for that particular handshake are inserted in between. With a purely unidirectional approach, latency bounds are not necessary for correctness, allowing tooling to automatically generate interconnect IP. Tydi does pose some constraints on such IP, however.
Interconnect components such as register slices and clock domain crossings may interpret the signals as being asynchronous. Thus, a sink may not rely upon bit flips driven in the same cycle by the source from occurring in the same cycle at the sink.
This allows clock domain crossings for the user-defined signals to reduce to simple synchronization registers.
The latency of any signal travelling from source to sink must be less than or equal to the latency of the streams.
This allows the user-defined signals to be used as a way to transfer quasi-constant values, such as the control register inputs for some kernel. As long as the constants are only mutated before the source sends the first stream transfer, the signals received by the sink are guaranteed to be stable by the time it receives the first transfer.
Handshaking in the reverse direction may be done using a reversed stream. When the sink is done processing, it sends a transfer on this stream. This will necessarily be received by the source at a later point in time, which is then free to modify the user signals again.
Note that slow-changing counter values (those that only increment or decrement slowly with respect to the involved clock domains) can be safely transferred using Gray code. Pulse/strobe signals can be safely transferred by the source toggling a user-defined signal whenever the strobe occurs, and the sink detecting the transitions in this signal.
Tools
(under construction)
Generators
(under construction)
Compositors
(under construction)
Frequently Asked Questions
References
Publications
- J. Peltenburg, J. Van Straten, M. Brobbel, Z. Al-Ars and H. P. Hofstee
Tydi: An Open Specification for Complex Data Structures Over Hardware Streams
IEEE Micro, vol. 40, no. 4, pp. 120-130, 1 July-Aug. 2020, doi: 10.1109/MM.2020.2996373.
Other
- Tydi playground: an interactive playground to visualize Tydi types
An early implementation from which the more generalized Tydi was spawned is Fletcher. Resources on Fletcher can be found here:
- Fletcher general paper: published in 2019 29th International Conference on Field Programmable Logic and Applications (FPL)
- Fletcher hardware design: published in ARC 2019: Applied Reconfigurable Computing.
- Fletcher: the Fletcher project repository