2022-08-11 20:06:33 -07:00
|
|
|
|
# MLIR Bytecode Format
|
|
|
|
|
|
|
|
|
|
|
|
This documents describes the MLIR bytecode format and its encoding.
|
|
|
|
|
|
|
|
|
|
|
|
[TOC]
|
|
|
|
|
|
|
|
|
|
|
|
## Magic Number
|
|
|
|
|
|
|
|
|
|
|
|
MLIR uses the following four-byte magic number to indicate bytecode files:
|
|
|
|
|
|
|
|
|
|
|
|
'\[‘M’<sub>8</sub>, ‘L’<sub>8</sub>, ‘ï’<sub>8</sub>, ‘R’<sub>8</sub>\]'
|
|
|
|
|
|
|
|
|
|
|
|
In hex:
|
|
|
|
|
|
|
|
|
|
|
|
'\[‘4D’<sub>8</sub>, ‘4C’<sub>8</sub>, ‘EF’<sub>8</sub>, ‘52’<sub>8</sub>\]'
|
|
|
|
|
|
|
|
|
|
|
|
## Format Overview
|
|
|
|
|
|
|
|
|
|
|
|
An MLIR Bytecode file is comprised of a byte stream, with a few simple
|
|
|
|
|
|
structural concepts layered on top.
|
|
|
|
|
|
|
|
|
|
|
|
### Primitives
|
|
|
|
|
|
|
|
|
|
|
|
#### Fixed-Width Integers
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
byte ::= `0x00`...`0xFF`
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Fixed width integers are unsigned integers of a known byte size. The values are
|
|
|
|
|
|
stored in little-endian byte order.
|
|
|
|
|
|
|
|
|
|
|
|
TODO: Add larger fixed width integers as necessary.
|
|
|
|
|
|
|
|
|
|
|
|
#### Variable-Width Integers
|
|
|
|
|
|
|
|
|
|
|
|
Variable width integers, or `VarInt`s, provide a compact representation for
|
|
|
|
|
|
integers. Each encoded VarInt consists of one to nine bytes, which together
|
|
|
|
|
|
represent a single 64-bit value. The MLIR bytecode utilizes the "PrefixVarInt"
|
|
|
|
|
|
encoding for VarInts. This encoding is a variant of the
|
|
|
|
|
|
[LEB128 ("Little-Endian Base 128")](https://en.wikipedia.org/wiki/LEB128)
|
|
|
|
|
|
encoding, where each byte of the encoding provides up to 7 bits for the value,
|
|
|
|
|
|
with the remaining bit used to store a tag indicating the number of bytes used
|
|
|
|
|
|
for the encoding. This means that small unsigned integers (less than 2^7) may be
|
|
|
|
|
|
stored in one byte, unsigned integers up to 2^14 may be stored in two bytes,
|
|
|
|
|
|
etc.
|
|
|
|
|
|
|
|
|
|
|
|
The first byte of the encoding includes a length prefix in the low bits. This
|
|
|
|
|
|
prefix is a bit sequence of '0's followed by a terminal '1', or the end of the
|
|
|
|
|
|
byte. The number of '0' bits indicate the number of _additional_ bytes, not
|
|
|
|
|
|
including the prefix byte, used to encode the value. All of the remaining bits
|
|
|
|
|
|
in the first byte, along with all of the bits in the additional bytes, provide
|
|
|
|
|
|
the value of the integer. Below are the various possible encodings of the prefix
|
|
|
|
|
|
byte:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
xxxxxxx1: 7 value bits, the encoding uses 1 byte
|
|
|
|
|
|
xxxxxx10: 14 value bits, the encoding uses 2 bytes
|
|
|
|
|
|
xxxxx100: 21 value bits, the encoding uses 3 bytes
|
|
|
|
|
|
xxxx1000: 28 value bits, the encoding uses 4 bytes
|
|
|
|
|
|
xxx10000: 35 value bits, the encoding uses 5 bytes
|
|
|
|
|
|
xx100000: 42 value bits, the encoding uses 6 bytes
|
|
|
|
|
|
x1000000: 49 value bits, the encoding uses 7 bytes
|
|
|
|
|
|
10000000: 56 value bits, the encoding uses 8 bytes
|
|
|
|
|
|
00000000: 64 value bits, the encoding uses 9 bytes
|
|
|
|
|
|
```
|
|
|
|
|
|
|
2022-08-23 23:39:18 -07:00
|
|
|
|
##### Signed Variable-Width Integers
|
|
|
|
|
|
|
|
|
|
|
|
Signed variable width integer values are encoded in a similar fashion to
|
|
|
|
|
|
[varints](#variable-width-integers), but employ
|
|
|
|
|
|
[zigzag encoding](https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding).
|
|
|
|
|
|
This encoding uses the low bit of the value to indicate the sign, which allows
|
|
|
|
|
|
for more efficiently encoding negative numbers. If a negative value were encoded
|
|
|
|
|
|
using a normal [varint](#variable-width-integers), it would be treated as an
|
|
|
|
|
|
extremely large unsigned value. Using zigzag encoding allows for a smaller
|
|
|
|
|
|
number of active bits in the value, leading to a smaller encoding. Below is the
|
|
|
|
|
|
basic computation for generating a zigzag encoding:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
(value << 1) ^ (value >> 63)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
2022-08-11 20:06:33 -07:00
|
|
|
|
#### Strings
|
|
|
|
|
|
|
|
|
|
|
|
Strings are blobs of characters with an associated length.
|
|
|
|
|
|
|
|
|
|
|
|
### Sections
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
section {
|
|
|
|
|
|
id: byte
|
|
|
|
|
|
length: varint
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Sections are a mechanism for grouping data within the bytecode. The enable
|
|
|
|
|
|
delayed processing, which is useful for out-of-order processing of data,
|
|
|
|
|
|
lazy-loading, and more. Each section contains a Section ID and a length (which
|
|
|
|
|
|
allowing for skipping over the section).
|
|
|
|
|
|
|
|
|
|
|
|
TODO: Sections should also carry an optional alignment. Add this when necessary.
|
|
|
|
|
|
|
|
|
|
|
|
## MLIR Encoding
|
|
|
|
|
|
|
|
|
|
|
|
Given the generic structure of MLIR, the bytecode encoding is actually fairly
|
|
|
|
|
|
simplistic. It effectively maps to the core components of MLIR.
|
|
|
|
|
|
|
|
|
|
|
|
### Top Level Structure
|
|
|
|
|
|
|
|
|
|
|
|
The top-level structure of the bytecode contains the 4-byte "magic number", a
|
|
|
|
|
|
version number, a null-terminated producer string, and a list of sections. Each
|
|
|
|
|
|
section is currently only expected to appear once within a bytecode file.
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
bytecode {
|
|
|
|
|
|
magic: "MLïR",
|
|
|
|
|
|
version: varint,
|
|
|
|
|
|
producer: string,
|
|
|
|
|
|
sections: section[]
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### String Section
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
strings {
|
|
|
|
|
|
numStrings: varint,
|
|
|
|
|
|
reverseStringLengths: varint[],
|
|
|
|
|
|
stringData: byte[]
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The string section contains a table of strings referenced within the bytecode,
|
|
|
|
|
|
more easily enabling string sharing. This section is encoded first with the
|
|
|
|
|
|
total number of strings, followed by the sizes of each of the individual strings
|
|
|
|
|
|
in reverse order. The remaining encoding contains a single blob containing all
|
|
|
|
|
|
of the strings concatenated together.
|
|
|
|
|
|
|
|
|
|
|
|
### Dialect Section
|
|
|
|
|
|
|
|
|
|
|
|
The dialect section of the bytecode contains all of the dialects referenced
|
|
|
|
|
|
within the encoded IR, and some information about the components of those
|
|
|
|
|
|
dialects that were also referenced.
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
dialect_section {
|
|
|
|
|
|
numDialects: varint,
|
|
|
|
|
|
dialectNames: varint[],
|
|
|
|
|
|
opNames: op_name_group[]
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
op_name_group {
|
|
|
|
|
|
dialect: varint,
|
|
|
|
|
|
numOpNames: varint,
|
|
|
|
|
|
opNames: varint[]
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Dialects are encoded as indexes to the name string within the string section.
|
|
|
|
|
|
Operation names are encoded in groups by dialect, with each group containing the
|
|
|
|
|
|
dialect, the number of operation names, and the array of indexes to each name
|
|
|
|
|
|
within the string section.
|
|
|
|
|
|
|
|
|
|
|
|
### Attribute/Type Sections
|
|
|
|
|
|
|
|
|
|
|
|
Attributes and types are encoded using two [sections](#sections), one section
|
|
|
|
|
|
(`attr_type_section`) containing the actual encoded representation, and another
|
|
|
|
|
|
section (`attr_type_offset_section`) containing the offsets of each encoded
|
|
|
|
|
|
attribute/type into the previous section. This structure allows for attributes
|
|
|
|
|
|
and types to always be lazily loaded on demand.
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
attr_type_section {
|
|
|
|
|
|
attrs: attribute[],
|
|
|
|
|
|
types: type[]
|
|
|
|
|
|
}
|
|
|
|
|
|
attr_type_offset_section {
|
|
|
|
|
|
numAttrs: varint,
|
|
|
|
|
|
numTypes: varint,
|
|
|
|
|
|
offsets: attr_type_offset_group[]
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
attr_type_offset_group {
|
|
|
|
|
|
dialect: varint,
|
|
|
|
|
|
numElements: varint,
|
|
|
|
|
|
offsets: varint[] // (offset << 1) | (hasCustomEncoding)
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
attribute {
|
|
|
|
|
|
encoding: ...
|
|
|
|
|
|
}
|
|
|
|
|
|
type {
|
|
|
|
|
|
encoding: ...
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Each `offset` in the `attr_type_offset_section` above is the size of the
|
|
|
|
|
|
encoding for the attribute or type and a flag indicating if the encoding uses
|
|
|
|
|
|
the textual assembly format, or a custom bytecode encoding. We avoid using the
|
|
|
|
|
|
direct offset into the `attr_type_section`, as a smaller relative offsets
|
|
|
|
|
|
provides more effective compression. Attributes and types are grouped by
|
|
|
|
|
|
dialect, with each `attr_type_offset_group` in the offset section containing the
|
|
|
|
|
|
corresponding parent dialect, number of elements, and offsets for each element
|
|
|
|
|
|
within the group.
|
|
|
|
|
|
|
|
|
|
|
|
#### Attribute/Type Encodings
|
|
|
|
|
|
|
|
|
|
|
|
In the abstract, an attribute/type is encoded in one of two possible ways: via
|
|
|
|
|
|
its assembly format, or via a custom dialect defined encoding.
|
|
|
|
|
|
|
|
|
|
|
|
##### Assembly Format Fallback
|
|
|
|
|
|
|
|
|
|
|
|
In the case where a dialect does not define a method for encoding the attribute
|
|
|
|
|
|
or type, the textual assembly format of that attribute or type is used as a
|
|
|
|
|
|
fallback. For example, a type of `!bytecode.type` would be encoded as the null
|
|
|
|
|
|
terminated string "!bytecode.type". This ensures that every attribute and type
|
|
|
|
|
|
may be encoded, even if the owning dialect has not yet opted in to a more
|
|
|
|
|
|
efficient serialization.
|
|
|
|
|
|
|
|
|
|
|
|
TODO: We shouldn't redundantly encode the dialect name here, we should use a
|
|
|
|
|
|
reference to the parent dialect instead.
|
|
|
|
|
|
|
|
|
|
|
|
##### Dialect Defined Encoding
|
|
|
|
|
|
|
2022-08-23 12:56:02 -07:00
|
|
|
|
In addition to the assembly format fallback, dialects may also provide a custom
|
|
|
|
|
|
encoding for their attributes and types. Custom encodings are very beneficial in
|
|
|
|
|
|
that they are significantly smaller and faster to read and write.
|
|
|
|
|
|
|
|
|
|
|
|
Dialects can opt-in to providing custom encodings by implementing the
|
|
|
|
|
|
`BytecodeDialectInterface`. This interface provides hooks, namely
|
|
|
|
|
|
`readAttribute`/`readType` and `writeAttribute`/`writeType`, that will be used
|
|
|
|
|
|
by the bytecode reader and writer. These hooks are provided a reader and writer
|
|
|
|
|
|
implementation that can be used to encode various constructs in the underlying
|
|
|
|
|
|
bytecode format. A unique feature of this interface is that dialects may choose
|
|
|
|
|
|
to only encode a subset of their attributes and types in a custom bytecode
|
|
|
|
|
|
format, which can simplify adding new or experimental components that aren't
|
|
|
|
|
|
fully baked.
|
|
|
|
|
|
|
|
|
|
|
|
When implementing the bytecode interface, dialects are responsible for all
|
|
|
|
|
|
aspects of the encoding. This includes the indicator for which kind of attribute
|
|
|
|
|
|
or type is being encoded; the bytecode reader will only know that it has
|
|
|
|
|
|
encountered an attribute or type of a given dialect, it doesn't encode any
|
|
|
|
|
|
further information. As such, a common encoding idiom is to use a leading
|
|
|
|
|
|
`varint` code to indicate how the attribute or type was encoded.
|
2022-08-11 20:06:33 -07:00
|
|
|
|
|
|
|
|
|
|
### IR Section
|
|
|
|
|
|
|
|
|
|
|
|
The IR section contains the encoded form of operations within the bytecode.
|
|
|
|
|
|
|
|
|
|
|
|
#### Operation Encoding
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
op {
|
|
|
|
|
|
name: varint,
|
|
|
|
|
|
encodingMask: byte,
|
|
|
|
|
|
location: varint,
|
|
|
|
|
|
|
|
|
|
|
|
attrDict: varint?,
|
|
|
|
|
|
|
|
|
|
|
|
numResults: varint?,
|
|
|
|
|
|
resultTypes: varint[],
|
|
|
|
|
|
|
|
|
|
|
|
numOperands: varint?,
|
|
|
|
|
|
operands: varint[],
|
|
|
|
|
|
|
|
|
|
|
|
numSuccessors: varint?,
|
|
|
|
|
|
successors: varint[],
|
|
|
|
|
|
|
|
|
|
|
|
regionEncoding: varint?, // (numRegions << 1) | (isIsolatedFromAbove)
|
|
|
|
|
|
regions: region[]
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The encoding of an operation is important because this is generally the most
|
|
|
|
|
|
commonly appearing structure in the bytecode. A single encoding is used for
|
|
|
|
|
|
every type of operation. Given this prevelance, many of the fields of an
|
|
|
|
|
|
operation are optional. The `encodingMask` field is a bitmask which indicates
|
|
|
|
|
|
which of the components of the operation are present.
|
|
|
|
|
|
|
|
|
|
|
|
##### Location
|
|
|
|
|
|
|
|
|
|
|
|
The location is encoded as the index of the location within the attribute table.
|
|
|
|
|
|
|
|
|
|
|
|
##### Attributes
|
|
|
|
|
|
|
|
|
|
|
|
If the operation has attribues, the index of the operation attribute dictionary
|
|
|
|
|
|
within the attribute table is encoded.
|
|
|
|
|
|
|
|
|
|
|
|
##### Results
|
|
|
|
|
|
|
|
|
|
|
|
If the operation has results, the number of results and the indexes of the
|
|
|
|
|
|
result types within the type table are encoded.
|
|
|
|
|
|
|
|
|
|
|
|
##### Operands
|
|
|
|
|
|
|
|
|
|
|
|
If the operation has operands, the number of operands and the value index of
|
|
|
|
|
|
each operand is encoded. This value index is the relative ordering of the
|
|
|
|
|
|
definition of that value from the start of the first ancestor isolated region.
|
|
|
|
|
|
|
|
|
|
|
|
##### Successors
|
|
|
|
|
|
|
|
|
|
|
|
If the operation has successors, the number of successors and the indexes of the
|
|
|
|
|
|
successor blocks within the parent region are encoded.
|
|
|
|
|
|
|
|
|
|
|
|
##### Regions
|
|
|
|
|
|
|
|
|
|
|
|
If the operation has regions, the number of regions and if the regions are
|
|
|
|
|
|
isolated from above are encoded together in a single varint. Afterwards, each
|
|
|
|
|
|
region is encoded inline.
|
|
|
|
|
|
|
|
|
|
|
|
#### Region Encoding
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
region {
|
|
|
|
|
|
numBlocks: varint,
|
|
|
|
|
|
|
|
|
|
|
|
numValues: varint?,
|
|
|
|
|
|
blocks: block[]
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
A region is encoded first with the number of blocks within. If the region is
|
|
|
|
|
|
non-empty, the number of values defined directly within the region are encoded,
|
|
|
|
|
|
followed by the blocks of the region.
|
|
|
|
|
|
|
|
|
|
|
|
#### Block Encoding
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
block {
|
|
|
|
|
|
encoding: varint, // (numOps << 1) | (hasBlockArgs)
|
|
|
|
|
|
arguments: block_arguments?, // Optional based on encoding
|
|
|
|
|
|
ops : op[]
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
block_arguments {
|
|
|
|
|
|
numArgs: varint?,
|
|
|
|
|
|
args: block_argument[]
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
block_argument {
|
|
|
|
|
|
typeIndex: varint,
|
|
|
|
|
|
location: varint
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
A block is encoded with an array of operations and block arguments. The first
|
|
|
|
|
|
field is an encoding that combines the number of operations in the block, with a
|
|
|
|
|
|
flag indicating if the block has arguments.
|