Add an option that does not bufferize the targeted op itself, but just materializes a buffer for the destination operands. This is useful for partial bufferization of complex ops such as `scf.forall`, which need special handling (and an analysis if the region).
Differential Revision: https://reviews.llvm.org/D155946
Applying the canonicalizer and CSE in an interleaved fashion is useful after bufferization (and maybe other transforms) to fold away self copies.
Differential Revision: https://reviews.llvm.org/D155933
The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension.
The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html
Depends on D155453
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155680
RegionBranchOpInterface did not allow the operation with regions to
specify itself as successors. Therefore, this implied that the control
is always transferred to a region before being transferred back to the
parent op. Since the region can only transfer the control back to the
parent op from a terminator, this transitively implied that the first
block of any region with a RegionBranchOpInterface is always executed
until the terminator can transfer the control flow back. This is
trivially false for any conditional-like operation that may or may not
execute the region, as well as for loop-like operations that may not
execute the body.
Remove the restriction from the interface description and update the
only transform that relied on it.
See
https://discourse.llvm.org/t/rfc-region-control-flow-interfaces-should-encode-region-not-executed-correctly/72103.
Depends On: https://reviews.llvm.org/D155757
Reviewed By: Mogball, springerm
Differential Revision: https://reviews.llvm.org/D155822
Initial implementations of dense dataflow analyses feature special cases
for operations that have region- or call-based control flow by
leveraging the corresponding interfaces. This is not necessarily
sufficient as these operations may influence the dataflow state by
themselves as well we through the control flow. For example,
`linalg.generic` and similar operations have region-based control flow
and their proper memory effects, so any memory-related analyses such as
last-writer require processing `linalg.generic` directly instead of, or
in addition to, the region-based flow.
Provide hooks to customize the processing of operations with region-
cand call-based contol flow in forward and backward dense dataflow
analysis. These hooks are trigerred when control flow is transferred
between the "main" operation, i.e. the call or the region owner, and
another region. Such an apporach allows the analyses to update the
lattice before and/or after the regions. In the `linalg.generic`
example, the reads from memory are interpreted as happening before the
body region and the writes to memory are interpreted as happening after
the body region. Using these hooks in generic analysis may require
introducing additional interfaces, but for now assume that the specific
analysis have spceial cases for the (rare) operaitons with call- and
region-based control flow that need additional processing.
Reviewed By: Mogball, phisiart
Differential Revision: https://reviews.llvm.org/D155757
This work adds `nvgpu.tma.async.load` Op that requests tma load asyncronusly using mbarrier object.
It also creates nvgpu.tma.descriptor type. The type is supposed be created by `cuTensorMapEncodeTiled` cuda drivers api.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155453
Linalg structure ops do not implement control flow in the way expected
by RegionBranchOpInterface, and the interface implementation isn't
actually used anywhere. The presence of this interface without correct
implementation is confusing for, e.g., dataflow analyses.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D155841
Using properties currently requires at the very least implementing four methods/code snippets:
* `convertToAttribute`
* `convertFromAttribute`
* `writeToMlirBytecode`
* `readFromMlirBytecode`
This makes replacing attributes with properties harder than it has to be: Attributes by default do not require immediately defining custom bytecode encoding.
This patch therefore adds opt-in implementations of `writeToMlirBytecode` and `readFromMlirBytecode` that work with the default implementations of `convertToAttribute` and `convertFromAttribute`. They are provided by `defvar`s in `OpBase.td` and can be used by adding:
```
let writeToMlirBytecode = writeMlirBytecodeWithConvertToAttribute;
let readFromMlirBytecode = readMlirBytecodeUsingConvertFromAttribute;
```
to ones TableGen definition.
While this bytecode encoding is almost certainly not ideal for a given property, it allows more incremental use of properties and getting something sane working before optimizing the bytecode format.
Differential Revision: https://reviews.llvm.org/D155286
Author inferReturnTypeComponents methods with the Op Adaptor by using the InferShapedTypeOpAdaptor.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D155243
Wave Matrix Multiply Accumulate (WMMA) is the instruction to accelerate
matrix multiplication on RDNA3 architectures. LLVM already provides a
set of intrinsics to generate wmma instructions. This change uses those
intrinsics to enable the feature in MLIR.
Reviewed By: krzysz00
Differential Revision: https://reviews.llvm.org/D152451
Continue to work outlined in D155747 and split the main SPIR-V ops
implementation file into a few smaller and quicker to compile files.
This organization matches the op definition organizaion in `.td` files.
In this patch, extract atomic, cast/conversion, and group op
implementation into separate files.
Reviewed By: antiagainst
Differential Revision: https://reviews.llvm.org/D155777
This gives the region a more meaningful name. The topic came up in a
discussion on https://reviews.llvm.org/D155435, where the name `region`
would have led to a situation where a convenience accessor called
`region` (after the ODS name) would have returned a Block.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D155810
This patch modifies the construction of the `OpenMPIRBuilder` in MLIR to
initialize the `IsGPU` flag using target triple information passed down from
the Flang frontend. If not present, it will default to `false`.
This replicates the behavior currently implemented in Clang, where the
`CodeGenModule::createOpenMPRuntime()` method creates a different
`CGOpenMPRuntime` instance depending on the target triple, which in turn has an
effect on the `IsGPU` flag of the `OpenMPIRBuilderConfig` object.
Differential Revision: https://reviews.llvm.org/D151903
Currently, inlining a function with a `noalias` parameter leads to a large loss of optimization potential as the `noalias` parameter, an important hint for alias analysis, is lost completely.
This patch fixes this with the same approach as LLVM by annotating all users of the `noalias` parameter with appropriate alias and noalias scope lists.
The implementation done here is not as sophisticated as LLVMs, which has more infrastructure related to escaping and captured pointers, but should work in the majority of important cases.
Any deficiency can be addressed in future patches.
Related LLVM code: 27ade4b554/llvm/lib/Transforms/Utils/InlineFunction.cpp (L1090)
Differential Revision: https://reviews.llvm.org/D155712
This work adds two Ops:
`mbarrier.arrive.expect_tx` performs expect_tx `mbarrier.barrier` returns `mbarrier.barrier.token`
`mbarrier.try_wait.parity` waits on `mbarrier.barrier` and `mbarrier.barrier.token`
`mbarrier.arrive.expect_tx` is one of the requirement to enable H100 TMA support.
Depends on D154074 D154076 D154059 D154060
Reviewed By: qcolombet
Differential Revision: https://reviews.llvm.org/D154094
This revision adds a branch weight op interface for the call / branch
operations that support branch weights. It can be used in the LLVM IR
import and export to simplify the branch weight conversion. An
additional mapping between call operations and instructions ensures
the actual conversion can be done in the module translation itself,
rather than in the dialect translation interface. It also has the
benefit that downstream users can amend custom metadata to the call
operation during the export to LLVM IR.
Reviewed By: zero9178, definelicht
Differential Revision: https://reviews.llvm.org/D155702
Also update the documentation of `Operation::fold`, which did not take into account in-place foldings.
Differential Revision: https://reviews.llvm.org/D155691
Commutative ops were previously folded with a special rule in `OperationFolder`. This change turns the folding into a proper `OpTrait` folder.
Differential Revision: https://reviews.llvm.org/D155687
The main op implementation file for SPIR-V grew past 5k LOC. This makes it
take a long time to compile and index with LSPs like clangd.
Pull out the first few SPIR-V extension ops into their own `.cpp` files,
just like we do with `.td` op definitions. This includes the
KHR/NV/Intel coop matrix and the integer dot prod extensions.
I plan to further split this in future revisions.
Reviewed By: antiagainst
Differential Revision: https://reviews.llvm.org/D155747
TL;DR the following API functions have been merged
```
void populateFoldUnitExtentDimsViaReshapesPatterns(RewritePatternSet &patterns);
void populateFoldUnitExtentDimsViaSlicesPatterns(RewritePatternSet &patterns);
```
into
```
void populateFoldUnitExtentDimsPatterns(RewritePatternSet &patterns,
ControlDropUnitDims &options);
```
To use the previous functionality use
```
ControlDropUnitDims options;
// By default options.rankReductionStrategy is
// ControlDropUnitDims::RankReductionStrategy::ReassociativeReshape.
populateFoldUnitExtentDimsPatterns(patterns, options);
```
and
```
ControlDropUnitDims options;
options.rankReductionStrategy = ControlDropUnitDims::RankReductionStrategy::ExtractInsertSlice
populateFoldUnitExtentDimsPatterns(patterns, options);
```
This pass is quite old and needed to be updated based on the current
approach to transformations in Linalg
- Instead of two patterns, one to just remove loop dimensions that are
unit extent (and using 0 in the indexing maps), and another to drop
the unit-extents in the operand shapes, combine into a single
transformation. This avoid creating an intermediate step with
indexing maps having 0's in the domains exp ressions.
- Expose the core transformation as a utility function and add a
pattern that calls this transformation.
This is a mostly NFC change, apart from the API change and dropping
the patterns/test that only dropped the loops that are unit extents.
Differential Revision: https://reviews.llvm.org/D155518
Allow the init and combiner regions to have more
arguments to pass information.
Reviewed By: razvanlupusoru
Differential Revision: https://reviews.llvm.org/D155656
For variables in declare clauses, their producing operation should be
marked with the data clause for ease of lookup and consistency
verification. Thus add an attribute that can be used for this purpose
plus verification that declare data operation matches the declare
data clause on variable.
Reviewed By: clementval
Differential Revision: https://reviews.llvm.org/D155640
The current representation of TBAA is the very last in-tree user of the `llvm.metadata` operation.
Using ops to model metadata has a few disadvantages:
* Building a graph has to be done through some weakly typed indirection mechanism such as `SymbolRefAttr`
* Creating the metadata has to be done through a builder within a metadata op.
* It is not multithreading safe as operation insertion into the same block is not thread-safe
This patch therefore converts TBAA metadata into an attribute representation, in a similar manner as it has been done for alias groups and access groups in previous patches.
This additionally has the large benefit of giving us more "correctness by construction" as it makes things like cycles in a TBAA graph, or references to an incorrectly typed metadata node impossible.
Differential Revision: https://reviews.llvm.org/D155444
Adds a generic lowering that suppors all cases of bufferization.dealloc
and one specialized, more efficient lowering for the simple case. Using
a helper function with for loops in the general case enables
O(|num_dealloc_memrefs|+|num_retain_memrefs|) size of the lowered code.
Depends on D155467
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D155468
This work introduces : `wgmma.fence.aligned`, `wgmma.commit.group.sync.aligned` and `wgmma.wait.group.sync.aligned` Ops. They are used to syncronize warpgroup level matrix multiply-accumulate instructions, as known as WGMMA.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155676
This is quite the niche intrinsic, whose whole purpose is to be able to essentially split an SSA value to be able to attach additional information to the new value.
It therefore really acts like a noop that just passes through its argument. It interestingly does not have any support in CodeGen and is therefore required to also be deleted by any pass creating it.
Differential Revision: https://reviews.llvm.org/D155678
`setListener` is dangerous because an already registered listener may accidentally be overwritten/replaced. (A `ForwardingListener` must be used in such cases.) This change updates a few trivial call sites of `setListener`, where no forwarding listener is needed.
Differential Revision: https://reviews.llvm.org/D155599
This is specified in the spec, but we just never really needed it. This
allows for users of the LSP libraries to inspect information about the
client that is connected to the server.
Differential Revision: https://reviews.llvm.org/D155566
Instead of using the I64EnumAttr for the DataClause, now use
EnumAttr instead. This makes tests more readable since now
one can use the format #acc<data_clause acc_copyin> instead of
just the number.
Reviewed By: clementval, vzakhari
Differential Revision: https://reviews.llvm.org/D155605
Author inferReturnTypes methods with the Op Adaptor by using the InferTypeOpAdaptor.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D155115
This patch adds an interface, named AggregatedOpInterface, that decomposes
complex operations into simpler ones.
For now, make the interface specific to Linalg because although the concept
is general, the way to materialize it needs some maturing.
Use that interface with the softmax operator.
Differential Revision: https://reviews.llvm.org/D154363
This functionality has been replaced by TypeCasters (see D151840)
depends on D154468
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D154469
A declare directive is used to specify the creation of a visible device
copy of a variable for the duration of the implicit data region as it
relates to the scope in which the variable is declared.
In order to support this, the following new operations were added:
1) `acc.global_ctor` and `acc.global_dtor`. These are used whenever the
declare directive applies to a global.
2) `acc.declare_enter` and `acc.declare_exit`. These operations are
modeled similarly to `acc.enter_data` and `acc.exit_data`. The reason
they are not modeled like `acc.data` is so that these operations can be
used both for globals and regions like functions.
3) `acc.declare_device_resident` and `acc.declare_link`. These
operations are modeled in a manner consistent with previously defined
data entry operation model.
The `acc.getdeviceptr` was generalized so that it can be used with
acc.declare_exit.
Reviewed By: clementval, vzakhari
Differential Revision: https://reviews.llvm.org/D155322
This patch implements domain and range restriction for PresburgerRelation
Reviewed By: Groverkss
Differential Revision: https://reviews.llvm.org/D154798
This transform looks for suitable vector transfers from global memory to shared memory and converts them to async device copies.
Differential Revision: https://reviews.llvm.org/D155569
These two headers both contained a strange mix of definitions related to
both patterns and non-pattern transforms. Put patterns and "populate"
functions into Patterns.h and standalone transforms into Transforms.h.
Depends On: D155223
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155454
At the moment, SME-to-LLVM lowerings rely entirely on
`LLVMTypeConverter`. This patch introduces a dedicated `TypeConverter`
that inherits from `LLVMTypeConverter` (it will also be used when
lowering ArmSME Ops to LLVM).
The new type converter merely disables lowerings for `VectorType` to
prevent 2-d scalable vectors (common in the context of ArmSME), e.g.
`vector<[16]x[16]xi8>`,
entering the LLVM Type converter. LLVM does not support arrays of
scalable vectors and hence the need for specialisation. In the case of
SME such types are effectively eliminated when emitting LLVM IR
intrinsics for SME.
Differential Revision: https://reviews.llvm.org/D155365
This patch adds a pass '-allocate-sme-tiles' to the ArmSME dialect that
implements allocation of SME ZA tiles.
It does this at the 'func.func' op level by replacing
'arm_sme.get_tile_id' ops with 'arith.constant' ops that represent the
tile number. The tiles in use in a given function are tracked by an
integer function attribute 'arm_sme.tiles_in_use' that is a 16-bit tile
mask with a bit for each 128-bit element tile (ZA0.Q-ZA15.Q), the
smallest ZA tile granule. This is initialized on the first
'arm_sme.get_tile_id' rewrite and updated on each subsequent rewrite.
Mixing of different element tile types is supported.
Section B2.3.2 of the SME spec [1] describes how the 128-bit element
tiles overlap with other element tiles.
Depends on D154941
[1] https://developer.arm.com/documentation/ddi0616/aa
Reviewed By: awarzynski
Differential Revision: https://reviews.llvm.org/D154955
At the moment, the lowering from the Vector dialect to SME looks like
this:
* Vector --> SME LLVM IR intrinsics
This patch introduces a new lowering layer between the Vector dialect
and the Arm SME extension:
* Vector --> ArmSME dialect (custom Ops) --> SME LLVM IR intrinsics.
This is motivated by 2 considerations:
1. Storing `ZA` to memory (e.g. `vector.transfer_write`) requires an
`scf.for` loop over all rows of `ZA`. Similar logic will apply to
"load to ZA from memory". This is a rather complex transformation and
a custom Op seems justified.
2. As discussed in [1], we need to prevent the LLVM type converter from
having to convert types unsupported in LLVM, e.g.
`vector<[16]x[16]xi8>`. A dedicated abstraction layer with custom Ops
opens a path to some fine tuning (e.g. custom type converters) that
will allow us to avoid this.
To facilitate this change, two new custom SME Op are introduced:
* `TileStoreOp`, and
* `ZeroOp`.
Note that no new functionality is added - these Ops merely model what's
already supported. In particular, the following tile size is assumed
(dimension and element size are fixed):
* `vector<[16]x[16]xi8>`
The new lowering layer is introduced via a conversion pass between the
Vector and the SME dialects. You can use the `-convert-vector-to-sme`
flag to run it. The following function:
```
func.func @example(%arg0 : memref<?x?xi8>) {
// (...)
%cst = arith.constant dense<0> : vector<[16]x[16]xi8>
vector.transfer_write %cst, %arg0 : vector<[16]x[16]xi8>, memref<?x?xi8>
return
}
```
would be lowered to:
```
func.func @example(%arg0: memref<?x?xi8>) {
// (...)
%0 = arm_sme.zero : vector<[16]x[16]xi8>
arm_sme.tile_store %arg0[%c0, %c0], %0 : memref<?x?xi8>, vector<[16]x[16]xi8>
return
}
```
Later, a mechanism will be introduced to guarantee that `arm_sme.zero`
and `arm_sme.tile_store` operate on the same virtual tile. For `i8`
elements this is not required as there is only one tile.
In order to lower the above output to LLVM, use
* `-convert-vector-to-llvm="enable-arm-sme"`.
[1] https://github.com/openxla/iree/issues/14294
Reviewed By: WanderAway
Differential Revision: https://reviews.llvm.org/D154867
This patch adds three new custom ops to the ArmSME dialect:
* arm_sme.get_tile_id - returns a scalar integer representing an SME
"virtual tile" that is not in use.
* arm_sme.cast_tile_to_vector - casts from a tile id to a 2-d scalable
vector type, which represents an SME "virtual tile".
* arm_sme.cast_vector_to_tile - casts from a 2-d scalable vector type,
which represents an SME "virtual tile", to a tile id.
The 'arm_sme.get_tile_id' op currently only supports tile 0, a follow-up
patch will implement proper tile allocation. A further follow-up patch
will demonstrate load/store to/from ZA using these ops.
See the op descriptions for further details and examples.
Thanks to @paulwalker-arm and @awarzynski for helping drive this.
Reviewed By: awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D154941
The dealloc operation deallocates each of the given memrefs if there is no alias
to that memref in the list of retained memrefs and the corresponding
condition value is set. This condition can be used to indicate and pass on
ownership of memref values (or in other words, the responsibility of
deallocating that memref). If two memrefs alias each other, only one will be
deallocated to avoid double free situations.
The memrefs to be deallocated must be the originally allocated memrefs,
however, the memrefs to be retained may be arbitrary memrefs.
Returns a list of conditions corresponding to the list of memrefs which
indicates the new ownerships, i.e., if the memref was deallocated the
ownership was dropped (set to 'false') and otherwise will be the same as the
input condition.
Differential Revision: https://reviews.llvm.org/D155467
Delete the backslash. It was there to compile tablegen file. It looks like space also works fine.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D155474
This revision fixes `hasTensorSemantics` and `hasBufferSemantics` for vector transfer ops, which may have a vector operand. `VectorType` implements `ShapedType` and such operands do not affect whether an op has tensor or buffer semantics. Also implement `DestinationStyleOpInterface` on `TransferReadOp` so that `hasTensorSemantics`/`hasBufferSemantics` can be called. (The op has no inits, but this makes it symmetric to `TransferWriteOp`.)
Differential Revision: https://reviews.llvm.org/D155469