This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.
This is a combination and refinement of patch series D116908, D116909, and D116910.
Depend on D155886.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142569
Currently, `ROCDL::GridDim*Op` is being translated to `__ockl_get_global_size`, however
to match the meaning of `gpu.grid_dim` it should instead be translated to
`__ockl_get_num_groups`. This change would also make it agree with the meaning
of `gridDimx.*` in HIP, see:
https://github.com/ROCm-Developer-Tools/hipamd/blob/develop/include/hip/amd_detail/amd_hip_runtime.h#L257
Difference between the functions:
```
__ockl_get_global_size = blockDim * numBlocks
__ockl_get_num_groups = numBlocks
```
Reviewed By: krzysz00
Differential Revision: https://reviews.llvm.org/D156009
`changed` was not updated correctly when it was already set to "true" before calling `applyPatternsAndFoldGreedily`.
Differential Revision: https://reviews.llvm.org/D155934
Add a transform that eliminates dead operations. This is useful after certain transforms (such as fusion) that create/clone new IR but leave the original IR in place.
Differential Revision: https://reviews.llvm.org/D155954
when operating integer type tensors, tosa elementwise multiplication
requires the element type of result to be a 32-bit integer rather
than the same type as inputs.
Change-Id: Ifd3d7ebd879be5c6b2c8e23aa6d7ef41f39c6d41
Reviewed By: mgehre-amd
Differential Revision: https://reviews.llvm.org/D154988
- Wrote complete documentation for the `Broadcastable` op trait. This is mostly meant as a thorough description of its previous behavior, with the exception of minor feature updates.
- Restricted legality criteria for a `Broadcastable` op in order to simplify current and future lowering passes and increase efficiency of code generated by those passes. New restriction are: 1) A dynamic dimension in an inferred result is not compatible with a static dimension in the actual result. 2) Broadcast semantics are restricted to input operands and not supported between inferred and actual result shapes.
- Implemented TOSA-to-Linalg lowering support for unary, binary, tertiary element-wise ops. This support is complete for all legal cases described in the `Broadcastable` trait documentation.
- Added unit tests for `tosa.abs`, `tosa.add`, and `tosa.select` as examples of unary, binary, and tertiary ops.
Reviewed By: eric-k256
Differential Revision: https://reviews.llvm.org/D153291
This commit adds a utility to implement liveness analysis using the
sparse backward data-flow analysis framework. Theoretically, liveness
analysis assigns liveness to each (value, program point) pair in the
program and it is thus a dense analysis. However, since values are
immutable in MLIR, a sparse analysis, which will assign liveness to
each value in the program, suffices here.
Liveness analysis has many applications. It can be used to avoid the
computation of extraneous operations that have no effect on the memory
or the final output of a program. It can also be used to optimize
register allocation. Both of these applications help achieve one very
important goal: reducing runtime.
A value is considered "live" iff it:
(1) has memory effects OR
(2) is returned by a public function OR
(3) is used to compute a value of type (1) or (2).
It is also to be noted that a value could be of multiple types (1/2/3) at
the same time.
A value "has memory effects" iff it:
(1.a) is an operand of an op with memory effects OR
(1.b) is a non-forwarded branch operand and a block where its op could
take the control has an op with memory effects.
A value `A` is said to be "used to compute" value `B` iff `B` cannot be
computed in the absence of `A`. Thus, in this implementation, we say that
value `A` is used to compute value `B` iff:
(3.a) `B` is a result of an op with operand `A` OR
(3.b) `A` is used to compute some value `C` and `C` is used to compute
`B`.
---
It is important to note that there already exists an MLIR liveness
utility here: llvm-project/mlir/include/mlir/Analysis/Liveness.h. So,
what is the need for this new liveness analysis utility being added by
this commit? That need is explained as follows:-
The similarities between these two utilities is that both use the
fixpoint iteration method to converge to the final result of liveness.
And, both have the same theoretical understanding of liveness as well.
However, the main difference between (a) the existing utility and (b)
the added utility is the "scope of the analysis". (a) is restricted to
analysing each block independently while (b) analyses blocks together,
i.e., it looks at how the control flows from one block to the other,
how a caller calls a callee, etc. The restriction in the former implies
that some potentially non-live values could be marked live and thus the
full potential of liveness analysis will not be realised.
This can be understood using the example below:
```
1 func.func private @private_dead_return_value_removal_0() -> (i32, i32) {
2 %0 = arith.constant 0 : i32
3 %1 = arith.addi %0, %0 : i32
4 return %0, %1 : i32, i32
5 }
6 func.func @public_dead_return_value_removal_0() -> (i32) {
7 %0:2 = func.call @private_dead_return_value_removal_0() : () -> (i32, i32)
8 return %0#0 : i32
9 }
```
Here, if we just restrict our analysis to a per-block basis like (a), we
will say that the %1 on line 3 is live because it is computed and then
returned outside its block by the function. But, if we perform a
backward data-flow analysis like (b) does, we will say that %0#1 of line
7 is not live because it isn't returned by the public function and thus,
%1 of line 3 is also not live. So, while (a) will be unable to suggest
any IR optimizations, (b) can enable this IR to convert to:-
```
1 func.func private @private_dead_return_value_removal_0() -> i32 {
2 %0 = arith.constant 0 : i32
3 return %0 : i32
4 }
5 func.func @public_dead_return_value_removal_0() -> i32 {
6 %0 = call @private_dead_return_value_removal_0() : () -> i32
7 return %0 : i32
8 }
```
One operation was removed and one unnecessary return value of the
function was removed and the function signature was modified. This is an
optimization that (b) can enable but (a) cannot. Such optimizations can
help remove a lot of extraneous computations that are currently being
done.
Signed-off-by: Srishti Srivastava <srishtisrivastava.ai@gmail.com>
Reviewed By: matthiaskramm, jcai19
Differential Revision: https://reviews.llvm.org/D153779
Move a bunch of lingering test cases from test/Transforms/ into
test/Dialect/Affine and MemRef.
Differential Revision: https://reviews.llvm.org/D155855
Preserve destination passing style (DPS) when decomposing
`linalg.Softmax`; instead of creating a new empty, which may materialize
as a new buffer after bufferization, use the result directly.
Reviewed By: qcolombet
Differential Revision: https://reviews.llvm.org/D155942
Add an option that does not bufferize the targeted op itself, but just materializes a buffer for the destination operands. This is useful for partial bufferization of complex ops such as `scf.forall`, which need special handling (and an analysis if the region).
Differential Revision: https://reviews.llvm.org/D155946
Applying the canonicalizer and CSE in an interleaved fashion is useful after bufferization (and maybe other transforms) to fold away self copies.
Differential Revision: https://reviews.llvm.org/D155933
To keep the pass simple, users should apply cleanup passes manually when necessary. In particular, `-cse -canonicalize` are often desireable to fold away self-copies that are created by the bufferization.
This addresses a comment in D120191.
Differential Revision: https://reviews.llvm.org/D155923
Allow the `names` argument in `MatchOp.match_op_names` to be of type
`str` in addition to `Sequence[str]`. In this case, the argument is
treated as a list with one name, i.e., it is possible to write
`MatchOp.match_op_names(..., "test.dummy")` instead of
`MatchOp.match_op_names(..., ["test.dummy"])`.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D155807
This patch extends the diagnostic output of `FuseIntoContainingOp` when
it fails to find the next producer by also provided the location of the
affected transform op.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D155803
The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension.
The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html
Depends on D155453
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155680
Similarly to when using `lli`, make sure that when using
`mlir-cpu-runner` with an emulator, a full path to `mlir-cpu-runner` is
used. Otherwise `mlir-cpu-runner` won't be found and you will get the
following error:
```
Error while loading mlir-cpu-runner: No such file or directory
```
This patch should fix:
* https://lab.llvm.org/buildbot/#/builders/179
The breakage was originally introduced in
https://reviews.llvm.org/D155405.
Differential Revision: https://reviews.llvm.org/D155920
RegionBranchOpInterface did not allow the operation with regions to
specify itself as successors. Therefore, this implied that the control
is always transferred to a region before being transferred back to the
parent op. Since the region can only transfer the control back to the
parent op from a terminator, this transitively implied that the first
block of any region with a RegionBranchOpInterface is always executed
until the terminator can transfer the control flow back. This is
trivially false for any conditional-like operation that may or may not
execute the region, as well as for loop-like operations that may not
execute the body.
Remove the restriction from the interface description and update the
only transform that relied on it.
See
https://discourse.llvm.org/t/rfc-region-control-flow-interfaces-should-encode-region-not-executed-correctly/72103.
Depends On: https://reviews.llvm.org/D155757
Reviewed By: Mogball, springerm
Differential Revision: https://reviews.llvm.org/D155822
Initial implementations of dense dataflow analyses feature special cases
for operations that have region- or call-based control flow by
leveraging the corresponding interfaces. This is not necessarily
sufficient as these operations may influence the dataflow state by
themselves as well we through the control flow. For example,
`linalg.generic` and similar operations have region-based control flow
and their proper memory effects, so any memory-related analyses such as
last-writer require processing `linalg.generic` directly instead of, or
in addition to, the region-based flow.
Provide hooks to customize the processing of operations with region-
cand call-based contol flow in forward and backward dense dataflow
analysis. These hooks are trigerred when control flow is transferred
between the "main" operation, i.e. the call or the region owner, and
another region. Such an apporach allows the analyses to update the
lattice before and/or after the regions. In the `linalg.generic`
example, the reads from memory are interpreted as happening before the
body region and the writes to memory are interpreted as happening after
the body region. Using these hooks in generic analysis may require
introducing additional interfaces, but for now assume that the specific
analysis have spceial cases for the (rare) operaitons with call- and
region-based control flow that need additional processing.
Reviewed By: Mogball, phisiart
Differential Revision: https://reviews.llvm.org/D155757
Current transformation expects module op to be two level higher, however, it is not always the case. This work searches module op in a while loop.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155825
This work adds `nvgpu.tma.async.load` Op that requests tma load asyncronusly using mbarrier object.
It also creates nvgpu.tma.descriptor type. The type is supposed be created by `cuTensorMapEncodeTiled` cuda drivers api.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155453
Linalg structure ops do not implement control flow in the way expected
by RegionBranchOpInterface, and the interface implementation isn't
actually used anywhere. The presence of this interface without correct
implementation is confusing for, e.g., dataflow analyses.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D155841
Using properties currently requires at the very least implementing four methods/code snippets:
* `convertToAttribute`
* `convertFromAttribute`
* `writeToMlirBytecode`
* `readFromMlirBytecode`
This makes replacing attributes with properties harder than it has to be: Attributes by default do not require immediately defining custom bytecode encoding.
This patch therefore adds opt-in implementations of `writeToMlirBytecode` and `readFromMlirBytecode` that work with the default implementations of `convertToAttribute` and `convertFromAttribute`. They are provided by `defvar`s in `OpBase.td` and can be used by adding:
```
let writeToMlirBytecode = writeMlirBytecodeWithConvertToAttribute;
let readFromMlirBytecode = readMlirBytecodeUsingConvertFromAttribute;
```
to ones TableGen definition.
While this bytecode encoding is almost certainly not ideal for a given property, it allows more incremental use of properties and getting something sane working before optimizing the bytecode format.
Differential Revision: https://reviews.llvm.org/D155286
This commit comprises a number of related changes:
(1) Reintroduces the semantic distinction between `parseVarUsage` vs `parseVarBinding`, adds documentation explaining the distinction, and adds commentary to the one place that violates the desired/intended semantics.
(2) Improves documentation/commentary about the forward-declaration of level-vars, and about the meaning of the `bool` parameter to `parseLvlSpec`.
(2) Removes the `VarEnv::addVars` method, and instead has `DimLvlMapParser` handle the conversion issues directly. In particular, the parser now stores and maintains the `{dims,lvls}AndSymbols` arrays, thereby avoiding the O(n^2) behavior of scanning through the entire `VarEnv` for each `parse{Dim,Lvl}Spec` call. Unfortunately there still remains another source of O(n^2) behavior, namely: the `AsmParser::parseAffineExpr` method will copy the `DimLvlMapParser::{dims,lvls}AndSymbols` arrays into `AffineParser::dimsAndSymbols` on each `parse{Dim,Lvl}Spec` call; but fixing that would require extensive changes to `AffineParser` itself.
Depends On D155532
Reviewed By: Peiming
Differential Revision: https://reviews.llvm.org/D155533
Continue to work outlined in D155747 and split the main SPIR-V ops
implementation file into a few smaller and quicker to compile files.
Move control flow and memory ops to their own implementation files.
Create new `.cpp` files for tablegened code.
After this change, the `SPIRVOps.cpp` is 2k LoC-long and takes a
reasonable amount of time to compile.
Reviewed By: antiagainst
Differential Revision: https://reviews.llvm.org/D155883
Author inferReturnTypeComponents methods with the Op Adaptor by using the InferShapedTypeOpAdaptor.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D155243
In https://reviews.llvm.org/D146917, MLIR's LIT configuration was
updated to allow us to use `mlir-cpu-runner` to run Arm SVE integration
tests. That update broke the following buildbot that doesn't support
SVE:
https://lab.llvm.org/buildbot/#/builders/179/builds/6704
While that bot doesn't support SVE, it can run SVE tests under
emulation. This patch makes sure that whenever an Arm emulator is set
(via `RM_EMULATOR_EXECUTABLE` CMake variable), it is used to run both
`lli` _and_ `mlir-cpu-runner`.
I am sending this without a review as it's a rather trivial change and I
want to quickly fix the spurious bot failure.
Wave Matrix Multiply Accumulate (WMMA) is the instruction to accelerate
matrix multiplication on RDNA3 architectures. LLVM already provides a
set of intrinsics to generate wmma instructions. This change uses those
intrinsics to enable the feature in MLIR.
Reviewed By: krzysz00
Differential Revision: https://reviews.llvm.org/D152451
Continue to work outlined in D155747 and split the main SPIR-V ops
implementation file into a few smaller and quicker to compile files.
This organization matches the op definition organizaion in `.td` files.
In this patch, extract atomic, cast/conversion, and group op
implementation into separate files.
Reviewed By: antiagainst
Differential Revision: https://reviews.llvm.org/D155777
This patch adds a mixin for ApplyPatternsOp to _transform_ops_ext.py
with syntactic sugar for construction such ops. Curiously, the op did
not have any constructors yet, probably because its tablegen definition
said to skip the default builders. The new constructor is thus quite
straightforward. The commit also adds a refined `region` property which
returns the first block of the single region.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D155435
This gives the region a more meaningful name. The topic came up in a
discussion on https://reviews.llvm.org/D155435, where the name `region`
would have led to a situation where a convenience accessor called
`region` (after the ODS name) would have returned a Block.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D155810
This patch adds a mix-in class for MapForallToBlocks with overloaded
constructors. This makes it optional to provide the return type of the
op, which is defaulte to `AnyOpType`.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D155717
This patch modifies the construction of the `OpenMPIRBuilder` in MLIR to
initialize the `IsGPU` flag using target triple information passed down from
the Flang frontend. If not present, it will default to `false`.
This replicates the behavior currently implemented in Clang, where the
`CodeGenModule::createOpenMPRuntime()` method creates a different
`CGOpenMPRuntime` instance depending on the target triple, which in turn has an
effect on the `IsGPU` flag of the `OpenMPIRBuilderConfig` object.
Differential Revision: https://reviews.llvm.org/D151903
Currently, inlining a function with a `noalias` parameter leads to a large loss of optimization potential as the `noalias` parameter, an important hint for alias analysis, is lost completely.
This patch fixes this with the same approach as LLVM by annotating all users of the `noalias` parameter with appropriate alias and noalias scope lists.
The implementation done here is not as sophisticated as LLVMs, which has more infrastructure related to escaping and captured pointers, but should work in the majority of important cases.
Any deficiency can be addressed in future patches.
Related LLVM code: 27ade4b554/llvm/lib/Transforms/Utils/InlineFunction.cpp (L1090)
Differential Revision: https://reviews.llvm.org/D155712
This work adds two Ops:
`mbarrier.arrive.expect_tx` performs expect_tx `mbarrier.barrier` returns `mbarrier.barrier.token`
`mbarrier.try_wait.parity` waits on `mbarrier.barrier` and `mbarrier.barrier.token`
`mbarrier.arrive.expect_tx` is one of the requirement to enable H100 TMA support.
Depends on D154074 D154076 D154059 D154060
Reviewed By: qcolombet
Differential Revision: https://reviews.llvm.org/D154094
This revision adds a branch weight op interface for the call / branch
operations that support branch weights. It can be used in the LLVM IR
import and export to simplify the branch weight conversion. An
additional mapping between call operations and instructions ensures
the actual conversion can be done in the module translation itself,
rather than in the dialect translation interface. It also has the
benefit that downstream users can amend custom metadata to the call
operation during the export to LLVM IR.
Reviewed By: zero9178, definelicht
Differential Revision: https://reviews.llvm.org/D155702
Handling access groups is luckily rather trivial: Any access groups from the call instruction are simply appended to any memory operations.
This is similar to one of the steps when handling alias scopes.
This patch nevertheless implements it as a separate function purely for readability purposes as it uses a different interface than alias scopes.
Differential Revision: https://reviews.llvm.org/D155795
Also update the documentation of `Operation::fold`, which did not take into account in-place foldings.
Differential Revision: https://reviews.llvm.org/D155691