Two back-to-back transpose operations are combined into a single transpose, which uses a combination of their permutation vectors.
Differential Revision: https://reviews.llvm.org/D77331
A certain number of EDSCs have a named form (e.g. `linalg.matmul`) and a generic form (e.g. `linalg.generic` with matmul traits).
Despite living in different namespaces, using the same name is confusiong in clients.
Rename them as `linalg_matmul` and `linalg_generic_matmul` respectively.
Add a method that given an affine map returns another with just its unique
results. Use this to drop redundant bounds in max/min for affine.for. Update
affine.for's canonicalization pattern and createCanonicalizedForOp to use
this.
Differential Revision: https://reviews.llvm.org/D77237
Summary:
This is to allow optimizations like loop invariant code motion to work
on the ParallelOp.
Additional small cleanup on the ForOp implementation of
LoopLikeInterface and the test file of loop-invariant-code-motion.
Differential Revision: https://reviews.llvm.org/D77128
This revision adds support for generating utilities for passes such as options/statistics/etc. that can be inferred from the tablegen definition. This removes additional boilerplate from the pass, and also makes it easier to remove the reliance on the pass registry to provide certain things(e.g. the pass argument).
Differential Revision: https://reviews.llvm.org/D76659
This generates a Passes.td for all of the dialects that have transformation passes. This removes the need for global registration for all of the dialect passes.
Differential Revision: https://reviews.llvm.org/D76657
Summary:
OpBuilder(Block) is specifically replaced with
OpBuilder::atBlockEnd(Block);
This is to make insertion behavior clear due to there being no one
correct answer for which location in a block the default insertion
point should be.
Differential Revision: https://reviews.llvm.org/D77060
Summary:
The RAW fusion happens only if the produecer block dominates the consumer block.
The WAW pattern also works with the precondition. I.e., if a producer can
dominate the consumer, they can fairly fuse together.
Since they are all tilable, we can think the pattern like this way:
Input:
```
linalg_op1 view
tile_loop
subview_2
linalg_op2 subview_2
```
Tile the first Linalg op as same as the second Linalg.
```
tile_loop
subview_1
linalg_op1 subview_1
tile_loop
subview_2
liangl_op2 subview_2
```
Since the first Linalg op is tilable in the same way and the computation are
independently, it's fair to fuse it with the second Linalg op.
```
tile_loop
subview_1
linalg_op1 subview_1
linalg_op2 subview_2
```
In short, this patch includes:
- Handling both RAW and WAW pattern.
- Adding a interface method to get input and output buffers.
- Exposing a method to get a StringRef of a dependency type.
- Fixing existing WAW tests and add one more use case: initialize the buffer
before conv op.
Differential Revision: https://reviews.llvm.org/D76897
Summary:
Performs an N-D pooling operation similarly to the description in the TF
documentation:
https://www.tensorflow.org/api_docs/python/tf/nn/pool
Different from the description, this operation doesn't perform on batch and
channel. It only takes tensors of rank `N`.
```
output[x[0], ..., x[N-1]] =
REDUCE_{z[0], ..., z[N-1]}
input[
x[0] * strides[0] - pad_before[0] + dilation_rate[0]*z[0],
...
x[N-1]*strides[N-1] - pad_before[N-1] + dilation_rate[N-1]*z[N-1]
],
```
The required optional arguments are:
- strides: an i64 array specifying the stride (i.e. step) for window
loops.
- dilations: an i64 array specifying the filter upsampling/input
downsampling rate
- padding: an i64 array of pairs (low, high) specifying the number of
elements to pad along a dimension.
If strides or dilations attributes are missing then the default value is
one for each of the input dimensions. Similarly, padding values are zero
for both low and high in each of the dimensions, if not specified.
Differential Revision: https://reviews.llvm.org/D76414
Existing tiling implementation of Linalg would still work for tiling
the batch dimensions of the convolution op.
Differential Revision: https://reviews.llvm.org/D76637
Summary:
Add support for TupleGetOp folding through InsertSlicesOp and ExtractSlicesOp.
Vector-to-vector transformations for unrolling and lowering to hardware vectors
can generate chains of structured vector operations (InsertSlicesOp,
ExtractSlicesOp and ShapeCastOp) between the producer of a hardware vector
value and its consumer. Because InsertSlicesOp, ExtractSlicesOp and ShapeCastOp
are structured, we can track the location (tuple index and vector offsets) of
the consumer vector value through the chain of structured operations to the
producer, enabling a much more powerful producer-consumer fowarding of values
through structured ops and tuple, which in turn enables a more powerful
TupleGetOp folding transformation.
Reviewers: nicolasvasilache, aartbik
Reviewed By: aartbik
Subscribers: grosul1, mehdi_amini, rriddle, jpienaar, burmako, shauheen, antiagainst, arpith-jacob, mgester, lucyrfox, liufengdb, Joonsoo, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76889
Summary: This object file has grown beyond the default limit, and elsewhere in LLVM, we seem to be setting this flag as a one-off, so continuing that here.
Reviewers: mravishankar, antiagainst
Subscribers: mgorny, mehdi_amini, rriddle, jpienaar, burmako, shauheen, antiagainst, nicolasvasilache, arpith-jacob, mgester, lucyrfox, liufengdb, Joonsoo, bader, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77002
This patch introduces a utility to separate full tiles from partial
tiles when tiling affine loop nests where trip counts are unknown or
where tile sizes don't divide trip counts. A conditional guard is
generated to separate out the full tile (with constant trip count loops)
into the then block of an 'affine.if' and the partial tile to the else
block. The separation allows the 'then' block (which has constant trip
count loops) to be optimized better subsequently: for eg. for
unroll-and-jam, register tiling, vectorization without leading to
cleanup code, or to offload to accelerators. Among techniques from the
literature, the if/else based separation leads to the most compact
cleanup code for multi-dimensional cases (because a single version is
used to model all partial tiles).
INPUT
affine.for %i0 = 0 to %M {
affine.for %i1 = 0 to %N {
"foo"() : () -> ()
}
}
OUTPUT AFTER TILING W/O SEPARATION
map0 = affine_map<(d0) -> (d0)>
map1 = affine_map<(d0)[s0] -> (d0 + 32, s0)>
affine.for %arg2 = 0 to %M step 32 {
affine.for %arg3 = 0 to %N step 32 {
affine.for %arg4 = #map0(%arg2) to min #map1(%arg2)[%M] {
affine.for %arg5 = #map0(%arg3) to min #map1(%arg3)[%N] {
"foo"() : () -> ()
}
}
}
}
OUTPUT AFTER TILING WITH SEPARATION
map0 = affine_map<(d0) -> (d0)>
map1 = affine_map<(d0) -> (d0 + 32)>
map2 = affine_map<(d0)[s0] -> (d0 + 32, s0)>
#set0 = affine_set<(d0, d1)[s0, s1] : (-d0 + s0 - 32 >= 0, -d1 + s1 - 32 >= 0)>
affine.for %arg2 = 0 to %M step 32 {
affine.for %arg3 = 0 to %N step 32 {
affine.if #set0(%arg2, %arg3)[%M, %N] {
// Full tile.
affine.for %arg4 = #map0(%arg2) to #map1(%arg2) {
affine.for %arg5 = #map0(%arg3) to #map1(%arg3) {
"foo"() : () -> ()
}
}
} else {
// Partial tile.
affine.for %arg4 = #map0(%arg2) to min #map2(%arg2)[%M] {
affine.for %arg5 = #map0(%arg3) to min #map2(%arg3)[%N] {
"foo"() : () -> ()
}
}
}
}
}
The separation is tested via a cmd line flag on the loop tiling pass.
The utility itself allows one to pass in any band of contiguously nested
loops, and can be used by other transforms/utilities. The current
implementation works for hyperrectangular loop nests.
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Differential Revision: https://reviews.llvm.org/D76700
- add `to_extent_tensor`
- rename `create_shape` to `from_extent_tensor` for symmetry
- add `split_at` and `concat` ops for basic shape manipulations
This set of ops is inspired by the requirements of lowering a dynamic-shape-aware batch matmul op. For such an op, the "matrix" dimensions aren't subject to broadcasting but the others are, and so we need to slice, broadcast, and reconstruct the final output shape. Furthermore, the actual broadcasting op used downstream uses a tensor of extents as its preferred shape interface for the actual op that does the broadcasting.
However, this functionality is quite general. It's obvious that `to_extent_tensor` is needed long-term to support many common patterns that involve computations on shapes. We can evolve the shape manipulation ops introduced here. The specific choices made here took into consideration the potentially unranked nature of the !shape.shape type, which means that a simple listing of dimensions to extract isn't possible in general.
Differential Revision: https://reviews.llvm.org/D76817
With commit 4d60f47 VectorOps was renamed to Vector and the naming of
the CMake target was adjusted. With commit 363dd3f QuantOps was
renamed to Quant, but the naming of the CMake target is left
untouched. This renames the CMake target.
- add method to get back an integer set from flat affine constraints;
this allows a round trip
- use this to complete the simplification of integer sets in
-simplify-affine-structures
- update FlatAffineConstraints::removeTrivialRedundancy to also do GCD
tightening and normalize by GCD (while still keeping it linear time).
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
gpu.launch
Current implementation of lowering from loop.parallel to gpu.launch
uses a DictionaryAttr to specify the mapping. Moving this attribute to
be auto-generated from specification as a StructAttr. This simplifies
a lot the logic of looking up and creating this attribute.
Differential Revision: https://reviews.llvm.org/D76165
The declarations for these were already part of transforms utils, but
the definitions were left in affine transforms. Move definitions to loop
transforms utils.
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Differential Revision: https://reviews.llvm.org/D76633
Move some of the affine transforms and their test cases to their
respective dialect directory. This patch does not complete the move, but
takes care of a good part.
Renames: prefix 'affine' to affine loop tiling cl options,
vectorize -> super-vectorize
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Differential Revision: https://reviews.llvm.org/D76565
Summary:
Change AffineOps Dialect structure to better group both IR and Tranforms. This included extracting transforms directly related to AffineOps. Also move AffineOps to Affine.
Differential Revision: https://reviews.llvm.org/D76161
Summary:
After D75831 has been landed, both the generic op and indexed_generic op can
handle 0-D edge case. In the previous patch, only generic op has been updated.
This patch updates the lowering to loops for indexed_generic op. Since they are
almost the sanme, the patch also refactors the common part.
Differential Revision: https://reviews.llvm.org/D76413
The Vector Dialect [document](https://mlir.llvm.org/docs/Dialects/Vector/) discusses the vector abstractions that MLIR supports and the various tradeoffs involved.
One of the layer that is missing in OSS atm is the Hardware Vector Ops (HWV) level.
This revision proposes an AVX512-specific to add a new Dialect/Targets/AVX512 Dialect that would directly target AVX512-specific intrinsics.
Atm, we rely too much on LLVM’s peephole optimizer to do a good job from small insertelement/extractelement/shufflevector. In the future, when possible, generic abstractions such as VP intrinsics should be preferred.
The revision will allow trading off HW-specific vs generic abstractions in MLIR.
Differential Revision: https://reviews.llvm.org/D75987
Summary:
These are not supported by any of the code using `type_cast`. In the general
case, such casting would require memrefs to handle a non-contiguous vector
representation or misaligned vectors (e.g., if the offset of the source memref
is not divisible by vector size, since offset in the target memref is expressed
in the number of elements).
Differential Revision: https://reviews.llvm.org/D76349
Non-32-bit scalar types requires special hardware support that may
not exist on all Vulkan-capable GPUs. This is reflected as non-32-bit
scalar types require special capabilities or extensions to be used.
This commit makes SPIRVTypeConverter target environment aware so
that it can properly convert standard types to what is accepted on
the target environment.
Right now if a scalar type bitwidth is not supported in the target
environment, we use 32-bit unconditionally. This requires Vulkan
runtime to also feed in data with a matched bitwidth and layout,
especially for interface types. The Vulkan runtime can do that by
inspecting the SPIR-V module. Longer term, we might want to introduce
a way to control how such case are handled and explicitly fail
if wanted.
Differential Revision: https://reviews.llvm.org/D76244
Types should be checked with the type hierarchy. This should result in
better responsibility division and API surface.
Differential Revision: https://reviews.llvm.org/D76243
This commit unifies target environment queries into a new wrapper
class spirv::TargetEnv and shares across various places needing
the functionality. We still create multiple instances of TargetEnv
though given the parent components (type converters, passes,
conversion targets) have different lifetimes.
In the meantime, LowerABIAttributesPass is updated to take into
consideration the target environment, which requires updates to
tests to provide that.
Differential Revision: https://reviews.llvm.org/D76242
Previously we only consider the version/extension/capability requirement
on the op itself. This commit updates SPIRVConversionTarget to also
take into consideration the values' types when deciding op legality.
Differential Revision: https://reviews.llvm.org/D75876
Previously in SPIRVTypeConverter, we always convert memref types
to StorageBuffer regardless of their memory spaces. This commit
fixes that to let the conversion to look into memory space
properly. For this purpose, a mapping between SPIR-V storage class
and memref memory space is introduced. The mapping is arbitary
decided at the moment and the hope is that we can leverage
string memory space later to be more clear.
Now spv.interface_var_abi cannot contain storage class unless it's
attached to a scalar value, where we need the storage class as side
channel information. Verifications and tests are properly adjusted.
Differential Revision: https://reviews.llvm.org/D76241
Summary:
Although bool and int1 are the same sometimes, using bool constant matches the
semantic better. In any cases, we don't have to care the type of conditions if
we remove the intial value. The type is determined automatically by the returned
type of logical operations.
Differential Revision: https://reviews.llvm.org/D76338