Currently, the lowering for vector.step lives
under a folder. This is not ideal if we want
to do transformation on it and defer the
materizaliztion of the constants much later.
This commits adds a rewrite pattern that
could be used by using
`transform.structured.vectorize_children_and_apply_patterns`
transform dialect operation.
Moreover, the rewriter of vector.step is also
now used in -convert-vector-to-llvm pass where
it handles scalable and non-scalable types as
LLVM expects it.
As a consequence of removing the vector.step
lowering as its folder, linalg vectorization
will keep vector.step intact.
Previously, the pass only supported emulation of loading vector sizes
that are multiples of the emulated data type. This patch expands its
support for emulating sizes that are not multiples of byte sizes. In
such cases, the element values are packed back-to-back to preserve
memory space.
To give a concrete example: if an input has type `memref<3x3xi2>`, it is
actually occupying 3 bytes in memory, with the first 18 bits storing the
values and the last 6 bits as padding. The slice of `vector<3xi2>` at
index `[2, 0]` is stored in memory from bit 12 to bit 18. To properly
load the elements from bit 12 to bit 18 from memory, first load byte 2
and byte 3, and convert it to a vector of `i2` type; then extract bits 4
to 10 (element index 2-5) to form a `vector<3xi2>`.
A limitation of this patch is that the linearized index of the unaligned
vector has to be known at compile time. Extra code needs to be emitted
to handle it if the condition does not hold.
The following ops are updated:
* `vector::LoadOp`
* `vector::TransferReadOp`
* `vector::MaskedLoadOp`
Since
ddf2d62c7d
, 0-d vectors are supported in VectorType. This patch removes 0-d vector
handling with scalars for the TransferOpReduceRank pattern. This pattern
specifically introduces tensor.extract_slice during vectorization,
causing vectorization to not fold transfer_read/transfer_write slices
properly. The changes in vectorization test files reflect this.
There are other places where lowering patterns are still side-stepping
from handling 0-d vectors properly, by turning them into scalars, but
this patch only focuses on the vector.transfer_x patterns.
This is a reasonable canonicalization because `extract` is more
constrained than `extract_strided_slices`, so there is no loss of
semantics here, just lifting an op to a special-case higher/constrained
op. And the additional `shape_cast` is merely adding leading unit dims
to match the original result type.
Context: discussion on #111541. I wasn't sure how this would turn out,
but in the process of writing this PR, I discovered at least 2 bugs in
the pattern introduced in #111541, which shows the value of shared
canonicalization patterns which are exercised on a high number of
testcases.
---------
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
This commit marks the type converter in `populate...` functions as
`const`. This is useful for debugging.
Patterns already take a `const` type converter. However, some
`populate...` functions do not only add new patterns, but also add
additional type conversion rules. That makes it difficult to find the
place where a type conversion was added in the code base. With this
change, all `populate...` functions that only populate pattern now have
a `const` type converter. Programmers can then conclude from the
function signature that these functions do not register any new type
conversion rules.
Also some minor cleanups around the 1:N dialect conversion
infrastructure, which did not always pass the type converter as a
`const` object internally.
`vector.transfer_*` folding and forwarding currently does not take into
account reshaping view-like memref ops (expand and collapse shape),
leading to potentially invalid store folding or value forwarding. This
patch adds tracking for those (and other) view-like ops. It is still
possible to design operations that alias memrefs without being a view
(e.g. memref in the iter_args of an `scf.for`), so these patterns may
still need revisiting in the future.
Adds a new Transform Dialect Op that collects patters for dropping unit
dims from various Ops:
* `transform.apply_patterns.vector.drop_unit_dims_with_shape_cast`.
It excludes patterns for vector.transfer Ops - these are collected
under:
* `apply_patterns.vector.rank_reducing_subview_patterns`,
and use ShapeCastOp _and_ SubviewOp to reduce the rank (and to eliminate
unit dims).
This new TD Ops allows us to test the "ShapeCast folder" pattern in
isolation. I've extracted the only test that I could find for that
folder from "vector-transforms.mlir" and moved it to a dedicated file:
"shape-cast-folder.mlir". I also added a test case with scalable
vectors.
Changes in VectorTransforms.cpp are not needed (added a comment with
a TODO + ordered the patterns alphabetically). I am Including them here
to avoid a separate PR.
There are some spurious libraries which can be removed.
I'm trying to bundle MLIR/LLVM library dependencies for our own
libraries. We're utilizing cmake function to recursively collect
MLIR/LLVM related dependencies. However, we identified certain library
dependencies as redundant and safe for removal.
This adds a pattern for dropping unit dims from the iter_args of scf.for
ops using vector.shape_cast. This composes with the other patterns for
dropping unit dims from elementwise ops and transposes.
Group all patterns that re-order vector.transpose and vector.broadcast
Ops (*) under `populateSinkVectorOpsPatterns`. These patterns are
normally used to "sink" redundant Vector Ops, hence grouping together.
Example:
```mlir
%at = vector.transpose %a, [1, 0]: vector<4x2xf32> to vector<2x4xf32>
%bt = vector.transpose %b, [1, 0]: vector<4x2xf32> to vector<2x4xf32>
%r = arith.addf %at, %bt : vector<2x4xf32>
```
would get converted to:
```mlir
%0 = arith.addf %a, %b : vector<4x2xf32>
%r = vector.transpose %0, [1, 0] : vector<2x4xf32>
```
This patch also moves all tests for these patterns so that all of them
are:
* run under one test-flag: `test-vector-sink-patterns`,
* located in one file: "vector-sink.mlir".
To facilitate this change:
* `-test-sink-vector-broadcast` is renamed as
`test-vector-sink-patterns`,
* "sink-vector-broadcast.mlir" is renamed as "vector-sink.mlir",
* tests for `ReorderCastOpsOnBroadcast` and
`ReorderElementwiseOpsOnTranspose` patterns are moved from
"vector-reduce-to-contract.mlir" to "vector-sink.mlir",
* `ReorderElementwiseOpsOnTranspose` patterns are removed from
`populateVectorReductionToContractPatterns` and added to (newly
created) `populateSinkVectorOpsPatterns`,
* `ReorderCastOpsOnBroadcast` patterns are removed from
`populateVectorReductionToContractPatterns` - these are already
present in `populateSinkVectorOpsPatterns`.
This should allow us better layering and more straightforward testing.
For the latter, the goal is to be able to easily identify which pattern
a particular test is exercising (especially when it's a specific
pattern).
NOTES FOR DOWNSTREAM USERS
In order to preserve the current functionality, please make sure to add
* `populateSinkVectorOpsPatterns`,
wherever you are using `populateVectorReductionToContractPatterns`.
Also, rename `populateSinkVectorBroadcastPatterns` as
`populateSinkVectorOpsPatterns`.
(*) I didn't notice any other re-order patterns.
Adds tests for scalable vectors in:
* sink-vector-broadcast.mlir
This test file excercises patterns grouped under
`populateSinkVectorBroadcastPatterns`, which includes:
* `ReorderElementwiseOpsOnBroadcast`,
* `ReorderCastOpsOnBroadcast`.
Right now there are only tests for the former. However, I've noticed
that "vector-reduce-to-contract.mlir" contains tests for the latter and
I've left a few TODOs to group these tests back together in one file.
Additionally, added some helpful `notifyMatchFailure` messages in
`ReorderElementwiseOpsOnBroadcast`.
This adds a new transform `eliminateVectorMasks()` which aims at
removing scalable `vector.create_masks` that will be all-true at
runtime. It attempts to do this by simply pattern-matching the mask
operands (similar to some canonicalizations), if that does not lead to
an answer (is all-true? yes/no), then value bounds analysis will be used
to find the lower bound of the unknown operands. If the lower bound is
>= to the corresponding mask vector type dim, then that dimension of the
mask is all true.
Note that the pattern matching prevents expensive value-bounds analysis
in cases where the mask won't be all true.
For example:
```mlir
%mask = vector.create_mask %dynamicValue, %c2 : vector<8x4xi1>
```
From looking at `%c2` we can tell this is not going to be an all-true
mask, so we don't need to run the value-bounds analysis for
`%dynamicValue` (and can exit the transform early).
Note: Eliminating create_masks here means replacing them with all-true
constants (which will then lead to the masks folding away).
da8778e499
breaks the lowering of vector.transpose that all the dimensions are unit
dimensions. The revision fixes the issue and adds a test.
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Adds tests with scalable vectors for the Vector-To-LLVM conversion pass.
Covers the following Ops:
* vector.bitcast
* vector.broadcast
Note, this has uncovered some missing logic in `BroadcastOpLowering`.
This PR fixes the most basic cases where the scalable flags were dropped
and the generated code was incorrect. Also, the conditions in
`vector::isBroadcastableTo` are relaxed to allow cases like this:
```mlir
%0 = vector.broadcast %arg0 : vector<1xf32> to vector<[4]xf32>
```
The `BroadcastOpLowering` pattern is effectively disabled for scalable
vectors in more complex cases where an SCF loop would be required to
loop over the scalable dims, e.g.:
```mlir
%0 = vector.broadcast %arg0 : vector<[4]x1x2xf32> to vector<[4]x3x2xf32>
```
These cases are marked as "Stretch not at start" in the code. In those
cases, support for scalable vectors is left as a TODO.
Disables `ContractionOpToMatmulOpLowering` for scalable vectors. This
pattern is meant to enable lowering to `llvm.matrix.multiply` - I'm not
aware of any use of that in the context of scalable vectors.
Since the `in_bounds` attribute is mandatory, there's no need for logic
like this (`readOp.getInBounds()` is guaranteed to return a non-empty
ArrayRef):
```cpp
ArrayAttr inBoundsAttr = readOp.getInBounds()
? rewriter.getArrayAttr( readOp.getInBoundsAttr().getValue().drop_back(dimsToDrop))
: ArrayAttr();
```
Instead, we can do this:
```cpp
ArrayAttr inBoundsAttr = rewriter.getArrayAttr(
readOp.getInBoundsAttr().getValue().drop_back(dimsToDrop));
```
This is a small follow-up for #97049 - this change should've been
included there.
Generalizes DropUnitDimFromElementwiseOps to support inner unit
dimensions.
This change stems from improving lowering of contractionOps for Arm SME.
Where we end up with inner unit dimensions on MulOp, BroadcastOp and
TransposeOp, preventing the generation of outerproducts.
discussed
[here](https://discourse.llvm.org/t/on-improving-arm-sme-lowering-resilience-in-mlir/78543/17?u=nujaa).
Fix after : https://github.com/llvm/llvm-project/pull/97652 showed an
unhandled edge case when all dimensions are one. The generated target
VectorType would be `vector<f32>` which is apparently not supported by
the mulf.
In case all dimensions are dropped, the target vectorType is
vector<1xf32>
---------
Co-authored-by: Benjamin Maxwell <macdue@dueutil.tech>
At the moment, the in_bounds attribute has two confusing/contradicting
properties:
1. It is both optional _and_ has an effective default-value.
2. The default value is "out-of-bounds" for non-broadcast dims, and
"in-bounds" for broadcast dims.
(see the `isDimInBounds` vector interface method for an example of this
"default" behaviour [1]).
This PR aims to clarify the logic surrounding the `in_bounds` attribute
by:
* making the attribute mandatory (i.e. it is always present),
* always setting the default value to "out of bounds" (that's
consistent with the current behaviour for the most common cases).
#### Broadcast dimensions in tests
As per [2], the broadcast dimensions requires the corresponding
`in_bounds` attribute to be `true`:
```
vector.transfer_read op requires broadcast dimensions to be in-bounds
```
The changes in this PR mean that we can no longer rely on the
default value in cases like the following (dim 0 is a broadcast dim):
```mlir
%read = vector.transfer_read %A[%base1, %base2], %f, %mask
{permutation_map = affine_map<(d0, d1) -> (0, d1)>} :
memref<?x?xf32>, vector<4x9xf32>
```
Instead, the broadcast dimension has to explicitly be marked as "in
bounds:
```mlir
%read = vector.transfer_read %A[%base1, %base2], %f, %mask
{in_bounds = [true, false], permutation_map = affine_map<(d0, d1) -> (0, d1)>} :
memref<?x?xf32>, vector<4x9xf32>
```
All tests with broadcast dims are updated accordingly.
#### Changes in "SuperVectorize.cpp" and "Vectorization.cpp"
The following patterns in "Vectorization.cpp" are updated to explicitly
set the `in_bounds` attribute to `false`:
* `LinalgCopyVTRForwardingPattern` and `LinalgCopyVTWForwardingPattern`
Also, `vectorizeAffineLoad` (from "SuperVectorize.cpp") and
`vectorizeAsLinalgGeneric` (from "Vectorization.cpp") are updated to
make sure that xfer Ops created by these hooks set the dimension
corresponding to broadcast dims as "in bounds". Otherwise, the Op
verifier would complain
Note that there is no mechanism to verify whether the corresponding
memory access are indeed in bounds. Still, this is consistent with the
current behaviour where the broadcast dim would be implicitly assumed
to be "in bounds".
[1]
4145ad2bac/mlir/include/mlir/Interfaces/VectorInterfaces.td (L243-L246)
[2]
https://mlir.llvm.org/docs/Dialects/Vector/#vectortransfer_read-vectortransferreadop
Restrict `DropInnerMostUnitDimsTransfer{Read|Write}` so that it fails
when one of the indices to be dropped could be != 0 and "out of bounds":
```mlir
func.func @negative_example(%arg0: memref<16x1xf32>, %arg1: vector<8x1xf32>, %idx_1: index, %idx_2: index) {
vector.transfer_write %arg1, %arg0[%idx_1, %idx_2] {in_bounds = [true, false]} : vector<8x1xf32>, memref<16x1xf32>
return
}
```
This is an edge case that could represent an out-of-bounds access,
though that will depend on the actual value of %i. Importantly, without
this change it would be transformed as follows:
```mlir
func.func @negative_example(%arg0: memref<16x1xf32>, %arg1: vector<8x1xf32>, %arg2: index, %arg3: index) {
%subview = memref.subview %arg0[0, 0] [16, 1] [1, 1] : memref<16x1xf32> to memref<16xf32, strided<[1]>>
%0 = vector.shape_cast %arg1 : vector<8x1xf32> to vector<8xf32>
vector.transfer_write %0, %subview[%arg2] {in_bounds = [true]} : vector<8xf32>, memref<16xf32, strided<[1]>>
return
}
```
This is incorrect - `%idx_2` is ignored and the "out of bounds" flags is
not propagated. Hence the extra restriction to avoid such cases.
NOTE: This is a follow-up for: #94904
Many state of the art models and quantization operations are now
directly working on vector.contract on integers.
This commit enables generalizes ext-contraction folding S.T we can emit
more performant vector.contracts on codegen pipelines.
Signed-off-by: Stanley Winata <stanley.winata@amd.com>
This pattern flattens vector.gather ops by unrolling the outermost
dimension for rank > 2 vectors. There's two issues with this pattern for
scalable vectors:
1. The unrolling doesn't take vscale into account. A constraint is
added to disable this pattern for vectors with leading scalable
dims.
2. The scalable dims are dropped when creating the new gather. Fixed
by propagating the flags.
Depends on #96049.
1D multi-reduction are lowered to arith which can prevent some
optimisations. I propose `ElementwiseToOuterproduct` matching a series of
ops to generate `vector.outerproduct`.
As part of some `ElementwiseToVectorOpsPatterns`, it could allow to fuse
other elementwiseOps to vector dialect.
Originally discussed
https://discourse.llvm.org/t/on-improving-arm-sme-lowering-resilience-in-mlir/78543/24.
quote @MacDue
```
%lhsBcast = vector.broadcast %lhsCast : vector<[4]xf32> to vector<[4]x[4]xf32>
%lhsT = vector.transpose %lhsBcast, [1, 0] : vector<[4]x[4]xf32> to vector<[4]x[4]xf32>
%rhsBcast = vector.broadcast %rhs : vector<[4]xf32> to vector<[4]x[4]xf32>
%mul = arith.mulf %lhsT, %rhsBcast : vector<[4]x[4]xf32>
```
Can be rewritten as:
```
%mul = vector.outerproduct $lhs, $rhs : vector<[4]xf32>, vector<[4]xf32>
```
---------
Co-authored-by: Han-Chung Wang <hanhan0912@gmail.com>
The main goal of this and subsequent PRs is to unify and categorize
tests in:
* vector-transfer-flatten.mlir
This should make it easier to identify the edge cases being tested (and
how they differ), remove duplicates and to add tests for scalable
vectors.
The main contributions of this PR:
* split tests that covered `xfer_read` + `xfer_write` into separate
tests (majority of the existing tests check _one_ xfer Op at a time),
* organise tests for `xfer_read` and `xfer_write` into separate
groups (separate with a big bold comment).
Note, all tests (i.e. test cases) are preserved and some new tests are
added. Deletions that you will see in `git diff` correspond to
`xfer_write` and `xfer_read` Ops being extracted to separate functions
(so that there's one xfer Op per function). In particular, the number of
test functions has grown from 26 to 30.
In addition, this PR unifies the tests so that:
* input variable names are consistent (e.g. make sure that the input
memref is always `arg`)
* CHECK lines use similar indentations
* 2 x tabs are always used for function arguments, 1 x tab for
function body
Finally, changes in "VectorTransferOpTransforms.cpp" are merely meant to
unify comments and logic between
* `FlattenContiguousRowMajorTransferWritePattern` and
* `FlattenContiguousRowMajorTransferReadPattern`.
The main goal of this PR (and subsequent PRs), is to add more tests with
scalable vectors to:
* vector-transfer-collapse-inner-most-dims.mlir
There's quite a few cases to consider, hence this is split into multiple
PRs. In this PR, the very first test for `vector.transfer_write` is
complemented with all the possible combinations:
* scalable (rather than fixed) unit trailing dim,
* dynamic (rather than static) trailing dim in the source memref.
To this end, the following tests:
* `@leading_scalable_dimension_transfer_write`
`@trailing_scalable_one_dim_transfer_write`
are replaced with:
* `@drop_two_inner_most_dim_scalable_inner_dim` and
`@negative_scalable_unit_dim`,
respectively. In addition:
* "_for_transfer_write" is removed from function names (to reduce
noise).
In addition, to maintain consistency between the tests for `xfer_read`
and `xfer_write`, 2 negative tests for `xfer_read` are also renamed.
This is to follow the suggestion made during the review of this PR.
Extra comments in "VectorTransforms.cpp" are added to better
document the limitations related to scalable vectors and which tests
added here excercise.
This is a follow-up for: #94490 and #94604
NOTE: This PR is limited to tests for `vector.transfer_write`.
Generalizes `DropUnitDimFromElementwiseOps` to support inner unit
dimensions.
This change stems from improving lowering of contractionOps for Arm SME.
Where we end up with inner unit dimensions on MulOp, BroadcastOp and
TransposeOp, preventing the generation of outerproducts.
discussed
[here](https://discourse.llvm.org/t/on-improving-arm-sme-lowering-resilience-in-mlir/78543/17?u=nujaa).
---------
Co-authored-by: Benjamin Maxwell <macdue@dueutil.tech>
Implements `TransferReadToVectorLoadLowering` and
`TransferWriteToVectorStoreLowering` as a `MaskableOpRewritePattern`.
Allowing to exit gracefully when run on an xferOp located inside a
`vector::MaskOp` instead of breaking because the pattern generated
multiple ops in the MaskOp with `error: 'vector.mask' op expects only
one operation to mask`.
Split of https://github.com/llvm/llvm-project/pull/90835
Restrict `DropInnerMostUnitDimsTransferRead` so that it fails when one
of the indices to be dropped could be != 0, e.g.
```mlir
func.func @negative_example(%A: memref<16x1xf32>, %i:index, %j:index) -> (vector<8x1xf32>) {
%f0 = arith.constant 0.0 : f32
%1 = vector.transfer_read %A[%i, %j], %f0 : memref<16x1xf32>, vector<8x1xf32>
return %1 : vector<8x1xf32>
}
```
This is an edge case that could represent an out-of-bounds access,
though that will depend on the actual value of `%j`. Importantly,
_without this change_ it would be transformed as follows:
```mlir
func.func @negative_example(%arg0: memref<16x1xf32>, %arg1: index, %arg2: index) -> vector<8x1xf32> {
%cst = arith.constant 0.000000e+00 : f32
%subview = memref.subview %arg0[0, 0] [16, 1] [1, 1] : memref<16x1xf32> to memref<16xf32, strided<[1]>>
%0 = vector.transfer_read %subview[%arg1], %cst : memref<16xf32, strided<[1]>>, vector<8xf32>
%1 = vector.shape_cast %0 : vector<8xf32> to vector<8x1xf32>
return %1 : vector<8x1xf32>
}
```
This is incorrect - `%arg2` is ignored. Hence the extra restriction to
avoid such cases.
NOTE: This PR is limited to tests for `vector.transfer_read`.
Implements `TransferOpReduceRank` as a `MaskableOpRewritePattern`.
Allowing to exit gracefully when run on a `vector::transfer_read`
located inside a `vector::MaskOp` instead of generating `error: 'vector.mask'
op expects only one operation to mask` because the
pattern generated multiple ops inside the MaskOp.
Split of https://github.com/llvm/llvm-project/pull/90835
We can flatten the transfer ops even when the collapsed indices are not
zeros. We can compute it. It is already supported in
vector.transfer_read cases. The revision refactors the logic and reuse
it in transfer_write cases.