Friendlier wrapper for transform.foreach.
To facilitate that friendliness, makes it so that OpResult.owner returns
the relevant OpView instead of Operation. For good measure, also changes
Value.owner to return OpView instead of Operation, thereby ensuring
consistency. That is, makes it is so that all op-returning .owner
accessors return OpView (and thereby give access to all goodies
available on registered OpViews.)
Reland of #171544 due to fixup for integration test.
Friendlier wrapper for `transform.foreach`.
To facilitate that friendliness, makes it so that `OpResult.owner`
returns the relevant `OpView` instead of `Operation`. For good measure,
also changes `Value.owner` to return `OpView` instead of `Operation`,
thereby ensuring consistency. That is, makes it is so that all
op-returning `.owner` accessors return `OpView` (and thereby give access
to all goodies available on registered `OpView`s.)
While figuring out how to perform an atomic exchange on a memref, I
tried the generic atomic rmw with the yielded value captured from the
enclosing scope (instead of a plain atomic_rmw with
`arith::AtomicRMWKind::assign`). Instead of segfaulting, this PR changes
the pass to produce an error when the result is not found in the
region's IR map.
It might be more useful to give a suggestion to the user, but giving an
error message instead of a crash is at least an imrovement, I think.
See: #172184
If we do not collect the diagnostics from the
CollectDiagnosticsToStringScope, even when the named_sequence applied
successfully, the Scope object's destructor will assert (with a
unhelpful message).
Enable `FailOnUnsupportedFP` for `ConvertToLLVMPattern` and set it to
`true` for all `math-to-llvm` patterns. This fixes various invalid
lowerings of `math` ops on `fp8`/`fp4` types.
Since `FatRawBufferCastOp` preserves the shape of its source operand,
the result dimensions can be reified by querying the source's
dimensions.
---------
Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
This patch fixes the lowering of the newly
added mbarrier.arrive Op w.r.t return value.
(Follow-up of PR #170545)
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
Added `ds.atomic.barrier.arrive.rtn.b64` and
`ds.atomic.async.barrier.arrive.b64` to ROCDL. These are parts of the
LDS memory barrier concept in GFX1250. Also added alias analysis to
`global/flat` data prefetch ops. Extended rocdl tests.
To support the `-fvisibility=...` option in Flang, we need a pass to
rewrite all the global definitions in the LLVM dialect that have the
default visibility to have the specified visibility. This change adds
such a pass.
Note that I did not add an option for `visiblity=default`; I believe
this makes sense for compiler drivers since users may want to tack an
option on at the end of a compile line to override earlier options, but
I don't think it makes sense for this pass to accept
`visibility=default`--it would just be an early exit IIUC.
This change adds dense and sparse MMA with block scaling intrinsics to
MLIR -> NVVM IR -> NVPTX flow. NVVM and NVPTX implementation is based on
PTX ISA 9.0.
This patch changes the transfer_write -> transfer_read load store
forwarding canonicalization pattern to work based on permutation maps
and less on adhoc logic. The old logic couldn't canonicalize a simple
unit dim broadcast through transfer_write/transfer_read which is added
as a test in this patch.
This patch also details what would be needed to support cases which are
not yet implemented better.
The patch motivated by Tosa Conformance test negate_32x45x49_i16_full failure.
TosaToLinalg pass has an optimization to transfer Tosa Negate to Sub if the zero points are zeros. However, when the input value is minimum negative number, the transformation will cause the underflow. By removing the transformation, if zp = 0 it would do the promotion to avoid the underflow.
Promotion types could be from int32 to int48. TOSA negate specification does not mention support for int48. Should we consider removing the promotion to int48 to stay aligned with the TOSA spec?
This bug was introduced by #108323, where the loc and ip were not
properly set. It may lead to errors when the operations are not linearly
asserted to the IR.
This PR adds unrolling for vector.constant_mask op based on the
targetShape. Each unrolled vector computes its local mask size in each
dimension (d) as:
min(max(originalMaskSize[d] - offset[d], 0), unrolledMaskSize[d]).
* changes workgroup mask's type from i16 to vector<16xi1>
* changes pad_amount and pad_interval from Index to I32
* adds lit tests for padEnable, iteration and dynamic cases
* adds TODO for a future instrumentation pass to validate inputs
* adds descriptor groups 2 and 3
As mentioned on Discourse,
* https://discourse.llvm.org/t/psa-vector-standardise-operand-naming
I am removing the deprecated Vector hooks near the creation of the
release/22 branch. These hooks were introduced in #158258 (~September
'25, ~3 months ago), so I assume folks have enough time to transition
away.
Enables adding instrumentation to pass manager that can track/flag
invariants. This would be useful for cases where one some tighter
requirements than the general dialects or for a phase of conversion that
elsewhere.
It would enable making verify also just a regular instrumentation I
believe, but also a non-goal as that is a first class concept and
baseline for the ops and passes.
Would have enabled some of the requirements of
https://discourse.llvm.org/t/pre-verification-logic-before-running-conversion-pass-in-mlir/88318/10
.
This patch adds `ReduceOp::verifyRegions` to ensure that the number of
reduction regions equals the number of operands (`getReductions().size()
== getOperands().size()`).
Additionally, `ParallelOp::verify` is updated to gracefully handle cases
where the number of reduce operands differs from the initial values,
preventing verification logic crashes and relying on `ReduceOp` to
report structural inconsistencies.
Fixes: #118768
The patch updates the lowering of `id` based pmevent
also to intrinsics. The mask is simply (1 << event-id).
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
Add additional bound for the induction variable of the scf.forall such
that:
%iv <= %lower_bound + (%trip_count - 1) * step
Same as https://github.com/llvm/llvm-project/pull/126426 but for
scf.forall loop
-- This commit is the fourth in the series of adding matchers
for linalg.*conv*/*pool*. Refer:
https://github.com/llvm/llvm-project/pull/163724
-- In this commit all variants of Conv2D convolution ops have been
added.
-- It also refactors the way these matchers work to make adding more
matchers concise.
Signed-off-by: Abhishek Varma <abhvarma@amd.com>
---------
Signed-off-by: Abhishek Varma <abhvarma@amd.com>
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Co-authored-by: hanhanW <hanhan0912@gmail.com>
This pass implements the OpenACC loop tiling transformation for acc.loop
operations that have the tile clause (OpenACC 3.4 spec, section 2.9.8).
The tile clause specifies that the iterations of the associated loops
should be divided into tiles (rectangular blocks). The pass transforms a
single or nested acc.loop with tile clauses into a structure of "tile
loops" (iterating over tiles) containing "element loops" (iterating
within tiles).
For example, tiling a 2-level nested loop with tile(T1, T2):
```
// Before tiling:
acc.loop tile(T1, T2) control(%i, %j) = ...
// After tiling:
acc.loop control(%i) step (s1*T1) { // tile loop 1
acc.loop control(%j) step (s2*T2) { // tile loop 2
acc.loop control(%ii) = (%i) to (min(ub1, %i+s1*T1)) {
acc.loop control(%jj) = (%j) to (min(ub2, %j+s2*T2)) {
// loop body using %ii, %jj
}
}
}
}
```
Key features:
- Handles constant tile sizes and wildcard tile sizes ('*') which use a
configurable default tile size
- Properly handles collapsed loops with tile counts exceeding collapse
count by uncollapsing loops before tiling
- Distributes gang/worker/vector attributes appropriately: gang -> tile
loops, vector -> element loops
- Validates that tile size types are not wider than loop IV types
- Emits optimization remarks for tiling decisions
Three test files are added:
- acc-loop-tiling.mlir: Tests single and nested loop tiling with
constant tile sizes, unknown tile sizes (*), and loops with collapse
attributes
- acc-loop-tiling-invalid.mlir: Tests error diagnostic when tile size
type is wider than the loop IV type
- acc-loop-tiling-remarks.mlir: Tests optimization remarks emitted for
tiling decisions including default tile size selection
Co-authored-by: Vijay Kandiah <vkandiah@nvidia.com>
`vector.extract_strided_slice` can have two forms when specifying
offsets.
Case 1:
```
%1 = vector.extract_strided_slice %0 { offsets = [8, 0], sizes = [8, 16], strides = [1, 1]}
: vector<24x16xf32> to vector<8x16xf32>
```
Case 2:
```
%1 = vector.extract_strided_slice %0 { offsets = [8], sizes = [8], strides = [1]}
: vector<24x16xf32> to vector<8x16xf32>
```
These two ops means the same thing, but case 2 is syntactic sugar to
avoid specifying offsets for fully extracted dims. Currently case 2
fails in XeGPU SIMT distribution. This PR fixes this issue.
During the XeGPU-to-XeVM type conversion, a memref is lowered to its
base address. This PR extends the conversion to correctly handle memrefs
that include an offset, such as those generated by memref.subview.
Add a new API `isValidValueUse ` to OpenACCSupport. This is used in
ACCImplicitData to check value that are already legal in the OpenACC
region and do not require implicit clause to be generated. An example
would be a CUDA Fortran device variable that is already on the GPU.
Enables verification of attributes, independent of op, that references symbols.
This enables verifying Attribute with symbol usage independent of operation
attached to (e.g., the validity is on the Attribute independent of the operation).
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>