This aims to implement most of the initial arguments for defaultmap
aside from firstprivate and none, and some of the more recent OpenMP 6
additions which will come in subsequent updates (with the OpenMP 6
variants needing parsing/semantic support first).
Similar to vector ops, XeGPU ops need to be unrolled into smaller shapes
such that they can be dispatched into a hardware instruction. This PR
marks the initial phase of a series dedicated to incorporating unroll
patterns for XeGPU operations. In this installment, we introduce
patterns for the following operations:
1. createNd
2. updateNd
3. prefetchNd
4. loadNd
5. storeNd
6. dpas
There was already a __clc_tan in the OpenCL layer. This commit moves the
function over whilst vectorizing it.
The function __clc_tan is no longer a public symbol, which should have
never been the case.
1. There is ADD64rm_ND instruction emitted with GOTPCREL relocation.
Handled it in "Suppress APX for relocation" pass and transformed it to
ADD64rm with register operand in non-rex2 register class. The relocation
type R_X86_64_CODE_6_GOTPCRELX will be added later for APX enabled with
relocation.
2. The register class for operands in instruction with relocation is
updated to non-rex2 one in "Suppress APX for relocation" pass, but it
may be updated/recomputed to larger register class (like
GR64_NOREX2RegClass to GR64RegClass). Fixed by not updating the register
class if it's non-rex2 register class and APX support for relocation is
disabled.
3. After "Suppress APX for relocation" pass, the instruction with
relocation may be folded with add NDD instruction to a add NDD
instruction with relocation. The later will be emitted to instruction
with APX relocation type which breaks backward compatibility. Fixed by
not folding instruction with GOTPCREL relocation with NDD instruction.
4. If the register in operand 0 of instruction with relocation is used
in the PHI instruction, it may be replaced with operand 0 of PHI
instruction (maybe EGPR) after PHI elimination and Machine Copy
Propagation pass. Fixed by suppressing EGPR in operand 0 of PHI
instruction to avoid APX relocation types emitted.
On some platforms (particularly macOS), a `\01` prefix gets added to the
name in an `asm` label. This gets stripped when we emit the
[`DW_AT_linkage_name`](2f877c2722/llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp (L531)).
But we weren't stripping this prefix when inserting the linkage name
into accelerator tables.
This manifested in an issue where LLDB tried to look up a name in the
index by linkage name, but wasn't able to find it because we indexed it
with the `\01` unstripped.
This patch strips the prefix before indexing.
…_reduce_matmul.
This patch exposes broadcast and transpose semantics on
'batch_reduce_matmul'. This is the last one in continuation of other two
variant of matmul ops.
The broadcast and transpose semantic are as follows:
Broadcast and Transpose semantics can be appiled by specifying the
explicit attribute 'indexing_maps' as shown below. This is a list
attribute, so must include maps for all arguments if specified.
Example Transpose:
```
linalg.batch_reduce_matmul indexing_maps = [
affine_map<(d0, d1, d2, d3) -> (d0, d3, d1)>, // transpose
affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>,
affine_map<(d0, d1, d2, d3) -> (d1, d2)>
]
ins(%arg0, %arg1 : memref<2x5x3xf32>,memref<2x5x7xf32>)
outs(%arg2: memref<3x7xf32>)
```
Example Broadcast:
```
linalg.batch_reduce_matmul indexing_maps = [
affine_map<(d0, d1, d2, d3) -> (d3)>, // broadcast
affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>,
affine_map<(d0, d1, d2, d3) -> (d1, d2)>
]
ins(%arg0, %arg1 : memref<5xf32>, memref<2x5x7xf32>)
outs(%arg2: memref<3x7xf32>)
```
Example Broadcast and Transpose:
```
linalg.batch_reduce_matmul indexing_maps = [
affine_map<(d0, d1, d2, d3) -> (d1, d3)>, // broadcast
affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>, // transpose
affine_map<(d0, d1, d2, d3) -> (d1, d2)>
]
ins(%arg0, %arg1 : memref<3x5xf32>, memref<2x7x5xf32>)
outs(%arg2: memref<3x7xf32>)
```
RFCs and related PR:
https://discourse.llvm.org/t/rfc-linalg-opdsl-constant-list-attribute-definition/80149https://discourse.llvm.org/t/rfc-op-explosion-in-linalg/82863https://discourse.llvm.org/t/rfc-mlir-linalg-operation-tree/83586https://github.com/llvm/llvm-project/pull/115319https://github.com/llvm/llvm-project/pull/122275
Closes#57270.
This PR changes the `Stmt *` field in `SymbolConjured` with
`CFGBlock::ConstCFGElementRef`. The motivation is that, when conjuring a
symbol, there might not always be a statement available, causing
information to be lost for conjured symbols, whereas the CFGElementRef
can always be provided at the callsite.
Following the idea, this PR changes callsites of functions to create
conjured symbols, and replaces them with appropriate `CFGElementRef`s.
There is a caveat at loop widening, where the correct location is the
CFG terminator (which is not an element and does not have a ref). In
this case, the first element in the block is passed as a location.
Previous PR #128251, Reverted at #137304.
Move early-exit handling up front to original VPlan construction, before
introducing early exits.
This builds on https://github.com/llvm/llvm-project/pull/137709, which
adds exiting edges to the original VPlan, instead of adding exit blocks
later.
This retains the exit conditions early, and means we can handle early
exits before forming regions, without the reliance on VPRecipeBuilder.
Once we retain all exits initially, handling early exits before region
construction ensures the regions are valid; otherwise we would leave
edges exiting the region from elsewhere than the latch.
Removing the reliance on VPRecipeBuilder removes the dependence on
mapping IR BBs to VPBBs and unblocks predication as VPlan transform:
https://github.com/llvm/llvm-project/pull/128420.
Depends on https://github.com/llvm/llvm-project/pull/137709 (included in
PR).
PR: https://github.com/llvm/llvm-project/pull/138393
This commit introduces a new unit test that covers a block address
cloning bug. Specifically, this test covers the bug tracked in
http://github.com/llvm/llvm-project/issues/47769 which has been resolved
in the meantime.
This commit moves the remaining FP64 sin and cos helper functions to the
CLC library. As a consequence, it formally moves all sin, cos and sincos
builtins to the CLC library. Previously, the FP16 and FP32 were
nominally there but still in the OpenCL layer while waiting for the FP64
ones.
The FP64 builtins are now vectorized as the FP16 and FP32 ones were
earlier.
One helper table had to be changed. It was previously a table of bytes
loaded by each work-item as uint4. Since this doesn't vectorize well,
the table was split to load two ulongNs per work-item. While this might
not be as efficient on some devices, one mitigating factor is that we
were previously loading 48 bytes per work-item in total, but only using
40 of them. With this commit we only load the bytes we need.
This patch adds patterns to select SVE2 bit-sel instructions such as BSL
from (or (and a, c), (and b, (vnot c)))) and other similar patterns. For
example:
```cpp
svuint64_t bsl(svuint64_t a, svuint64_t b, svuint64_t c) {
return (a & c) | (b & ~c);
}
```
Currently:
```gas
bsl:
and z0.d, z2.d, z0.d
bic z1.d, z1.d, z2.d
orr z0.d, z0.d, z1.d
ret
```
Becomes:
```gas
bsl:
bsl z0.d, z0.d, z1.d, z2.d
ret
```
Similarly to #135016, refactor getPTrue to return splat (1) for
all-active patterns. The main motivation for this is to improve
code gen for fixed-length vector loads/stores that are converted to SVE
masked memory ops when the vectors are wider than Neon. Emitting the
mask as a splat helps DAGCombiner simplify all-active masked
loads/stores into unmaked ones, for which it already has suitable
combines and ISel has suitable patterns.
After the introduction of `OpAsmAttrInterface`, it is favorable to
migrate code using `OpAsmDialectInterface` for ASM alias generation,
which lives in `Dialect.cpp`, to use `OpAsmAttrInterface`, which lives
in `Attrs.td`. In this way, attribute behavior is placed near its
tablegen definition and people won't need to go through other files to
know what other (unexpected) hooks comes into play.
See #124721 for the interface itself and #128191 and #130479 for prior
migrations.
Note that `MLProgramOpAsmInterface` has no content now. However, if we
delete it, a failure related to dialect resource handling will occur
```
within split at llvm-project/mlir/test/IR/invalid-file-metadata.mlir:60 offset :7:7: error: unexpected error: unexpected 'resource' section for dialect 'ml_program'
```
To support resource such interface must be registered.
After the introduction of `OpAsmAttrInterface`, it is favorable to
migrate code using `OpAsmDialectInterface` for ASM alias generation,
which lives in `Dialect.cpp`, to use `OpAsmAttrInterface`, which lives
in `Attrs.td`. In this way, attribute behavior is placed near its
tablegen definition and people won't need to go through other files to
know what other (unexpected) hooks comes into play.
See #124721 for the interface itself and #128191 for prior migration for
Builtin Attributes.
See #131504 for the `genMnemonicAlias` tablegen field.
Currently when we version a loop all loads and stores have the noalias
metadata added to them. If there were some pointers that could not be
analysed, and thus we could not generate runtime aliasing checks for,
then we should not mark loads and stores using these pointers as
noalias.
This is done by getting rid of setNoAliasToLoop and instead using
annotateLoopWithNoAlias, as that already correctly handles partial alias
information. This does result in slightly different aliasing metadata
being generated, but it looks like it's more precise.
Currently this doesn't result in any change to the transforms that
LoopVersioningLICM does, as LoopAccessAnalysis discards all results if
it couldn't analyse every pointer leading to no loop versioning
happening, but an upcoming patch will change that and we need this first
otherwise we incorrectly mark some pointers as noalias even when they
aren't.
Recently some users reported that they observed large increases of
runtime (up to +600% on some translation units) when they upgraded to a
more recent (slightly patched, internal) clang version. Bisection
revealed that the bulk of this increase was probably caused by my
earlier commit bb27d5e5c6 ("Don't assume
third iteration in loops").
As I evaluated that earlier commit on several open source project, it
turns out that on average it's runtime-neutral (or slightly helpful: it
reduced the total analysis time by 1.5%) but it can cause runtime spikes
on some code: in particular it more than doubled the time to analyze
`tmux` (one of the smaller test projects).
Further profiling and investigation proved that these spikes were caused
by an _increase of analysis scope_ because there was an heuristic that
placed functions on a "don't inline this" blacklist if they reached the
`-analyzer-max-loop` limit (anywhere, on any one execution path) --
which became significantly rarer when my commit ensured the analyzer no
longer "just assumes" four iterations. (With more inlining significantly
more entry points use up their allocated budgets, which leads to the
increased runtime.)
I feel that this heuristic for the "don't inline" blacklist is
unjustified and arbitrary, because reaching the "retry without inlining"
limit on one path does not imply that inlining the function won't be
valuable on other paths -- so I hope that we can eventually replace it
with more "natural" limits of the analysis scope.
However, the runtime increases are annoying for the users whose project
is affected, so I created this quick workaround commit that approximates
the "don't inline" blacklist effects of ambiguous loops (where the
analyzer doesn't understand the loop condition) without fully reverting
the "Don't assume third iteration" commit (to avoid reintroducing the
false positives that were eliminated by it).
Investigating this issue was a team effort: I'm grateful to Endre Fülöp
(gamesh411) who did the bisection and shared his time measurement setup,
and Gábor Tóthvári (tigbr) who helped me in profiling.
[mlir][vector] Standardize base Naming Across Vector Ops (NFC)
This change standardizes the naming convention for the argument
representing the value to read from or write to in Vector ops that
interface with Tensors or MemRefs. Specifically, it ensures that all
such ops use the name `base` (i.e., the base address or location to
which offsets are applied).
Updated operations:
* `vector.transfer_read`,
* `vector.transfer_write`.
For reference, these ops already use `base`:
* `vector.load`, `vector.store`, `vector.scatter`, `vector.gather`,
`vector.expandload`, `vector.compressstore`, `vector.maskedstore`,
`vector.maskedload`.
This is a non-functional change (NFC) and does not alter the semantics of these
operations. However, it does require users of the XFer ops to switch from
`op.getSource()` to `op.getBase()`.
To ease the transition, this PR temporarily adds a `getSource()` interface
method for compatibility. This is intended for downstream use only and should
not be relied on upstream. The method will be removed prior to the LLVM 21
release.
Implements #131602
There are checks in clang codebase that determine the type of source
file, associated with a given location - specifically, if it is an
ordonary file or comes from sources like command-line options or a
built-in definitions. These checks often rely on calls to
`getPresumedLoc`, which is relatively expensive. In certain cases, these
checks are combined, leading to repeated calculations of the costly
function negatively affecting compile time.
This change tries to optimize such checks. It must fix compile time
regression introduced in
https://github.com/llvm/llvm-project/pull/137306/.
---------
Co-authored-by: cor3ntin <corentinjabot@gmail.com>
All the tests pass and a bootstrap and run of the llvm-test-suite passed
successfully. Enable verifyInstructionPredicates so that instructions
which are invalid with the current set of features produce an error.
`-rewrite-objc` passes `-x objective-c++-cpp-output` as input type to
the preprocessor job. This is not correct since we would be
preprocessing a preprocessed file. The correct input type is
`objective-c++`.
When using -no-integrated-cpp, before, the driver won't collapse actions
when the input was not llvm-ir
or it would collapse them too aggressively with -save-temps
The original code was checking the action type (which is IR too for
preprocessed->bc actions) instead of the action inputs.
Use the RegSubRegPair struct defined in TargetInstrInfo instead of the
custom definitions in HexagonGenPredicates and
HexagonConstantPropogation.
This patch addresses the FIXME's that were there in these passes.
Re-landing this patch with small tweaks to address CI bot failures
as it was run on many different configurations. I think the test
may run on aarch64 Linux systems now.
When a frameless function faults or is interrupted asynchronously, the
UnwindPlan MAY have no register location rule for the return address
register (lr on arm64); the value is simply live in the lr register when
it was interrupted, and the frame below this on the stack -- e.g.
sigtramp on a Unix system -- has the full register context, including
that register.
RegisterContextUnwind::SavedLocationForRegister, when asked to find the
caller's pc value, will first see if there is a pc register location. If
there isn't, on a Return Address Register architecture like
arm/mips/riscv, we rewrite the register request from "pc" to "RA
register", and search for a location.
On frame 0 (the live frame) and an interrupted frame, the UnwindPlan may
have no register location rule for the RA Reg, that is valid. A
frameless function that never calls another may simply keep the return
address in the live register the whole way. Our instruction emulation
unwind plans explicitly add a rule (see Pavel's May 2024 change
https://github.com/llvm/llvm-project/pull/91321 ), but an UnwindPlan
sourced from debug_frame may not.
I've got a case where this exactly happens - clang debug_frame for arm64
where there is no register location for the lr in a frameless function.
There is a fault in the middle of this frameless function and we only
get the lr value from the fault handler below this frame if lr has a
register location of `IsSame`, in line with Pavel's 2024 change.
Similar to how we see a request of the RA Reg from frame 0 after failing
to find an unwind location for the pc register, the same style of
special casing is needed when this is a function that was interrupted.
Without this change, we can find the pc of the frame that was executing
when it was interrupted, but we need $lr to find its caller, and we
don't descend down to the trap handler to get that value, truncating the
stack.
rdar://145614545
This is in prep for OSA2011 instruction definitions, which has a CBCond
instruction family.
Reviewers: rorth, s-barannikov, brad0
Reviewed By: s-barannikov
Pull Request: https://github.com/llvm/llvm-project/pull/138402
This reverts commit
a0260a95ec,
reapplying
7c5f5f3ef8,
with a fix that makes *both*
pipe handles inheritable.
The original commit description was:
This is a follow-up to https://github.com/llvm/llvm-project/pull/126935,
which enables passing handles to a child
process on windows systems. Unlike on unix-like systems, the handles
need to be created with the "inheritable" flag because there's to way to
change the flag value after it has been created. This is why I don't
respect the child_process_inherit flag but rather always set the flag to
true. (My next step is to delete the flag entirely.)
This does mean that pipe may be created as inheritable even if its not
necessary, but I think this is offset by the fact that windows (unlike
unixes, which pass all ~O_CLOEXEC descriptors through execve and *all*
descriptors through fork) has a way to specify the precise set of
handles to pass to a specific child process.
If this turns out to be insufficient, instead of a constructor flag, I'd
rather go with creating a separate api to create an inheritable copy of
a handle (as typically, you only want to inherit one end of the pipe).