This patch moves abs_timeout and monotonicity out of the linux dir into
common. Both of these functions depend on clock_gettime which is the
actual os-dependent component. As other features in `__support/threads`
may want to use these, it's better to share it in common.
Check the result of `convertType` before calling `TypeAttr::get`. This
prevents a crash on unsupported types (e.g. `tensor`) by ensuring the
pattern fails gracefully.
Added regression test: map-info-type-conversion-fail.mlir
Fixes: #108159
## Summary
Allowing implicit compatibility between `_Float16` vector types and
`half` vector types in OpenCL mode. This enables AMDGPU builtins to work
correctly across OpenCL, HIP, and C++ without requiring separate builtin
definitions.
## Problem Statement
When using AMDGPU image builtins that return half-precision vectors in
OpenCL, users encounter type incompatibility errors:
**Builtin Definition:**
`TARGET_BUILTIN(__builtin_amdgcn_image_load_1d_v4f16_i32, "V4xiiQtii",
"nc", "image-insts")`
**Test Case:**
```
typedef half half4 __attribute__((ext_vector_type(4)));
half4 test_builtin_image_load_1d_2(half4 v4f16, int i32, __amdgpu_texture_t tex) {
return __builtin_amdgcn_image_load_1d_v4f16_i32(100, i32, tex, 120, i32);
}
```
**Error:**
```
error: returning '__attribute__((__vector_size__(4 * sizeof(_Float16)))) _Float16'
(vector of 4 '_Float16' values) from a function with incompatible result type
'half4' (vector of 4 'half' values)
```
## Solution
In OpenCL, allow implicit compatibility between `_Float16` vector types
and `half` vector types. This is needed for AMDGPU builtins that may
return _Float16 vectors to work correctly with OpenCL half vector types.
Fixed issue
[[PowerPC] llc crashed at -O1/O2/O3: Assertion `isImm() && "Wrong
MachineOperand mutator"'
failed.](https://github.com/llvm/llvm-project/issues/167672)
the root cause of the crash, the IMM operand is in different operand num
of the instruction PPC::XXSPLTW and PPC::XXSPLTB/PPC::XXSPLTH.
and the patch also fix a potential bug that the new element index of
PPC::XXSPLTB/PPC::XXSPLTH/XXSPLTW use the same logic. It should be
different .We need to convert the element index into the proper unit
(byte for VSPLTB, halfword for VSPLTH, word for VSPLTW) because
PPC::XXSLDWI interprets its ShiftImm in 32-bit word units.
gpu printf test was not using the runtime required by lit.local.cfg
All other tests in the directory are correctly using level zero runtime.
But gpu printf test is using sycl runtime.
These quantities should never unsigned-wrap. This matches the behavior
if only VFxUF is used (and not VF): when computing both VF and VFxUF,
nuw should hold for each step separately.
This patch adds a simple iterator range that allows conditionally
iterating a collection in reverse. It works with any collection
supported by `llvm::reverse(Collection)`.
```
void foo(bool Reverse, std::vector<int>& C) {
for (int I : reverse_conditionally(C, Reverse)) {
// ...
}
}
```
The change in #170263 does not do justice to common knowledge in the backend.
Fix the comment to reflect the relation between FLAT encoding, flat pointer
access, and LDSDMA operations.
There's no reason to use the ocml or nv prefixed functions and
maintain this list of alias macros. I left these macros in for
NVPTX in the scalbn and logb case, since those have a special
case hack in the AMDGPU codegen and probably do not work on ptx.
Port AMDGPUArgumentUsageInfo analysis to the NPM to fix suboptimal code
generation when NPM is enabled by default.
Previously, DAG.getPass() returns nullptr when using NPM, causing the
argument usage info to be unavailable during ISel. This resulted in
fallback to FixedABIFunctionInfo which assumes all implicit arguments
are needed, generating unnecessary register setup code for entry
functions.
Fixes LLVM::CodeGen/AMDGPU/cc-entry.ll
Changes:
- Split AMDGPUArgumentUsageInfo into a data class and NPM analysis
wrapper
- Update SIISelLowering to use DAG.getMFAM() for NPM path
- Add RequireAnalysisPass in addPreISel() to ensure analysis
availability
This follows the same pattern used for PhysicalRegisterUsageInfo.
In 531.deepsjeng_r from SPEC CPU 2017 there's a loop that we
unprofitably loop vectorize on RISC-V.
The loop looks something like:
```c
for (int i = 0; i < n; i++) {
if (x0[i] == a)
if (x1[i] == b)
if (x2[i] == c)
// do stuff...
}
```
Because it's so deeply nested the actual inner level of the loop rarely
gets executed. However we still deem it profitable to vectorize, which
due to the if-conversion means we now always execute the body.
This stems from the fact that `getPredBlockCostDivisor` currently
assumes that blocks have 50% chance of being executed as a heuristic.
We can fix this by using BlockFrequencyInfo, which gives a more accurate
estimate of the innermost block being executed 12.5% of the time. We can
then calculate the probability as `HeaderFrequency / BlockFrequency`.
Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V.
Whilst there's a lot of changes in the in-tree tests, this doesn't
affect llvm-test-suite or SPEC CPU 2017 that much:
- On armv9-a -flto -O3 there's 0.0%/0.2% more geomean loops vectorized
on llvm-test-suite/SPEC CPU 2017.
- On x86-64 -flto -O3 **with PGO** there's 0.9%/0% less geomean loops
vectorized on llvm-test-suite/SPEC CPU 2017.
Overall geomean compile time impact is 0.03% on stage1-ReleaseLTO:
https://llvm-compile-time-tracker.com/compare.php?from=9eee396c58d2e24beb93c460141170def328776d&to=32fbff48f965d03b51549fdf9bbc4ca06473b623&stat=instructions%3Au
The 'bind' clause emits an attribute on the RoutineOp that states which
function it should call on the device side. When provided in
double-quotes, the function on the device side should be the exact name
given. This patch emits the IR to do that.
As a part of that, we add a helper function to the OpenACC dialect to do
so, as well as a version that adds the ID version (though we don't
exercise th at yet).
The 'bind' with an ID should do the MANGLED name, but it isn't quite
clear what that name SHOULD be yet. Since the signature of a function is
included in its mangling, and we're not providing said signature, we
have to come up with something. This is left as an exercise for a future
patch.
Remove v8i64 dependency from original shift-by-1 tests - this was added for #132601 but is unlikely to be necessary
Add tests for general shifts as well as shift-by-constant and shift-of-constant examples
This test is failing on some buildbots now that the internal shell has
been turned on and was failing previously on some ppc bots when turning
it on a while back (before it got reverted).
At least one X86 bot is barely hitting the limit
(https://lab.llvm.org/buildbot/#/builders/174/builds/28487 224MB-235MB).
This likely needs to be bumped due to changes in the process tree (now
that we invoke things through python rather than a bash shell) with the
enablement of the internal shell.
So far, the syntax was `target frame-provider register <cmd-options>
[<run-args>]`. Note the optional `run-args` at the end. They are
completely ignored by the actual command, but the command line parser
still accepts them.
This commit removes them.
This was probably a copy-paste error from `CommandObjectProcessLaunch`
which was probably used as a blue-print for `target frame-provider
register`.
BOLT currently ignores functions with synchronous PAuth DWARF info.
If more than 10% of functions get ignored for inconsistencies, we
should emit a warning to only use asynchronous unwind tables.
See related issue: #165215
Assuming the predicate is hoisted, this should have a slightly better
throughput: https://godbolt.org/z/jb7aP7Efc
Note: SVE must be used to convert back to bf16 as the bfmlalb/t
instructions operate on even/odd lanes, but the neon bfcvtn/2 process
the top/bottom halves of vectors.
`SPIRVEmitIntrinsics::simplifyZeroLengthArrayGepInst` asserted that it
always expected a `GetElementPtrInst` from `IRBuilder::CreateGEP` (which
returns a `Value`). `IRBuilder` can fold and return a `ConstantExpr`
instead, thus violating the assertion. The patch fixes this by using
`GetElementPtrInst::Create` to always return a `GetElementPtrInst`.
This LLVM defect was identified via the AMD Fuzzing project.
Building with GCC produces:
```
<...>/TuneExtensionOps.cpp:180:26: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
180 | if (*selectedRegionIdx < 0 || *selectedRegionIdx >= getNumRegions())
| ~~~~~~~~~~~~~~~~~~~^~~
<...>/TuneExtensionOps.cpp: In member function ‘llvm::LogicalResult mlir::transform::tune::AlternativesOp::verify()’:
/home/david.spickett/llvm-project/mlir/lib/Dialect/Transform/TuneExtension/TuneExtensionOps.cpp:236:19: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
236 | if (regionIdx < 0 || regionIdx >= getNumRegions())
| ~~~~~~~~~~^~~
```
As we are sign extending these variables, use int64_t instead of size_t
for their type.
By completely omitting invalidation in the case of InstanceCall, we do
not clear the moved state of the fields of the this object after an
opaque call to a member function of the object itself.
Previously, we would miss inserting a wait if the ds_read had AA info,
but it didn't match
any LDS DMA op, for example if we didn't track the LDS DMA op it aliases
with because it exceeded the tracking limit.
I have seen a failure whereby the fuzzer failed to reach the expected
input and thus failed the test.
This patch bumps the max executions to 10,000,000 in order to give the
fuzzer a better chance of reaching the expected input. Most runs
complete successfully, so I do not see this adding test time in the
general case; I believe it's a fair tradeoff for the unlucky seed to run
for longer if it reduces the noise from false positives. Note, this
updates a different `RUN:` to
https://github.com/llvm/llvm-project/pull/165402.
rdar://162122184