When set a value to minBits, and doing scalarizer pass, if last remained
boolean vector size can't be aligned to min bits, remained bits should
be processed each by each, and not allowed to do a direct shuffle during
packing.
Problem:
In 'concatenate' step, when processing a boolean vector, if last
remained bits (fragment) can't be aligned to minBits, but required to be
packed, those bits should be processed each by each.
A direct call to vector shuffle is to assume those remained boolean bits
can be packed to target pack size. For example, when processing a
boolean vector with `size = 7`, but set `min bits = 4`, first fragment
with `4` bits can be packed correctly, but there are still `3` bits
remained which can't be used in a vector shuffle call.
Solution:
If remained bits can't be aligned to required target (min bits) pack
size, process them each by each.
(This will mostly only influence boolean vector as they have bit width
not aligned to pow(2).)
---------
Co-authored-by: Zhou, Shaochi(AMD) <shaozhou@amd.com>
Currently, invoking `clang++` with `-fdiagnostics-format=sarif` causes a
crash, with stack traces indicating that
`SARIFDiagnostic::emitIncludeLocation` is unimplemented.
This PR adds minimal support for converting `In file included from ...`
and `In module ...` into `SARIF.result.relatedLocations`. With this
change, `clang++ -fdiagnostics-format=sarif` no longer crashes and now
provides a minimal amount of useful information.
Thank you.
Enable the clang_ignored_gcc_optimization_f_group in flang. These
options are accepted by clang, but ignored after emitting a warning
message. flang's behavior now mirrors both clang and gfortran.
Fixes#158436
Instructions, used outside the block, must be considered the first
choice for the main instructionsin the copyable nodes, to avoid
use-before-def.
Fixes#171055
Code-gen produced incorrect code for cases when the trip count an
associated DO loop was zero. The generated code evaluated the trip count
of the loop and substracted 1 from it. When this was passed to
__kmpc_for_static_init_4u, the value was interpreted as unsigned, which
made the upper bound of the worksharing loop 2^32-1 and caused a
division by zero in the calculation of the loop bounds for the threads.
Emitting the symbol in `emitGlobalAlias` seemed most efficient,
otherwise I think you'd have to traverse all aliases. I have verified
that the additional symbol is picked up by `arm-none-eabi-ld` and
correctly generates an entry in `veneers.o`.
Fixes#162084
Some usage case or shapes for 2D block op with sub byte types can be
emulated with 2D block operations for non-sub byte types. Add sub byte
type i4 as a valid XeGPU type. And add lowering of certain 2D
block operations by emulating with larger element types.
Inside the LLVM codebase, const vector& should just be ArrayRef, as this
more general API works both with vectors, SmallVectors and
SmallVectorImpl, as well as with single elements.
This commit replaces two uses introduced in
https://github.com/llvm/llvm-project/pull/168797 .
We can see the following while running clang-repl in C mode
```
anutosh491@vv-nuc:/build/anutosh491/llvm-project/build/bin$ ./clang-repl --Xcc=-x --Xcc=c --Xcc=-std=c23
clang-repl> printf("hi\n");
In file included from <<< inputs >>>:1:
input_line_1:1:1: error: call to undeclared library function 'printf' with type 'int (const char *, ...)'; ISO C99 and
later do not support implicit function declarations [-Wimplicit-function-declaration]
1 | printf("hi\n");
| ^
input_line_1:1:1: note: include the header <stdio.h> or explicitly provide a declaration for 'printf'
error: Parsing failed.
clang-repl> #include <stdio.h>
hi
```
In debug mode while dumping the generated Module, i see this
```
clang-repl> printf("hi\n");
In file included from <<< inputs >>>:1:
input_line_1:1:1: error: call to undeclared library function 'printf' with type 'int (const char *, ...)'; ISO C99 and
later do not support implicit function declarations [-Wimplicit-function-declaration]
1 | printf("hi\n");
| ^
input_line_1:1:1: note: include the header <stdio.h> or explicitly provide a declaration for 'printf'
error: Parsing failed.
clang-repl> #include <stdio.h>
=== compile-ptu 1 ===
[TU=0x55556cfbf830, M=0x55556cfc13a0 (incr_module_1)]
[LLVM IR]
; ModuleID = 'incr_module_1'
source_filename = "incr_module_1"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@.str = private unnamed_addr constant [4 x i8] c"hi\0A\00", align 1
@llvm.global_ctors = appending global [1 x { i32, ptr, ptr }] [{ i32, ptr, ptr } { i32 65535, ptr @_GLOBAL__sub_I_incr_module_1, ptr null }]
define internal void @__stmts__0() #0 {
entry:
%call = call i32 (ptr, ...) @printf(ptr noundef @.str)
ret void
}
declare i32 @printf(ptr noundef, ...) #1
; Function Attrs: noinline nounwind uwtable
define internal void @_GLOBAL__sub_I_incr_module_1() #2 section ".text.startup" {
entry:
call void @__stmts__0()
ret void
}
attributes #0 = { "min-legal-vector-width"="0" }
attributes #1 = { "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #2 = { noinline nounwind uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}
!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 8, !"PIC Level", i32 2}
!2 = !{i32 7, !"PIE Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 2}
!4 = !{i32 7, !"frame-pointer", i32 2}
!5 = !{!"clang version 22.0.0git (https://github.com/anutosh491/llvm-project.git81ad8fbc2b)"}
=== end compile-ptu ===
execute-ptu 1: [TU=0x55556cfbf830, M=0x55556cfc13a0 (incr_module_1)]
hi
```
Basically I see that CodeGen emits IR for a cell before we know whether
DiagnosticsEngine has an error. For C code like `printf("hi\n");`
without <stdio.h>, Sema emits a diagnostic but still produces a
"codegen-able" `TopLevelStmt`, so the `printf` call is IR-generated into
the current module.
Previously, when `Diags.hasErrorOccurred()` was true, we only cleaned up
the PTU AST and left the CodeGen module untouched. The next successful
cell then called `GenModule()`, which returned that same module (now
also containing the next cell’s IR), causing side effects from the
failed cell (e.g. printf)
This reverts commit fccb65ef8f.
It breaks pre-merge CI:
```
2025-12-08T16:35:11.7239054Z /home/gha/actions-runner/_work/llvm-project/llvm-project/mlir/lib/Pass/PassRegistry.cpp:439:37: error: ISO C++ requires the name after '::~' to be found in the same scope as the name before '::~' [-Werror,-Wdtor-name]
2025-12-08T16:35:11.7240458Z 439 | llvm::cl::OptionValue<OpPassManager>::~OptionValue() = default;
2025-12-08T16:35:11.7241014Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
2025-12-08T16:35:11.7241494Z | ::OptionValue
2025-12-08T16:35:11.7241903Z 1 error generated.
```
This patch moves abs_timeout and monotonicity out of the linux dir into
common. Both of these functions depend on clock_gettime which is the
actual os-dependent component. As other features in `__support/threads`
may want to use these, it's better to share it in common.
Check the result of `convertType` before calling `TypeAttr::get`. This
prevents a crash on unsupported types (e.g. `tensor`) by ensuring the
pattern fails gracefully.
Added regression test: map-info-type-conversion-fail.mlir
Fixes: #108159
## Summary
Allowing implicit compatibility between `_Float16` vector types and
`half` vector types in OpenCL mode. This enables AMDGPU builtins to work
correctly across OpenCL, HIP, and C++ without requiring separate builtin
definitions.
## Problem Statement
When using AMDGPU image builtins that return half-precision vectors in
OpenCL, users encounter type incompatibility errors:
**Builtin Definition:**
`TARGET_BUILTIN(__builtin_amdgcn_image_load_1d_v4f16_i32, "V4xiiQtii",
"nc", "image-insts")`
**Test Case:**
```
typedef half half4 __attribute__((ext_vector_type(4)));
half4 test_builtin_image_load_1d_2(half4 v4f16, int i32, __amdgpu_texture_t tex) {
return __builtin_amdgcn_image_load_1d_v4f16_i32(100, i32, tex, 120, i32);
}
```
**Error:**
```
error: returning '__attribute__((__vector_size__(4 * sizeof(_Float16)))) _Float16'
(vector of 4 '_Float16' values) from a function with incompatible result type
'half4' (vector of 4 'half' values)
```
## Solution
In OpenCL, allow implicit compatibility between `_Float16` vector types
and `half` vector types. This is needed for AMDGPU builtins that may
return _Float16 vectors to work correctly with OpenCL half vector types.
Fixed issue
[[PowerPC] llc crashed at -O1/O2/O3: Assertion `isImm() && "Wrong
MachineOperand mutator"'
failed.](https://github.com/llvm/llvm-project/issues/167672)
the root cause of the crash, the IMM operand is in different operand num
of the instruction PPC::XXSPLTW and PPC::XXSPLTB/PPC::XXSPLTH.
and the patch also fix a potential bug that the new element index of
PPC::XXSPLTB/PPC::XXSPLTH/XXSPLTW use the same logic. It should be
different .We need to convert the element index into the proper unit
(byte for VSPLTB, halfword for VSPLTH, word for VSPLTW) because
PPC::XXSLDWI interprets its ShiftImm in 32-bit word units.
gpu printf test was not using the runtime required by lit.local.cfg
All other tests in the directory are correctly using level zero runtime.
But gpu printf test is using sycl runtime.
These quantities should never unsigned-wrap. This matches the behavior
if only VFxUF is used (and not VF): when computing both VF and VFxUF,
nuw should hold for each step separately.
This patch adds a simple iterator range that allows conditionally
iterating a collection in reverse. It works with any collection
supported by `llvm::reverse(Collection)`.
```
void foo(bool Reverse, std::vector<int>& C) {
for (int I : reverse_conditionally(C, Reverse)) {
// ...
}
}
```
The change in #170263 does not do justice to common knowledge in the backend.
Fix the comment to reflect the relation between FLAT encoding, flat pointer
access, and LDSDMA operations.
There's no reason to use the ocml or nv prefixed functions and
maintain this list of alias macros. I left these macros in for
NVPTX in the scalbn and logb case, since those have a special
case hack in the AMDGPU codegen and probably do not work on ptx.
Port AMDGPUArgumentUsageInfo analysis to the NPM to fix suboptimal code
generation when NPM is enabled by default.
Previously, DAG.getPass() returns nullptr when using NPM, causing the
argument usage info to be unavailable during ISel. This resulted in
fallback to FixedABIFunctionInfo which assumes all implicit arguments
are needed, generating unnecessary register setup code for entry
functions.
Fixes LLVM::CodeGen/AMDGPU/cc-entry.ll
Changes:
- Split AMDGPUArgumentUsageInfo into a data class and NPM analysis
wrapper
- Update SIISelLowering to use DAG.getMFAM() for NPM path
- Add RequireAnalysisPass in addPreISel() to ensure analysis
availability
This follows the same pattern used for PhysicalRegisterUsageInfo.
In 531.deepsjeng_r from SPEC CPU 2017 there's a loop that we
unprofitably loop vectorize on RISC-V.
The loop looks something like:
```c
for (int i = 0; i < n; i++) {
if (x0[i] == a)
if (x1[i] == b)
if (x2[i] == c)
// do stuff...
}
```
Because it's so deeply nested the actual inner level of the loop rarely
gets executed. However we still deem it profitable to vectorize, which
due to the if-conversion means we now always execute the body.
This stems from the fact that `getPredBlockCostDivisor` currently
assumes that blocks have 50% chance of being executed as a heuristic.
We can fix this by using BlockFrequencyInfo, which gives a more accurate
estimate of the innermost block being executed 12.5% of the time. We can
then calculate the probability as `HeaderFrequency / BlockFrequency`.
Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V.
Whilst there's a lot of changes in the in-tree tests, this doesn't
affect llvm-test-suite or SPEC CPU 2017 that much:
- On armv9-a -flto -O3 there's 0.0%/0.2% more geomean loops vectorized
on llvm-test-suite/SPEC CPU 2017.
- On x86-64 -flto -O3 **with PGO** there's 0.9%/0% less geomean loops
vectorized on llvm-test-suite/SPEC CPU 2017.
Overall geomean compile time impact is 0.03% on stage1-ReleaseLTO:
https://llvm-compile-time-tracker.com/compare.php?from=9eee396c58d2e24beb93c460141170def328776d&to=32fbff48f965d03b51549fdf9bbc4ca06473b623&stat=instructions%3Au
The 'bind' clause emits an attribute on the RoutineOp that states which
function it should call on the device side. When provided in
double-quotes, the function on the device side should be the exact name
given. This patch emits the IR to do that.
As a part of that, we add a helper function to the OpenACC dialect to do
so, as well as a version that adds the ID version (though we don't
exercise th at yet).
The 'bind' with an ID should do the MANGLED name, but it isn't quite
clear what that name SHOULD be yet. Since the signature of a function is
included in its mangling, and we're not providing said signature, we
have to come up with something. This is left as an exercise for a future
patch.
Remove v8i64 dependency from original shift-by-1 tests - this was added for #132601 but is unlikely to be necessary
Add tests for general shifts as well as shift-by-constant and shift-of-constant examples
This test is failing on some buildbots now that the internal shell has
been turned on and was failing previously on some ppc bots when turning
it on a while back (before it got reverted).
At least one X86 bot is barely hitting the limit
(https://lab.llvm.org/buildbot/#/builders/174/builds/28487 224MB-235MB).
This likely needs to be bumped due to changes in the process tree (now
that we invoke things through python rather than a bash shell) with the
enablement of the internal shell.
So far, the syntax was `target frame-provider register <cmd-options>
[<run-args>]`. Note the optional `run-args` at the end. They are
completely ignored by the actual command, but the command line parser
still accepts them.
This commit removes them.
This was probably a copy-paste error from `CommandObjectProcessLaunch`
which was probably used as a blue-print for `target frame-provider
register`.
BOLT currently ignores functions with synchronous PAuth DWARF info.
If more than 10% of functions get ignored for inconsistencies, we
should emit a warning to only use asynchronous unwind tables.
See related issue: #165215