The 'link' clause is like the rest of the global clauses (copyin,
create, device_resident), except it only has an entry op(thus no
dtor).
This patch also removes a bunch of now stales TODOs from the tests.
Previously libcall lowering decisions were made directly
in the TargetLowering constructor. Pull these into the subtarget
to facilitate turning LibcallLoweringInfo into a separate analysis
in the future.
This backend support will allow the LoadStoreVectorizer, in certain
cases, to fill in gaps when creating load/store vectors and generate
LLVM masked load/stores
(https://llvm.org/docs/LangRef.html#llvm-masked-store-intrinsics). To
accomplish this, changes are separated into two parts. This first part
has the backend lowering and TTI changes, and a follow up PR will have
the LSV generate these intrinsics:
https://github.com/llvm/llvm-project/pull/159388.
In this backend change, Masked Loads get lowered to PTX with `#pragma
"used_bytes_mask" [mask];`
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
And Masked Stores get lowered to PTX using the new sink symbol syntax
(https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st).
# TTI Changes
TTI changes are needed because NVPTX only supports masked loads/stores
with _constant_ masks. `ScalarizeMaskedMemIntrin.cpp` is adjusted to
check that the mask is constant and pass that result into the TTI check.
Behavior shouldn't change for non-NVPTX targets, which do not care
whether the mask is variable or constant when determining legality, but
all TTI files that implement these API need to be updated.
# Masked store lowering implementation details
If the masked stores make it to the NVPTX backend without being
scalarized, they are handled by the following:
* `NVPTXISelLowering.cpp` - Sets up a custom operation action and
handles it in lowerMSTORE. Similar handling to normal store vectors,
except we read the mask and place a sentinel register `$noreg` in each
position where the mask reads as false.
For example,
```
t10: v8i1 = BUILD_VECTOR Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>, Constant:i1<-1>, Constant:i1<0>, Constant:i1<0>, Constant:i1<-1>
t11: ch = masked_store<(store unknown-size into %ir.lsr.iv28, align 32, addrspace 1)> t5:1, t5, t7, undef:i64, t10
->
STV_i32_v8 killed %13:int32regs, $noreg, $noreg, killed %16:int32regs, killed %17:int32regs, $noreg, $noreg, killed %20:int32regs, 0, 0, 1, 8, 0, 32, %4:int64regs, 0, debug-location !18 :: (store unknown-size into %ir.lsr.iv28, align 32, addrspace 1);
```
* `NVPTXInstInfo.td` - changes the definition of store vectors to allow
for a mix of sink symbols and registers.
* `NVPXInstPrinter.h/.cpp` - Handles the `$noreg` case by printing "_".
# Masked load lowering implementation details
Masked loads are routed to normal PTX loads, with one difference: a
`#pragma "used_bytes_mask"` is emitted before the load instruction
(https://docs.nvidia.com/cuda/parallel-thread-execution/#pragma-strings-used-bytes-mask).
To accomplish this, a new operand is added to every NVPTXISD Load type
representing this mask.
* `NVPTXISelLowering.h/.cpp` - Masked loads are converted into normal
NVPTXISD loads with a mask operand in two ways. 1) In type legalization
through replaceLoadVector, which is the normal path, and 2) through
LowerMLOAD, to handle the legal vector types
(v2f16/v2bf16/v2i16/v4i8/v2f32) that will not be type legalized. Both
share the same convertMLOADToLoadWithUsedBytesMask helper. Both default
this operand to UINT32_MAX, representing all bytes on. For the latter,
we need a new `NVPTXISD::MLoadV1` type to represent that edge case
because we cannot put the used bytes mask operand on a generic
LoadSDNode.
* `NVPTXISelDAGToDAG.cpp` - Extract used bytes mask from loads, add them
to created machine instructions.
* `NVPTXInstPrinter.h/.cpp` - Print the pragma when the used bytes mask
isn't all ones.
* `NVPTXForwardParams.cpp`, `NVPTXReplaceImageHandles.cpp` - Update
manual indexing of load operands to account for new operand.
* `NVPTXInsrtInfo.td`, `NVPTXIntrinsics.td` - Add the used bytes mask to
the MI definitions.
* `NVPTXTagInvariantLoads.cpp` - Ensure that masked loads also get
tagged as invariant.
Some generic changes that are needed:
* `LegalizeVectorTypes.cpp` - Ensure flags are preserved when splitting
masked loads.
* `SelectionDAGBuilder.cpp` - Preserve `MD_invariant_load` on masked
load SDNode creation
For HIP, the SPIR-V backend can be optionally activated with the -use-spirv-backend flag. This option uses the SPIR-V BE instead of the SPIR-V translator. These changes also ensure that -use-spirv-backend does not require external dependencies, such as spirv-as and spirv-link
Add an end-to-end (non-LTO) test verifying that the optimization
pipeline is set up correctly for Profile Guided Heap Optimization (PGHO)
transforms. Ensure that both PGHO and AllocToken can stack, and the
AllocToken pass does not interfere with PGHO and vice versa.
Before PR #152775, `llvm::getLoopEstimatedTripCount` never returned 0.
If `llvm::setLoopEstimatedTripCount` were called with 0, it would zero
branch weights, causing `llvm::getLoopEstimatedTripCount` to return
`std::nullopt`.
PR #152775 changed that behavior: if `llvm::setLoopEstimatedTripCount`
is called with 0, it sets `llvm.loop.estimated_trip_count` to 0, causing
`llvm::getLoopEstimatedTripCount` to return 0. However, it kept
documentation saying `llvm::getLoopEstimatedTripCount` returns a
positive count.
Some passes continue to assume `llvm::getLoopEstimatedTripCount` never
returns 0 and crash if it does, as reported in issue #164254. To restore
the behavior they expect, this patch changes
`llvm::getLoopEstimatedTripCount` to return `std::nullopt` when
`llvm.loop.estimated_trip_count` is 0.
When rewriting operator new calls to their hot/cold variants for PGHO,
`!alloc_token` metadata was being dropped. This metadata is required by
the AllocToken pass to correctly instrument the optimized allocation.
Fix it by preserving all metadata.
As mentioned in
a27bb38ee6/mlir/include/mlir/Dialect/Quant/IR/QuantTypes.h (L109)
`QuantType::getStorageType` doesn't capture the sign information. This
lead to the following IR to fail during verification:
```
func.func @clamp(%arg0:tensor<?x112x112x32x!quant.uniform<u8:f32, 0.023529412224888802:-128>>) -> (tensor<?x112x112x32x!quant.uniform<u8:f32, 0.023529412224888802:-128>>) {
%0 = tosa.clamp %arg0 {max_val = 255 : ui8, min_val = 0 : ui8} : (tensor<?x112x112x32x!quant.uniform<u8:f32, 0.023529412224888802:-128>>) -> tensor<?x112x112x32x!quant.uniform<u8:f32, 0.023529412224888802:-128>>
return %0 : tensor<?x112x112x32x!quant.uniform<u8:f32, 0.023529412224888802:-128>>
}
```
with `'tosa.clamp' op min/max attributes types are incompatible with
input/output element types` error
since `getStorageType` was returning signed integer but the clamp
attributes were unsigned.
This PR updates the usage of `getStorageType` in tosa codebase to
correctly use the signed info for the quantized type.
SPIR-V doesn't support variadic functions, though we make an exception
for `printf`.
If we don't error, we generate invalid SPIR-V because the backend has no
idea how to codegen vararg functions as it is not described in the spec.
We get asm like this:
```
%27 = OpFunction %6 None %7
%28 = OpFunctionParameter %4
; -- End function
```
The above asm is totally invalid, there's no `OpFunctionEnd` and it
causes crashes in downstream tools like `spirv-as` and `spirv-link`.
We already have many `printf` tests locking down that this doesn't break
`printf`, it was already handled elsewhere at the time the error check
runs.
Note the SPIR-V Translator does the same thing, see
[here](https://github.com/KhronosGroup/SPIRV-LLVM-Translator/pull/2703).
---------
Signed-off-by: Nick Sarnie <nick.sarnie@intel.com>
These are C tests, not C++, so no function parameters means unspecified
number of parameters, not `void`.
These compile fine on the current tested offload targets because an
error is only
[thrown](https://github.com/llvm/llvm-project/blob/main/clang/lib/Sema/SemaDecl.cpp#L10695)
if the calling convention doesn't support variadic arguments, which they
happen to.
When compiling this test for other targets that do not support variadic
arguments, we get an error, which does not seem intentional.
Just add `void` to the parameter list.
---------
Signed-off-by: Nick Sarnie <nick.sarnie@intel.com>
This prevents the scheduler from thinking copy instructions are free. In
#167008, we saw cases where the scheduler moved ABI copies past other
instructions creating high register pressure that caused the register
allocator to run out of registers. They can't be spilled because the
physical register lifetime was increased, not the virtual register.
Ideally, we would detect what register class the COPY is for, but for now
I've just treated it as a scalar integer copy.
VPVector(End)PointerRecipes are single-scalar if all their operands are.
This should be effectively NFC currently, but it should re-enable cost
checking for some more VPWidenMemoryRecipe after
https://github.com/llvm/llvm-project/pull/157387 as discovered by
John Brawn.
Refines the existing conversion to allow `fir.do_loop` annotated with
`unordered` to be lowered to `scf.parallel`, while other loops retain
their original lowering.
These were in TargetLibraryInfo, but missing from RuntimeLibcalls.
This only adds the cases that already have the non-chk variants
already. Copies the enabled-by-default logic from TargetLibraryInfo,
which is probably overly permissive. Only isPS opts-out.
This patch is a follow-up of #162306 for the reduction clause.
Inside the compute region that carries the reduction clause, a new
hlfir.declare is generated for symbol appearing in the reduction clause.
The input of this hlfir.declare is the acc.reduction result. The related
semantics::Symbol is remapped to the hlfir.declare result so that any
reference to the symbol inside the compute region will use this SSA
value as the starting point instead of the SSA value for the host
address.
`[[nodiscard]]` should be applied to functions where discarding the
return value is most likely a correctness issue.
- https://libcxx.llvm.org/CodingGuidelines.html#apply-nodiscard-where-relevant
The following functions/classes have been annotated in this patch:
- [x] `bind_back`, `bind_front`, `bind`
- [x] `function`, `mem_fn`
- [x] `reference_wrapper`
The command line reality is this:
$ clang -c prog.c -fveclib=accelerate
error: invalid value 'accelerate' in '-fveclib=accelerate'
$ clang -c prog.c -fveclib=Accelerate
prog.c:1:2: warning: This is only a test [-W#warnings]
1 | #warning This is only a test
| ^
1 warning generated.
$ clang -c prog.c -fveclib=libmvec
prog.c:1:2: warning: This is only a test [-W#warnings]
1 | #warning This is only a test
| ^
1 warning generated.
$ clang -c prog.c -fveclib=LIBMVEC
error: invalid value 'LIBMVEC' in '-fveclib=LIBMVEC'
$ clang -c prog.c -fveclib=massv
error: invalid value 'massv' in '-fveclib=massv'
$ clang -c prog.c -fveclib=MASSV
prog.c:1:2: warning: This is only a test [-W#warnings]
1 | #warning This is only a test
| ^
1 warning generated.
$ clang -c prog.c -fveclib=sleef
error: invalid value 'sleef' in '-fveclib=sleef'
$ clang -c prog.c -fveclib=sleefgnuabi
error: invalid value 'sleefgnuabi' in '-fveclib=sleefgnuabi'
$ clang -c prog.c -fveclib=SLEEF
prog.c:1:2: warning: This is only a test [-W#warnings]
1 | #warning This is only a test
| ^
1 warning generated.
$ clang -c prog.c -fveclib=darwin_libsystem_m
error: invalid value 'darwin' in '-fveclib=darwin_libsystem_m'
$ clang -c prog.c -fveclib=Darwin_libsystem_m
prog.c:1:2: warning: This is only a test [-W#warnings]
1 | #warning This is only a test
| ^
1 warning generated.
$ clang -c prog.c -fveclib=armpl
error: invalid value 'armpl' in '-fveclib=armpl'
$ clang -c prog.c -fveclib=ARMPL
error: invalid value 'ARMPL' in '-fveclib=ARMPL'
$ clang -c prog.c -fveclib=ArmPL
prog.c:1:2: warning: This is only a test [-W#warnings]
1 | #warning This is only a test
| ^
1 warning generated.
$ clang -c prog.c -fveclib=amdlibm
error: invalid value 'amdlibm' in '-fveclib=amdlibm'
$ clang -c prog.c -fveclib=AMDLIBM
clang: error: unsupported option 'AMDLIBM' for target 'aarch64'
This PR adds hardware-measured latencies for all instructions defined in
Section 13 of the RVV specification: "Vector Floating-Point
Instructions" to the SpacemiT-X60 scheduling model.
This patch does the lowering for a 'declare' construct that is not a
function-local-scope. It also does the lowering for 'create', which has
an entry-op of create and exit-op of delete.
Global/NS/Struct scope 'declare's emit a single 'acc_ctor' and
'acc_dtor' (except in the case of 'link') per variable referenced. The
ctor is the entry op followed by a declare_enter. The dtor is a
get_device_ptr, followed by a declare_exit, followed by a delete(exit
op). This DOES include any necessary bounds.
This patch implements all of the above. We use a separate 'visitor' for
the clauses here since it is particularly different from the other uses,
AND there are only 4 valid clauses. Additionally, we had to split the
modifier conversion into its own 'helpers' file, which will hopefully
get some additional use in the future.
Values were parsed into an unsigned APInt with just enough of a bit
width to hold the number then interpreted as signed values. This
resulted in hex, octal and binary literals from being interpreted as
negative when the most significant bit is 1.
For example the `-0b11` would have a bit width of 2, would be
interpreted as -1, then negated to become 1.
This patch updates the mbarrier.arrive.* family of Ops to include
all features added up-to Blackwell.
* Update the `mbarrier.arrive` Op to include shared_cluster
memory space, cta/cluster scope and an option to lower using
relaxed semantics.
* An `arrive_drop` variant is added for both the `arrive` and
`arrive.nocomplete` operations.
* Updates for expect_tx and complete_tx operations.
* Verifier checks are added wherever appropriate.
* lit tests are added to verify the lowering to the intrinsics.
TODO:
* Updates for the remaining mbarrier family will be done in
subsequent PRs. (mainly, arrive.expect-tx, test_wait and try_waits)
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
Previously, only literal upper-bounded loops were recognized. This patch
relaxes this matching to accept any compile-time deducible constant
expression.
It would be better to rely on the SVals (values from the symbolic
domain), as those could potentially have more accurate answers, but this
one is much simpler.
Note that at the time we calculate this value, we have not evaluated the
sub-exprs of the condition, consequently, we can't just query the
Environment for the folded SVal.
Because of this, the next best tool in our toolbox is comp-time
evaluating the Expr.
rdar://165363923