The flag was added in 8ef48d07ef to suppress build warning and is no
longer needed.
It adds "no-builtins" attribute, which prevents libclc functions from
being inlined into caller that don't have the attribute.
The flag is meant to prevent folding standard library calls into
optimized implementations. For libclc device targets, however, such
target‑driven folding is desirable.
llvm-diff shows no change to amdgcn--amdhsa.bc and nvptx--nvidiacl.bc.
Co-authored-by: Mészáros Gergely <gergely.meszaros@intel.com>
Change my email address in the process. I will not be able to keep up
maintainership duties on this project in the future.
Adding the wording on the inactive maintainers section myself like this
feels self-aggrandizing but was copied from other LLVM projects.
* Replace call-site check with external declaration scan (grep declare)
to avoid false positives for not-inlined __clc_* functions.
* _clc_get_el* helpers are defined as inline in clc_shuffle2.cl, so they
have available_externally attribute. When they fail to inline they are
deleted by EliminateAvailableExternallyPass and become unresolved in
cedar-r600--.bc. Mark them static to resolve the issue.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary:
The added bit counting builtins for vectors used `cttz` and `ctlz`,
which is consistent with the LLVM naming convention. However, these are
clang builtins and implement exactly the `__builtin_ctzg` and
`__builtin_clzg` behavior. It is confusing to people familiar with other
other builtins that these are the only bit counting intrinsics named
differently. This includes the additional operation for the undefined
zero case, which was added as a `clzg` extension.
always_inline doesn't guarantee performance improvement.
Target-specific optimizations decide whether inlining is profitable.
Changes to amdgcn--amdhsa.bc:
* _Z9__clc_logDv16_f and _Z15__clc_remainderDv16_fS_ are not inlined.
* sincos vector function code size has doubled due to apparent
duplication.
Also replace typo _CLC_DECL with _CLC_DEF for function definition.
This fixes `No such file or directory` error when "Unix Makefiles"
generator is used, see https://github.com/intel/llvm/issues/20058.
Ninja generator implicitly creates output directory when generating
libclc libraries, but "Unix Makefiles" generator does not.
This PR reduces amdgcn--amdhsa.bc size by 1.8% and nvptx64--nvidiacl.bc
size by 4%.
Loop trip count is constant and backend can decide whether to unroll.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Before this PR, weak linkage is applied to a few CLC generic functions
to allow target specific implementation to override generic one.
However, adding weak linkage has a side effect of preventing
inter-procedural optimization, such as PostOrderFunctionAttrsPass,
because weak function doesn't have exact definition (as determined by
hasExactDefinition in the pass).
This PR resolves the issue by adding --override flag for every
non-generic bitcode file in llvm-link run. This approach eliminates the
need for weak linkage while still allowing target-specific
implementation to override generic one.
llvm-diff shows imporoved attribute deduction for some functions in
amdgcn--amdhsa.bc, e.g.
%23 = tail call half @llvm.sqrt.f16(half %22)
=>
%23 = tail call noundef half @llvm.sqrt.f16(half %22)
Our downstream libclc add a few more targets that customizes build_flags
and opt_flags. Then in each customization block, MACRO_ARCH is defined
to be ${ARCH}.
Hoisting MACRO_ARCH definition out of if-else-end block avoids code
duplication. This also avoids potential error when MACRO_ARCH definition
is forgotten, e.g. in https://github.com/intel/llvm/pull/19971.
It is necessary to add MemorySemantic argument for AMDGPU which means
the memory or address space to which the memory ordering is applied.
The MemorySemantic is also necessary for implementing the SPIR-V
MemoryBarrier instruction. Additionally, the implementation of
__clc_mem_fence on Intel GPUs requires the MemorySemantic argument.
Using __builtin_amdgcn_fence for AMDGPU is follow-up of
https://github.com/llvm/llvm-project/pull/151446#discussion_r2254006508
llvm-diff shows no change to nvptx64--nvidiacl.bc.
libclc sequential build issue addressed in commit 0c21d6b4c8 is
specific to cmake MSVC generator. Therefore, this PR avoids creating a
large number of targets when a non-MSVC generator is used, such as the
Ninja generator, which is used in pre-merge CI on Windows in
llvm-project repo. We plan to migrate from MSVC generator to Ninja
generator in our downstream CI to fix flaky cmake bug `Cannot restore
timestamp`, which might be related to the large number of targets.
Before this PR, PostOrderFunctionAttrsPass in opt run can deduce
memory(none) for these functions.
This PR explicitly adds the attribute to align with Clang's OpenCL
headers and ensures the attribute is present throughout the compilation
flow. Generated bitcode files amdgcn--amdhsa.bc and nvptx64--nvidiacl.bc
become slightly smaller.
Fix a regression of df74736732.
cmake MSVC generator is multiple configurations. Build type is not known
at configure time and CMAKE_CFG_INTDIR is evaluated to $(Configuration)
at configure time. libclc install fails since $(Configuration) in
bitcode file path is unresolved in libclc/cmake_install.cmake at install time.
We need a solution that resolves libclc bitcode file path at install
time. This PR fixes the issue using CMAKE_INSTALL_CONFIG_NAME which can
be evaluated at install time. This is the same solution as in
https://reviews.llvm.org/D76827
The target's output bitcode `libclc_builtins_lib` is located in a
sub-directory in clang resource directory since df74736732. Setting
TARGET_FILE property can allow targets in non-libclc project to obtain
the path to `libclc_builtins_lib`.
__clc_mem_fence and __clc_work_group_barrier function have two
parameters memory_scope and memory_order. The design allows the clc
functions to implement SPIR-V ControlBarrier and MemoryBarrier
functions in the future.
The default memory ordering in clc is set to __ATOMIC_SEQ_CST, which is
also the default and strongest ordering in OpenCL and C++.
OpenCL cl_mem_fence_flags parameter is converted to combination of
__MEMORY_SCOPE_DEVICE and __MEMORY_SCOPE_WRKGRP, which is passed to clc.
llvm-diff shows no change to nvptx64--nvidiacl.bc.
llvm-diff show a small change to amdgcn--amdhsa.bc and the number of
LLVM IR instruction is reduced by 1: https://alive2.llvm.org/ce/z/_Uhqvt
This commit adds driver support for linking libclc OpenCL libraries. It
takes the form of a new optional flag: --libclc-lib=namespec. Nothing is
linked unless this flag is specified.
Not all libclc targets have corresponding clang targets. For this reason
it is desirable for users to be able to specify a libclc library name.
We support this by taking both a library name (without the .bc suffix)
or a filename. Both of these are searched for in the clang resource
directory. Filenames are
also checked themselves so that absolute paths can be provided. The
syntax for specifying filenames (as opposed to library names) uses a
leading colon (:), inspired by the -l option.
To accommodate this option, libclc libraries are now placed into clang's
resource directory in an in-tree configuration. The libraries are all
placed in <resource-dir>/lib/libclc and
are not grouped under host-specific directories as some other runtime
libraries are; it is not expected that OpenCL libraries will differ
depending on the host toolchain.
Currently only the AMDGPU toolchain supports this option as a proof of
concept. Other targets such as NVPTX or SPIR/SPIR-V could support it
too. We could optionally let target toolchains search for libclc
libraries themselves, possibly when passed an empty --libclc-lib.
This removes the dependency on an external tool to build the SPIR-V
files. It may be of interest to projects such as Mesa.
Note that the option is off by default as using the SPIR-V backend, at
least on my machine, uses a *lot* of memory and the process is often
killed in a parallelized build. It does complete, however.
Fixes#135327.
With this commit, the CLC fmin/fmax builtins use clang's
__builtin_elementwise_(min|max)imumnum which helps us generate LLVM
minimumnum/maximumnum intrinsics directly. These intrinsics uniformly
select the non-NaN input over the (quiet or signalling) NaN input, which
corresponds to what the OpenCL CTS tests.
These intrinsics maintain the vector types, as opposed to scalarizing,
which was previously happening. This commit therefore helps to optimize
codegen for those targets.
Note that there is ongoing discussion regarding how these builtins
should handle signalling NaNs in the OpenCL specification and whether
they should be able to return a quiet NaN as per the IEEE behaviour. If
the specification and/or CTS is ever updated to allow or mandate
returning a qNAN, these builtins could/should be updated to use
__builtin_elementwise_(min|max)num instead which would lower to LLVM
minnum/maxnum intrinsics.
The SPIR-V targets maintain the old implementations, as the LLVM ->
SPIR-V translator can't currently handle the LLVM intrinsics. The
implementation has been simplifies to consistently use clang builtins,
as opposed to before where the half version was explicitly defined.
[1] https://github.com/KhronosGroup/OpenCL-CTS/pull/2285
With libclc being a 'runtime', the top-level build assumes that there is
a corresopnding 'libclc' target. We previously weren't providing this,
leading to a build failure if the user tried to build it.
This commit remedies this by adding support for building the 'libclc'
target. It does so by adding dependencies from the OpenCL builtins to
this target. It uses a configurable in-between target -
libclc-opencl-builtins - to ease the possibility of adding non-OpenCL
builtin libraries in the future.
Also delete unary_def_via_fp32.inc. There are small changes in
amdgcn--amdhsa.bc due to vector conversion is scalarized, e.g.
%2 = fpext <4 x half> %0 to <4 x float>
%3 = extractelement <4 x float> %2, i64 0
%4 = tail call float @llvm.fabs.f32(float %3)
->
%2 = extractelement <4 x half> %0, i64 0
%3 = tail call half @llvm.fabs.f16(half %2)
%4 = fpext half %3 to float
Fix the symlink creation logic to use relative paths instead of
absolute, in order to ensure that the installed symlinks actually refer
to the installed .bc files rather than the ones from the build
directory. This was broken in #146833. The change is a bit roundabout
but it attempts to preserve the spirit of #146833, that is the ability
to use multiple output directories (provided they all resides in
`${LIBCLC_OUTPUT_LIBRARY_DIR}` and preserve the same structure in the
installed tree).
Signed-off-by: Michał Górny <mgorny@gentoo.org>
Fix `libclc/utils/CMakeLists.txt` to expose `prepare_builtins_*`
variables in parent scope. This was a regression introduced in #148815
where the code was moved into subdirectory, and the variables would no
longer be accessible to calls in top-level CMakeLists, resulting in
attempting to build targets with empty command:
```
[1566/1676] cd /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build && -o /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build/clspv--.bc /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build/obj.libclc.dir/clspv--/builtins.opt.clspv--.bc
FAILED: clspv--.bc /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build/clspv--.bc
cd /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build && -o /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build/clspv--.bc /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build/obj.libclc.dir/clspv--/builtins.opt.clspv--.bc
/bin/sh: line 1: -o: command not found
```
Add corresponding clc functions, which are implemented with clang
__scoped_atomic builtins. OpenCL functions are implemented as a wrapper
over clc functions.
Also change legacy atomic_inc and atomic_dec to re-use the newly added
clc_atomic_inc/dec implementations. llvm-diff only no change to
atomic_inc and atomic_dec in bitcode.
Notes:
* Generic OpenCL built-ins functions uses __ATOMIC_SEQ_CST and
__MEMORY_SCOPE_DEVICE for memory order and memory scope parameters.
* OpenCL atomic_*_explicit, atomic_flag* built-ins are not implemented
yet.
* OpenCL built-ins of atomic_intptr_t, atomic_uintptr_t, atomic_size_t
and atomic_ptrdiff_t types are not implemented yet.
* llvm-diff shows no change to nvptx64--nvidiacl.bc and
amdgcn--amdhsa.bc since __opencl_c_atomic_order_seq_cst and
__opencl_c_atomic_scope_device are not defined in these two targets.
The implementation is based on reference implementation in
OpenCL-CTS/test_integer_ops. The generic implementations pass
OpenCL-CTS/test_integer_ops tests on Intel GPU.
The file is listing build artifacts to ignore, but LLVM has long had the
policy that in-tree builds are not supported, so the ignore rules
shouldn't serve their original purpose anymore.
The rules however are annoying because although they probably intended
only to ignore top-level build artifacts, they lack the leading `/` so
they match any file with the ignored name anywhere under `libclc/`.
Changes in this PR:
* Declare most of workitem functions in clc and opencl folders.
* Call clc workitem function in corresponding OpenCL workitem function.
* Move ptx-nvidiacl workitem built-in implementations into clc.
* Move a few amdgcn workitem built-in implementations into clc.
* Include only needed headers in OpenCL workitem functions.
* Implement get_local_linear_id, get_max_sub_group_size,
get_num_sub_groups,
get_sub_group_id, get_sub_group_local_id, get_sub_group_size for
ptx-nvidiacl.
llvm-diff shows this PR adds a few new symbols to nvptx64--nvidiacl.bc.
llvm-diff shows no change to amdgcn--amdhsa.bc, nvptx--.bc and
nvptx64--.bc.
This commit finishes the work started in #146840 and #147276. It makes
each OpenCL header self-contained and each implementation file include
only the headers it needs. It removes the need for a catch-all include
file of all OpenCL builtin declarations.
This commit continues the work from #146840 and extends it to the maths,
geomtrics, common, and relational directories.
All headers have include guards and, where appropriate, include the
minimal code required for their specific definitions. Implementation
files no longer include the large catch-all header of all OpenCL builtin
declarations.