Summary:
The Offloading library wraps around the underlying plugins. The problem
is that we currently initialize all plugins we find, even if they are
not needed for the program. This is very expensive for trivial uses, as
fully heterogenous usage is quite rare. In practice this means that you
will always pay a 200 ms penalty for having CUDA installed.
This patch changes the behavior to provide accessors into the plugins
and devices that allows them to be initialized lazily. We use a
once_flag, this should properly take a fast-path check while still
blocking on concurrent use.
Making full use of this will require a way to filter platforms more
specifically. I'm thinking of what this would look like as an API.
I'm thinking that we either have an extra iterate function that takes a
callback on the platform, or we just provide a helper to find all the
devices that can run a given image. Maybe both?
Fixes: https://github.com/llvm/llvm-project/issues/159636
This PR adds deferral of descriptor maps until they are necessary for
assumed dummy argument types. The intent is to avoid a problem where a
user can inadvertently map a temporary local descriptor to device
without their knowledge and proceed to never unmap it. This temporary
local descriptor remains lodged in OpenMP device memory and the next
time another variable or descriptor residing in the same stack address
is mapped we incur a runtime OpenMP map error as we try to remap the
same address.
This fix was discussed with the OpenMP committee and applies to OpenMP
5.2 and below, future versions of OpenMP can avoid this issue via the
attach semantics added to the specification.
Summary:
Need to verify this actually has a device. We really need to rework this
to point to a real impolementation, or streamline it to handle this
automatically.
Summary:
This check is unnecessarily restrictive and currently incorrectly fires
for any size less than eight bytes. Just remove it, we do sanity checks
elsewhere and at some point need to trust the ABI.
This implements two pieces to restore the interop functionality (that I
broke) when the 6.0 interfaces were added:
* A set of wrappers that support the old interfaces on top of the new
ones
* The same level of interop support for the CUDA amd AMD plugins
This PR adds support for nested derived types and their mappers to the
MapInfoFinalization pass.
- Generalize MapInfoFinalization to add child maps for arbitrarily
nested allocatables when a derived object is mapped via declare mapper.
- Traverse HLFIR designates rooted at the target block arg and build
full coordinate_of chains; append members with correct membersIndex.
This fixes#156461.
Summary:
Several of these tests have been failing for literal years. Ideally we
make efforts to fix this, but keeping these broken has had serious
consequences on our testing infrastructure where failures are the norm
so almost all test failures are disregarded. I made a tracking issue for
the ones that have been disabled.
https://github.com/llvm/llvm-project/issues/161265
`PRIVATE | ATTACH` maps can be used to represent firstprivate pointers
that should be initialized by doing doing the pointee's device address,
if its lookup succeeds, or retain the original host pointee's address
otherwise.
With this, for a test like the following:
```f90
integer, pointer :: p(:)
!$omp target map(p(1))
... print*, p(1)
!$omp end target
```
The codegen can look like:
```llvm
; maps for p:
; &p(1), &p(1), sizeof(p(1)), TO|FROM //(1)
; &ref_ptr(p), &p(1), sizeof(ref_ptr(p)), ATTACH //(2)
; &ref_ptr(p), &p(1), sizeof(ref_ptr(p)), PRIVATE|ATTACH|PARAM //(3)
call... @__omp_outlined...(ptr %ref_ptr_of_p)
```
* `(1)` maps the pointee `p(1)`.
* `(2)` attaches it to the (previously) mapped `ref_ptr(p)`, if present.
It can be controlled via OpenMP 6.1's `attach(auto/always/never)`
map-type modifiers.
* `(3)` privatizes and initializes the local `ref_ptr(p)`, which gets
passed
in as the kernel argument `%ref_ptr_of_p`. Can be skipped if p is not
referenced directly within the region.
While similar mapping can be used for C/C++, it's more important/useful
for Fortran as we can avoid creating another argument for passing the
descriptor, and use that to initialize the private copy in the body of
the kernel.
Summary:
Forgot to port this option's old handling from offload. It's not way
easier since they're built in the same CMake project. Also delete the
leftover directory that's not used anymore, don't know how that was
still there.
Currently, devices store a raw pointer to back to their owning Platform.
Platforms are stored directly inside of a vector. Modifying this vector
risks invalidating all the platform pointers stored in devices.
This patch allocates platforms individually, and changes devices to
store a reference to its platform instead of a pointer. This is safe,
because platforms are guaranteed to outlive the devices they contain.
Enable the generation of no-loop kernels for Fortran OpenMP code. target
teams distribute parallel do pragmas can be promoted to no-loop kernels
if the user adds the -fopenmp-assume-teams-oversubscription and
-fopenmp-assume-threads-oversubscription flags.
If the OpenMP kernel contains reduction or num_teams clauses, it is not
promoted to no-loop mode.
The global OpenMP device RTL oversubscription flags no longer force
no-loop code generation for Fortran.
This PR adds support for nested derived types and their mappers to the
MapInfoFinalization pass.
- Generalize MapInfoFinalization to add child maps for arbitrarily
nested allocatables when a derived object is mapped via declare mapper.
- Traverse HLFIR designates rooted at the target block arg and build
full coordinate_of chains; append members with correct membersIndex.
This fixes#156461.
Summary:
The host plugin is basically OpenMP specific and doesn't work very well.
Previously we were skipping over it in the list instead of just not
adding it at all.
If olMemAlloc happens to allocate memory that was already allocated
elsewhere (possibly by another device on another platform), it is now
thrown away and a new allocation generated.
A new `AllocBases` vector is now available, which is an ordered list
of allocation start addresses.
Currently there are two serialization modes for bitstream Remarks:
standalone and separate. The separate mode splits remark metadata (e.g.
the string table) from actual remark data. The metadata is written into
the object file by the AsmPrinter, while the remark data is stored in a
separate remarks file. This means we can't use bitstream remarks with
tools like opt that don't generate an object file. Also, it is confusing
to post-process bitstream remarks files, because only the standalone
files can be read by llvm-remarkutil. We always need to use dsymutil
to convert the separate files to standalone files, which only works for
MachO. It is not possible for clang/opt to directly emit bitstream
remark files in standalone mode, because the string table can only be
serialized after all remarks were emitted.
Therefore, this change completely removes the separate serialization
mode. Instead, the remark string table is now always written to the end
of the remarks file. This requires us to tell the serializer when to
finalize remark serialization. This automatically happens when the
serializer goes out of scope. However, often the remark file goes out of
scope before the serializer is destroyed. To diagnose this, I have added
an assert to alert users that they need to explicitly call
finalizeLLVMOptimizationRemarks.
This change paves the way for further improvements to the remark
infrastructure, including more tooling (e.g. #159784), size optimizations
for bitstream remarks, and more.
Pull Request: https://github.com/llvm/llvm-project/pull/156715
Summary:
This was originally added in as a hack to work around CUDA's limitation
on allocation. The `libc` implementation now isn't even used for CUDA so
this code is never hit. Even if this case, this code never truly worked.
A true solution would be to use CUDA's virtual memory API instead to
allocate 2MiB slabs independenctly from the normal memory management
done in the stream.
Summary:
I made the GPU flags accept more of the default LLVM warnings, which
triggered some new cases. Clean those up and fix some other ones while
I'm at it.
With declare mapper, the parent base entry was emitted as `TARGET_PARAM`
only. The mapper received a map-type without `to/from`, causing
components to degrade to `alloc`-only (no copies), breaking allocatable
payload mapping. This PR preserves the map-type bits from the parent.
This fixes#156466.
Summary:
This exposes the 'isDeviceCompatible' routine for checking if a binary
*can* be loaded. This is useful if people don't want to consume errors
everywhere when figuring out which image to put to what device.
I don't know if this is a good name, I was thining like `olIsCompatible`
or whatever. Let me know what you think.
Long term I'd like to be able to do something similar to what OpenMP
does where we can conditionally only initialize devices if we need them.
That's going to be support needed if we want this to be more
generic.
Summary:
Turns out the new CUDA ABI now applies retroactively to all the other
SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping
all the SM flags consistent but using an offset.
Fixes: https://github.com/llvm/llvm-project/issues/159088
Currently get this error
```
offload/plugins-nextgen/common/src/PluginInterface.cpp:859:63: error: member reference type 'StringRef' is not a pointer; did you mean to use '.'?
```
We pass the full image binary now so we can't really print anything
useful here.
Seems introduced in https://github.com/llvm/llvm-project/pull/158748.
---------
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Summary:
Currently we have this `__tgt_device_image` indirection which just takes
a reference to some pointers. This was all find and good when the only
usage of this was from a section of GPU code that came from an ELF
constant section. However, we have expanded beyond that and now need to
worry about managing lifetimes. We have code that references the image
even after it was loaded internally. This patch changes the
implementation to instaed copy the memory buffer and manage it locally.
This PR reworks the JIT and other image handling to directly manage its
own memory. We now don't need to duplicate this behavior externally at
the Offload API level. Also we actually free these if the user unloads
them.
Upside, less likely to crash and burn. Downside, more latency when
loading an image.
This change adds support for saving full contents of attached Fortran
descriptors, and not just their pointee address, in the shadow-pointer
table.
With this, we now support:
* comparing full contents of descriptors to check whether a previous
shadow-pointer entry is stale;
* restoring the full contents of descriptors
And with that, we can now use ATTACH map-types (added in #149036) for
mapping Fortran pointer/allocatable arrays, and array-sections on them.
e.g.:
```f90
integer, allocatable :: x(:)
!$omp target enter data map(to: x(:))
```
as:
```
void* addr_of_pointee = allocated(x) ? &x(1) : nullptr;
int64_t sizeof_pointee = allocated(x) ? sizeof(x(:)) : 0
addr_of_pointee, addr_of_pointee, sizeof_pointee, TO
addr_of_descriptor, addr_of_pointee, size_of_descriptor, ATTACH
```
Per the logic in top-level CMakeLists, `libomptarget` is placed into
`LLVM_LIBRARY_OUTPUT_INTDIR` when this variable is set. Adjust the test
logic to include this directory in `-L` and `-Wl,-rpath` arguments as
well, in order to fix finding tests when building via the `runtimes`
top-level directory.
Signed-off-by: Michał Górny <mgorny@gentoo.org>
Currently, there's a number of issues with mapping characters with LEN's
specified (strings effectively). They're represented as a char type in
FIR with a len parameter, and then later on they're expanded into an
array of characters when we're translating to the LLVM dialect. However,
we don't generate a bounds for these at lowering. The fix in this PR for
this is to generate a bounds from the LEN parameter and attatch it to
the map on lowering from FIR to the LLVM dialect when we encounter this
type.
If there are no devices available for testing on liboffload, the test
will no longer throw an error when it fails to instantiate.
The tests will be silently skipped, but with a warning printed to
stderr.
Summary:
This operation is done every time we load a binary, this behavior should
be moved into OpenMP since it concerns an OpenMP specific data struct.
This is a little messy, because ideally we should only be using public
APIs, but more can be extracted later.
Permit redefining `OPENMP_STANDALONE_BUILD` to make it possible to build
offload correctly via runtimes build (i.e. build where the top-level
project is `runtimes`). This follows the same logic in `openmp`
component.
Signed-off-by: Michał Górny <mgorny@gentoo.org>
Summary:
Currently we build the OpenMP device runtime as part of the `offload/`
project. This is problematic because it has several restrictions when
compared to the normal offloading runtime. It can only be built with an
up-to-date clang and we need to set the target appropriately. Currently
we hack around this by creating the compiler invocation manually, but
this patch moves it into a separate runtimes build.
This follows the same build we use for libc, libc++, compiler-rt, and
flang-rt. This also moves it from `offload/` into `openmp/` because it
is still the `openmp/` runtime and I feel it is more appropriate. We do
want a generic `offload/` library at some point, but it would be trivial
to then add that as a separate library now that we have the
infrastructure that makes adding these new libraries trivial.
This most importantly will require that users update their build
configs, mostly adding the following lines at a minimum. I was debating
whether or not I should 'auto-upgrade' this, but I just went with a
warning.
```
-DLLVM_RUNTIME_TARGETS='default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda' \
-DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=openmp \
-DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=openmp \
```
This also changed where the `.bc` version of the library lives, but it's
still created.
This PR adds several new tests for mapping of chained structures, i.e.
those resembling:
#pragma omp target map(tofrom: a->b->c)
These are currently XFAILed, although the first two tests actually work
with unified memory -- I'm not sure if it's possible to easily improve
the condition on the XFAILs in question to make them more accurate.
These cases are all fixed by the WIP PR
https://github.com/llvm/llvm-project/pull/153683.