It is possible that a module has so many kernels that the 4GB limit of
GPU VA is depleted when each kernel allocates a 64 KB page for its own
ISA. In such case, propagate the ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY to
the API caller to indicate the actual problem.
Currently such scenario is not detected, the execution advances a bit
further and the following crashes do not let the user to easily
understand what happened.
Related-To: NEO-7788
Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>
After resolving NEO-7684 in turns out that `zeKernelGetProperties`
is still returning wrong value for `maxNumSubgroups` since it
did not take into account `LargeGRF & SIMD` limitation.
Related-To: NEO-7829
Signed-off-by: Krzysztof Gibala <krzysztof.gibala@intel.com>
Do not make indirect allocations resident if kernel does not use
indirect access.
For both level zero and opencl.
Currently disabled by default, enable with debug flag
DetectIndirectAccessInKernel
Related-To: NEO-7712
Signed-off-by: Dominik Dabek <dominik.dabek@intel.com>
- apply revelant flags only on platforms supporting these flags
- update command list preemption level when supported
- use actual kernel preemption level to program interface descriptor data
Related-To: NEO-7771
Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>
Mutex was added to kernel_imp for atomic operation during
printPrintfOutput on kernel.
Related-To: LOCI-3681
Signed-off-by: Zhang, Winston <winston.zhang@intel.com>
Do not make indirect allocations resident if kernel does not use
indirect access.
Enable for both level zero and opencl.
Related-To: NEO-7712
Signed-off-by: Dominik Dabek <dominik.dabek@intel.com>
Sizing context (PVC):
When using LargeGRF (a.k.a GRF256) there are only 4 HW threads per EU
(instead of default 8). Together with SIMD16 that means that there can
be max 64 work-items per EU. With 8 EU per subslice this gives 512
work-items on a single subslice. For correct intra-WG synchronization
all its WIs must be executed on the same subslice (to access the same
SLM, where the synchronization primitives are stored). Thus, with SIMD16
and LargeGRF the work-group size must not exceed 512 (PVC example).
So far `maxWorkGroupSize` is taken solely from a DeviceInfo structure
both in `ModuleTranslationUnit::processUnpackedBinary()` and
`ModuleImp::initialize()`. This method does not take kernel parameters
(LargeGRF) into account. It allows to submit a kernel using LargeGRF
with SIMD16 with the work-group size set to 1024. That leads to a hang.
Fix the `.maxWorkGroupSize` computation so that it takes the kernel
parameters into consideration.
Add new (for discrete platforms >= XeHP) and adapt existing tests, fix
cosmetics by the way.
Similar check for OCL:
https://github.com/intel/compute-runtime/blob/master/opencl/source/comma
nd_queue/enqueue_kernel.h#L130
Related-To: NEO-7684
Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>
Optimize zeKernelSetGroupSize by early returning success if group size
values have not changed since last function call.
Moved ImplicitArgs construction above setGroupSize call
in kernel initialization to prevent pImplicitArgs being nullptr
in calls in which we use cached group sizes and early return.
Related-To: NEO-7394
Signed-off-by: Fabian Zwolinski <fabian.zwolinski@intel.com>
- printf used in kernel is printed on synchronize() call, if
hang is detected - printf buffer was not printed immediately but
only when Kernel was destroyed
- this change adds copying printf buffer with internal engine
(whenever available) right after hang detection on
CommandQueue::synchronize() call
Related-To: NEO-6427
Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
Optimize zeKernelSetGroupSize by early returning success if group size
values have not changed since last function call.
Related-To: NEO-7394
Signed-off-by: Fabian Zwolinski <fabian.zwolinski@intel.com>
Two code parts contained invalid logic related to traversing
opaque list of pNexts. This has been fixed.
Signed-off-by: Patryk Wrobel <patryk.wrobel@intel.com>
Add missing allocation of kernel private memory for the scenario when
the private memory is not allocated within `KernelImp::initialize()` but
deferred until `appendLaunchKernelWithParams()` instead.
One kernel can never allocate more private/scratch memory than
`globalMemorySize`, that ends up in `ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY`
being returned. However, several separate kernels can exceed the
`globalMemorySize` and then, the private region of each such kernel is
allocated at later stage, in `appendLaunchKernelWithParams()`.
Such mechanism was present on pre-xehp platforms and it is now added to
xehp-and-later.
See:
* ModuleImp::checkIfPrivateMemoryPerDispatchIsNeeded()
* Module::shouldAllocatePrivateMemoryPerDispatch()
Related-To: NEO-7398
Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>
This fixes several bugs in previous (reverted) implementation.
We use correct RTStack pointer offset, and a larger RTStack size.
Related-To: LOCI-2966
Signed-off-by: Jim Snow <jim.m.snow@intel.com>