Sizing context (PVC):
When using LargeGRF (a.k.a GRF256) there are only 4 HW threads per EU
(instead of default 8). Together with SIMD16 that means that there can
be max 64 work-items per EU. With 8 EU per subslice this gives 512
work-items on a single subslice. For correct intra-WG synchronization
all its WIs must be executed on the same subslice (to access the same
SLM, where the synchronization primitives are stored). Thus, with SIMD16
and LargeGRF the work-group size must not exceed 512 (PVC example).
So far `maxWorkGroupSize` is taken solely from a DeviceInfo structure
both in `ModuleTranslationUnit::processUnpackedBinary()` and
`ModuleImp::initialize()`. This method does not take kernel parameters
(LargeGRF) into account. It allows to submit a kernel using LargeGRF
with SIMD16 with the work-group size set to 1024. That leads to a hang.
Fix the `.maxWorkGroupSize` computation so that it takes the kernel
parameters into consideration.
Add new (for discrete platforms >= XeHP) and adapt existing tests, fix
cosmetics by the way.
Similar check for OCL:
https://github.com/intel/compute-runtime/blob/master/opencl/source/comma
nd_queue/enqueue_kernel.h#L130
Related-To: NEO-7684
Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>
Optimize zeKernelSetGroupSize by early returning success if group size
values have not changed since last function call.
Moved ImplicitArgs construction above setGroupSize call
in kernel initialization to prevent pImplicitArgs being nullptr
in calls in which we use cached group sizes and early return.
Related-To: NEO-7394
Signed-off-by: Fabian Zwolinski <fabian.zwolinski@intel.com>
- printf used in kernel is printed on synchronize() call, if
hang is detected - printf buffer was not printed immediately but
only when Kernel was destroyed
- this change adds copying printf buffer with internal engine
(whenever available) right after hang detection on
CommandQueue::synchronize() call
Related-To: NEO-6427
Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
Optimize zeKernelSetGroupSize by early returning success if group size
values have not changed since last function call.
Related-To: NEO-7394
Signed-off-by: Fabian Zwolinski <fabian.zwolinski@intel.com>
Two code parts contained invalid logic related to traversing
opaque list of pNexts. This has been fixed.
Signed-off-by: Patryk Wrobel <patryk.wrobel@intel.com>
Add missing allocation of kernel private memory for the scenario when
the private memory is not allocated within `KernelImp::initialize()` but
deferred until `appendLaunchKernelWithParams()` instead.
One kernel can never allocate more private/scratch memory than
`globalMemorySize`, that ends up in `ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY`
being returned. However, several separate kernels can exceed the
`globalMemorySize` and then, the private region of each such kernel is
allocated at later stage, in `appendLaunchKernelWithParams()`.
Such mechanism was present on pre-xehp platforms and it is now added to
xehp-and-later.
See:
* ModuleImp::checkIfPrivateMemoryPerDispatchIsNeeded()
* Module::shouldAllocatePrivateMemoryPerDispatch()
Related-To: NEO-7398
Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>
This fixes several bugs in previous (reverted) implementation.
We use correct RTStack pointer offset, and a larger RTStack size.
Related-To: LOCI-2966
Signed-off-by: Jim Snow <jim.m.snow@intel.com>
Add support for inline samplers in zebin.
Generate required SAMPLER_STATEs in DSH.
Resolves: NEO-7388
Signed-off-by: Krystian Chmielewski <krystian.chmielewski@intel.com>
Instead of just returning proper error code in case of exceeding
available Shared Local Memory size we also want to print error message
to make debugging easier.
Related-To: NEO-7280
Signed-off-by: Fabian Zwolinski <fabian.zwolinski@intel.com>
Previously we used an array-of-pointers approach, but using an
array-of-structures is in some ways simpler.
We also split out the RTStack as a separate allocation.
Related-To: LOCI-2966
Signed-off-by: Jim Snow <jim.m.snow@intel.com>
With compiler LSC WAs this gives better performance.
If debugger is active, policy will not be changed ie.
will be WBP.
Related-To: NEO-7003
Signed-off-by: Dominik Dabek <dominik.dabek@intel.com>
With compiler LSC WAs this gives better performance.
If debugger is active, policy will not be changed ie.
will be WBP.
Related-To: NEO-7003
Signed-off-by: Dominik Dabek <dominik.dabek@intel.com>
This change:
- prevents writing memory out of the range of the destination buffer
- prevents calling strlen() with non-null terminated c-string
- corrects the logic, which validates passed range to proceed
when real length fits the destination buffer
Related-To: NEO-7264
Signed-off-by: Wrobel, Patryk <patryk.wrobel@intel.com>