So far, there is a separate page allocated for each kernel's ISA within
`KernelImmutableData::initialize()`. Apparently the ISA blocks are often
much smaller than a 64k page, which leads to poor memory utilization and
was even observed to cause the device OOM error if a single module has
several keys.
Improve the situation by reusing the parent allocation (owned by the
module instance) for modules, which kernel ISAs can fit together within
a single 64k page. This improves the memory utilization on a single
module level.
Related-To: NEO-7788
Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>
So far, there is a separate page allocated for each kernel's ISA within
`KernelImmutableData::initialize()`. Apparently the ISA blocks are often
much smaller than a 64k page, which leads to poor memory utilization and
was even observed to cause the device OOM error if a single module has
several keys.
Improve the situation by reusing the parent allocation (owned by the
module instance) for modules, which kernel ISAs can fit together within
a single 64k page. This improves the memory utilization on a single
module level.
Related-To: NEO-7788
Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>
- program surface states for redescribed images correctly. Image copy
to/from memory are using redescribed surface states,
- refactor state base address programming - program address and size
together, set max size at the beginning due to lack of Enable flag
- set GpuBase in WddmAllocation when external heap is used
- return max ssh required size from kernelInfo or based on stateful args
Related-To: NEO-7063
Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
- also use helper when checking that is simd1 to have same flow
Related-To: NEO-8087
Signed-off-by: Cencelewska, Katarzyna <katarzyna.cencelewska@intel.com>
- program global bindless ssba when external allocator used (
UseExternalAllocatorForSshAndDsh)
Related-To: NEO-7063
Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
Due to this flag was not properly handled on Windows, command buffer
allocations were never reused in immediate command lists in case of
host secondary buffers. This lead to huge host memory consumption
and performance degradation
Related-To: NEO-8072
Signed-off-by: Igor Venevtsev <igor.venevtsev@intel.com>
- use calculateNumThreadsPerThreadGroup instead of getThreadsPerWG to
have same flow and proper values of threads per work groups
Related-To: NEO-8087
Signed-off-by: Cencelewska, Katarzyna <katarzyna.cencelewska@intel.com>
- use calculateNumThreadsPerThreadGroup instead of getThreadsPerWG to
have same flow and proper values of threads per work groups
Related-To: NEO-8087
Signed-off-by: Cencelewska, Katarzyna <katarzyna.cencelewska@intel.com>
denorm support is controlled by IGC, we should just set zero by default
Related-To: NEO-8059
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
- allow bindless kernels to execute
- bindless addressing kernels are using private heaps mode
- do not differentiate bindful and bindless surface state base addresses
Related-To: NEO-7063
Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
denorm support is controlled by IGC, we should just set zero by default
Related-To: NEO-8059
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
Signed-off-by: Kamil Kopryk <kamil.kopryk@intel.com>
Related-To: NEO-6075
Ngen binaries contain stateful information, however they are
not used in isa on Pvc. Therefore, we can just ignore them.
- getGlobalBindlessHeapConfiguration() should be used to choose global
alloctor for SSH
- remove not needed and incorrect unit tests
- remove not needed branches
- bindless mode controls bindless compilation only
Related-To: NEO-7063
Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
- when debugging is enabled, assert() in gpu kernel will trigger SW
exception
Related-To: NEO-5753
Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
Related-To: LOCI-4332
- Signal non-timestamp Walkers with in-order CL value
- Event host synchronization based on CL signal value
Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
adjust thread group dispatch size on pvc if chosen size does not evenly
divide dimension
this is to avoid leftover thread groups
Related-To: NEO-7927
Signed-off-by: Dominik Dabek <dominik.dabek@intel.com>
This change is part of performance feature to start command list batch buffers
as primary.
Implicit Scaling sometimes require to jump over control section and these jumps
must maintain the same level of batch buffer as the whole command list.
Related-To: NEO-7807
Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>
This change adds space reservation in command list for returning batch buffer
start hw command.
Primary batch buffer can be run from direct submission or from KMD call and
must be aligned to required size.
Ending patch for batch buffer start must be in the last command buffer of the
command list.
Related-To: NEO-7807
Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>
This feature is part of performance improvement to dispatch and start
command buffers as primary batch buffers.
When exhausted command buffer is closed, then reserve exact space for chained
batch buffer start and bind it to the next command buffer.
When closing command buffer, then save ending pointer and
reserve aligned space.
Related-To: NEO-7807
Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>