Commit Graph

69 Commits

Author SHA1 Message Date
Maciej Bielski
147bd894ec refactor: use PRINT_STRING macro for most diagnostics
Related-To: NEO-14742
Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>
2025-11-28 13:28:29 +01:00
Lukasz Jobczyk
6515e422e9 refactor: move eviction container to residency controller
Related-To: NEO-13315

Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>
2025-10-13 08:41:34 +02:00
Lukasz Jobczyk
a390327ef9 refactor: remove unused code
Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>
2025-09-10 15:38:48 +02:00
Jack Myers
f2b5126598 feature: enable tbx fault manager by default
Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-06-23 09:59:32 +02:00
Jack Myers
5f78147e16 fix: hotfix for svmcpu tbx uploads
Test program in the linked, related issue
is crashing in tbx mode. Tbx server indicated
upload of invalid memory was made before exit.

Running with debug messages showed that the
problematic upload was an svmcpu buffer when
running neo with separate cpu and gpu
buffers for shared memory management.

Using this info, the problem was narrowed down
to a missing unprotect call in page fault manager
related code, resulting in a protected(invalid)
memory region getting uploaded to tbx.

It is unclear yet why this unprotect call was not made,
since other svmcpu buffers were uploaded without issue.

This hotfix forces the unprotect call in the fault handler,
which allows the test program to run to completion. However,
there is now a failing test case.

Considering the critical nature of the associated
NEO issue and that this patch should unblock
the work depending on the fix, this hotfix should
get merged regardless of the failing test case.

In the meantime, I will continue triaging the
failing test and will implement a proper fix
once the root cause is isolated.

Related-To: NEO-13404
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-03-14 04:47:21 +01:00
Jack Myers
c26d24e555 fix: tbx page fault manager hang issue
- Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being
faultable.
- Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and
`TbxPageFaultManager` to allow recursive locking.
- Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer`
in `TbxCommandStreamTests`.

Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-02-18 05:05:38 +01:00
Compute-Runtime-Validation
116f7270be Revert "fix: tbx page fault manager hang issue"
This reverts commit 7d4e70a25b.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2025-02-12 10:38:05 +01:00
Jack Myers
7d4e70a25b fix: tbx page fault manager hang issue
- Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being
faultable.
- Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and
`TbxPageFaultManager` to allow recursive locking.
- Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer`
in `TbxCommandStreamTests`.

Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-02-12 02:19:37 +01:00
Jack Myers
d62122a656 fix: exceptions to TBX faultable types
This commit addresses a bug in the previous implementation where almost all once
writable types, except `gpuTimestampBuffers`, were incorrectly enabled for TBX
faultable checks. The fix ensures that only the subset of once writable
types that are also lockable are considered TBX faultable, using the lockable
check to avoid manual exceptions and re-inventing the wheel.

Changes:

- Updated `isAllocTbxFaultable` method to check if the allocation type is
lockable in addition to being once writable.
- Refactored unit tests to include separate checks for lockable and non-lockable
allocation types.

Performance optimization:

- Removed unnecessary memory data erasure in `handlePageFault` to avoid constant
erase/insert operations, leveraging the O(1) search time of unordered maps.

Related-To: NEO-12319
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-01-17 00:52:49 +01:00
Jack Myers
7f9fadc314 fix: regression caused by tbx fault mngr
Addresses regressions from the reverted merge
of the tbx fault manager for host memory.

Recursive locking of mutex caused deadlock.

To fix, separate tbx fault data from base
cpu fault data, allowing separate mutexes
for each, eliminating recursive locks on
the same mutex.

By separating, we also help ensure that tbx-related
changes don't affect the original cpu fault manager code
paths.

As an added safe guard preventing critical regressions
and avoiding another auto-revert, the tbx fault manager
is hidden behind a new debug flag which is disabled by default.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-01-09 07:48:53 +01:00
Compute-Runtime-Validation
124e755b9d Revert "fix: regression caused by tbx fault mngr"
This reverts commit 9a14fe2478.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2024-12-19 17:35:03 +01:00
Jack Myers
9a14fe2478 fix: regression caused by tbx fault mngr
Addresses regressions from the reverted merge
of the tbx fault manager for host memory.

This fixes attempts by the tbx fault manager
to protect/unprotect host buffer memory, even
if the host ptr was not driver-allocated.

In the case of the smoke test that triggered
the critical regression, clCreateBuffer was
called with the CL_MEM_USE_HOST_PTR flag.
The subsequent `mprotect` calls on the
provided host ptr then failed.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2024-12-18 23:16:36 +01:00
Compute-Runtime-Validation
6c5d9a6ed7 Revert "feature: extend TBX page fault manager from CPU implementation"
This reverts commit 51c0e80299.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2024-12-12 12:30:22 +01:00
Jack Myers
51c0e80299 feature: extend TBX page fault manager from CPU implementation
In TBX mode, the host could not write to host buffers after access from device
code due to the lack of a migration mechanism post-initial TBX upload.
Migration is unnecessary with real hardware, but required for TBX.

This patch introduces a new page fault manager type that extends the original
CPU fault manager, enabling automatic migration of host buffers in TBX mode.

Refactoring was necessary to avoid diamond inheritance, achieved by using a
template parameter as the base class for OS-specific fault managers.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2024-12-11 09:09:50 +01:00
Young Jin Yoon
b1f73355ac feature: capture multiple cpu pagefault handler
Recorded multiple page fault handlers by using vector in
cpu_page_fault_manager_linux.

Added a static handlerIndex in order to track the depth of
handler logic to call appropriate previous handlers.

Related-To: NEO-11563
Signed-off-by: Young Jin Yoon <young.jin.yoon@intel.com>
2024-11-20 02:40:30 +01:00
Szymon Morek
b2fd1972a4 fix: add cpu alloc to eviction list only once
Related-To: NEO-12572

Also, before migration to GPU domain, remove it from this list

Signed-off-by: Szymon Morek <szymon.morek@intel.com>
2024-10-01 11:47:32 +02:00
Compute-Runtime-Validation
f03550487f Revert "fix: add cpu alloc to eviction list only once"
This reverts commit dfc863c7c1.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2024-09-28 11:50:36 +02:00
Compute-Runtime-Validation
4f96b6132f Revert "feature: capture multiple cpu pagefault handler"
This reverts commit 4b3a6e9cfe.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2024-09-28 07:53:30 +02:00
Young Jin Yoon
4b3a6e9cfe feature: capture multiple cpu pagefault handler
Recorded multiple page fault handlers by using vector in
cpu_page_fault_manager_linux.

Added a static handlerIndex in order to track the depth of
handler logic to call appropriate previous handlers.

Related-To: NEO-11563
Signed-off-by: Young Jin Yoon <young.jin.yoon@intel.com>
2024-09-27 00:34:45 +02:00
Szymon Morek
dfc863c7c1 fix: add cpu alloc to eviction list only once
Related-To: NEO-12572

Signed-off-by: Szymon Morek <szymon.morek@intel.com>
2024-09-26 19:49:59 +02:00
Compute-Runtime-Validation
5569ebbe1f Revert "feature: capture multiple cpu pagefault handler"
This reverts commit 44f2912195.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2024-09-03 05:18:26 +02:00
Young Jin Yoon
44f2912195 feature: capture multiple cpu pagefault handler
Recorded multiple page fault handlers by using vector in
cpu_page_fault_manager_linux.

Added a static handlerIndex in order to track the depth of
handler logic to call appropriate previous handlers.

Related-To: NEO-11563
Signed-off-by: Young Jin Yoon <young.jin.yoon@intel.com>
2024-09-02 16:46:35 +02:00
Szymon Morek
b8f181d50e performance: remove trim candidate list
Related-To: NEO-11755

Removing trim candidate list reduces overhead
caused by residency handling. Allocations required
for eviction are placed in eviction container managed
by CSR.

Signed-off-by: Szymon Morek <szymon.morek@intel.com>
2024-08-23 12:21:50 +02:00
Kozlowski, Marek
bd8fc07bb7 fix: Replace printf with current logging practice
* add missing stdout flush

Signed-off-by: Kozlowski, Marek <marek.kozlowski@intel.com>
2024-07-15 14:22:04 +02:00
Lukasz Jobczyk
8217b76cef refactor: Add key to not register pagefault handler on migration
Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>
2024-05-28 08:45:34 +02:00
Mateusz Jablonski
cb2b572e94 feature: add support for null aub mode
In this mode AUB csr will be created, however, no aub file will be created

Related-To: NEO-11097
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2024-04-09 16:59:42 +02:00
Mateusz Jablonski
dd1b9d6abc refactor: correct naming of enum class constants 8/n
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-12-19 08:18:18 +01:00
Mateusz Jablonski
c9664e6bad refactor: rename global debug manager to debugManager
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-11-30 13:00:59 +01:00
Mateusz Jablonski
80e59ff344 fix: don't call virtual method in ctor
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-10-30 12:26:57 +01:00
Jablonski, Mateusz
ac5f64f5c6 fix: fix compilation error in clang on Windows (2/n)
Signed-off-by: Jablonski, Mateusz <mateusz.jablonski@intel.com>
2023-10-24 15:59:06 +02:00
Young Jin Yoon
91deddb69b feature: register handler when we migrate to GPU
Created registerFaultHandler() and checkFaultHandlerFromPageFaultManager()
and removed registering sigaction() from the contructor of the
PageFaultManagerLinux class.

Added if statment to check the current pagefault handler is from the
pagefault manager. If not, register the pagefault handler of the current
pagefault manager on linux.

Refactored windows exception vector adding logic to
registerFaultHandler() and call upon the constructor of the
PageFaultManagerWindows, and make
checkFaultHandlerFromPageFaultManager() always return true for windows.

Related-To: NEO-8190
Signed-off-by: Young Jin Yoon <young.jin.yoon@intel.com>
2023-08-14 11:14:03 +02:00
Milczarek, Slawomir
027c51d396 feature: Add CPU side USM allocation to trim candidate list on page fault
Enable eviction of CPU side USM allocation for UMD migrations on Windows.
Reverts incorrect auto-revert commit 218de586a4f28b1de3e983b9006e7a99d3a4d10e.

Related-To: NEO-8015

Signed-off-by: Milczarek, Slawomir <slawomir.milczarek@intel.com>
2023-07-25 15:21:12 +02:00
Compute-Runtime-Validation
918b41d26d Revert "feature: Add CPU side USM allocation to trim candidate list on page f...
This reverts commit 60a4448a07.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2023-07-24 08:44:22 +02:00
Milczarek, Slawomir
60a4448a07 feature: Add CPU side USM allocation to trim candidate list on page fage fault
Enable eviction of CPU side USM allocation for UMD migrations on Windows.

Related-To: NEO-8015
Signed-off-by: Milczarek, Slawomir <slawomir.milczarek@intel.com>
2023-07-23 10:24:28 +02:00
Compute-Runtime-Validation
4a562e352b Revert "feature: Add CPU side USM allocation to trim candidate list on page f...
This reverts commit cce2cc920d.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2023-07-21 16:40:59 +02:00
Milczarek, Slawomir
cce2cc920d feature: Add CPU side USM allocation to trim candidate list on page fault
Enable eviction of CPU side USM allocation for UMD migrations on Windows.

Related-To: NEO-8015

Signed-off-by: Milczarek, Slawomir <slawomir.milczarek@intel.com>
2023-07-21 14:18:38 +02:00
Milczarek, Slawomir
a6a0b95344 fix: Cpu page fault manager with control of host ptr eviction
Related-To: NEO-8015

Signed-off-by: Milczarek, Slawomir <slawomir.milczarek@intel.com>
2023-07-12 14:13:17 +02:00
Michal Mrozek
593b3cf4fd Revert "[performance] do not perform migrations if not needed."
Signed-off-by: Michal Mrozek <michal.mrozek@intel.com>
2023-03-14 19:08:19 +01:00
Michal Mrozek
15f08a92c0 [performance] do not perform migrations if not needed.
Skip migrations if nothing is migrated to the CPU side.

Related-To: NEO-5170
Signed-off-by: Michal Mrozek <michal.mrozek@intel.com>
2023-02-17 18:38:52 +01:00
Warchulski, Jaroslaw
5eef40fedd Cleanup includes 22
Cleaned up files:
opencl/source/built_ins/builtins_dispatch_builder.h
opencl/source/context/context.h
opencl/source/gtpin/gtpin_notify.h
opencl/source/kernel/kernel.h
opencl/source/kernel/multi_device_kernel.h
opencl/source/mem_obj/buffer.h
opencl/source/mem_obj/mem_obj.h
shared/source/built_ins/registry/built_ins_registry.h
shared/source/page_fault_manager/cpu_page_fault_manager.h

Related-To: NEO-5548
Signed-off-by: Warchulski, Jaroslaw <jaroslaw.warchulski@intel.com>
2023-01-05 16:59:01 +01:00
Mateusz Jablonski
43b790957d style: format code using clang-format 15.0.6
Related-To: NEO-7500
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-01-05 10:33:47 +01:00
Jaime Arteaga
1e9e877394 Style: Add 0x prefix to PrintUmdSharedMigration logs
This to align with format used on another tools, like onetrace.

Signed-off-by: Jaime Arteaga <jaime.a.arteaga.molina@intel.com>
2023-01-03 03:37:10 +01:00
Warchulski, Jaroslaw
f275eea6ec Cleanup includes 14
Cleaned up files:
shared/source/device/device.h

Related-To: NEO-5548

Signed-off-by: Warchulski, Jaroslaw <jaroslaw.warchulski@intel.com>
2022-12-23 10:46:34 +01:00
Lukasz Jobczyk
8927399cce Set proper gpu domain transfer handler for CAL
Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>
2022-11-17 11:53:02 +01:00
Jaime Arteaga
db58e50564 Improve PrintUmdSharedMigration
Add size and timing data.

Signed-off-by: Jaime Arteaga <jaime.a.arteaga.molina@intel.com>
2022-07-18 19:47:13 +02:00
Daniel Chabrowski
7463e1970b Cleanup headers
Make TUs and headers self-contained, remove unused headers

Signed-off-by: Daniel Chabrowski <daniel.chabrowski@intel.com>
2022-05-18 11:42:06 +02:00
Milczarek, Slawomir
7cd4ca5ce7 Fixed AUB capture in HW mode for umd-migrated shared allocations
Signed-off-by: Milczarek, Slawomir <slawomir.milczarek@intel.com>
2022-03-30 12:04:58 +02:00
Maciej Bielski
a71b88fefb PageFaultHandler: speedup UM allocations lookup
Add a per-instance SVMAllocsManager::nonGpuDomainAllocs container for
all allocations to be removed in
moveAllocationsWithinUMAllocsManagerToGpuDomain. This approach replaces
the current iterative search and performs the task faster.

Add 7 new unit-tests to verify the functionality related to
nonGpuDomainAllocs container, both in expected and unexpected/synthetic
scenarios.

For UTs replace a dummy unifiedMemoryManager pointer with a pointer to
an instace of SVMAllocsManager, otherwise a SegFault error is thrown at
the end of tests.

Perform overall cleanup in related tests implementation, includes but
not limited to removal of:

- givenInitialPlacementGpu\
WhenMovingToGpuDomainThenFirstAccessDoesNotInvokeTransfer

As it is fully covered by:

givenAllocationMovedToGpuDomain\
WhenVerifyingPagefaultThenAllocationIsMovedToCpuDomain

- givenInitialPlacementGpu\
WhenVerifyingPagefaultThenFirstAccessDoesNotInvokeTransfer

As it is fully covered by:

givenTbxAndnitialPlacementGpu\
WhenVerifyingPagefaultThenMemoryIsUnprotectedOnly

Finally, reduce code duplication where possible.

Related-To: NEO-6658
Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>
2022-03-16 11:18:24 +01:00
Milczarek, Slawomir
2be98a1e62 Create kmd migrated allocation with initial placement
Implements ZE_HOST_MEM_ALLOC_FLAG_BIAS_INITIAL_PLACEMENT
for zeMemAllocShared with KMD migrated allocation.

Signed-off-by: Milczarek, Slawomir <slawomir.milczarek@intel.com>
2022-02-01 15:42:10 +01:00
Michal Mrozek
a12f9cb377 [2/n] Optimize indirect calls.
Migrate shared allocation when command list sets indirect flags.

Signed-off-by: Michal Mrozek <michal.mrozek@intel.com>
2022-01-25 13:10:53 +01:00