Test program in the linked, related issue
is crashing in tbx mode. Tbx server indicated
upload of invalid memory was made before exit.
Running with debug messages showed that the
problematic upload was an svmcpu buffer when
running neo with separate cpu and gpu
buffers for shared memory management.
Using this info, the problem was narrowed down
to a missing unprotect call in page fault manager
related code, resulting in a protected(invalid)
memory region getting uploaded to tbx.
It is unclear yet why this unprotect call was not made,
since other svmcpu buffers were uploaded without issue.
This hotfix forces the unprotect call in the fault handler,
which allows the test program to run to completion. However,
there is now a failing test case.
Considering the critical nature of the associated
NEO issue and that this patch should unblock
the work depending on the fix, this hotfix should
get merged regardless of the failing test case.
In the meantime, I will continue triaging the
failing test and will implement a proper fix
once the root cause is isolated.
Related-To: NEO-13404
Signed-off-by: Jack Myers <jack.myers@intel.com>
- Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being
faultable.
- Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and
`TbxPageFaultManager` to allow recursive locking.
- Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer`
in `TbxCommandStreamTests`.
Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
- Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being
faultable.
- Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and
`TbxPageFaultManager` to allow recursive locking.
- Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer`
in `TbxCommandStreamTests`.
Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
This commit addresses a bug in the previous implementation where almost all once
writable types, except `gpuTimestampBuffers`, were incorrectly enabled for TBX
faultable checks. The fix ensures that only the subset of once writable
types that are also lockable are considered TBX faultable, using the lockable
check to avoid manual exceptions and re-inventing the wheel.
Changes:
- Updated `isAllocTbxFaultable` method to check if the allocation type is
lockable in addition to being once writable.
- Refactored unit tests to include separate checks for lockable and non-lockable
allocation types.
Performance optimization:
- Removed unnecessary memory data erasure in `handlePageFault` to avoid constant
erase/insert operations, leveraging the O(1) search time of unordered maps.
Related-To: NEO-12319
Signed-off-by: Jack Myers <jack.myers@intel.com>
Addresses regressions from the reverted merge
of the tbx fault manager for host memory.
Recursive locking of mutex caused deadlock.
To fix, separate tbx fault data from base
cpu fault data, allowing separate mutexes
for each, eliminating recursive locks on
the same mutex.
By separating, we also help ensure that tbx-related
changes don't affect the original cpu fault manager code
paths.
As an added safe guard preventing critical regressions
and avoiding another auto-revert, the tbx fault manager
is hidden behind a new debug flag which is disabled by default.
Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
Addresses regressions from the reverted merge
of the tbx fault manager for host memory.
This fixes attempts by the tbx fault manager
to protect/unprotect host buffer memory, even
if the host ptr was not driver-allocated.
In the case of the smoke test that triggered
the critical regression, clCreateBuffer was
called with the CL_MEM_USE_HOST_PTR flag.
The subsequent `mprotect` calls on the
provided host ptr then failed.
Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
In TBX mode, the host could not write to host buffers after access from device
code due to the lack of a migration mechanism post-initial TBX upload.
Migration is unnecessary with real hardware, but required for TBX.
This patch introduces a new page fault manager type that extends the original
CPU fault manager, enabling automatic migration of host buffers in TBX mode.
Refactoring was necessary to avoid diamond inheritance, achieved by using a
template parameter as the base class for OS-specific fault managers.
Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
Recorded multiple page fault handlers by using vector in
cpu_page_fault_manager_linux.
Added a static handlerIndex in order to track the depth of
handler logic to call appropriate previous handlers.
Related-To: NEO-11563
Signed-off-by: Young Jin Yoon <young.jin.yoon@intel.com>
Recorded multiple page fault handlers by using vector in
cpu_page_fault_manager_linux.
Added a static handlerIndex in order to track the depth of
handler logic to call appropriate previous handlers.
Related-To: NEO-11563
Signed-off-by: Young Jin Yoon <young.jin.yoon@intel.com>
Recorded multiple page fault handlers by using vector in
cpu_page_fault_manager_linux.
Added a static handlerIndex in order to track the depth of
handler logic to call appropriate previous handlers.
Related-To: NEO-11563
Signed-off-by: Young Jin Yoon <young.jin.yoon@intel.com>
Related-To: NEO-11755
Removing trim candidate list reduces overhead
caused by residency handling. Allocations required
for eviction are placed in eviction container managed
by CSR.
Signed-off-by: Szymon Morek <szymon.morek@intel.com>
In this mode AUB csr will be created, however, no aub file will be created
Related-To: NEO-11097
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
Created registerFaultHandler() and checkFaultHandlerFromPageFaultManager()
and removed registering sigaction() from the contructor of the
PageFaultManagerLinux class.
Added if statment to check the current pagefault handler is from the
pagefault manager. If not, register the pagefault handler of the current
pagefault manager on linux.
Refactored windows exception vector adding logic to
registerFaultHandler() and call upon the constructor of the
PageFaultManagerWindows, and make
checkFaultHandlerFromPageFaultManager() always return true for windows.
Related-To: NEO-8190
Signed-off-by: Young Jin Yoon <young.jin.yoon@intel.com>
Enable eviction of CPU side USM allocation for UMD migrations on Windows.
Reverts incorrect auto-revert commit 218de586a4f28b1de3e983b9006e7a99d3a4d10e.
Related-To: NEO-8015
Signed-off-by: Milczarek, Slawomir <slawomir.milczarek@intel.com>
Enable eviction of CPU side USM allocation for UMD migrations on Windows.
Related-To: NEO-8015
Signed-off-by: Milczarek, Slawomir <slawomir.milczarek@intel.com>
Enable eviction of CPU side USM allocation for UMD migrations on Windows.
Related-To: NEO-8015
Signed-off-by: Milczarek, Slawomir <slawomir.milczarek@intel.com>
Add a per-instance SVMAllocsManager::nonGpuDomainAllocs container for
all allocations to be removed in
moveAllocationsWithinUMAllocsManagerToGpuDomain. This approach replaces
the current iterative search and performs the task faster.
Add 7 new unit-tests to verify the functionality related to
nonGpuDomainAllocs container, both in expected and unexpected/synthetic
scenarios.
For UTs replace a dummy unifiedMemoryManager pointer with a pointer to
an instace of SVMAllocsManager, otherwise a SegFault error is thrown at
the end of tests.
Perform overall cleanup in related tests implementation, includes but
not limited to removal of:
- givenInitialPlacementGpu\
WhenMovingToGpuDomainThenFirstAccessDoesNotInvokeTransfer
As it is fully covered by:
givenAllocationMovedToGpuDomain\
WhenVerifyingPagefaultThenAllocationIsMovedToCpuDomain
- givenInitialPlacementGpu\
WhenVerifyingPagefaultThenFirstAccessDoesNotInvokeTransfer
As it is fully covered by:
givenTbxAndnitialPlacementGpu\
WhenVerifyingPagefaultThenMemoryIsUnprotectedOnly
Finally, reduce code duplication where possible.
Related-To: NEO-6658
Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>
- Added support for disabling CPU migration of USM memory given
ZE_MEMORY_ADVICE_SET_READ_MOSTLY && ZE_MEMORY_ADVICE_SET_PREFERRED_LOCATION
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>