A change related to the tbx fault manager
incorrectly removed a switch case from
`AubHelper::isOneTimeAubWritableAllocationType`.
This fixes that and refactors some APIs to prevent
similar mistakes from happening again by cleaning
up logic.
Addresses show stopper for pre-si pytorch workflows.
Resolves: NEO-14399
Signed-off-by: Jack Myers <jack.myers@intel.com>
Test program in the linked, related issue
is crashing in tbx mode. Tbx server indicated
upload of invalid memory was made before exit.
Running with debug messages showed that the
problematic upload was an svmcpu buffer when
running neo with separate cpu and gpu
buffers for shared memory management.
Using this info, the problem was narrowed down
to a missing unprotect call in page fault manager
related code, resulting in a protected(invalid)
memory region getting uploaded to tbx.
It is unclear yet why this unprotect call was not made,
since other svmcpu buffers were uploaded without issue.
This hotfix forces the unprotect call in the fault handler,
which allows the test program to run to completion. However,
there is now a failing test case.
Considering the critical nature of the associated
NEO issue and that this patch should unblock
the work depending on the fix, this hotfix should
get merged regardless of the failing test case.
In the meantime, I will continue triaging the
failing test and will implement a proper fix
once the root cause is isolated.
Related-To: NEO-13404
Signed-off-by: Jack Myers <jack.myers@intel.com>
- command list append state is managed from internal queue and can be skipped
- initial state configuration should be processed by both kernel and non-kernel
- only kernel operation can process required state, as non-kernel cannot change
Related-To: NEO-10356
Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>
Related-To: NEO-13870
Currently all monitor fences are triggering
interrupt due to Notify Enable field.
With this change, such field is programmed
right before KMD wait.
Signed-off-by: Szymon Morek <szymon.morek@intel.com>
- add new enum type for command list flush from immediate
- add new argument for flushing immediate command list - regular command list
- add capability to provide additional stream for epilogue commands
- add pointer to provide external csr mutex to lock both execution and flush
Related-To: NEO-10356
Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>
Do not check if ULLS light is active during every Csr::makeResident
call. Store that information once during ULLS init.
Related-To: NEO-13922
Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>
- Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being
faultable.
- Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and
`TbxPageFaultManager` to allow recursive locking.
- Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer`
in `TbxCommandStreamTests`.
Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
- Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being
faultable.
- Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and
`TbxPageFaultManager` to allow recursive locking.
- Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer`
in `TbxCommandStreamTests`.
Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
This commit addresses a bug in the previous implementation where almost all once
writable types, except `gpuTimestampBuffers`, were incorrectly enabled for TBX
faultable checks. The fix ensures that only the subset of once writable
types that are also lockable are considered TBX faultable, using the lockable
check to avoid manual exceptions and re-inventing the wheel.
Changes:
- Updated `isAllocTbxFaultable` method to check if the allocation type is
lockable in addition to being once writable.
- Refactored unit tests to include separate checks for lockable and non-lockable
allocation types.
Performance optimization:
- Removed unnecessary memory data erasure in `handlePageFault` to avoid constant
erase/insert operations, leveraging the O(1) search time of unordered maps.
Related-To: NEO-12319
Signed-off-by: Jack Myers <jack.myers@intel.com>
Patch #34223 introduced the TbxPageFaultManager for handling
uploads/downloads of host buffers to the Tbx server, ensuring
host memory is kept consistent between the host and device,
even after multiple alternating writes from the host and gpu.
This patch enable fault handling for all `isAubOnceWritable`
types.
Minor exception for gpuTimestampBuffers as enabling this type
seems to break things in real-world use cases outside of ULTs.
Related-To: NEO-12319
Signed-off-by: Jack Myers <jack.myers@intel.com>
Addresses regressions from the reverted merge
of the tbx fault manager for host memory.
Recursive locking of mutex caused deadlock.
To fix, separate tbx fault data from base
cpu fault data, allowing separate mutexes
for each, eliminating recursive locks on
the same mutex.
By separating, we also help ensure that tbx-related
changes don't affect the original cpu fault manager code
paths.
As an added safe guard preventing critical regressions
and avoiding another auto-revert, the tbx fault manager
is hidden behind a new debug flag which is disabled by default.
Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>