compute-runtime

mirror of https://github.com/intel/compute-runtime.git synced 2026-01-04 15:53:45 +08:00

Author	SHA1	Message	Date
Jack Myers	5f78147e16	fix: hotfix for svmcpu tbx uploads Test program in the linked, related issue is crashing in tbx mode. Tbx server indicated upload of invalid memory was made before exit. Running with debug messages showed that the problematic upload was an svmcpu buffer when running neo with separate cpu and gpu buffers for shared memory management. Using this info, the problem was narrowed down to a missing unprotect call in page fault manager related code, resulting in a protected(invalid) memory region getting uploaded to tbx. It is unclear yet why this unprotect call was not made, since other svmcpu buffers were uploaded without issue. This hotfix forces the unprotect call in the fault handler, which allows the test program to run to completion. However, there is now a failing test case. Considering the critical nature of the associated NEO issue and that this patch should unblock the work depending on the fix, this hotfix should get merged regardless of the failing test case. In the meantime, I will continue triaging the failing test and will implement a proper fix once the root cause is isolated. Related-To: NEO-13404 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-03-14 04:47:21 +01:00
Jack Myers	c26d24e555	fix: tbx page fault manager hang issue - Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being faultable. - Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and `TbxPageFaultManager` to allow recursive locking. - Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer` in `TbxCommandStreamTests`. Related-To: NEO-13748 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-02-18 05:05:38 +01:00
Compute-Runtime-Validation	116f7270be	Revert "fix: tbx page fault manager hang issue" This reverts commit `7d4e70a25b`. Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>	2025-02-12 10:38:05 +01:00
Jack Myers	7d4e70a25b	fix: tbx page fault manager hang issue - Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being faultable. - Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and `TbxPageFaultManager` to allow recursive locking. - Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer` in `TbxCommandStreamTests`. Related-To: NEO-13748 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-02-12 02:19:37 +01:00
Jack Myers	d62122a656	fix: exceptions to TBX faultable types This commit addresses a bug in the previous implementation where almost all once writable types, except `gpuTimestampBuffers`, were incorrectly enabled for TBX faultable checks. The fix ensures that only the subset of once writable types that are also lockable are considered TBX faultable, using the lockable check to avoid manual exceptions and re-inventing the wheel. Changes: - Updated `isAllocTbxFaultable` method to check if the allocation type is lockable in addition to being once writable. - Refactored unit tests to include separate checks for lockable and non-lockable allocation types. Performance optimization: - Removed unnecessary memory data erasure in `handlePageFault` to avoid constant erase/insert operations, leveraging the O(1) search time of unordered maps. Related-To: NEO-12319 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-01-17 00:52:49 +01:00
Jack Myers	7f9fadc314	fix: regression caused by tbx fault mngr Addresses regressions from the reverted merge of the tbx fault manager for host memory. Recursive locking of mutex caused deadlock. To fix, separate tbx fault data from base cpu fault data, allowing separate mutexes for each, eliminating recursive locks on the same mutex. By separating, we also help ensure that tbx-related changes don't affect the original cpu fault manager code paths. As an added safe guard preventing critical regressions and avoiding another auto-revert, the tbx fault manager is hidden behind a new debug flag which is disabled by default. Related-To: NEO-12268 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-01-09 07:48:53 +01:00
Compute-Runtime-Validation	124e755b9d	Revert "fix: regression caused by tbx fault mngr" This reverts commit `9a14fe2478`. Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>	2024-12-19 17:35:03 +01:00
Jack Myers	9a14fe2478	fix: regression caused by tbx fault mngr Addresses regressions from the reverted merge of the tbx fault manager for host memory. This fixes attempts by the tbx fault manager to protect/unprotect host buffer memory, even if the host ptr was not driver-allocated. In the case of the smoke test that triggered the critical regression, clCreateBuffer was called with the CL_MEM_USE_HOST_PTR flag. The subsequent `mprotect` calls on the provided host ptr then failed. Related-To: NEO-12268 Signed-off-by: Jack Myers <jack.myers@intel.com>	2024-12-18 23:16:36 +01:00
Compute-Runtime-Validation	6c5d9a6ed7	Revert "feature: extend TBX page fault manager from CPU implementation" This reverts commit `51c0e80299`. Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>	2024-12-12 12:30:22 +01:00
Jack Myers	51c0e80299	feature: extend TBX page fault manager from CPU implementation In TBX mode, the host could not write to host buffers after access from device code due to the lack of a migration mechanism post-initial TBX upload. Migration is unnecessary with real hardware, but required for TBX. This patch introduces a new page fault manager type that extends the original CPU fault manager, enabling automatic migration of host buffers in TBX mode. Refactoring was necessary to avoid diamond inheritance, achieved by using a template parameter as the base class for OS-specific fault managers. Related-To: NEO-12268 Signed-off-by: Jack Myers <jack.myers@intel.com>	2024-12-11 09:09:50 +01:00

10 Commits