Commit Graph

42 Commits

Author SHA1 Message Date
Jack Myers 5f78147e16 fix: hotfix for svmcpu tbx uploads
Test program in the linked, related issue
is crashing in tbx mode. Tbx server indicated
upload of invalid memory was made before exit.

Running with debug messages showed that the
problematic upload was an svmcpu buffer when
running neo with separate cpu and gpu
buffers for shared memory management.

Using this info, the problem was narrowed down
to a missing unprotect call in page fault manager
related code, resulting in a protected(invalid)
memory region getting uploaded to tbx.

It is unclear yet why this unprotect call was not made,
since other svmcpu buffers were uploaded without issue.

This hotfix forces the unprotect call in the fault handler,
which allows the test program to run to completion. However,
there is now a failing test case.

Considering the critical nature of the associated
NEO issue and that this patch should unblock
the work depending on the fix, this hotfix should
get merged regardless of the failing test case.

In the meantime, I will continue triaging the
failing test and will implement a proper fix
once the root cause is isolated.

Related-To: NEO-13404
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-03-14 04:47:21 +01:00
Jack Myers 7f9fadc314 fix: regression caused by tbx fault mngr
Addresses regressions from the reverted merge
of the tbx fault manager for host memory.

Recursive locking of mutex caused deadlock.

To fix, separate tbx fault data from base
cpu fault data, allowing separate mutexes
for each, eliminating recursive locks on
the same mutex.

By separating, we also help ensure that tbx-related
changes don't affect the original cpu fault manager code
paths.

As an added safe guard preventing critical regressions
and avoiding another auto-revert, the tbx fault manager
is hidden behind a new debug flag which is disabled by default.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-01-09 07:48:53 +01:00
Compute-Runtime-Validation 124e755b9d Revert "fix: regression caused by tbx fault mngr"
This reverts commit 9a14fe2478.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2024-12-19 17:35:03 +01:00
Jack Myers 9a14fe2478 fix: regression caused by tbx fault mngr
Addresses regressions from the reverted merge
of the tbx fault manager for host memory.

This fixes attempts by the tbx fault manager
to protect/unprotect host buffer memory, even
if the host ptr was not driver-allocated.

In the case of the smoke test that triggered
the critical regression, clCreateBuffer was
called with the CL_MEM_USE_HOST_PTR flag.
The subsequent `mprotect` calls on the
provided host ptr then failed.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2024-12-18 23:16:36 +01:00
Compute-Runtime-Validation 6c5d9a6ed7 Revert "feature: extend TBX page fault manager from CPU implementation"
This reverts commit 51c0e80299.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2024-12-12 12:30:22 +01:00
Jack Myers 51c0e80299 feature: extend TBX page fault manager from CPU implementation
In TBX mode, the host could not write to host buffers after access from device
code due to the lack of a migration mechanism post-initial TBX upload.
Migration is unnecessary with real hardware, but required for TBX.

This patch introduces a new page fault manager type that extends the original
CPU fault manager, enabling automatic migration of host buffers in TBX mode.

Refactoring was necessary to avoid diamond inheritance, achieved by using a
template parameter as the base class for OS-specific fault managers.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2024-12-11 09:09:50 +01:00
Bartosz Dunajski dab4166837 fix: add missing aub polls on sync points
Related-To: HSD-14023925176

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-11-21 09:17:54 +01:00
Bartosz Dunajski dd8460beba refactor: reduce TBX download timeout for unit tests
Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-09-09 19:05:03 +02:00
Bartosz Dunajski db611962f7 fix: improve task count handling in tbx download path
Related-To: HSD-18039789178

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-08-28 15:32:15 +02:00
Szymon Morek b8f181d50e performance: remove trim candidate list
Related-To: NEO-11755

Removing trim candidate list reduces overhead
caused by residency handling. Allocations required
for eviction are placed in eviction container managed
by CSR.

Signed-off-by: Szymon Morek <szymon.morek@intel.com>
2024-08-23 12:21:50 +02:00
Bartosz Dunajski 696b02bfd3 fix: improve TBX downloading after L0 Event sync
Related-To: HSD-18038498579

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-08-23 10:42:17 +02:00
Bartosz Dunajski 24cfd203ab fix: dont download tbx allocations on heapless first device submission
Related-To: HSD-18039476929

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-08-06 14:03:42 +02:00
Mateusz Jablonski cb2b572e94 feature: add support for null aub mode
In this mode AUB csr will be created, however, no aub file will be created

Related-To: NEO-11097
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2024-04-09 16:59:42 +02:00
Mateusz Hoppe 83ac95d293 fix: L0 - remove synchronization with events on appends in tbx mode
Related-To: NEO-9400

Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2023-11-27 10:39:55 +01:00
Jablonski, Mateusz ac5f64f5c6 fix: fix compilation error in clang on Windows (2/n)
Signed-off-by: Jablonski, Mateusz <mateusz.jablonski@intel.com>
2023-10-24 15:59:06 +02:00
Dunajski, Bartosz 25195ebc96 fix: capability to write memory chunk in aub/tbx mode
Related-To: GSD-6604

Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
2023-10-19 19:13:11 +02:00
Dunajski, Bartosz cd9ad1f04c fix: decanonize GPU VA during TBX memory read.
Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
2023-07-26 19:44:19 +02:00
Warchulski, Jaroslaw 6fb68dd84b Separation of MemoryAllocation from os_agnostic_memory_manager.h
Related-To: NEO-5548
Signed-off-by: Warchulski, Jaroslaw <jaroslaw.warchulski@intel.com>
2023-01-04 15:09:36 +01:00
Zbigniew Zdanowicz c0c9ce548a Validate level zero events in TBX mode
Related-To: NEO-7545

Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>
2022-11-28 16:45:02 +01:00
Maciej Plewka 4b42b066f8 Use dedicated using type for TaskCount
Related-To: NEO-7155

Signed-off-by: Maciej Plewka <maciej.plewka@intel.com>
2022-11-28 16:44:44 +01:00
Mateusz Jablonski a17df8fa86 Return SubmissionStatus from processResidency method
it allows to return non-binary status to API layer

Related-To: NEO-7412
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2022-11-15 13:17:43 +01:00
Dunajski, Bartosz 06a647a5e9 Set SkipResourceCleanup in TBX mode
Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
2022-10-27 12:23:08 +02:00
Compute-Runtime-Validation 638aba45a0 Revert "Set SkipResourceCleanup in TBX mode"
This reverts commit cb83c1d935.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2022-10-26 07:09:29 +02:00
Dunajski, Bartosz cb83c1d935 Set SkipResourceCleanup in TBX mode
Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
2022-10-25 14:31:35 +02:00
Fabian Zwolinski 645600d141 Return error when there is no memory to evict
We want to return error code to the application instead of aborting when
we are not able to make more memory resident.

Related-To: NEO-7289
Signed-off-by: Fabian Zwolinski <fabian.zwolinski@intel.com>
2022-09-22 14:26:55 +02:00
Krzysztof Gibala 2fcda0a528 Refactor: Change decanonize method accessing point
Accessing decanonize method as a member of GmmHelper class object

Signed-off-by: Krzysztof Gibala <krzysztof.gibala@intel.com>
2022-05-11 12:57:02 +02:00
Jobczyk, Lukasz a285712cc4 Add missing download allocation calls
Signed-off-by: Jobczyk, Lukasz <lukasz.jobczyk@intel.com>
Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>
2022-03-31 09:49:22 +02:00
Lukasz Jobczyk a230f267e1 Poll task count indefinitely on high throttle command queue
Resolves: NEO-6781

Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>
2022-03-25 10:06:16 +01:00
Patryk Wrobel 7f729b7f89 Detect GPU hang in clWaitForEvents
This change:
- moves NEO::WaitStatus to a separate file
- enables detection of GPU hang in clWaitForEvents
- adjusts most of blocking calls in CommandStreamReceiver to return WaitStatus
- adds ULTs to cover the new code

Related-To: NEO-6681
Signed-off-by: Patryk Wrobel <patryk.wrobel@intel.com>
2022-02-23 13:33:09 +01:00
Patryk Wrobel 498cf5e871 Implement GPU hang detection
This change uses DRM_IOCTL_I915_GET_RESET_STATS to detect
GPU hangs. When such situation is encountered, then
zeCommandQueueSynchronize returns ZE_RESULT_ERROR_DEVICE_LOST.

Related-To: NEO-5313
Signed-off-by: Patryk Wrobel <patryk.wrobel@intel.com>
2022-01-31 13:48:17 +01:00
Raiyan Latif 394c0e90e1 Return error when failing on submission
Signed-off-by: Raiyan Latif <raiyan.latif@intel.com>
2022-01-12 16:42:30 +01:00
Lukasz Jobczyk b59b0b6b36 Download timestamps before checking completion
Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>
2021-12-28 08:14:27 +01:00
Lukasz Jobczyk 7f1c87f049 Fix flush tag update in TBX mode
Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>
2021-11-24 12:30:29 +01:00
Filip Hazubski f1b696824c Ensure engine is initialized when writing memory in AUB and TBX modes
Signed-off-by: Filip Hazubski <filip.hazubski@intel.com>
2021-10-14 19:45:43 +02:00
Filip Hazubski 27b5952ba3 Correct initializeEngine usage in aub tests
Signed-off-by: Filip Hazubski <filip.hazubski@intel.com>
2021-10-14 19:36:00 +02:00
Zbigniew Zdanowicz 3b35ba504f Adapt command stream receiver to multiple active partitions
Related-To: NEO-6244

Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>
2021-09-23 14:32:20 +02:00
Mateusz Jablonski f8867e0b97 Move generic command stream receiver files to shared
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2021-09-22 23:55:31 +02:00
Mateusz Jablonski d348526941 Simplify checkAndActivateAubSubCapture method
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2021-09-22 20:23:56 +02:00
Zbigniew Zdanowicz cd4f3c221a Synchronize switching command buffers for all partitions
Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>
2021-09-08 12:12:23 +02:00
Filip Hazubski 29c64c3dd0 Disable implicit scaling for cooperative kernels
When implicit scaling is disabled use useSingleSubdeviceValue = true.

Resolves: NEO-5757

Signed-off-by: Filip Hazubski <filip.hazubski@intel.com>
2021-08-18 14:56:37 +02:00
Bartosz Dunajski c7a936d1f4 Add memory banks to Simulated CSR
Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2021-07-05 12:19:58 +02:00
Dominik Dabek 39f0387ecc Move tbx stream, tbx csr to shared
Related-To: NEO-5161

Signed-off-by: Dominik Dabek <dominik.dabek@intel.com>i
2021-05-31 14:35:32 +02:00