Commit Graph

69 Commits

Author SHA1 Message Date
Jack Myers c26d24e555 fix: tbx page fault manager hang issue
- Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being
faultable.
- Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and
`TbxPageFaultManager` to allow recursive locking.
- Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer`
in `TbxCommandStreamTests`.

Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-02-18 05:05:38 +01:00
Compute-Runtime-Validation 116f7270be Revert "fix: tbx page fault manager hang issue"
This reverts commit 7d4e70a25b.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2025-02-12 10:38:05 +01:00
Jack Myers 7d4e70a25b fix: tbx page fault manager hang issue
- Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being
faultable.
- Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and
`TbxPageFaultManager` to allow recursive locking.
- Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer`
in `TbxCommandStreamTests`.

Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-02-12 02:19:37 +01:00
Jack Myers d62122a656 fix: exceptions to TBX faultable types
This commit addresses a bug in the previous implementation where almost all once
writable types, except `gpuTimestampBuffers`, were incorrectly enabled for TBX
faultable checks. The fix ensures that only the subset of once writable
types that are also lockable are considered TBX faultable, using the lockable
check to avoid manual exceptions and re-inventing the wheel.

Changes:

- Updated `isAllocTbxFaultable` method to check if the allocation type is
lockable in addition to being once writable.
- Refactored unit tests to include separate checks for lockable and non-lockable
allocation types.

Performance optimization:

- Removed unnecessary memory data erasure in `handlePageFault` to avoid constant
erase/insert operations, leveraging the O(1) search time of unordered maps.

Related-To: NEO-12319
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-01-17 00:52:49 +01:00
Jack Myers 0b2ac4d331 feature: Tbx faults for all once writable types
Patch #34223 introduced the TbxPageFaultManager for handling
uploads/downloads of host buffers to the Tbx server, ensuring
host memory is kept consistent between the host and device,
even after multiple alternating writes from the host and gpu.

This patch enable fault handling for all `isAubOnceWritable`
types.

Minor exception for gpuTimestampBuffers as enabling this type
seems to break things in real-world use cases outside of ULTs.

Related-To: NEO-12319
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-01-16 01:43:19 +01:00
Jack Myers 7f9fadc314 fix: regression caused by tbx fault mngr
Addresses regressions from the reverted merge
of the tbx fault manager for host memory.

Recursive locking of mutex caused deadlock.

To fix, separate tbx fault data from base
cpu fault data, allowing separate mutexes
for each, eliminating recursive locks on
the same mutex.

By separating, we also help ensure that tbx-related
changes don't affect the original cpu fault manager code
paths.

As an added safe guard preventing critical regressions
and avoiding another auto-revert, the tbx fault manager
is hidden behind a new debug flag which is disabled by default.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-01-09 07:48:53 +01:00
Compute-Runtime-Validation 124e755b9d Revert "fix: regression caused by tbx fault mngr"
This reverts commit 9a14fe2478.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2024-12-19 17:35:03 +01:00
Jack Myers 9a14fe2478 fix: regression caused by tbx fault mngr
Addresses regressions from the reverted merge
of the tbx fault manager for host memory.

This fixes attempts by the tbx fault manager
to protect/unprotect host buffer memory, even
if the host ptr was not driver-allocated.

In the case of the smoke test that triggered
the critical regression, clCreateBuffer was
called with the CL_MEM_USE_HOST_PTR flag.
The subsequent `mprotect` calls on the
provided host ptr then failed.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2024-12-18 23:16:36 +01:00
Compute-Runtime-Validation 6c5d9a6ed7 Revert "feature: extend TBX page fault manager from CPU implementation"
This reverts commit 51c0e80299.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2024-12-12 12:30:22 +01:00
Jack Myers 51c0e80299 feature: extend TBX page fault manager from CPU implementation
In TBX mode, the host could not write to host buffers after access from device
code due to the lack of a migration mechanism post-initial TBX upload.
Migration is unnecessary with real hardware, but required for TBX.

This patch introduces a new page fault manager type that extends the original
CPU fault manager, enabling automatic migration of host buffers in TBX mode.

Refactoring was necessary to avoid diamond inheritance, achieved by using a
template parameter as the base class for OS-specific fault managers.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2024-12-11 09:09:50 +01:00
Bartosz Dunajski dab4166837 fix: add missing aub polls on sync points
Related-To: HSD-14023925176

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-11-21 09:17:54 +01:00
Bartosz Dunajski dd8460beba refactor: reduce TBX download timeout for unit tests
Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-09-09 19:05:03 +02:00
Bartosz Dunajski db611962f7 fix: improve task count handling in tbx download path
Related-To: HSD-18039789178

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-08-28 15:32:15 +02:00
Szymon Morek b8f181d50e performance: remove trim candidate list
Related-To: NEO-11755

Removing trim candidate list reduces overhead
caused by residency handling. Allocations required
for eviction are placed in eviction container managed
by CSR.

Signed-off-by: Szymon Morek <szymon.morek@intel.com>
2024-08-23 12:21:50 +02:00
Bartosz Dunajski 696b02bfd3 fix: improve TBX downloading after L0 Event sync
Related-To: HSD-18038498579

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-08-23 10:42:17 +02:00
Bartosz Dunajski 24cfd203ab fix: dont download tbx allocations on heapless first device submission
Related-To: HSD-18039476929

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-08-06 14:03:42 +02:00
Mateusz Hoppe b3d72ddd3d fix: write memory for resident allocations in simulation mode
- refactor and call proceesFlushResdiency() on memoryOperationsHandler
- call free() to remove allocation from resident allocations when
graphics allocation is released

Related-To: NEO-11719

Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2024-06-14 18:49:01 +02:00
Mateusz Jablonski cb2b572e94 feature: add support for null aub mode
In this mode AUB csr will be created, however, no aub file will be created

Related-To: NEO-11097
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2024-04-09 16:59:42 +02:00
Filip Hazubski d25026b263 refactor: Add getTotalMemBankSize function to ReleaseHelper
Minor refactor of ULTs to not use hard coded banks size.

Signed-off-by: Filip Hazubski <filip.hazubski@intel.com>
2024-03-06 09:53:56 +01:00
Michal Mrozek 64232ec370 fix: choose proper csr for low priority immediate command lists
Resolves: NEO-10168

Signed-off-by: Michal Mrozek <michal.mrozek@intel.com>
Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2024-02-28 12:45:02 +01:00
Mateusz Jablonski de93bc6928 refactor: correct naming of enum class constants 10/n
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-12-19 11:30:39 +01:00
Mateusz Jablonski 739d181026 refactor: correct naming of enum class constants 6/n
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-12-13 14:48:52 +01:00
Mateusz Jablonski c9664e6bad refactor: rename global debug manager to debugManager
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-11-30 13:00:59 +01:00
Mateusz Hoppe 83ac95d293 fix: L0 - remove synchronization with events on appends in tbx mode
Related-To: NEO-9400

Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2023-11-27 10:39:55 +01:00
Compute-Runtime-Validation fca2159430 Revert "fix: if device hierarchy is flat then getSubDevicesCount return 1u"
This reverts commit cb0bb57f49.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2023-10-26 15:40:29 +02:00
Baj, Tomasz cb0bb57f49 fix: if device hierarchy is flat then getSubDevicesCount return 1u
Related-To: NEO-9167

Signed-off-by: Baj, Tomasz <tomasz.baj@intel.com>
2023-10-25 15:51:52 +02:00
Mateusz Hoppe 52b0f32688 fix: offset cpu address when writing chunk in simulated csr
- not only gpuAddress is offset but also cpu address with data needs
to be offset while writing memory.

Related-To: GSD-6604

Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2023-10-23 17:01:25 +02:00
Dunajski, Bartosz 25195ebc96 fix: capability to write memory chunk in aub/tbx mode
Related-To: GSD-6604

Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
2023-10-19 19:13:11 +02:00
Mateusz Hoppe f5cb7df7cd fix: do not download event allocation in TBX mode
- only download when allocation was used - inidcated by taskCount
Resolves: NEO-8312

Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2023-08-29 16:27:33 +02:00
Dunajski, Bartosz cd9ad1f04c fix: decanonize GPU VA during TBX memory read.
Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
2023-07-26 19:44:19 +02:00
Mateusz Jablonski 30c5d8a681 fix: pass gmm helper to getDumpSurfaceInfo function
gmm may not exist for buffer allocation

Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-07-03 11:59:52 +02:00
Dunajski, Bartosz 5fe9d70066 feature: new multitile post sync layout for immediate write [1/n]
No functional changes in this commit. This is prework.

Related-To: NEO-7966

Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
2023-06-07 13:11:10 +02:00
Fabian Zwolinski e351a90f81 refactor: Rename member variables to camelCase 2/n
Signed-off-by: Fabian Zwolinski <fabian.zwolinski@intel.com>
2023-04-27 20:39:22 +02:00
Kamil Kopryk fa8579602f refactor: rename product helper files n/n
Related-To: NEO-7703
Signed-off-by: Kamil Kopryk <kamil.kopryk@intel.com>
2023-03-10 13:24:38 +01:00
Warchulski, Jaroslaw 0556d543a3 Cleanup includes 56
Related-To: NEO-5548
Signed-off-by: Warchulski, Jaroslaw <jaroslaw.warchulski@intel.com>
2023-02-16 14:42:44 +01:00
Warchulski, Jaroslaw 8c17313c8b Cleanup includes 53
Cleaned up files:
opencl/source/mem_obj/image.inl
shared/offline_compiler/source/decoder/zebin_manipulator.h
shared/source/aub_mem_dump/aub_alloc_dump.h
shared/source/compiler_interface/intermediate_representations.h
shared/source/helpers/blit_commands_helper_base.inl
shared/source/utilities/debug_file_reader.h
shared/source/utilities/software_tags.h
shared/source/xe_hpc_core/hw_cmds_pvc.h

Related-To: NEO-5548

Signed-off-by: Warchulski, Jaroslaw <jaroslaw.warchulski@intel.com>
2023-02-15 13:21:14 +01:00
Kamil Kopryk 2484c7ceb2 refactor: rename hw_helper files to gfx_core_helper files
Related-To: NEO-6853
Signed-off-by: Kamil Kopryk <kamil.kopryk@intel.com>
2023-02-01 19:37:51 +01:00
Warchulski, Jaroslaw 77501d86ba Cleanup includes 35
Cleaned up files:
shared/source/command_stream/command_stream_receiver.h

Related-To: NEO-5548
Signed-off-by: Warchulski, Jaroslaw <jaroslaw.warchulski@intel.com>
2023-01-17 18:51:40 +01:00
Warchulski, Jaroslaw a2fe929f0c Cleanup includes 18
Cleaned up files:
shared/source/command_stream/command_stream_receiver_hw.h
shared/source/compiler_interface/compiler_interface.h
shared/source/direct_submission/direct_submission_hw.h
shared/source/helpers/dirty_state_helpers.h

Related-To: NEO-5548
Signed-off-by: Warchulski, Jaroslaw <jaroslaw.warchulski@intel.com>
2023-01-02 13:28:29 +01:00
Kamil Kopryk 3c5b3d4bac Refactor: don't use global ProductHelper getter in shared files 2/n
Related-To: NEO-6853
Signed-off-by: Kamil Kopryk <kamil.kopryk@intel.com>
2022-12-29 09:50:06 +01:00
Kamil Kopryk 232b886056 Rename HwInfoConfig to ProductHelper
Related-To: NEO-6853
Signed-off-by: Kamil Kopryk <kamil.kopryk@intel.com>
2022-12-14 14:39:52 +01:00
Mateusz Jablonski 8f308f24e5 Reduce usage of global gfx core helper getter [1/n]
Related-To: NEO-6853
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2022-12-09 17:27:37 +01:00
Kamil Kopryk 03b687881f Rename HwHelper -> GfxCoreHelper
Related-To: NEO-6853
Signed-off-by: Kamil Kopryk <kamil.kopryk@intel.com>
2022-12-09 10:29:06 +01:00
Maciej Plewka 4b42b066f8 Use dedicated using type for TaskCount
Related-To: NEO-7155

Signed-off-by: Maciej Plewka <maciej.plewka@intel.com>
2022-11-28 16:44:44 +01:00
Mateusz Jablonski a17df8fa86 Return SubmissionStatus from processResidency method
it allows to return non-binary status to API layer

Related-To: NEO-7412
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2022-11-15 13:17:43 +01:00
Warchulski, Jaroslaw 6cbb3cfb05 Cleanup includes 3
Related-To: NEO-5548
Signed-off-by: Warchulski, Jaroslaw <jaroslaw.warchulski@intel.com>
2022-11-07 14:52:31 +01:00
Warchulski, Jaroslaw fb25f96081 Cleanup includes 2
Related-To: NEO-5548
Signed-off-by: Warchulski, Jaroslaw <jaroslaw.warchulski@intel.com>
2022-11-07 10:36:50 +01:00
Dunajski, Bartosz 06a647a5e9 Set SkipResourceCleanup in TBX mode
Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
2022-10-27 12:23:08 +02:00
Compute-Runtime-Validation 638aba45a0 Revert "Set SkipResourceCleanup in TBX mode"
This reverts commit cb83c1d935.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2022-10-26 07:09:29 +02:00
Dunajski, Bartosz cb83c1d935 Set SkipResourceCleanup in TBX mode
Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
2022-10-25 14:31:35 +02:00