Commit Graph

88 Commits

Author SHA1 Message Date
Mateusz Hoppe
b4e4fcf786 feature: add experimental extension to verify memory in aub mode
Related-To: NEO-14153, NEO-17038

Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2025-12-16 13:57:32 +01:00
Kamil Kopryk
aef760b6b0 feature: add tbx support for host functions
Tbx requires write memory after changing a mapped
allocation from the driver side.
Host function use bytes mapped from tagAllocation.

Host function data update has 2 steps:
* update the mapped data in the driver
* write memory so Tbx can see the data

Tag allocation can be pulled (downloadAllocation)
e.g. while waiting, and at the same time the host function worker thread
can update the data.
In such scenario the updated mapped data could be reverted
by a concurrent downloadAllocation call.

I've added a lock to prevent concurrent downloadAllocation calls
overlapping the 2step tbx host function data update.

Related-To: NEO-14577
Signed-off-by: Kamil Kopryk <kamil.kopryk@intel.com>
2025-12-16 09:55:51 +01:00
Mateusz Hoppe
00b4219adb refactor: defer hwContext creation for aubs and tbx
- create HardwareContext when osContext is setup and initialized

Related-To: NEO-16666

Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2025-12-10 19:20:28 +01:00
Jaroslaw Warchulski
60376bd98a refactor: cleanup includes
Signed-off-by: Jaroslaw Warchulski <jaroslaw.warchulski@intel.com>
2025-12-10 09:33:04 +01:00
Fabian Zwoliński
6102280f71 fix: add missing writeMemory for pooled global surface
Related-To: HSD-18043489182, HSD-18043476772
Signed-off-by: Fabian Zwoliński <fabian.zwolinski@intel.com>
2025-10-17 14:26:54 +02:00
Mateusz Jablonski
35f6dc12b8 refactor: remove not needed code
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2025-10-15 16:19:04 +02:00
Lukasz Jobczyk
ce1c5d747b fix: fix data race issue
Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>
2025-10-13 14:11:28 +02:00
Lukasz Jobczyk
6515e422e9 refactor: move eviction container to residency controller
Related-To: NEO-13315

Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>
2025-10-13 08:41:34 +02:00
Maciej Bielski
dcfe6c4a26 fix: add lock within processEviction() of TBX
Related-To: NEO-15630

Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>
2025-09-18 14:17:13 +02:00
Mateusz Jablonski
f13c18be8c refactor: remove not needed debug break
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2025-09-17 17:00:08 +02:00
Jakub Nowacki
a5025edc20 fix: make initializeEngine() thread-safe
Related-To: NEO-15630

Signed-off-by: Jakub Nowacki <jakub.nowacki@intel.com>
2025-09-17 11:01:33 +02:00
Mateusz Jablonski
e2f533e2a1 refactor: remove not needed code
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2025-09-15 14:19:40 +02:00
Jack Myers
f2b5126598 feature: enable tbx fault manager by default
Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-06-23 09:59:32 +02:00
Aleksandra Nizio
f0780df9be fix: Remove unused AubMemDump logic
Related-To: NEO-14718
Signed-off-by: Aleksandra Nizio <aleksandra.nizio@intel.com>
2025-06-20 19:54:48 +02:00
Aleksandra Nizio
1dfc9227c4 fix: Removind address_mapper.h
Related-To: NEO-14718
Signed-off-by: Aleksandra Nizio <aleksandra.nizio@intel.com>
2025-05-27 20:37:56 +02:00
Aleksandra Nizio
e43ec2bbfd fix: Removing stream
Related-To: NEO-14718
Signed-off-by: Aleksandra Nizio <aleksandra.nizio@intel.com>
2025-05-27 18:53:02 +02:00
Aleksandra Nizio
31fe1978d4 fix: Removing streamProvider and addressMapper
Related-To: NEO-14718
Signed-off-by: Aleksandra Nizio <aleksandra.nizio@intel.com>
2025-05-26 16:01:18 +02:00
Jack Myers
0e25970853 fix: re-add switch case for once writable query
A change related to the tbx fault manager
incorrectly removed a switch case from
`AubHelper::isOneTimeAubWritableAllocationType`.

This fixes that and refactors some APIs to prevent
similar mistakes from happening again by cleaning
up logic.

Addresses show stopper for pre-si pytorch workflows.

Resolves: NEO-14399
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-03-19 09:54:54 +01:00
Jack Myers
5f78147e16 fix: hotfix for svmcpu tbx uploads
Test program in the linked, related issue
is crashing in tbx mode. Tbx server indicated
upload of invalid memory was made before exit.

Running with debug messages showed that the
problematic upload was an svmcpu buffer when
running neo with separate cpu and gpu
buffers for shared memory management.

Using this info, the problem was narrowed down
to a missing unprotect call in page fault manager
related code, resulting in a protected(invalid)
memory region getting uploaded to tbx.

It is unclear yet why this unprotect call was not made,
since other svmcpu buffers were uploaded without issue.

This hotfix forces the unprotect call in the fault handler,
which allows the test program to run to completion. However,
there is now a failing test case.

Considering the critical nature of the associated
NEO issue and that this patch should unblock
the work depending on the fix, this hotfix should
get merged regardless of the failing test case.

In the meantime, I will continue triaging the
failing test and will implement a proper fix
once the root cause is isolated.

Related-To: NEO-13404
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-03-14 04:47:21 +01:00
Jack Myers
c26d24e555 fix: tbx page fault manager hang issue
- Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being
faultable.
- Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and
`TbxPageFaultManager` to allow recursive locking.
- Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer`
in `TbxCommandStreamTests`.

Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-02-18 05:05:38 +01:00
Compute-Runtime-Validation
116f7270be Revert "fix: tbx page fault manager hang issue"
This reverts commit 7d4e70a25b.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2025-02-12 10:38:05 +01:00
Jack Myers
7d4e70a25b fix: tbx page fault manager hang issue
- Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being
faultable.
- Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and
`TbxPageFaultManager` to allow recursive locking.
- Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer`
in `TbxCommandStreamTests`.

Related-To: NEO-13748
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-02-12 02:19:37 +01:00
Jack Myers
d62122a656 fix: exceptions to TBX faultable types
This commit addresses a bug in the previous implementation where almost all once
writable types, except `gpuTimestampBuffers`, were incorrectly enabled for TBX
faultable checks. The fix ensures that only the subset of once writable
types that are also lockable are considered TBX faultable, using the lockable
check to avoid manual exceptions and re-inventing the wheel.

Changes:

- Updated `isAllocTbxFaultable` method to check if the allocation type is
lockable in addition to being once writable.
- Refactored unit tests to include separate checks for lockable and non-lockable
allocation types.

Performance optimization:

- Removed unnecessary memory data erasure in `handlePageFault` to avoid constant
erase/insert operations, leveraging the O(1) search time of unordered maps.

Related-To: NEO-12319
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-01-17 00:52:49 +01:00
Jack Myers
0b2ac4d331 feature: Tbx faults for all once writable types
Patch #34223 introduced the TbxPageFaultManager for handling
uploads/downloads of host buffers to the Tbx server, ensuring
host memory is kept consistent between the host and device,
even after multiple alternating writes from the host and gpu.

This patch enable fault handling for all `isAubOnceWritable`
types.

Minor exception for gpuTimestampBuffers as enabling this type
seems to break things in real-world use cases outside of ULTs.

Related-To: NEO-12319
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-01-16 01:43:19 +01:00
Jack Myers
7f9fadc314 fix: regression caused by tbx fault mngr
Addresses regressions from the reverted merge
of the tbx fault manager for host memory.

Recursive locking of mutex caused deadlock.

To fix, separate tbx fault data from base
cpu fault data, allowing separate mutexes
for each, eliminating recursive locks on
the same mutex.

By separating, we also help ensure that tbx-related
changes don't affect the original cpu fault manager code
paths.

As an added safe guard preventing critical regressions
and avoiding another auto-revert, the tbx fault manager
is hidden behind a new debug flag which is disabled by default.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2025-01-09 07:48:53 +01:00
Compute-Runtime-Validation
124e755b9d Revert "fix: regression caused by tbx fault mngr"
This reverts commit 9a14fe2478.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2024-12-19 17:35:03 +01:00
Jack Myers
9a14fe2478 fix: regression caused by tbx fault mngr
Addresses regressions from the reverted merge
of the tbx fault manager for host memory.

This fixes attempts by the tbx fault manager
to protect/unprotect host buffer memory, even
if the host ptr was not driver-allocated.

In the case of the smoke test that triggered
the critical regression, clCreateBuffer was
called with the CL_MEM_USE_HOST_PTR flag.
The subsequent `mprotect` calls on the
provided host ptr then failed.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2024-12-18 23:16:36 +01:00
Compute-Runtime-Validation
6c5d9a6ed7 Revert "feature: extend TBX page fault manager from CPU implementation"
This reverts commit 51c0e80299.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2024-12-12 12:30:22 +01:00
Jack Myers
51c0e80299 feature: extend TBX page fault manager from CPU implementation
In TBX mode, the host could not write to host buffers after access from device
code due to the lack of a migration mechanism post-initial TBX upload.
Migration is unnecessary with real hardware, but required for TBX.

This patch introduces a new page fault manager type that extends the original
CPU fault manager, enabling automatic migration of host buffers in TBX mode.

Refactoring was necessary to avoid diamond inheritance, achieved by using a
template parameter as the base class for OS-specific fault managers.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
2024-12-11 09:09:50 +01:00
Bartosz Dunajski
dab4166837 fix: add missing aub polls on sync points
Related-To: HSD-14023925176

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-11-21 09:17:54 +01:00
Bartosz Dunajski
dd8460beba refactor: reduce TBX download timeout for unit tests
Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-09-09 19:05:03 +02:00
Bartosz Dunajski
db611962f7 fix: improve task count handling in tbx download path
Related-To: HSD-18039789178

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-08-28 15:32:15 +02:00
Szymon Morek
b8f181d50e performance: remove trim candidate list
Related-To: NEO-11755

Removing trim candidate list reduces overhead
caused by residency handling. Allocations required
for eviction are placed in eviction container managed
by CSR.

Signed-off-by: Szymon Morek <szymon.morek@intel.com>
2024-08-23 12:21:50 +02:00
Bartosz Dunajski
696b02bfd3 fix: improve TBX downloading after L0 Event sync
Related-To: HSD-18038498579

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-08-23 10:42:17 +02:00
Bartosz Dunajski
24cfd203ab fix: dont download tbx allocations on heapless first device submission
Related-To: HSD-18039476929

Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>
2024-08-06 14:03:42 +02:00
Mateusz Hoppe
b3d72ddd3d fix: write memory for resident allocations in simulation mode
- refactor and call proceesFlushResdiency() on memoryOperationsHandler
- call free() to remove allocation from resident allocations when
graphics allocation is released

Related-To: NEO-11719

Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2024-06-14 18:49:01 +02:00
Mateusz Jablonski
cb2b572e94 feature: add support for null aub mode
In this mode AUB csr will be created, however, no aub file will be created

Related-To: NEO-11097
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2024-04-09 16:59:42 +02:00
Filip Hazubski
d25026b263 refactor: Add getTotalMemBankSize function to ReleaseHelper
Minor refactor of ULTs to not use hard coded banks size.

Signed-off-by: Filip Hazubski <filip.hazubski@intel.com>
2024-03-06 09:53:56 +01:00
Michal Mrozek
64232ec370 fix: choose proper csr for low priority immediate command lists
Resolves: NEO-10168

Signed-off-by: Michal Mrozek <michal.mrozek@intel.com>
Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2024-02-28 12:45:02 +01:00
Mateusz Jablonski
de93bc6928 refactor: correct naming of enum class constants 10/n
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-12-19 11:30:39 +01:00
Mateusz Jablonski
739d181026 refactor: correct naming of enum class constants 6/n
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-12-13 14:48:52 +01:00
Mateusz Jablonski
c9664e6bad refactor: rename global debug manager to debugManager
Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-11-30 13:00:59 +01:00
Mateusz Hoppe
83ac95d293 fix: L0 - remove synchronization with events on appends in tbx mode
Related-To: NEO-9400

Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2023-11-27 10:39:55 +01:00
Compute-Runtime-Validation
fca2159430 Revert "fix: if device hierarchy is flat then getSubDevicesCount return 1u"
This reverts commit cb0bb57f49.

Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>
2023-10-26 15:40:29 +02:00
Baj, Tomasz
cb0bb57f49 fix: if device hierarchy is flat then getSubDevicesCount return 1u
Related-To: NEO-9167

Signed-off-by: Baj, Tomasz <tomasz.baj@intel.com>
2023-10-25 15:51:52 +02:00
Mateusz Hoppe
52b0f32688 fix: offset cpu address when writing chunk in simulated csr
- not only gpuAddress is offset but also cpu address with data needs
to be offset while writing memory.

Related-To: GSD-6604

Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2023-10-23 17:01:25 +02:00
Dunajski, Bartosz
25195ebc96 fix: capability to write memory chunk in aub/tbx mode
Related-To: GSD-6604

Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
2023-10-19 19:13:11 +02:00
Mateusz Hoppe
f5cb7df7cd fix: do not download event allocation in TBX mode
- only download when allocation was used - inidcated by taskCount
Resolves: NEO-8312

Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>
2023-08-29 16:27:33 +02:00
Dunajski, Bartosz
cd9ad1f04c fix: decanonize GPU VA during TBX memory read.
Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>
2023-07-26 19:44:19 +02:00
Mateusz Jablonski
30c5d8a681 fix: pass gmm helper to getDumpSurfaceInfo function
gmm may not exist for buffer allocation

Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>
2023-07-03 11:59:52 +02:00