compute-runtime

mirror of https://github.com/intel/compute-runtime.git synced 2026-01-10 07:08:04 +08:00

Author	SHA1	Message	Date
Mateusz Hoppe	b4e4fcf786	feature: add experimental extension to verify memory in aub mode Related-To: NEO-14153, NEO-17038 Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>	2025-12-16 13:57:32 +01:00
Kamil Kopryk	aef760b6b0	feature: add tbx support for host functions Tbx requires write memory after changing a mapped allocation from the driver side. Host function use bytes mapped from tagAllocation. Host function data update has 2 steps: * update the mapped data in the driver * write memory so Tbx can see the data Tag allocation can be pulled (downloadAllocation) e.g. while waiting, and at the same time the host function worker thread can update the data. In such scenario the updated mapped data could be reverted by a concurrent downloadAllocation call. I've added a lock to prevent concurrent downloadAllocation calls overlapping the 2step tbx host function data update. Related-To: NEO-14577 Signed-off-by: Kamil Kopryk <kamil.kopryk@intel.com>	2025-12-16 09:55:51 +01:00
Mateusz Hoppe	00b4219adb	refactor: defer hwContext creation for aubs and tbx - create HardwareContext when osContext is setup and initialized Related-To: NEO-16666 Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>	2025-12-10 19:20:28 +01:00
Jaroslaw Warchulski	60376bd98a	refactor: cleanup includes Signed-off-by: Jaroslaw Warchulski <jaroslaw.warchulski@intel.com>	2025-12-10 09:33:04 +01:00
Fabian Zwoliński	6102280f71	fix: add missing writeMemory for pooled global surface Related-To: HSD-18043489182, HSD-18043476772 Signed-off-by: Fabian Zwoliński <fabian.zwolinski@intel.com>	2025-10-17 14:26:54 +02:00
Mateusz Jablonski	35f6dc12b8	refactor: remove not needed code Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2025-10-15 16:19:04 +02:00
Lukasz Jobczyk	ce1c5d747b	fix: fix data race issue Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>	2025-10-13 14:11:28 +02:00
Lukasz Jobczyk	6515e422e9	refactor: move eviction container to residency controller Related-To: NEO-13315 Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>	2025-10-13 08:41:34 +02:00
Maciej Bielski	dcfe6c4a26	fix: add lock within processEviction() of TBX Related-To: NEO-15630 Signed-off-by: Maciej Bielski <maciej.bielski@intel.com>	2025-09-18 14:17:13 +02:00
Mateusz Jablonski	f13c18be8c	refactor: remove not needed debug break Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2025-09-17 17:00:08 +02:00
Jakub Nowacki	a5025edc20	fix: make initializeEngine() thread-safe Related-To: NEO-15630 Signed-off-by: Jakub Nowacki <jakub.nowacki@intel.com>	2025-09-17 11:01:33 +02:00
Mateusz Jablonski	e2f533e2a1	refactor: remove not needed code Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2025-09-15 14:19:40 +02:00
Jack Myers	f2b5126598	feature: enable tbx fault manager by default Related-To: NEO-13748 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-06-23 09:59:32 +02:00
Aleksandra Nizio	f0780df9be	fix: Remove unused AubMemDump logic Related-To: NEO-14718 Signed-off-by: Aleksandra Nizio <aleksandra.nizio@intel.com>	2025-06-20 19:54:48 +02:00
Aleksandra Nizio	1dfc9227c4	fix: Removind address_mapper.h Related-To: NEO-14718 Signed-off-by: Aleksandra Nizio <aleksandra.nizio@intel.com>	2025-05-27 20:37:56 +02:00
Aleksandra Nizio	e43ec2bbfd	fix: Removing stream Related-To: NEO-14718 Signed-off-by: Aleksandra Nizio <aleksandra.nizio@intel.com>	2025-05-27 18:53:02 +02:00
Aleksandra Nizio	31fe1978d4	fix: Removing streamProvider and addressMapper Related-To: NEO-14718 Signed-off-by: Aleksandra Nizio <aleksandra.nizio@intel.com>	2025-05-26 16:01:18 +02:00
Jack Myers	0e25970853	fix: re-add switch case for once writable query A change related to the tbx fault manager incorrectly removed a switch case from `AubHelper::isOneTimeAubWritableAllocationType`. This fixes that and refactors some APIs to prevent similar mistakes from happening again by cleaning up logic. Addresses show stopper for pre-si pytorch workflows. Resolves: NEO-14399 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-03-19 09:54:54 +01:00
Jack Myers	5f78147e16	fix: hotfix for svmcpu tbx uploads Test program in the linked, related issue is crashing in tbx mode. Tbx server indicated upload of invalid memory was made before exit. Running with debug messages showed that the problematic upload was an svmcpu buffer when running neo with separate cpu and gpu buffers for shared memory management. Using this info, the problem was narrowed down to a missing unprotect call in page fault manager related code, resulting in a protected(invalid) memory region getting uploaded to tbx. It is unclear yet why this unprotect call was not made, since other svmcpu buffers were uploaded without issue. This hotfix forces the unprotect call in the fault handler, which allows the test program to run to completion. However, there is now a failing test case. Considering the critical nature of the associated NEO issue and that this patch should unblock the work depending on the fix, this hotfix should get merged regardless of the failing test case. In the meantime, I will continue triaging the failing test and will implement a proper fix once the root cause is isolated. Related-To: NEO-13404 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-03-14 04:47:21 +01:00
Jack Myers	c26d24e555	fix: tbx page fault manager hang issue - Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being faultable. - Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and `TbxPageFaultManager` to allow recursive locking. - Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer` in `TbxCommandStreamTests`. Related-To: NEO-13748 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-02-18 05:05:38 +01:00
Compute-Runtime-Validation	116f7270be	Revert "fix: tbx page fault manager hang issue" This reverts commit `7d4e70a25b`. Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>	2025-02-12 10:38:05 +01:00
Jack Myers	7d4e70a25b	fix: tbx page fault manager hang issue - Updated `isAllocTbxFaultable` to exclude `gpuTimestampDeviceBuffer` from being faultable. - Replaced `SpinLock` with `RecursiveSpinLock` in `CpuPageFaultManager` and `TbxPageFaultManager` to allow recursive locking. - Added unit tests to verify the correct handling of `gpuTimestampDeviceBuffer` in `TbxCommandStreamTests`. Related-To: NEO-13748 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-02-12 02:19:37 +01:00
Jack Myers	d62122a656	fix: exceptions to TBX faultable types This commit addresses a bug in the previous implementation where almost all once writable types, except `gpuTimestampBuffers`, were incorrectly enabled for TBX faultable checks. The fix ensures that only the subset of once writable types that are also lockable are considered TBX faultable, using the lockable check to avoid manual exceptions and re-inventing the wheel. Changes: - Updated `isAllocTbxFaultable` method to check if the allocation type is lockable in addition to being once writable. - Refactored unit tests to include separate checks for lockable and non-lockable allocation types. Performance optimization: - Removed unnecessary memory data erasure in `handlePageFault` to avoid constant erase/insert operations, leveraging the O(1) search time of unordered maps. Related-To: NEO-12319 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-01-17 00:52:49 +01:00
Jack Myers	0b2ac4d331	feature: Tbx faults for all once writable types Patch #34223 introduced the TbxPageFaultManager for handling uploads/downloads of host buffers to the Tbx server, ensuring host memory is kept consistent between the host and device, even after multiple alternating writes from the host and gpu. This patch enable fault handling for all `isAubOnceWritable` types. Minor exception for gpuTimestampBuffers as enabling this type seems to break things in real-world use cases outside of ULTs. Related-To: NEO-12319 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-01-16 01:43:19 +01:00
Jack Myers	7f9fadc314	fix: regression caused by tbx fault mngr Addresses regressions from the reverted merge of the tbx fault manager for host memory. Recursive locking of mutex caused deadlock. To fix, separate tbx fault data from base cpu fault data, allowing separate mutexes for each, eliminating recursive locks on the same mutex. By separating, we also help ensure that tbx-related changes don't affect the original cpu fault manager code paths. As an added safe guard preventing critical regressions and avoiding another auto-revert, the tbx fault manager is hidden behind a new debug flag which is disabled by default. Related-To: NEO-12268 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-01-09 07:48:53 +01:00
Compute-Runtime-Validation	124e755b9d	Revert "fix: regression caused by tbx fault mngr" This reverts commit `9a14fe2478`. Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>	2024-12-19 17:35:03 +01:00
Jack Myers	9a14fe2478	fix: regression caused by tbx fault mngr Addresses regressions from the reverted merge of the tbx fault manager for host memory. This fixes attempts by the tbx fault manager to protect/unprotect host buffer memory, even if the host ptr was not driver-allocated. In the case of the smoke test that triggered the critical regression, clCreateBuffer was called with the CL_MEM_USE_HOST_PTR flag. The subsequent `mprotect` calls on the provided host ptr then failed. Related-To: NEO-12268 Signed-off-by: Jack Myers <jack.myers@intel.com>	2024-12-18 23:16:36 +01:00
Compute-Runtime-Validation	6c5d9a6ed7	Revert "feature: extend TBX page fault manager from CPU implementation" This reverts commit `51c0e80299`. Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>	2024-12-12 12:30:22 +01:00
Jack Myers	51c0e80299	feature: extend TBX page fault manager from CPU implementation In TBX mode, the host could not write to host buffers after access from device code due to the lack of a migration mechanism post-initial TBX upload. Migration is unnecessary with real hardware, but required for TBX. This patch introduces a new page fault manager type that extends the original CPU fault manager, enabling automatic migration of host buffers in TBX mode. Refactoring was necessary to avoid diamond inheritance, achieved by using a template parameter as the base class for OS-specific fault managers. Related-To: NEO-12268 Signed-off-by: Jack Myers <jack.myers@intel.com>	2024-12-11 09:09:50 +01:00
Bartosz Dunajski	dab4166837	fix: add missing aub polls on sync points Related-To: HSD-14023925176 Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>	2024-11-21 09:17:54 +01:00
Bartosz Dunajski	dd8460beba	refactor: reduce TBX download timeout for unit tests Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>	2024-09-09 19:05:03 +02:00
Bartosz Dunajski	db611962f7	fix: improve task count handling in tbx download path Related-To: HSD-18039789178 Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>	2024-08-28 15:32:15 +02:00
Szymon Morek	b8f181d50e	performance: remove trim candidate list Related-To: NEO-11755 Removing trim candidate list reduces overhead caused by residency handling. Allocations required for eviction are placed in eviction container managed by CSR. Signed-off-by: Szymon Morek <szymon.morek@intel.com>	2024-08-23 12:21:50 +02:00
Bartosz Dunajski	696b02bfd3	fix: improve TBX downloading after L0 Event sync Related-To: HSD-18038498579 Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>	2024-08-23 10:42:17 +02:00
Bartosz Dunajski	24cfd203ab	fix: dont download tbx allocations on heapless first device submission Related-To: HSD-18039476929 Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>	2024-08-06 14:03:42 +02:00
Mateusz Hoppe	b3d72ddd3d	fix: write memory for resident allocations in simulation mode - refactor and call proceesFlushResdiency() on memoryOperationsHandler - call free() to remove allocation from resident allocations when graphics allocation is released Related-To: NEO-11719 Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>	2024-06-14 18:49:01 +02:00
Mateusz Jablonski	cb2b572e94	feature: add support for null aub mode In this mode AUB csr will be created, however, no aub file will be created Related-To: NEO-11097 Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2024-04-09 16:59:42 +02:00
Filip Hazubski	d25026b263	refactor: Add getTotalMemBankSize function to ReleaseHelper Minor refactor of ULTs to not use hard coded banks size. Signed-off-by: Filip Hazubski <filip.hazubski@intel.com>	2024-03-06 09:53:56 +01:00
Michal Mrozek	64232ec370	fix: choose proper csr for low priority immediate command lists Resolves: NEO-10168 Signed-off-by: Michal Mrozek <michal.mrozek@intel.com> Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>	2024-02-28 12:45:02 +01:00
Mateusz Jablonski	de93bc6928	refactor: correct naming of enum class constants 10/n Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2023-12-19 11:30:39 +01:00
Mateusz Jablonski	739d181026	refactor: correct naming of enum class constants 6/n Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2023-12-13 14:48:52 +01:00
Mateusz Jablonski	c9664e6bad	refactor: rename global debug manager to debugManager Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2023-11-30 13:00:59 +01:00
Mateusz Hoppe	83ac95d293	fix: L0 - remove synchronization with events on appends in tbx mode Related-To: NEO-9400 Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>	2023-11-27 10:39:55 +01:00
Compute-Runtime-Validation	fca2159430	Revert "fix: if device hierarchy is flat then getSubDevicesCount return 1u" This reverts commit `cb0bb57f49`. Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>	2023-10-26 15:40:29 +02:00
Baj, Tomasz	cb0bb57f49	fix: if device hierarchy is flat then getSubDevicesCount return 1u Related-To: NEO-9167 Signed-off-by: Baj, Tomasz <tomasz.baj@intel.com>	2023-10-25 15:51:52 +02:00
Mateusz Hoppe	52b0f32688	fix: offset cpu address when writing chunk in simulated csr - not only gpuAddress is offset but also cpu address with data needs to be offset while writing memory. Related-To: GSD-6604 Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>	2023-10-23 17:01:25 +02:00
Dunajski, Bartosz	25195ebc96	fix: capability to write memory chunk in aub/tbx mode Related-To: GSD-6604 Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>	2023-10-19 19:13:11 +02:00
Mateusz Hoppe	f5cb7df7cd	fix: do not download event allocation in TBX mode - only download when allocation was used - inidcated by taskCount Resolves: NEO-8312 Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>	2023-08-29 16:27:33 +02:00
Dunajski, Bartosz	cd9ad1f04c	fix: decanonize GPU VA during TBX memory read. Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>	2023-07-26 19:44:19 +02:00
Mateusz Jablonski	30c5d8a681	fix: pass gmm helper to getDumpSurfaceInfo function gmm may not exist for buffer allocation Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2023-07-03 11:59:52 +02:00

1 2

88 Commits