compute-runtime

Commit Graph

Author	SHA1	Message	Date
Jack Myers	5f78147e16	fix: hotfix for svmcpu tbx uploads Test program in the linked, related issue is crashing in tbx mode. Tbx server indicated upload of invalid memory was made before exit. Running with debug messages showed that the problematic upload was an svmcpu buffer when running neo with separate cpu and gpu buffers for shared memory management. Using this info, the problem was narrowed down to a missing unprotect call in page fault manager related code, resulting in a protected(invalid) memory region getting uploaded to tbx. It is unclear yet why this unprotect call was not made, since other svmcpu buffers were uploaded without issue. This hotfix forces the unprotect call in the fault handler, which allows the test program to run to completion. However, there is now a failing test case. Considering the critical nature of the associated NEO issue and that this patch should unblock the work depending on the fix, this hotfix should get merged regardless of the failing test case. In the meantime, I will continue triaging the failing test and will implement a proper fix once the root cause is isolated. Related-To: NEO-13404 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-03-14 04:47:21 +01:00
Jack Myers	7f9fadc314	fix: regression caused by tbx fault mngr Addresses regressions from the reverted merge of the tbx fault manager for host memory. Recursive locking of mutex caused deadlock. To fix, separate tbx fault data from base cpu fault data, allowing separate mutexes for each, eliminating recursive locks on the same mutex. By separating, we also help ensure that tbx-related changes don't affect the original cpu fault manager code paths. As an added safe guard preventing critical regressions and avoiding another auto-revert, the tbx fault manager is hidden behind a new debug flag which is disabled by default. Related-To: NEO-12268 Signed-off-by: Jack Myers <jack.myers@intel.com>	2025-01-09 07:48:53 +01:00
Compute-Runtime-Validation	124e755b9d	Revert "fix: regression caused by tbx fault mngr" This reverts commit `9a14fe2478`. Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>	2024-12-19 17:35:03 +01:00
Jack Myers	9a14fe2478	fix: regression caused by tbx fault mngr Addresses regressions from the reverted merge of the tbx fault manager for host memory. This fixes attempts by the tbx fault manager to protect/unprotect host buffer memory, even if the host ptr was not driver-allocated. In the case of the smoke test that triggered the critical regression, clCreateBuffer was called with the CL_MEM_USE_HOST_PTR flag. The subsequent `mprotect` calls on the provided host ptr then failed. Related-To: NEO-12268 Signed-off-by: Jack Myers <jack.myers@intel.com>	2024-12-18 23:16:36 +01:00
Compute-Runtime-Validation	6c5d9a6ed7	Revert "feature: extend TBX page fault manager from CPU implementation" This reverts commit `51c0e80299`. Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>	2024-12-12 12:30:22 +01:00
Jack Myers	51c0e80299	feature: extend TBX page fault manager from CPU implementation In TBX mode, the host could not write to host buffers after access from device code due to the lack of a migration mechanism post-initial TBX upload. Migration is unnecessary with real hardware, but required for TBX. This patch introduces a new page fault manager type that extends the original CPU fault manager, enabling automatic migration of host buffers in TBX mode. Refactoring was necessary to avoid diamond inheritance, achieved by using a template parameter as the base class for OS-specific fault managers. Related-To: NEO-12268 Signed-off-by: Jack Myers <jack.myers@intel.com>	2024-12-11 09:09:50 +01:00
Bartosz Dunajski	dab4166837	fix: add missing aub polls on sync points Related-To: HSD-14023925176 Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>	2024-11-21 09:17:54 +01:00
Bartosz Dunajski	dd8460beba	refactor: reduce TBX download timeout for unit tests Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>	2024-09-09 19:05:03 +02:00
Bartosz Dunajski	db611962f7	fix: improve task count handling in tbx download path Related-To: HSD-18039789178 Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>	2024-08-28 15:32:15 +02:00
Szymon Morek	b8f181d50e	performance: remove trim candidate list Related-To: NEO-11755 Removing trim candidate list reduces overhead caused by residency handling. Allocations required for eviction are placed in eviction container managed by CSR. Signed-off-by: Szymon Morek <szymon.morek@intel.com>	2024-08-23 12:21:50 +02:00
Bartosz Dunajski	696b02bfd3	fix: improve TBX downloading after L0 Event sync Related-To: HSD-18038498579 Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>	2024-08-23 10:42:17 +02:00
Bartosz Dunajski	24cfd203ab	fix: dont download tbx allocations on heapless first device submission Related-To: HSD-18039476929 Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>	2024-08-06 14:03:42 +02:00
Mateusz Jablonski	cb2b572e94	feature: add support for null aub mode In this mode AUB csr will be created, however, no aub file will be created Related-To: NEO-11097 Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2024-04-09 16:59:42 +02:00
Mateusz Hoppe	83ac95d293	fix: L0 - remove synchronization with events on appends in tbx mode Related-To: NEO-9400 Signed-off-by: Mateusz Hoppe <mateusz.hoppe@intel.com>	2023-11-27 10:39:55 +01:00
Jablonski, Mateusz	ac5f64f5c6	fix: fix compilation error in clang on Windows (2/n) Signed-off-by: Jablonski, Mateusz <mateusz.jablonski@intel.com>	2023-10-24 15:59:06 +02:00
Dunajski, Bartosz	25195ebc96	fix: capability to write memory chunk in aub/tbx mode Related-To: GSD-6604 Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>	2023-10-19 19:13:11 +02:00
Dunajski, Bartosz	cd9ad1f04c	fix: decanonize GPU VA during TBX memory read. Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>	2023-07-26 19:44:19 +02:00
Warchulski, Jaroslaw	6fb68dd84b	Separation of MemoryAllocation from os_agnostic_memory_manager.h Related-To: NEO-5548 Signed-off-by: Warchulski, Jaroslaw <jaroslaw.warchulski@intel.com>	2023-01-04 15:09:36 +01:00
Zbigniew Zdanowicz	c0c9ce548a	Validate level zero events in TBX mode Related-To: NEO-7545 Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>	2022-11-28 16:45:02 +01:00
Maciej Plewka	4b42b066f8	Use dedicated using type for TaskCount Related-To: NEO-7155 Signed-off-by: Maciej Plewka <maciej.plewka@intel.com>	2022-11-28 16:44:44 +01:00
Mateusz Jablonski	a17df8fa86	Return SubmissionStatus from processResidency method it allows to return non-binary status to API layer Related-To: NEO-7412 Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2022-11-15 13:17:43 +01:00
Dunajski, Bartosz	06a647a5e9	Set SkipResourceCleanup in TBX mode Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>	2022-10-27 12:23:08 +02:00
Compute-Runtime-Validation	638aba45a0	Revert "Set SkipResourceCleanup in TBX mode" This reverts commit `cb83c1d935`. Signed-off-by: Compute-Runtime-Validation <compute-runtime-validation@intel.com>	2022-10-26 07:09:29 +02:00
Dunajski, Bartosz	cb83c1d935	Set SkipResourceCleanup in TBX mode Signed-off-by: Dunajski, Bartosz <bartosz.dunajski@intel.com>	2022-10-25 14:31:35 +02:00
Fabian Zwolinski	645600d141	Return error when there is no memory to evict We want to return error code to the application instead of aborting when we are not able to make more memory resident. Related-To: NEO-7289 Signed-off-by: Fabian Zwolinski <fabian.zwolinski@intel.com>	2022-09-22 14:26:55 +02:00
Krzysztof Gibala	2fcda0a528	Refactor: Change decanonize method accessing point Accessing decanonize method as a member of GmmHelper class object Signed-off-by: Krzysztof Gibala <krzysztof.gibala@intel.com>	2022-05-11 12:57:02 +02:00
Jobczyk, Lukasz	a285712cc4	Add missing download allocation calls Signed-off-by: Jobczyk, Lukasz <lukasz.jobczyk@intel.com> Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>	2022-03-31 09:49:22 +02:00
Lukasz Jobczyk	a230f267e1	Poll task count indefinitely on high throttle command queue Resolves: NEO-6781 Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>	2022-03-25 10:06:16 +01:00
Patryk Wrobel	7f729b7f89	Detect GPU hang in clWaitForEvents This change: - moves NEO::WaitStatus to a separate file - enables detection of GPU hang in clWaitForEvents - adjusts most of blocking calls in CommandStreamReceiver to return WaitStatus - adds ULTs to cover the new code Related-To: NEO-6681 Signed-off-by: Patryk Wrobel <patryk.wrobel@intel.com>	2022-02-23 13:33:09 +01:00
Patryk Wrobel	498cf5e871	Implement GPU hang detection This change uses DRM_IOCTL_I915_GET_RESET_STATS to detect GPU hangs. When such situation is encountered, then zeCommandQueueSynchronize returns ZE_RESULT_ERROR_DEVICE_LOST. Related-To: NEO-5313 Signed-off-by: Patryk Wrobel <patryk.wrobel@intel.com>	2022-01-31 13:48:17 +01:00
Raiyan Latif	394c0e90e1	Return error when failing on submission Signed-off-by: Raiyan Latif <raiyan.latif@intel.com>	2022-01-12 16:42:30 +01:00
Lukasz Jobczyk	b59b0b6b36	Download timestamps before checking completion Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>	2021-12-28 08:14:27 +01:00
Lukasz Jobczyk	7f1c87f049	Fix flush tag update in TBX mode Signed-off-by: Lukasz Jobczyk <lukasz.jobczyk@intel.com>	2021-11-24 12:30:29 +01:00
Filip Hazubski	f1b696824c	Ensure engine is initialized when writing memory in AUB and TBX modes Signed-off-by: Filip Hazubski <filip.hazubski@intel.com>	2021-10-14 19:45:43 +02:00
Filip Hazubski	27b5952ba3	Correct initializeEngine usage in aub tests Signed-off-by: Filip Hazubski <filip.hazubski@intel.com>	2021-10-14 19:36:00 +02:00
Zbigniew Zdanowicz	3b35ba504f	Adapt command stream receiver to multiple active partitions Related-To: NEO-6244 Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>	2021-09-23 14:32:20 +02:00
Mateusz Jablonski	f8867e0b97	Move generic command stream receiver files to shared Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2021-09-22 23:55:31 +02:00
Mateusz Jablonski	d348526941	Simplify checkAndActivateAubSubCapture method Signed-off-by: Mateusz Jablonski <mateusz.jablonski@intel.com>	2021-09-22 20:23:56 +02:00
Zbigniew Zdanowicz	cd4f3c221a	Synchronize switching command buffers for all partitions Signed-off-by: Zbigniew Zdanowicz <zbigniew.zdanowicz@intel.com>	2021-09-08 12:12:23 +02:00
Filip Hazubski	29c64c3dd0	Disable implicit scaling for cooperative kernels When implicit scaling is disabled use useSingleSubdeviceValue = true. Resolves: NEO-5757 Signed-off-by: Filip Hazubski <filip.hazubski@intel.com>	2021-08-18 14:56:37 +02:00
Bartosz Dunajski	c7a936d1f4	Add memory banks to Simulated CSR Signed-off-by: Bartosz Dunajski <bartosz.dunajski@intel.com>	2021-07-05 12:19:58 +02:00
Dominik Dabek	39f0387ecc	Move tbx stream, tbx csr to shared Related-To: NEO-5161 Signed-off-by: Dominik Dabek <dominik.dabek@intel.com>i	2021-05-31 14:35:32 +02:00

42 Commits