fix: regression caused by tbx fault mngr

Addresses regressions from the reverted merge
of the tbx fault manager for host memory.

Recursive locking of mutex caused deadlock.

To fix, separate tbx fault data from base
cpu fault data, allowing separate mutexes
for each, eliminating recursive locks on
the same mutex.

By separating, we also help ensure that tbx-related
changes don't affect the original cpu fault manager code
paths.

As an added safe guard preventing critical regressions
and avoiding another auto-revert, the tbx fault manager
is hidden behind a new debug flag which is disabled by default.

Related-To: NEO-12268
Signed-off-by: Jack Myers <jack.myers@intel.com>
This commit is contained in:
Jack Myers
2024-12-23 21:42:43 +00:00
committed by Compute-Runtime-Automation
parent b8157a2547
commit 7f9fadc314
36 changed files with 874 additions and 169 deletions

View File

@@ -1,5 +1,5 @@
/*
* Copyright (C) 2019-2024 Intel Corporation
* Copyright (C) 2019-2025 Intel Corporation
*
* SPDX-License-Identifier: MIT
*
@@ -16,14 +16,14 @@
#include "opencl/source/command_queue/csr_selection_args.h"
namespace NEO {
void PageFaultManager::transferToCpu(void *ptr, size_t size, void *cmdQ) {
void CpuPageFaultManager::transferToCpu(void *ptr, size_t size, void *cmdQ) {
auto commandQueue = static_cast<CommandQueue *>(cmdQ);
commandQueue->getDevice().stopDirectSubmissionForCopyEngine();
auto retVal = commandQueue->enqueueSVMMap(true, CL_MAP_WRITE, ptr, size, 0, nullptr, nullptr, false);
UNRECOVERABLE_IF(retVal);
}
void PageFaultManager::transferToGpu(void *ptr, void *cmdQ) {
void CpuPageFaultManager::transferToGpu(void *ptr, void *cmdQ) {
auto commandQueue = static_cast<CommandQueue *>(cmdQ);
commandQueue->getDevice().stopDirectSubmissionForCopyEngine();
@@ -37,7 +37,7 @@ void PageFaultManager::transferToGpu(void *ptr, void *cmdQ) {
UNRECOVERABLE_IF(allocData == nullptr);
this->evictMemoryAfterImplCopy(allocData->cpuAllocation, &commandQueue->getDevice());
}
void PageFaultManager::allowCPUMemoryEviction(bool evict, void *ptr, PageFaultData &pageFaultData) {
void CpuPageFaultManager::allowCPUMemoryEviction(bool evict, void *ptr, PageFaultData &pageFaultData) {
auto commandQueue = static_cast<CommandQueue *>(pageFaultData.cmdQ);
auto allocData = memoryData[ptr].unifiedMemoryManager->getSVMAlloc(ptr);