Currently `handleCacheControlINTELForPrefetch` requires type size to perform call conversion correctly,
but on opaque pointers such size is not available and we cannot just extract it anyhow.
We're waiting for "OpUntypedPrefetch" extension which will support opaque pointers in such cases.
Meanwhile as of today I'm adding skip of the whole prefetch conversion when opaque pointer type is involved
and we're waiting for update about the status of "OpUntypedPrefetch"
Don't swap src0 and src1 of pseudo_mad instruction in HWConformity if the swapping
causes invalid datatype combination. For example:
pseudo_mad (32) result1(0,0)<1>:d x1(0,0)<2;0>:uw r0.1<0;0>:d z(0,0)<1;0>:d
In this case, we swap src0(actually src2) and src1 if src1 is scalar but src0 is
not, as src0(actually src2) has no regioning support:
pseudo_mad (32) result1(0,0)<1>:d r0.1<0;0>:d x1(0,0)<2;0>:uw z(0,0)<1;0>:d
After swapping, the datatype combination is invalid as it changes the datatype
combination from (W * D + D) to (D * W + D). If src2(actually src0) is D, HW only
supports (W * D + D).
Then we wouldn't generate mad, and we generate mul+add instead. But without this
swapping, actually we can generate mad as src0(actually src2) is aligned to dst.
We are not having performance parity with the old implementation. One of the reasons is suboptimal loading from rtstack.
This change should coalesce loads for trivial rayquery usages
PredefinedConstantResolving pass caused type mismatch assertion
in tests while moving to opaque pointers. It happened when there was a
type difference between a global variable and it's load instruction
user. With typed pointers the pass was skipped in this scenario because
user of same global was bitcast and then it's user was load. What this
pass tried to do was doing RAUW operation on load to replace it with
global constant. This fix changes pass's behaviour by enabling constant
folding even when there is a type difference between load instruction
and global constant.
Example crashing ir:
```llvm
@global = constant [3 x i64] [i64 16, i64 32, i64 64]
define void @func(i64 %0) {
%2 = load i64, ptr @global ; <-- crash
ret void
}
```
In RematChainsAnalysis.hpp pass arguments using const reference to
avoid copy construction. Initialize pointer members to nullptr
to avoid uninitialized memory usage.
When importing built-in types, the type called "struct.intel_ray_query_opaque_t" was properly imported on typed-pointers mode as:
```
%struct.intel_ray_query_opaque_t = type opaque
```
but on opaque types mode it was not present in the generated BiF .bc file, thus it was not imported.
It caused ResolveOCLRaytracingBuiltins pass to fail because it relied on having when creating Alloca.
```
auto *allocaType = IGCLLVM::getTypeByName(callInst.getModule(), "struct.intel_ray_query_opaque_t");
auto *alloca = m_builder->CreateAlloca(allocaType);
```
This patch adds workaround for it by creating such type when it is not present.
CodeScheduling improvements to ensure better register pressure handling
- Support handling of the remated instructions that are used by select
(not memop)
- Various heuristics added to handle situations with small (splitted)
loads
- Heuristic to populate the same vector added
The old handleStoreInst/loadEltsFromVecAlloca assume 1:1 lane mapping
and equal sizes between user value and the promoted vector element type.
This is insufficient for mixed widths (e.g. <4 x i8> and <... x i32>),
cross-lane accesses created by the new byte-offset GEP lowering, or
pointers under opaque pointers (bitcasts between pointers and
non-pointers are illegal).
With the changes:
1) Stores (handleStoreInst and storeEltsToVecAlloca) normalize the
source (scalar or vector) to a single integer of NeedBits = N *
DstBits using ptrtoint/bitcast, split the big integer into K = ceil(
NeedBits / SrcBits) chunks, bitcast/inttoptr each chunk back to
the promoted lane type and insert into K consecutive lanes starting
at the scalarized index.
2) Loads (handleLoadInst and loadEltsFromVecAlloca) read K promoted
lanes starting at the scalarized index, convert each lane to
iSrcBits, pack into i(K*SrcBits), truncate to i(NeedBits), then
expand to the requested scalar or <N x DstScalarTy>. Use inttoptr for
pointer results.
There is also still a simple (old) path. If SrcBits == DstBits, just
emit extractelement with casts (if needed).
All paths do a single load of the promoted vector,
extractelement/insertelement, and in case of stores only a single store
back.
With these changes, the LLVM IR emitted from LowerGEPForPrivMem
will look different. Instead of using plain bitcasts, there are now
ptrtoint/inttoptr instructions and there is additional packing/splitting
logic. For the simple (old) load path, the new implementation should
essentially emit the same pattern (potnetially skipping bitcasts).
The additional integer/bitcast instruction sequences should be easily
foldable. Memory traffic is unchanged (still one vector load/store).
Overall register pressure should be similar, the pass still eliminates
GEPs and avoids private/scratch accesses.
zeinfo now contains information if kernel/function has printf calls
and function pointer calls. This allows neo to create printf_buffer when
it is really used.