MismatchDetect wasn't detecting type size mismatch in this case
```llvm
%0 = alloca [2 x double]
%1 = getelementptr inbounds [2 x double], ptr %0, i64 0, i64 0
%2 = load <2 x i32>, ptr %1
```
As it was comparing number of bits allocated by load instruction type --
<2 x i32> to allocated bits of alloca scalar type -- double, resulting
in not detecting size mismatch as 64 == 64. I've changed approach to
using LLVM API getScalarSizeInBits() type method to compare scalar
sizes, similarily to what was done in typed pointers path (see
SOALayoutChecker::visitBitCastInst). Refactored control flow.
For `opaque pointers` we cannot deduce a type for `Prefetch` call with `ptr`.
We need this information to create appropriate builtins. To achieve this,
we can use `llvm::demangle()` method and find appropriate type.
This change is compatible with both typed and opaque pointers.
immediate is 1 or -1 as increment or decrement
In cases a shader is doing typed atomics with typed, or untyped atomics
with ugm, or untyped atomics with slm and just increment or decrement
atomic operation using an immediate as -1 and 1, we can use EATOMIC_INC(2)
or EATOMIC_DEC(3) to replace EATOMIC_IADD.
Refactor optimization handling in new inline raytracing to defer modifying the function until we are done with all liveness objects.
This way, we don't invalidate liveness objects and avoid costly recalculations.
Cross block load vectorization works on an assumption: within a single block, we can preload the rstack data for multiple rayinfo calls without drastically increasing overall register pressure.
This let's us cull a lot of sends (applications will usually cluster rayinfo calls within a single block).
The first implementation was flawed though. It didn't take into account the following things:
1. Some instructions will write to the stack (like TraceRayInline). This will make the shadow copy stale.
2. RayInfo instructions will create their own blocks when lowered. This will affect basic block -> stack pointer mapping, creating more shadow copies and unnecessary loads.
1. is fixed by splitting the block after instructions that write to the stack.
2. is fixed by collecting ray info instructions first, assigning stack pointers to them, and only then lowering them.
spillHeader may be used to store offset for spill/fill instruction. It
must be infinite spill cost variable. If spillHeader gets assigned to a
register that causes fragmentation, then it could cause previously
spilled variables to not get an allocation in fail safe RA iteration.
With this change, we find first GRF candidate that can be assigned to
spillHeader. This way, we avoid fragmenting free GRF space.
Add std::move if it's at the end of variable's scope.
Small refactor + typo fix in tryPrintLabel labmda.
Replace `construct -> push_back` pattern with emplace_back.
Change arg to const ref where makes sense.
The old scalarization advanced the GEP scalarized index by the number of
smaller vector elements when a GEP indexed through a reinterpreted
vector whose lane size differed from the promoted lane. This
over-advanced the index (e.g. using 8 for <8 x i32> over double lanes
instead of 4), producing incorrect accesses.
The fix:
- Track the promoted lane byte size (m_promotedLaneBytes) in
TransposeHelper and set it in TransposeHelperPromote's constructor.
- In TransposeHelper::getArrSizeAndEltType, when a vector is a
reinterpret of the promoted storage, compute the increment as
vector_byte_size / m_promotedLaneBytes instead of
vector_byte_size / small_element_size.
When calling `NewF->setName(OriginalName);`
setName under the hood performs `destroyValueName();` which invalidates OriginalName.
It resulted in such IR:
`define spir_kernel void @"\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD"(ptr addrspace(1)...)`
or
`%14 = call <2 x i32> @"\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD\DD"(ptr addrspace(1) %input...)`
When payload of spill intrinsic is address taken, we cannot simply
replace the virtual register with a temporary coalesced range. That's
because address takens can have indirect def.
Treat such cases as non-spill coalesceable.
This is a functional fix.
immediate is 1 or -1 as increment or decrement
In cases a shader is doing typed atomics with typed, or untyped atomics
with ugm, or untyped atomics with slm and just increment or decrement
atomic operation using an immediate as -1 and 1, we can use EATOMIC_INC(2)
or EATOMIC_DEC(3) to replace EATOMIC_IADD.
immediate is 1 or -1 as increment or decrement
In cases a shader is doing typed atomics with typed, or untyped atomics
with ugm, or untyped atomics with slm and just increment or decrement
atomic operation using an immediate as -1 and 1, we can use EATOMIC_INC(2)
or EATOMIC_DEC(3) to replace EATOMIC_IADD.
immediate is 1 or -1 as increment or decrement
In cases a shader is doing typed atomics with typed, or untyped atomics
with ugm, or untyped atomics with slm and just increment or decrement
atomic operation using an immediate as -1 and 1, we can use EATOMIC_INC(2)
or EATOMIC_DEC(3) to replace EATOMIC_IADD.
immediate is 1 or -1 as increment or decrement
In cases a shader is doing typed atomics with typed, or untyped atomics
with ugm, or untyped atomics with slm and just increment or decrement
atomic operation using an immediate as -1 and 1, we can use EATOMIC_INC(2)
or EATOMIC_DEC(3) to replace EATOMIC_IADD.