The old handleStoreInst/loadEltsFromVecAlloca assume 1:1 lane mapping
and equal sizes between user value and the promoted vector element type.
This is insufficient for mixed widths (e.g. <4 x i8> and <... x i32>),
cross-lane accesses created by the new byte-offset GEP lowering, or
pointers under opaque pointers (bitcasts between pointers and
non-pointers are illegal).
With the changes:
1) Stores (handleStoreInst and storeEltsToVecAlloca) normalize the
source (scalar or vector) to a single integer of NeedBits = N *
DstBits using ptrtoint/bitcast, split the big integer into K = ceil(
NeedBits / SrcBits) chunks, bitcast/inttoptr each chunk back to
the promoted lane type and insert into K consecutive lanes starting
at the scalarized index.
2) Loads (handleLoadInst and loadEltsFromVecAlloca) read K promoted
lanes starting at the scalarized index, convert each lane to
iSrcBits, pack into i(K*SrcBits), truncate to i(NeedBits), then
expand to the requested scalar or <N x DstScalarTy>. Use inttoptr for
pointer results.
There is also still a simple (old) path. If SrcBits == DstBits, just
emit extractelement with casts (if needed).
All paths do a single load of the promoted vector,
extractelement/insertelement, and in case of stores only a single store
back.
With these changes, the LLVM IR emitted from LowerGEPForPrivMem
will look different. Instead of using plain bitcasts, there are now
ptrtoint/inttoptr instructions and there is additional packing/splitting
logic. For the simple (old) load path, the new implementation should
essentially emit the same pattern (potnetially skipping bitcasts).
The additional integer/bitcast instruction sequences should be easily
foldable. Memory traffic is unchanged (still one vector load/store).
Overall register pressure should be similar, the pass still eliminates
GEPs and avoids private/scratch accesses.
zeinfo now contains information if kernel/function has printf calls
and function pointer calls. This allows neo to create printf_buffer when
it is really used.
In opaque pointer mode, GEPs that index into globals often have a
different shape. SimplifyConstant pass assumed two-index GEPs (0, index)
and directly used the second operand as an element index. However, it is
possible to address flat aggregates using single-index GEPs.
See the two examples below from SYCL_CTS-math_builtin_float_double_1_ocl
run in typed and opaque pointer mode.
Two-index GEP example:
%130 = getelementptr inbounds [2 x i32], [2 x i32] addrspace(2)* @__stgamma_ep_nofp64__ones, i64 0, i64 %129
%131 = bitcast i32 addrspace(2)* %130 to float addrspace(2)*
%132 = load float, float addrspace(2)* %131, align 4, !tbaa !5163, !noalias !5409
Single-index GEP example:
%103 = getelementptr inbounds float, ptr addrspace(2) @__stgamma_ep_nofp64__ones, i64 %102
%104 = load float, ptr addrspace(2) %103, align 4, !tbaa !5163, !noalias !5409
This patch changes the pass to always use the last GEP index as the
element selector. This works because the pass only transforms top-level
arrays of scalars/vectors. In these cases, the element being loaded is
always designated by the final GEP index (whether there are earlier
indices selecting the actual aggregate or single index in opaque pointer
mode).
Do not rely on bitcasts when deciding whether an index adjustment is
necessary. In opaque pointers mode types can change between instructions
without bitcasts.
Compute workloads add following implicit arguments:
* payloadHeader - 8 x i32 packing global_id_offset (3 x i32),
local_size (3 x i32) and 2 x i32 reserved.
* enqueued_local_size - 3 x i32
Most of the time only enqueued_local_size is used, leaving local_size
unnecessary. In the end, payloadHeader has unused 20 bytes.
This commit enables short payload header on PVC platform.
When verifying if an operand access exceeds the declared variable size, we should do special
handling for madw instruction as this instruction write both the low and high results to
GRFs.
This change addresses the handling of predicated
stores for sub-DW values with non-uniform stored values.
Predicate alone is not enough to calculate the correct
offset. So, we use `EMASK & Predicate` to determine the
correct offset.
When LLVM IR uses opaque pointers or inserts a bitcast to i8*, a
subsequent GEP is expressed in bytes. The legacy handleGEPInst always
scalarized indices by starting from pGEP->getSourceElementType(). After
the i8* cast, the type is i8, so the algorithm mistakenly treated the
byte index as a count of elements, producing misscaled (too large)
scalarized index.
Example:
%a = alloca [16 x [16 x float]], align 4
%b = bitcast [16 x [16 x float]]* %a to i8*
%c = getelementptr inbounds i8, i8* %b, i64 64
Here, 64 is a byte offset into the original aggregate. The old
implementation, seeing i8, scaled as if 64 elements, not 64 bytes.
Yet, the meaningful base of the GEP is alloca's aggregate type
[16 x [16 x float]] and the element-calculations should be based on this
type.
This change:
1. Introduces getFirstNonScalarSourceElementType(GEP), which
walks back from the GEP base through pointer casts to find a root
aggregate element type.
2. Adds additional handling in handleGEPInst, so that i8 GEP byte offset
is converted to an element index of the underlying base type.
This way the algorithm avoids basing element index scalarization on
incidental i8* and keeps index calculation aligned with the underlying
allocation layout.
For reference, in typed pointer mode (or without the bitcast), the GEP
would look like this:
%a = alloca [16 x [16 x float]], align 4
%c = getelementptr inbounds [16 x [16 x float]], [16 x [16 x float]]* %a, i64 0, i64 1
Here, %c is the pointer to the 2nd inner array [16 x float]*.
In cases where we have no local casts to generics and we allocate
private memory in global space, we can replace GenericCastToPtrExplicit
with simple address space cast.