Originally implemented in https://github.com/swiftlang/swift/pull/29014.
I've made a couple changes:
1. Use the target's address size, not lldb
2. Replaced the loop with a format string
This change fixes an issue with the use of `-alternatename` in the MSVC
CRT on ARM64EC, where both mangled and demangled symbol names are
specified. Without this patch, the demangled name could be resolved to
an anti-dependency alias of the target. Since chaining anti-dependency
aliases is not allowed, this results in an undefined symbol.
The root cause isn't specific to ARM64EC, it can affect other targets as
well, even when anti-dependency aliases aren't involved. The
accompanying test case demonstrates a scenario where the symbol could be
resolved from an archive. However, because the archive member is pulled
in after the first pass of alternate name resolution, and archive
members don't override weak aliases, eager resolution would incorrectly
skip it.
It was disabled because there may be artificial padding. After [refining the pack op semantics](773e158c64),
we can assume that there is no artificial padding. Thus, the check can
be removed, and we can unconditionally enable the consumer fusion if it
is a perfect tiling case.
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Summary:
This is a bit of an awkward transition point for the new and old
drivers. Previously AMDGPU uses this to generate offloading bundles, but
the new driver much prefers to output the file itself. This patch
changes the behavior to always respect `--gpu-bundle-output` instead of
having it be the default behavior. This means that we effectively get to
override the default new driver behavior with this flag now. This should
hoepfully fix some errors in the downstream comgr tests.
Summary:
Previously, querying for the offload architecture tool would invoke the
user's PATH, which is bad when potentially using the driver from a
direct path. This patch change this to *only* consider the
`offload-arch` that's supposed to live next to the driver executable.
Now we will no longer pick up a potentially conflicting version of this
tool and it should always be found (Since it's a clang tool that's
installazed alongside the driver)
This relands PRs #143108 and #144538.
The original PR was reverted due to a mistake that made all the mlir
tests run only if SPIRV target was enabled. This is now resolved since
enabling spirv-tools does not required SPIRV target any longer.
spirv-tools are not required by default to run SPIRV mlir tests, but
they can be optionally enabled in some SPIRV mlir test to verify that
the produced SPIRV assembly pass validation.
The other reverted PR #144685 is not longer needed and not part of this
relanding.
Original commit message:
> At the MLIR level unsigned integer and signless integers are different
types. Indeed when looking up the two types in type definition cache
they do not match.
> Hence when translating a SPIR-V module which contains both usign and
signless integers will contain the same type declaration twice
(something like OpTypeInt 32 0) which is not permitted in SPIR-V and
such generated modules fail validation.
> This patch solves the problem by mapping unisgned integer types to
singless integer types before looking up in the type definition cache.
---------
Signed-off-by: Davide Grohmann <davide.grohmann@arm.com>
This reverts commit
0844812b2e
with a shape fix in
1db4c6b275
The revision restrict the `linalg.pack` op to not have artificial
padding semantics. E.g., the below is valid without the change, and it
becomes invalid with the change.
```mlir
func.func @foo(%src: tensor<9xf32>) -> tensor<100x8xf32> {
%cst = arith.constant 0.000000e+00 : f32
%dest = tensor.empty() : tensor<100x8xf32>
%pack = linalg.pack %src
padding_value(%cst : f32)
inner_dims_pos = [0]
inner_tiles = [8] into %dest
: tensor<9xf32> -> tensor<100x8xf32>
return %pack : tensor<100x8xf32>
}
```
IMO, it is a misuse if we use pack ops with artificial padding sizes
because the intention of the pack op is to relayout the source based on
target intrinsics, etc. The output shape is expected to be
`tensor<2x8xf32>`. If people need extra padding sizes, they can create a
new pad op followed by the pack op.
This also makes consumer tiling much easier because the consumer fusion
does not support artificial padding sizes. It is very hard to make it
work without using ad-hoc patterns because the tiling sizes are about
source, which implies that you don't have a core_id/thread_id to write
padding values to the whole tile.
People may have a question how why pad tiling implementation works. The
answer is that it creates an `if-else` branch to handle the case. In my
experience, it is very struggle in transformation because most of the
time people only need one side of the branch given that the tile sizes
are usually greater than padding sizes. However, the implementation is
conservatively correct in terms of semantics. Given that the
introduction of `pack` op is to serve the relayout needs better, having
the restriction makes sense to me.
Removed tests:
-
`no_bubble_up_pack_extending_dimension_through_expand_cannot_reassociate`
from `data-layout-propagation.mlir`: it is a dup test to
`bubble_up_pack_non_expanded_dims_through_expand` after we fix the
shape.
- `fuse_pack_consumer_with_untiled_extra_padding` from
`tile-and-fuse-consumer.mlir`: it was created for artificial padding in
the consumer fusion implementation.
The other changes in lit tests are just fixing the shape.
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
This allows to set an optional integer level for a given debug type. The
string format is `type[:level]`, and the integer is interpreted as such:
- if not provided: all debugging for this debug type is enabled.
- if >0: all debug that is < to the level is enabled.
- if 0: same as for >0 but also does not disable the other debug-types,
it acts as a negative filter.
The LDBG() macro is updated to accept an optional log level to
illustrate the feature. Here is the expected behavior:
LDBG() << "A"; // Identical to LDBG(1) << "A";
LDBG(2) << "B";
With `--debug-only=some_type`: we'll see A and B in the output.
With `--debug-only=some_type:1`: we'll see A but not B in the output.
With `--debug-only=some_type:2`: we'll see A and B in the output. (same
with any level above 2)
With `--debug-only=some_type:0`: we'll see neither A nor B in the
output, but we'll see any other logging for other debug types.
Summary:
This reference counter tracks how many threads are using a given slab.
Currently it's a 64-bit integer, this patch reduces it to a 32-bit
integer. The benefit of this is that we save a few registers now that we
no longer need to use two for these operations. This increases the risk
of overflow, but given that the largest value we accept for a single
slab is ~131,000 it is a long way off of the maximum of four billion or
so. Obviously we can oversubscribe the reference count by having threads
attempt to claim the lock and then try to free it, but I assert that it
is exceedingly unlikely that we will somehow have over four billion GPU
threads stalled in the same place.
A later optimization could be done to split the reference counter and
pointers into a struct of arrays, that will save 128 KiB of static
memory (as we currently use 512 KiB for the slab array).
Changed the warning message:
- **From**: 'Attempt to free released memory'
**To**: 'Attempt to release already released memory'
- **From**: 'Attempt to free non-owned memory'
**To**: 'Attempt to release non-owned memory'
- **From**: 'Use of memory after it is freed'
**To**: 'Use of memory after it is released'
All connected tests and their expectations have been changed
accordingly.
Inspired by [this
PR](https://github.com/llvm/llvm-project/pull/147542#discussion_r2195197922)
Clang's current implementation only works on array types, but GCC (which
is where we got this attribute) supports it on pointers as well as
arrays.
Fixes#150951
Previously we saved registers in the shadow space of callee before
calling __delayLoadHelper2. Now we save arguments in the shadow space of
the caller and allocate shadow space for the callee.
Fixes#51941
---------
Co-authored-by: Benjamin Santerre <benjamin.santerre@gmail.com>
This controls whether tag checking is performed for loads and
stores, or stores only.
It requires a specific architecture feature which we detect
with a HWCAP3 and cpuinfo feature.
Live process tests look for this and adjust expectations
accordingly, core file tests are using an updated file with
this feature enabled.
The size of the core file has increased and there's nothing
I can do about that. Could be the presence of new architecure
features or kernel changes since I last generated them.
I can generate a smaller file that has the tag segment,
but that segment does not actually contain tag data. So
that's no use.
I'm not really sure what the point of these was, but they originated
in the base support commit for gfx942 mfma support. These don't impact
the selection at all, so don't belong in this test. These were causing
allocation failure depending on whether or not the AGPR or VGPR form
was used.
`readability-qualified-auto` check currently looks at the unsugared
type, skipping any typedefs, to determine if the variable is a
pointer-type. This may not be the desired behaviour, in particular when
the type depends on compilation flags.
For example
```
#if CONDITION
using Handler = int *;
#else
using Handler = uint64_t;
#endif
```
A more common example is some implementations of `std::array` use
pointers as iterators.
This introduces the IgnoreAliasing option so that
`readability-qualified-auto` does not look beyond typedefs.
---------
Co-authored-by: juanbesa <juanbesa@devvm33299.lla0.facebook.com>
Co-authored-by: Kazu Hirata <kazu@google.com>
Co-authored-by: EugeneZelenko <eugene.zelenko@gmail.com>
Co-authored-by: Baranov Victor <bar.victor.2002@gmail.com>
Reapply #140091.
branch-folder hoists common instructions from TBB and FBB into their
pred. Without this patch it achieves this by splicing the instructions from TBB
and deleting the common ones in FBB. That moves the debug locations and debug
instructions from TBB into the pred without modification, which is not
ideal. Debug locations are handled in #140063.
This patch handles debug instructions - in the simplest way possible, which is
to just kill (undef) them. We kill and hoist the ones in FBB as well as TBB
because otherwise the fact there's an assignment on the code path is deleted
(which might lead to a prior location extending further than it should).
There's possibly something we could do to preserve some variable locations in
some cases, but this is the easiest not-incorrect thing to do.
Note I had to replace the constant DBG_VALUEs to use registers in the test- it
turns out setDebugValueUndef doesn't undef constant DBG_VALUEs... which feels
wrong to me, but isn't something I want to touch right now.
---
Fix end-iterator-dereference and add test.
This will be used to detect the presence of Arm's new Memory Tagging
store only checking feature. This commit just adds the plumbing to get
that value into the detection function.
FreeBSD has not allocated a number for HWCAP3 and already has AT_ARGV
defined as 29. So instead of attempting to read from FreeBSD processes,
I've explicitly passed 0. We don't want to be reading some other entry
accidentally.
If/when FreeBSD adds HWCAP3 we can handle it like we do for
AUXV_FREEBSD_AT_HWCAP.
No extra tests here, those will be coming with the next change for MTE
support.
No change of meaning, just formatting and an extra example to make it
easier to comprehend:
* Split separate, important points into their own paragraphs.
* Remove a contraction.
* Finally, show to to use "static" on a function. As before we just
showed why namespaces were bad, but not what you should do instead.
If a chunk is empty and there are no other non-empty chunks in the same
section, `removeEmptySections()` will remove the entire section. In this
case, use a section index of 0, as the MSVC linker does, instead of
asserting.
When vectorizing with predication some loops that were previously
vectorized without zvfhmin/zvfbfmin will no longer be vectorized because
the masked load/store or gather/scatter cost returns illegal.
This is due to a discrepancy where for these costs we check
isLegalElementTypeForRVV but for regular memory accesses we don't.
But for bf16 and f16 vectors we don't actually need the extension
support for loads and stores, so this adds a new function which takes
this into account.
For regular memory accesses we should probably also e.g. return an
invalid cost for i64 elements on zve32x, but it doesn't look like we
have tests for this yet.
We also should probably not be vectorizing these bf16/f16 loops to begin
with if we don't have zvfhmin/zvfbfmin and zfhmin/zfbfmin. I think this
is due to the scalar costs being too cheap. I've added tests for this in
a100f63672 to fix in another patch.
Summary:
This wait restricts how long we wait on a slab. The only reason this
isn't an infinite loop is to prevent complete deadlocks. However, this
limit was *just* on the cusp of waiting long enough for the allocation
to be done. Just increase this to a sufficiently large value, because
this limit only exists to keep the interface wait-free in the absolute
worst case scheduling scenario. This *MASSIVELY* improved performance
for mixed allocations as we no longer shuffled around creating more than
necessary.
Summary:
We previously used `match_all` as the shortcut to figure out which
threads were destined for which slots. This lowers to a for-loop, which
even if it often only executes once still causes some slowdown
especially when divergent. Instead we use a single ballot call and then
calculate it.
Here the ballot tells us which lanes are the first in a block, either
the starting index or the barrier for a new 32-bit int. We then use some
bit magic to figure out for each lane ID its closest leader. For the
length we simply use the length calculated by the leader of the
remaining bits to be written. This removes the match any and the
shuffle, which improves the minimum number of cycles this takes by about
5%.
Summary:
This slightly increases performance in a few places. First, we
optimistically assume the cached slab has ample space which lets us
avoid the atomic load on the highly contended counter in the case that
it is likely to succeed. Second, we no longer call `match_any` twice as
we can calculate the uniform slabs at the moment we return them.
Thirdly, we always choose a random index on a 32-bit boundary. This
means that in the fast case we fulfil the allocation with a single
`fetch_or`, and in the other case we quickly move to the free bit.
This nets around a 7.75% improvement for the fast path case.
We can do it all in finalizeStore if we ensure it always sees the
stores.
For that, I needed to fix a hidden bug where finalizeStore wouldn't see
all stores
because sometimes the iterator got out-of-sync and didn't point to the
store anymore.
This also removes the waits before volatile LDS stores which never
needed it, that was a bug until now.
Summary:
The slots in this allocation scheme are statically allocated. All sizes
share the same array of slots, but are given different starting
locations to space them apart. The previous implementation used a
trivial linear slice. This is inefficient because it provides the more
likely allocations (1-1024 bytes) with just as much space as a highly
unlikely one (1 MiB).
This patch uses a cubic easing function to gradually shrink the gaps.
For example, we used to get around 700 free slots for a 16 byte
allocation, now we get around 2100 before it starts encroaching on the
32 byte allocation space. This could be improved further, but I think
this is sufficient.