Adds a generic lowering that suppors all cases of bufferization.dealloc
and one specialized, more efficient lowering for the simple case. Using
a helper function with for loops in the general case enables
O(|num_dealloc_memrefs|+|num_retain_memrefs|) size of the lowered code.
Depends on D155467
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D155468
This patch mostly renames files so it better reflects the function they declare.
Reviewed By: michaelrj
Differential Revision: https://reviews.llvm.org/D155607
Preparation to update bazel builder to use LLVM 16 release
where layering check was enabled https://reviews.llvm.org/D132779
Current setup missed some backsliding in layering check as it has
only on for projects with the check enforced.
Disabled it completely for libc and fixed for DWARFLinkerParallel.
It would be great to re-enable it for libc later.
This transform looks for suitable vector transfers from global memory to shared memory and converts them to async device copies.
Differential Revision: https://reviews.llvm.org/D155569
This is a follow up on D154800 and D154770 to make the code structure more principled and avoid too many nested #ifdef/#endif.
Reviewed By: courbet
Differential Revision: https://reviews.llvm.org/D155515
This fixes builds for 7e78ecfe10 (both cmake and bazel) as well as trim unnecessary dependencies.
This is achieved by moving the functionality to test/lib/GPU which is a more natural landing pad.
Add a simple transform operation to the NVGPU extension that performs
software pipelining of copies to shared memory. The functionality is
extremely minimalistic in this version and only supports copies from
global to shared memory inside an `scf.for` loop with either
`vector.transfer` or `nvgpu.device_async_copy` operations when
pipelining preconditions are already satisfied in the IR. This is the
minimally useful version that uses the more general loop pipeliner in an
NVGPU-specific way. Further extensions and orthogonalizations will be
necessary.
This required a change to the loop pipeliner itself to properly
propagate errors should the predicate generator fail.
This is loosely inspired from the vesion in IREE, but has less unsafe
assumptions and more principled way of communicating decisions.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155223
* Move passes to `Transforms` directory.
* Add `Utils.h` (will be utilized in a subsequent change).
Differential Revision: https://reviews.llvm.org/D155427
This is a follow up on D154800 and D154770 to make the code structure more principled and avoid too many nested #ifdef/#endif.
Reviewed By: courbet
Differential Revision: https://reviews.llvm.org/D155181
This is a follow up on D154800 and D154770 to make the code structure more principled and avoid too many nested #ifdef/#endif.
Reviewed By: courbet
Differential Revision: https://reviews.llvm.org/D155174
It also unifies the computation of StridedLayoutAttr. If the stride is
static known value, we can just use it.
Differential Revision: https://reviews.llvm.org/D155017
This is a follow up on D154800 and D154770 to make the code structure more principled and avoid too many nested #ifdef/#endif.
Reviewed By: courbet
Differential Revision: https://reviews.llvm.org/D155099
This is a follow up on D154800 and D154770 to make the code structure more principled and avoid too many nested #ifdef/#endif.
Reviewed By: courbet
Differential Revision: https://reviews.llvm.org/D155076
Switch the parse of command line options from llvm::cl to OptTable.
The motivation for this change is to continue adding llvm based tools
to the llvm driver multicall. For more information about the proposal
and motivation, please see https://discourse.llvm.org/t/rfc-llvm-busybox-proposal/58494
Reviewed By: abrachet
Differential Revision: https://reviews.llvm.org/D154642
This is the counterpart to the forward dense dataflow analysis and
integrates into the dataflow framework. The implementation follows the
structure of existing dataflow analyses.
Reviewed By: Mogball, phisiart
Differential Revision: https://reviews.llvm.org/D154713
There will be subsequent patches to move things around and make the file layout more principled.
Differential Revision: https://reviews.llvm.org/D154770
GPU code generation, and specifically the shared memory copy insertion
may introduce spurious barriers guarding read-after-read dependencies or
read-after-write on non-aliasing data, which degrades performance due to
unnecessary synchronization. Add a pattern and transform op that removes
such barriers by analyzing memory effects that the barrier actually
guards that are not also guarded by other barriers. The code is adapted
from the Polygeist incubator project.
Co-authored-by: William Moses <gh@wsmoses.com>
Co-authored-by: Ivan Radanov Ivanov <ivanov.i.aa@m.titech.ac.jp>
Reviewed By: nicolasvasilache, wsmoses
Differential Revision: https://reviews.llvm.org/D154720
For machines with a lot of cores, hardware prefetchers can saturate the memory bus when utilization is high.
In this case it is desirable to turn off the hardware prefetcher completely.
This has a big impact on the performance of memory functions such as `memcpy` that rely on the fact that the next cache line will be readily available.
This patch adds the 'LIBC_COPT_MEMCPY_X86_USE_SOFTWARE_PREFETCHING' compile time option that generates a version of memcpy with software prefetching. While not fully restoring the original performances it mitigates the impact to an acceptable level.
Reviewed By: rtenneti
Differential Revision: https://reviews.llvm.org/D154494
The accuracy for the MPFR numbers in the strtofloat fuzz test was set
too high, causing rounding issues when rounding to a smaller final
result.
Reviewed By: lntue
Differential Revision: https://reviews.llvm.org/D154150
`getConstantIntValue` extracts constant values from all constant-like ops, not just `arith::ConstantIndexOp`.
Differential Revision: https://reviews.llvm.org/D154356
This transform op promotes loops with one iteration. I.e., the loop op is replaced by just the loop body.
Differential Revision: https://reviews.llvm.org/D154361
This revision adds a pre-bufferization transform that can reduce the number of allocation. It is similar to `bufferization.eliminate_empty_tensors`, but specific to LinalgOp.
The transform looks for `tensor.empty` ops where the SSA use-def chain ends in an "ins" operand of a `LinalgOp`. If the same `LinalgOp` has an unused "outs" operand (and some other conditions are met), this "outs" operand can be used instead of the `tensor.empty` and the "ins" operand can be turned into an "outs" operand.
Differential Revision: https://reviews.llvm.org/D153952
This is based on ideas from @nafi to:
- use a branchless version of 'cmp' for 'uint32_t',
- completely resolve the lexicographic comparison through vector
operations when wide types are available. We also get rid of byte
reloads and serializing '__builtin_ctzll'.
I did not include the suggestion to replace comparisons of 'uint16_t'
with two 'uint8_t' as it did not seem to help the codegen. This can
be revisited in sub-sequent patches.
The code been rewritten to reduce nested function calls, making the
job of the inliner easier and preventing harmful code duplication.
Reviewed By: nafi3000
Differential Revision: https://reviews.llvm.org/D148717
We are in the progress of migrating to a much improved surface syntax for the Sparse Tensor Encoding Attribute (STEA).
You can see a preview of this in the StableHLO RFC at
https://github.com/openxla/stablehlo/blob/main/rfcs/20230210-sparsity.md
//**This design is courtesy Wren Romano.**//
This initial revision
(1) Introduces the first version of a new parser written by Wren Romano
(2) Introduces a simple "migration plan" using NEW_SYNTAX on the STEA, which will allow us to test the new parser with new examples, as well as migrate existing examples over without the need to rewrite them all
This first "drop" merely provides the entry points to parse the new syntax. The parser is still under active development. For example, we need to address the "lookahead" issue when parsing the lvl spec (viz. do we see l0 = d0 or a direct d0). Another larger task is to actually implement "affine" parsing (since the MLIR affine parser is not accessible in other parts of the tree).
EXAMPLE:
Currently, CSR looks like
#CSR = #sparse_tensor.encoding<{
lvlTypes = ["dense","compressed"],
dimToLvl = affine_map<(i,j) -> (i,j)>
}>
but you can "force" the new parser with
#CSR = #sparse_tensor.encoding<{
NEW_SYNTAX =
(d0, d1) -> (l0 = d0 : dense, l1 = d1 : compressed)
}>
Reviewed By: Peiming
Differential Revision: https://reviews.llvm.org/D153997