We would always allocate maximum amount for vector containing
DWARFUnitInfo. In real usecases what ends up hapenning is we allocate a
giant vector when processing one CU, or for thin-lto case multiple CUs.
This lead to a lot of memory overhead, and 2x BOLT processing slowdown
for at least one service built with monolithic DWARF.
For binaries built with LTO with clang all of CUs that have cross
references will share an abbrev table and will be processed in one
batch. Rest of CUs are processesd in --cu-processing-batch-size size.
Which defaults to 1.
For theoretical cases where cross-cu references are present, but they do
not share abbrev will increase the size of CloneUnitCtxMap as each CU is
being processsed.
There is an assumption that we dont need to to mix sanitizer with
symbolizer from different LLVM revison. If so we can detect it by
`__sanitizer_symbolize_code` and assume that the rest is present.
Summary:
We currently keep a cache of created ELF files from the relevant images.
This shouldn't be necessary as the entire ELF interface is generally
trivially constructable and extremely cheap. The cost of constructing
one of these objects is simply a size check and writing a pointer to the
underlying data. Given that, keeping a cache of these images should not
be necessary overall.
Atomic instructions (load / store/ atomicrwm / cmpxchg) are not
really undefined behavior if they lack natural alignment. They will
(with AtomicExpand pass enabled) be converted into libcalls.
Update the language reference to reflect this.
This patch reduces the size of heap memory required by the test
`malloc_parallel.c` and `malloc.c`. The original size is too large such
that `malloc` returns `nullptr` on many threads, causing illegal
memory access.
Using the LoopInfo from OMPInfoCache after the Attributor ran resulted
in a crash due to it being in an invalid state.
---------
Co-authored-by: Ivan Radanov Ivanov <ivanov2@llnl.gov>
This patch add three GTest unit tests that test plugin read and write
operations. Tests can be compiled with `ninja -C runtimes/runtimes-bins
LibomptUnitTests`.
Stubify broadly takes either tbd files or binary dylibs and turns them
into tbd files. In future patches, stubify will also allow additional
information to be embedded into the final TBD output too.
Add Util APIs to TextAPI for common operations used by readtapi for now.
InstCombine canonicalizes `add` to `or` when possible, but this makes
some optimizations applicable to `add` to be missed because they don't
realize that the `or` is equivalent to `add`.
In this patch we generalize `foldICmpBinOp` to handle such cases.
This patch adds representation for `device_type` clause information on
compute construct (parallel, kernels, serial).
The `device_type` clause on compute construct impacts clauses that
appear after it. The values impacted by `device_type` are now tied with
an attribute array that represent the device_type associated with them.
`DeviceType::None` is used to represent the value produced by a clause
before any `device_type`. The operands and the attribute information are
parser/printed together.
This is an example with `vector_length` clause. The first value (64) is
not impacted by `device_type` so it will be represented with
DeviceType::None. None is not printed. The second value (128) is tied
with the `device_type(multicore)` clause.
```
!$acc parallel vector_length(64) device_type(multicore) vector_length(256)
```
```
acc.parallel vector_length(%c64 : i32, %c128 : i32 [#acc.device_type<multicore>]) {
}
```
When multiple values can be produced for a single clause like
`num_gangs` and `wait`, an extra attribute describe the number of values
belonging to each `device_type`. Values and attributes are
parsed/printed together.
```
acc.parallel num_gangs({%c2 : i32, %c4 : i32}, {%c4 : i32} [#acc.device_type<nvidia>])
```
While preparing this patch I noticed that the wait devnum is not part of
the operations and is not lowered. It will be added in a follow up
patch.
This fixes a longstanding bug in the `Context._CAPICreate` method
whereby it was not taking ownership of the PyMlirContext wrapper when
casting to a Python object. The result was minimally that all such
contexts transferred in that way would leak. In addition, counter to the
documentation for the `_CAPICreate` helper (see
`mlir-c/Bindings/Python/Interop.h`) and the `forContext` /
`forOperation` methods, we were silently upgrading any unknown
context/operation pointer to steal-ownership semantics. This is
dangerous and was causing some subtle bugs downstream where this
facility is getting the most use.
This patch corrects the semantics and will only do an ownership transfer
for `_CAPICreate`, and it will further require that it is an ownership
transfer (if already transferred, it was just silently succeeding).
Removing the mis-aligned behavior made it clear where the downstream was
doing the wrong thing.
It also adds some `_testing_` functions to create unowned context and
operation capsules so that this can be fully tested upstream, reworking
the tests to verify the behavior.
In some torture testing downstream, I was not able to trigger any memory
corruption with the newly enforced semantics. When getting it wrong, a
regular exception is raised.
See #73359
Types using `assemblyFormat` to define parsing don't need an additional
handwritten parser. So we should remove the handwritten parsers where
one
provided by an `assemblyFormat` already exists to avoid confusion and
de-syncing.
This patch enables more numeric (mod, sum, matmul, etc.) APIs,
and some others.
I added new macros to disable warnings about using C++ STD methods
like operators of std::complex, which do not have __device__ attribute.
This may probably result in unresolved references, if the header files
implementation relies on libstdc++. I will need to follow up on this.
The current (experimental) spec for WebAssembly shared libraries does
not include a full symbol table like the object format. This change
extracts symbol information from the normal wasm exports.
This is the first step in having the linker report undefined symbols
when linking with shared libraries. The current behaviour is to ignore
all undefined symbols when linking with `-pie` or `-shared`.
See https://github.com/emscripten-core/emscripten/issues/18198
In 9a38a72f1d `ProductId` was assigned from the stringified value of
`CLANG_VENDOR`, if that macro was defined. However, `CLANG_VENDOR` is
supposed to be a string, as it is defined (optionally) as such in the
top-level clang `CMakeLists.txt`.
Furthermore, `CLANG_VENDOR` is only passed as a build-time define when
compiling `Version.cpp`, so add a `getClangVendor()` function to
`Version.h`, and use it in `CodegGenModule.cpp`, instead of relying on
the macro.
Fixes: 9a38a72f1d
In 9a38a72f1d `ProductId` was assigned from the stringified value of
`CLANG_VENDOR`, if that macro was defined. However, `CLANG_VENDOR` is
supposed to be a string, as it is defined (optionally) as such in the
top-level clang `CMakeLists.txt`.
Move the addition of `-DCLANG_VENDOR` to the compiler flags from
`clang/lib/Basic/CMakeLists.txt` to the top-level `CMakeLists.txt`, so
it is consistent across the whole clang codebase. Then remove the
stringification from `CodeGenModule.cpp`, to make it work correctly.
Fixes: 9a38a72f1d
When project components are built as separate shared libraries, a lot
of errors appear about undefined symbols, e.g.
```
/usr/bin/ld: CMakeFiles/obj.MLIRGPUPipelines.dir/GPUToNVVMPipeline.cpp.o
: in function `(anonymous namespace)::buildCommonPassPipeline(mlir::OpPa
ssManager&, (anonymous namespace)::GPUToNVVMPipelineOptions const&)':
GPUToNVVMPipeline.cpp:(.text._ZN12_GLOBAL__N_123buildCommonPassPipelineE
RN4mlir13OpPassManagerERKNS_24GPUToNVVMPipelineOptionsE+0xa5): undefined
reference to `mlir::createConvertLinalgToLoopsPass()'
```
Add the necessary dependencies to Dialect/GPU/Pipelines/CMakeLists.txt
The maxnum/minnum semantics can be found at
https://llvm.org/docs/LangRef.html#llvm-minnum-intrinsic.
The revision also updates function names in lit tests to match op name.
Take arith.maxnumf as example:
```
func.func @maxnumf(%lhs: f32, %rhs: f32) -> f32 {
%result = arith.maxnumf %lhs, %rhs : f32
return %result : f32
}
```
will be expanded to
```
func.func @maxnumf(%lhs: f32, %rhs: f32) -> f32 {
%0 = arith.cmpf ugt, %lhs, %rhs : f32
%1 = arith.select %0, %lhs, %rhs : f32
%2 = arith.cmpf uno, %lhs, %lhs : f32
%3 = arith.select %2, %rhs, %1 : f32
return %3 : f32
}
```
Case 1: Both LHS and RHS are not NaN; LHS > RHS
In this case, `%1` is LHS. `%3` and `%1` have the same value, so `%3` is
LHS.
Case 2: LHS is NaN and RHS is not NaN
In this case, `%2` is true, so `%3` is always RHS.
Case 3: LHS is not NaN and RHS is NaN
In this case, `%0` is true and `%1` is LHS. `%2` is false, so `%3` and
`%1` have the same value, which is LHS.
Case 4: Both LHS and RHS are NaN:
`%1` and RHS are all NaN, so the result is still NaN.
Introduce RecordVisitor. This is used for different clients that want to
extract information out of RecordSlice types.
The first and immediate use case is for serializing symbol information
into TBD files.
As part of startup refactoring, this patch adds a function to merge
multiple objects into a single relocatable object:
cc -r obj1.o obj2.o -o obj.o
A relocatable object is an object file that is not fully linked into an
executable or a shared library. It is an intermediate file format that
can be passed into the linker.
A crt object can have arch-specific code and arch-agnostic code. To
reduce code cohesion, the implementation is splitted into multiple
units. As a result, we need to merge them into a single relocatable
object.
emitPopInst checks a single function exit MBB. If other paths also exit
the function and any of there terminators uses LR implicitly, it is not
save to clear the Restored bit.
Check all terminators for the function before clearing Restored.
This fixes a mis-compile in outlined-fn-may-clobber-lr-in-caller.ll
where the machine-outliner previously introduced BLs that clobbered LR
which in turn is used by the tail call return.
Alternative to #73553
The `acc` dialect operations now implement MemoryEffects interfaces in
the following ways:
- Data entry operations which may read host memory via `varPtr` are now
marked as so. The majority of them do NOT actually read the host memory.
For example, `acc.present` works on the basis of presence of pointer and
not necessarily what the data points to - so they are not marked as
reading the host memory. They still use `varPtr` though but this
dependency is reflected through ssa.
- Data clause operations which may mutate the data pointed to by
`accPtr` are marked as doing so.
- Data clause operations which update required structured or dynamic
runtime counters are marked as reading and writing the newly defined
`RuntimeCounters` resource. Some operations, like `acc.getdeviceptr` do
not actually use the runtime counters - but are marked as reading them
since the address obtained depends on the mapping operations which do
update the runtime counters. Namely, `acc.getdeviceptr` cannot be moved
across other mapping operations.
- Constructs are marked as writing to the `ConstructResource`. This may
be too strict but is needed for the following reasons: 1) Structured
constructs may not use `accPtr` and instead use `varPtr` - when this is
the case, data actions may be removed even when used. 2) Unstructured
constructs are currently used to aggregate multiple data actions. We do
not want such constructs removed or moved for now.
- Terminators are marked as `Pure` as in other dialects.
The current approach has the following limitations which may require
further improvements:
- Subsequent `acc.copyin` operations on same data do not actually read
host memory pointed to by `varPtr` but are still marked as so.
- Two `acc.delete` operations on same data may not mutate `accPtr` until
the runtime counters are zero (but are still marked as mutating).
- The `varPtrPtr` argument, when present, points to the address of
location of `varPtr`. When mapping to target device, an `accPtrPtr`
needs computed and this memory is mutated. This effect is not captured
since the current operations do not produce `accPtrPtr`.
- Runtime counter effects are imprecise since two operations with
differing `varPtr` increment/decrement different counters. Additionally,
operations with `varPtrPtr` mutate attachment counters.
- The `ConstructResource` is too strict and likely can be relaxed with
better modeling.
Functions using different constant expressions were incorrectly
merged, because a lot of state was missing from the comparison,
including the opcode, the comparison predicate, the GEP element
type, as well as the inbounds, inrange and nowrap poison flags.
If decided to resize the instruction, need to drop wrapping flags from
the resulting vector instructions to avoid incorrect
optimizations/assumptions later.
Fixes PR75995.