Summary:
Add a dummy option in BOLT to allow us to write any string in
the bolt info section. This is accomplished since we record the complete
argv vector. This string used to tag this binary with any ID that can
later be associated with a specific BOLT invocation.
(cherry picked from FBD21441902)
Summary:
We use a special routine to emit line info for functions that we do not
overwrite. The resulting DWARF was not quite efficient as we were
advancing addresses using a larger than needed opcodes. Since there were
only a few functions that we didn't emit/overwrite, it was not a big
issue.
However, in lite mode the majority of functions are not overwritten and
as a result, the inefficiency in debug line encoding got exposed and
binaries were getting larger than expected .debug_line sections.
Fix it by using more conventional line table opcodes for address
advancing.
(cherry picked from FBD21423074)
Summary: We only support linking ELF runtime library right now. If the library is an archiver, we check that each individual library inside the archiver is an ELF library.
(cherry picked from FBD21388672)
Summary:
When optimizing a binary without relocations, we can skip processing
functions without profile (cold functions). By skipping processing of
cold functions, we reduce the processing time and memory. We call
such mode a lite mode, and it is enabled by default.
Some processing is still done for functions without profile even in lite
mode. scanExternalRefs() function is used to detect secondary entry
points to functions that are not marked in the symbol table.
Note that the no-relocation requirement is a temporary limitation
of the lite mode.
(cherry picked from FBD21366567)
Summary:
Whenever a function is not meant for processing, e.g. when the user
requests to optimize only a subset of functions, mark the function as
ignored. Use this attribute instead of opts::shouldProcess().
(cherry picked from FBD21374806)
Summary:
The commit that fixed ICF determinism in non-relocation mode disabled
profile merging for functions. Dyno stats output needs to be updated to
reflect the lack of merge.
(cherry picked from FBD21366046)
Summary:
In non-relocation mode, the code for marking a function non-simple was
decoupled from the code that added new entry points. Fix that.
(cherry picked from FBD21264247)
Summary:
Some functions could be called at an address inside their function body.
Typically, these functions are written in assembly as C/C++ does not
have a multi-entry function concept. The addresses inside a function
body that could be referenced from outside are called secondary entry
points.
In BOLT we support processing functions with secondary/multiple entry
points. We used to mark basic blocks representing those entry points
with a special flag. There was only one problem - each basic block has
exactly one MCSymbol associated with it, and for the most efficient
processing we prefer that symbol to be local/temporary. However, in
certain scenarios, e.g. when running in non-relocation mode, we need
the entry symbol to be global/non-temporary.
We could create global symbols for secondary points ahead of time when
the entry point is marked in the symbol table. But not all such entries
are properly marked. This means that potentially we could discover an
entry point only after disassembling the code that references it, and
it could happen after a local label was already created at the same
location together with all its references. Replacing the local symbol
and updating the references turned out to be an error-prone process.
This diff takes a different approach. All basic blocks are created with
permanently local symbols. Whenever there's a need to add a secondary
entry point, we create an extra global symbol or use an existing one
at that location. Containing BinaryFunction maps a local symbol of a
basic block to the global symbol representing a secondary entry point.
This way we can tell if the basic block is a secondary entry point,
and we emit both symbols for all secondary entry points. Since secondary
entry points are quite rare, the overhead of this approach is minimal.
Note that the same location could be referenced via local symbol from
inside a function and via global entry point symbol from outside.
This is true for both primary and secondary entry points.
(cherry picked from FBD21150193)
Summary:
Add an option to fail processing of the input binary if the profile
is not accurate:
-stale-threshold=<uint>
- maximum percentage of stale functions to tolerate (default: 100)
Default (100) means never to fail.
A function profile is considered stale if any branch in its profile
has invalid source or destination.
Use `-stale-threshold=0` to fail if any staleness is detected in the
profile.
(cherry picked from FBD21189036)
Summary:
In relocation mode, there is no use for old .eh_frame section. Moreover,
it can interfere with new EH frames via .eh_frame_hdr when the original
.text is reused.
(cherry picked from FBD21120070)
Summary:
In non-relocation mode, make sure we emit extra symbols for a folded
function even if the function was not overwritten due to its large
size.
(cherry picked from FBD21080467)
Summary:
In a rare case, we may fold a function and fail to emit it in
non-relocation mode due to a function size increase. At the same time,
the function that the original function was folded into could have been
successfully emitted, e.g. because it was split in the presence of a
profile information.
Later, because the function was not emitted, we have to use its original
.eh_frame entry in the preserved .eh_frame section. However, that entry
is no longer referencing the original function, but the function that
the original was folded into. This happens since the original symbol gets
emitted at the other function location. As a result, .eh_frame entry for
the folded function is missing.
To prevent incorrect update of the original .eh_frame, create
relocations against absolute values. This guarantees preservation of the
section contents while updating pc-relative references.
(cherry picked from FBD21061130)
Summary:
RuntimeDyldImpl::resolveExternalSymbols() some time ago used to call
getSymbolAddress() while in the second loop. That call could have
modified the contents of ExternalSymbolRelocations that the loop was
iterating over. Thus the code was written in a way that erased the
processed entry on every loop iteration and reset the map
iterator. With large number of entries in ExternalSymbolRelocations the
loop code becomes a performance bottleneck.
Since getSymbolAddress() is no longer used, the
ExternalSymbolRelocations could be iterated in a straightforward way and
the map cleared before the function exit.
(cherry picked from FBD21057058)
Summary:
Indirect calls that use RSP to compute the target address would
break in instrumentation mode because we were adding instructions that
changed the stack pointer. Fix this.
(cherry picked from FBD20883791)
Summary:
Further speedup ICF by applying stricter rules for congruent functions.
While checking symbolic operands in congruent functions, consider
operands congruent only if they are equal or reference functions
with identical hashes, i.e. potentially foldable functions.
Note that jump table operands are handled as a special case.
(cherry picked from FBD20912054)
Summary:
Too many hash collisions may cause ICF to run slowly.
We used to hash BinaryFunction only looking at instruction opcodes,
ignoring instruction operands. With many almost identical functions,
such approach may lead to long ICF processing time. By including
operands into the hash, we reduce the number of collisions and
improve the runtime often by a factor of 2 or more.
(cherry picked from FBD20888957)
Summary:
ICF may fold functions in arbitrary order when running multi-threaded.
This is fine in relocation mode as we end up with just one function
holding all function symbols.
However, in non-relocation mode we keep all function bodies, and if we
keep merging profiles in non-deterministic order, we end up with
functions with non deterministic profiles. The fix for non-relocation
mode is to not merge profiles as the factual new profile could be
different from the merged one since both function instances are
potentially callable.
Additionally, emit extra symbols for ICF functions in non-relocation
mode to make it possible to track the folding.
(cherry picked from FBD20889866)
Summary:
Some functions may have exactly the same code and exception handlers.
However, their action tables could be different leading to mismatching
semantics. We should verify their equivalence while running ICF.
(cherry picked from FBD20889035)
Summary:
The version of LLVM that we are based on lacks the support for base
address in DWARF location lists. Add the missing pieces.
(cherry picked from FBD20640784)
Summary:
Make ELF symbol table rewriting code more structured. While at it,
remove symbols from non-allocatable sections.
(cherry picked from FBD20243386)
Summary:
Consolidate code and data emission code in ELF-independent
BinaryEmitter. The high-level interface includes only two
functions emitBinaryContext() and emitFunctionBody() used
by RewriteInstance and BinaryContext respectively.
(cherry picked from FBD20332901)
Summary:
This is a prerequisite for larger emitter refactoring.
Since .dynamic is read unconditionally, add an error message if the
section is missing, or the size of the section is zero.
(cherry picked from FBD20331735)
Summary:
Shrink wrapping has a mode where it will directly move push
pop pairs, instead of replacing them with stores/loads. This is an
ambitious mode that is triggered sometimes, but whenever matching with
a push, it would operate with the assumption that the restoring
instruction was a pop, not a load, otherwise it would assert. Fix this
assertion to bail nicely back to non-pushpop mode (use regular store and
load instructions).
(cherry picked from FBD20085905)
Summary:
Change our X86 target to use long nops by default. In general,
BOLT does not put nops into the instruction stream that is going to be
executed, since it doesn't align basic blocks, only functions. Since we
rebased BOLT, our relationship with MCAssembler changed because it
stopped using multibyte nops and we never needed to revisit that. But it
makes a difference if we want to mitigate perf issues with the Intel
JCC erratum, since the nops inserted are going to be decoded and
executed. To make MCAssembler emit long nops again, we need to explictly
set mattr (Features) of the X86 target.
(cherry picked from FBD19987277)
Summary: Start adding initial bits for MachO, this diff contains some small preparations for finding functions inside a MachO binary, this will be done in the next diff. The concept of a section in the MachO world is quite different from ELF, nevertheless, for functions for now it more or less fits into the current picture (in BOLT), but things will diverge more significantly a bit later.
(cherry picked from FBD19648161)
Summary:
There are two peephole subpasses, remove-double-jumps and
remove-useless-conditional-branches, that operates by reading branches
directly, which makes them tricky to run before fix-branches. In the
case of remove-double-jumps, it will even lead to suboptimal code if
the patched branch was going to be removed by fix-branches when the
target is the fall-through. If the final target is a tail call, it will
lead to a broken CFG in the worst case. Fix this by moving these passes
after SCTC, which already produces CFGs with conditional tail calls.
(cherry picked from FBD18795592)