Summary:
When optimizing a binary without relocations, we can skip processing
functions without profile (cold functions). By skipping processing of
cold functions, we reduce the processing time and memory. We call
such mode a lite mode, and it is enabled by default.
Some processing is still done for functions without profile even in lite
mode. scanExternalRefs() function is used to detect secondary entry
points to functions that are not marked in the symbol table.
Note that the no-relocation requirement is a temporary limitation
of the lite mode.
(cherry picked from FBD21366567)
Summary:
Whenever a function is not meant for processing, e.g. when the user
requests to optimize only a subset of functions, mark the function as
ignored. Use this attribute instead of opts::shouldProcess().
(cherry picked from FBD21374806)
Summary:
In non-relocation mode, the code for marking a function non-simple was
decoupled from the code that added new entry points. Fix that.
(cherry picked from FBD21264247)
Summary:
Some functions could be called at an address inside their function body.
Typically, these functions are written in assembly as C/C++ does not
have a multi-entry function concept. The addresses inside a function
body that could be referenced from outside are called secondary entry
points.
In BOLT we support processing functions with secondary/multiple entry
points. We used to mark basic blocks representing those entry points
with a special flag. There was only one problem - each basic block has
exactly one MCSymbol associated with it, and for the most efficient
processing we prefer that symbol to be local/temporary. However, in
certain scenarios, e.g. when running in non-relocation mode, we need
the entry symbol to be global/non-temporary.
We could create global symbols for secondary points ahead of time when
the entry point is marked in the symbol table. But not all such entries
are properly marked. This means that potentially we could discover an
entry point only after disassembling the code that references it, and
it could happen after a local label was already created at the same
location together with all its references. Replacing the local symbol
and updating the references turned out to be an error-prone process.
This diff takes a different approach. All basic blocks are created with
permanently local symbols. Whenever there's a need to add a secondary
entry point, we create an extra global symbol or use an existing one
at that location. Containing BinaryFunction maps a local symbol of a
basic block to the global symbol representing a secondary entry point.
This way we can tell if the basic block is a secondary entry point,
and we emit both symbols for all secondary entry points. Since secondary
entry points are quite rare, the overhead of this approach is minimal.
Note that the same location could be referenced via local symbol from
inside a function and via global entry point symbol from outside.
This is true for both primary and secondary entry points.
(cherry picked from FBD21150193)
Summary:
In a rare case, we may fold a function and fail to emit it in
non-relocation mode due to a function size increase. At the same time,
the function that the original function was folded into could have been
successfully emitted, e.g. because it was split in the presence of a
profile information.
Later, because the function was not emitted, we have to use its original
.eh_frame entry in the preserved .eh_frame section. However, that entry
is no longer referencing the original function, but the function that
the original was folded into. This happens since the original symbol gets
emitted at the other function location. As a result, .eh_frame entry for
the folded function is missing.
To prevent incorrect update of the original .eh_frame, create
relocations against absolute values. This guarantees preservation of the
section contents while updating pc-relative references.
(cherry picked from FBD21061130)
Summary:
Too many hash collisions may cause ICF to run slowly.
We used to hash BinaryFunction only looking at instruction opcodes,
ignoring instruction operands. With many almost identical functions,
such approach may lead to long ICF processing time. By including
operands into the hash, we reduce the number of collisions and
improve the runtime often by a factor of 2 or more.
(cherry picked from FBD20888957)
Summary:
The version of LLVM that we are based on lacks the support for base
address in DWARF location lists. Add the missing pieces.
(cherry picked from FBD20640784)
Summary:
Consolidate code and data emission code in ELF-independent
BinaryEmitter. The high-level interface includes only two
functions emitBinaryContext() and emitFunctionBody() used
by RewriteInstance and BinaryContext respectively.
(cherry picked from FBD20332901)
Summary:
In this diff we refactor the code around getting the original binary encoding of function's body.
The main changes are: remove BinaryContext::getFunctionData, remove the parameter of the method BinaryFunction::disassemble, introduce BinaryFunction::getData.
(cherry picked from FBD19824368)
Summary:
In strict mode, a jump table with targets generated by
builtin_unreachable (located at the very end of the function) was
asserting when being recreated by postProcessIndirectBranches. Fix
this.
(cherry picked from FBD19614981)
Summary:
BinaryFunction used to have a list of Names associated with its main
entry point. However, the function is primarily identified by its
corresponding symbol or symbols, and these symbols are available as we
are creating them for a corresponding BinaryData object.
There's also no reason to emit symbols for alternative function names
(aliases), so change the code to only emit needed symbols.
When we emit a cold fragment for a function, only emit one cold symbol
for the fragment instead of one per every main entry symbol/name.
When we match a symbol to an entry point in the function, with this
change we can first go through the list of main entry symbols (now that
they are available).
(cherry picked from FBD19426709)
Summary:
Call postProcessEntryPoints only after all functions have been
disassembled and all interprocedural references have been processed,
when all possible entry points have been accounted for. This makes our
detection of bad entries more robust as it does not depend on the order
of the functions any more.
(cherry picked from FBD19404767)
Summary:
"Fix symbol table entries for secondary entries" diff broke the inliner.
Fix the breakage and make the discovery of secondary entry points more
accurate.
Add ability to BinaryContext::getFunctionForSymbol() to return an entry
point discriminator and use it instead of calling getEntryForSymbol()
and isSecondaryEntry(). This is the preferred way since
getFunctionForSymbol() is thread-safe.
(cherry picked from FBD19295983)
Summary:
Commit "Support full instrumentation" changed the map
SymbolToFunction in BinaryContext to map secondary entries of functions
too. This introduced unexpected behavior in our symbol table rewriting
logic, which caused it to mistakenly write them with the address of the
original function. Fix the behavior of getBinaryFunctionAtAddress to
correct this. Also fix other users of SymbolToFunction to ensure they
are not accidentally using secondary entries when they shouldn't.
(cherry picked from FBD19168319)
Summary:
Add full instrumentation support (branches, direct and
indirect calls). Add output statistics to show how many hot bytes
were split from cold ones in functions. Add -cold-threshold option
to allow splitting warm code (non-zero count). Add option in
bolt-diff to report missing functions in profile 2.
In instrumentation, fini hooks are fixed to run proper finalization
code after program finishes. Hooks for startup are added to setup
the runtime structures that needs initilization, such as indirect call
hash tables.
Add support for automatically dumping profile data every N seconds by
forking a watcher process during runtime.
(cherry picked from FBD17644396)
Summary:
If -trap-avx512 option is not set, verify that we correctly encode
AVX-512 instructions and treat them as ordinary instructions.
(cherry picked from FBD18666427)
Summary:
There is no need to support existing functionality of adding entry
points after the CFG is built as the function is only called in empty or
disassembled state. Previously we used to run disassemble+buildCFG per
function, but now these phases are decoupled.
Also, remove a couple of redundant checks.
(cherry picked from FBD18622822)
Summary:
Free more lists in BinaryFunction::releaseCFG().
Release BinaryFunction::Relocations after disassembly.
Do not populate BinaryFunction::MoveRelocations as we are not using them
currently.
Also remove PCRelativeRelocationOffsets that weren't used.
(cherry picked from FBD18413256)
Summary:
Once we emit function code, we no longer need CFG for next phases
that use basic blocks for address-translation and symbol update
purposes. We free memory used by CFG and instructions. The freed
memory gets reused by later phases resulting in overall memory usage
reduction.
We can probably improve memory consumption even further by replacing
BinaryBasicBlocks with more compact data structures.
(cherry picked from FBD18408954)
Summary:
We've used to emit special annotations to update SDT markers. However,
we can just use "Offset" annotations for the same purpose. Unlike BAT,
we have to generate "reverse" address translation tables.
This approach eliminates reliance on instructions after code emission.
(cherry picked from FBD18318660)
Summary:
Use BinaryBasicBlock::OffsetTranslationTable for BAT. This removes
dependency on instructions after the code emission.
(cherry picked from FBD18283965)
Summary:
For functions with unknown control flow, do not populate TakenBranches
with an entry pointing to the end of the function.
(cherry picked from FBD18034019)
Summary:
Change our CMake config for the standalone runtime instrumentation
library to check for the elf.h header before using it, so the build
doesn't break on systems lacking it. Also fix a SmallPtrSet usage where
its elements are not really pointers, but uint64_t, breaking the build
in Apple's Clang.
(cherry picked from FBD17505759)
Summary:
Change our edge profiling technique when using instrumentation
to do not instrument every edge. Instead, build the spanning tree
for the CFG and omit instrumentation for edges in the spanning tree.
Infer the edge count for these edges when writing the profile during
run time. The inference works with a bottom-up traversal of the spanning
tree and establishes the value of the edge connecting to the parent based
on a simple flow equation involving output and input edges, where the
only unknown variable is the parent edge.
This requires some engineering in the runtime lib to support dynamic
allocation for building these graphs at runtime.
(cherry picked from FBD17062773)
Summary:
We were too permissive by allowing more jump tables during the
preliminary scan of memory. This allowed for jump tables to be
falsely detected. And since we didn't have a way to backtrack
the jump table creation, we had to assert.
This diff refactors the code that analyzes jump table contents.
Preliminary and final passes share the same code. The only difference
should be the detection of instruction boundaries that are available
during the final pass.
This should affect strict relocation mode only.
(cherry picked from FBD16923335)
Summary:
Add option `-check-encoding` to verify if the input to LLVM disassembler
matches the output of the assembler. When set, the verification runs on
every instruction in processed functions.
I'm not enabling the option by default as it could be quite noisy on x86
where instruction encoding is ambiguous and can include redundant
prefixes.
(cherry picked from FBD16595415)
Summary:
In non-relocation mode, we allow data objects to be embedded in the
code. Such objects could be unmarked, and could occupy an area between
functions, the area which is considered to be code padding.
When we disassemble code, we detect references into the padding area
and adjust it, so that it is not overwritten during the code emission.
We assume the reference to be pointing to the beginning of the object.
However, assembly-written functions may reference the middle of an
object and use negative offsets to reference data fields. Thus,
conservatively, we reduce the possibly-overwritten padding area to
a minimum if the object reference was detected.
Since we also allow functions with unknown code in non-relocation mode,
it is possible that we miss references to some objects in code.
To cover such cases, we need to verify the padding area before we
allow to overwrite it.
(cherry picked from FBD16477787)
Summary:
Shrink wrapping is an expensive part of frame optimizations if
performed on all functions. This diff makes it run in parallel,
reducing wall time.
(cherry picked from FBD16092651)
Summary:
This diff parallelize the construction of call graph during disassembly.
The diff includes a change to parallel-utilities where another interface
is added, that support running tasks on binaryFunctions that involves
adding instruction annotations. This pattern is common in different places,
e.g. frame optimizations. And such, pattern justify creating an interface,
that abstract out all the messy details.
(cherry picked from FBD16232809)
Summary:
This diff change reorderBasicBlocks pass to run in parallel,
it does so by adding locks to the fix branches function,
and creating temporary MCCodeEmitters when estimating basic block code size.
(cherry picked from FBD16161149)
Summary:
If two indirect branches use the same jump table, we need to
detect this and duplicate dump tables so we can modify this CFG
correctly. This is necessary for instrumentation and shrink wrapping.
For the latter, we only detect this and bail, fixing this old known
issue with shrink wrapping.
Other minor changes to support better instrumentation: add an option
to instrument only hot functions, add LOCK prefix to instrumentation
increment instruction, speed up splitting critical edges by avoiding
calling recomputeLandingPads() unnecessarily.
(cherry picked from FBD16101312)
Summary:
In strict relocation mode we rely on relocations to represent all
possible entry points into a function. Most of the code generated by
tested compilers (gcc and clang) will result in relocations against
any internal labels for jump tables and for computed goto tables.
In situations where we cannot properly reconstruct a jump table, or when
we cannot determine a table that guides an indirect jump, e.g. when
multiple computed goto tables are used, we conservatively assume that
the indirect jump can end up at any possible basic block referenced by
relocations.
In strict mode, simple functions may include the aforementioned
instructions with unknown control flow with a conservative list of
destinations added to the containing basic block. This allows us to
expand coverage of simple functions and to enable code reordering
optimizations for more functions.
The strict mode is recommended when BOLT is used with a well-formed
code generated by a compiler.
To use the strict mode, add "-strict" on the command line.
Another effect of this diff, is that with relocations, we will always
replace the immediate operand of an instruction with a symbol if the
relocation exists against this operand.
Also this diff fixes issues with Clang compiled with -fpic.
(cherry picked from FBD15872849)
Summary:
An instrumentation pass that modifies the input binary to
generate a profile after execution finishes. It modifies branches to
increment counters stored in the process memory and injects a new
function that dumps this data to an fdata file, readable by BOLT.
This instrumentation is experimental and currently uses a naive
approach where every branch is instrumented. This is not ideal for
runtime performance, but should be good enough for us to
evaluate/debug LBR profile quality against instrumentation.
Does not support instrumenting indirect calls yet, only direct
calls, direct branches and indirect local branches.
(cherry picked from FBD15998096)
Summary:
Make BOLT ignore empty functions (those containing no instructions,
despite having some space allocated to it filled with zeroes).
(cherry picked from FBD15981683)
Summary:
Now that we populate jump tables after all functions are disassembled,
we can check for instruction boundaries corresponding to jump table
entries. No need to delegate this task to postProcessJumpTables().
(cherry picked from FBD15814762)
Summary:
During the initial disassembly pass, only identify jump tables
without populating the contents. Later, after all functions have been
disassembled, we have a better idea of jump table boundaries and can do
a better job of populating their entries.
As a result, we no longer have embedded jump tables (i.e. a jump table
that is parter of another jump table). If we ever need to keep
sequential jump tables inseparable during the output, we can always
add such functionality later.
Fixesfacebookincubator/BOLT#56.
(cherry picked from FBD15800427)
Summary:
We used to handle PC-relative address references differently from direct
address references. As a result, some cases, such as escaped function
label address, were not handled when dealing with absolute (non-PIC)
code. This diff moves processing of an address reference into
BinaryContext::handleAddressRef() which is called for both PIC and
non-PIC code.
(cherry picked from FBD15643535)
Summary:
Similarly to how the compiler relies on DWARF to map samples, so
it is possible to collect profile data in binaries optimized by PGO
techniques and retrofit data to be used in a representation of the program
that was not optimized by PGO, this diff implements an option in BOLT to
encode a table in the output binary that allows us to map data collected
in optimized binaries back to the address space used in the input binary
(where the profile is useful, since we do not support running BOLT on a
binary already optimized by BOLT). The goal is to offer an option to
support BOLT in scenarios where it is not easy to run a special deployment of
the binary with a version that was not optimized by BOLT just for data
collection.
This feature is enabled with the -enable-bat flag. BAT stands for BOLT
Address Translation, which refers to the process of mapping output to
input addresses.
(cherry picked from FBD15531860)
Summary:
Options such as `-print-only`, `-skip-funcs`, etc. now take regular
expressions. Internally, the option is converted to '^funcname$' form
prior to regex matching. This ensures that names without special
symbols will match exactly, i.e. "foo" will not match "foo123".
(cherry picked from FBD15551930)
Summary:
Run analyzeIndirectBranch() using basic block boundaries instead of
running ad-hoc validation of the jump table assumptions.
(cherry picked from FBD15465034)
Summary:
Move handling of interprocedural references to BinaryContext.
Post-process indirect branches immediately after the CFG is built.
This is almost NFC. Since indirect branches are now post-processed
before the profile data is processed it interferes with the way the
profile data in YAML format is handled.
(cherry picked from FBD15456003)
Summary:
SDT markers that appears as nops in the assembly, are preserved and not eliminated.
Functions with SDT markers are also flagged. Inlining and folding are disabled for
functions that have SDT markers.
(cherry picked from FBD15379799)
Summary:
Make BinaryContext responsible for creation and management of
JumpTables. This will be used for detection and resolution of jump table
conflicts across functions.
(cherry picked from FBD15196017)
Summary:
While checking for a size of a jump table, we've used containing
section as a boundary. This worked for most cases as typically jump
tables are not marked with symbol table entries. However, the compiler
may generate objects for indirect goto's.
(cherry picked from FBD15158905)
Summary:
This adds very basic and limited support for split functions.
In non-relocation mode, split functions are ignored, while their debug
info is properly updated. No support in the relocation mode yet.
Split functions consist of a main body and one or more fragments.
For fragments, the main part is called their parent. Any fragment
could only be entered via its parent or another fragment.
The short-term goal is to correctly update debug information for split
functions, while the long-term goal is to have a complete support
including full optimization. Note that if we don't detect split
bodies, we would have to add multiple entry points via tail calls,
which we would rather avoid.
Parent functions and fragments are represented by a `BinaryFunction`
and are marked accordingly. For now they are marked as non-simple, and
thus only supported in non-relocation mode. Once we start building a
CFG, it should be a common graph (i.e. the one that includes all
fragments) in the parent function.
The function discovery is unchanged, except for the detection of
`\.cold\.` pattern in the function name, which automatically marks the
function as a fragment of another function.
Because of the local function name ambiguity, we cannot rely on the
function name to establish child fragment and parent relationship.
Instead we rely on disassembly processing.
`BinaryContext::getBinaryFunctionContainingAddress()` now returns a
parent function if an address from its fragment is passed.
There's no jump table support at the moment. Jump tables can have
source and destinations in both fragment and parent.
Parent functions that enter their fragments via C++ exception handling
mechanism are not yet supported.
(cherry picked from FBD14970569)