Summary:
The new interface for handling Call Frame Information:
* CFI state at any point in a function (in CFG state) is defined by
CFI state at basic block entry and CFI instructions inside the
block. The state is independent of basic blocks layout order
(this is implied by CFG state but wasn't always true in the past).
* Use BinaryBasicBlock::getCFIStateAtInstr(const MCInst *Inst) to
get CFI state at any given instruction in the program.
* No need to call fixCFIState() after any given pass. fixCFIState()
is called only once during function finalization, and any function
transformations after that point are prohibited.
* When introducing new basic blocks, make sure CFI state at entry
is set correctly and matches CFI instructions in the basic block
(if any).
* When splitting basic blocks, use getCFIStateAtInstr() to get
a state at the split point, and set the new basic block's CFI
state to this value.
Introduce CFG_Finalized state to indicate that no further optimizations
are allowed on the function. This state is reached after we have synced
CFI instructions and updated EH info.
Rename "-print-after-fixup" option to "-print-finalized".
This diffs fixes CFI for cases when we split conditional tail calls,
and for indirect call promotion optimization.
(cherry picked from FBD4629307)
Summary:
Add pass to strip 'repz' prefix from 'repz retq' sequence. The prefix
is not used in Intel CPUs afaik. The pass is on by default.
(cherry picked from FBD4610329)
Summary:
In a prev diff I added an option to update jump tables in-place (on by default)
and accidentally broke the default handling of jump tables in relocation
mode. The update should be happening semi-automatically, but because
we ignore relocations for jump tables it wasn't happening (derp).
Since we mostly use '-jump-tables=move' this hasn't been noticed for
some time.
This diff gets rid of IgnoredRelocations and removes relocations
from a relocation set when they are no longer needed. If relocations
are created later for jump tables they are no longer ignored.
(cherry picked from FBD4595159)
Summary:
gcc5 can generate new types of relocations that give linker a freedom
to substitute instructions. These relocations are PC-relative, and
since we manually process such relocations they don't present
much of a problem.
Additionally, detect non-pc-relative access from code into a middle of
a function. Occasionally I've seen such code, but don't know exactly
how to trigger its generation. Just issue a warning for now.
(cherry picked from FBD4566473)
Summary:
Some functions coming from assembly may not have been marked
with size. We assume the size to include all bytes up to
the next function/object in the file. As a result,
function body will include any padding inserted by the linker.
If linker inserts 0-value bytes this could be misinterpreted
as invalid instruction and BOLT will bail out on such functions
in non-relocation mode, and give up on a binary in relocation
mode.
This diff detects zero-padding, ignores it, and continues processing
as normal.
(cherry picked from FBD4528893)
Summary:
Perform indirect call promotion optimization in BOLT.
The code scans the instructions during CFG creation for all
indirect calls. Right now indirect tail calls are not handled
since the functions are marked not simple. The offsets of the
indirect calls are stored for later use by the ICP pass.
The indirect call promotion pass visits each indirect call and
examines the BranchData for each. If the most frequent targets
from that callsite exceed the specified threshold (default 90%),
the call is promoted. Otherwise, it is ignored. By default,
only one target is considered at each callsite.
When an candiate callsite is processed, we modify the callsite
to test for the most common call targets before calling through
the original generic call mechanism.
The CFG and layout are modified by ICP.
A few new command line options have been added:
-indirect-call-promotion
-indirect-call-promotion-threshold=<percentage>
-indirect-call-promotion-topn=<int>
The threshold is the minimum frequency of a call target needed
before ICP is triggered.
The topn option controls the number of targets to consider for
each callsite, e.g. ICP is triggered if topn=2 and the total
requency of the top two call targets exceeds the threshold.
Example of ICP:
C++ code:
int B_count = 0;
int C_count = 0;
struct A { virtual void foo() = 0; }
struct B : public A { virtual void foo() { ++B_count; }; };
struct C : public A { virtual void foo() { ++C_count; }; };
A* a = ...
a->foo();
...
original:
400863: 49 8b 07 mov (%r15),%rax
400866: 4c 89 ff mov %r15,%rdi
400869: ff 10 callq *(%rax)
40086b: 41 83 e6 01 and $0x1,%r14d
40086f: 4d 89 e6 mov %r12,%r14
400872: 4c 0f 44 f5 cmove %rbp,%r14
400876: 4c 89 f7 mov %r14,%rdi
...
after ICP:
40085e: 49 8b 07 mov (%r15),%rax
400861: 4c 89 ff mov %r15,%rdi
400864: 49 ba e0 0b 40 00 00 movabs $0x400be0,%r10
40086b: 00 00 00
40086e: 4c 3b 10 cmp (%rax),%r10
400871: 75 29 jne 40089c <main+0x9c>
400873: 41 ff d2 callq *%r10
400876: 41 83 e6 01 and $0x1,%r14d
40087a: 4d 89 e6 mov %r12,%r14
40087d: 4c 0f 44 f5 cmove %rbp,%r14
400881: 4c 89 f7 mov %r14,%rdi
...
40089c: ff 10 callq *(%rax)
40089e: eb d6 jmp 400876 <main+0x76>
(cherry picked from FBD3612218)
Summary:
Add an option to overwrite jump tables without moving and make it a
default:
-jump-tables - jump tables support (default=basic)
=none - do not optimize functions with jump tables
=basic - optimize functions with jump tables
=move - move jump tables to a separate section
=split - split jump tables section into hot and cold based on
function execution frequency
=aggressive - aggressively split jump tables section based on usage of
the tables
(cherry picked from FBD4448499)
Summary:
In-non relocation mode, when we run ICF the second time,
we fold the same functions again since they were not
removed from the function set. This diff marks them as
folded and ignores them during ICF optimization. Note
that we still want to optimize such functions since they
are potentially called from the code not covered by BOLT
in non-relocation mode.
Folded functions are also excluded from dyno stats with
this diff
Also print the number of times folded functions were called.
When 2 functions - f1() and f2() are folded, that number
would be min(call_frequency(f1), call_frequency(f2)).
(cherry picked from FBD4399993)
Summary:
Re-worked the way ICF operates. The pass now checks for more than just
call instructions, but also for all references including function
pointers. Jump tables are handled too.
(cherry picked from FBD4372491)
Summary:
An optimization to simplify conditional tail calls by removing unnecessary branches. It adds the following two command line options:
-simplify-conditional-tail-calls - simplify conditional tail calls by removing unnecessary jumps
-sctc-mode - mode for simplify conditional tail calls
=always - always perform sctc
=preserve - only perform sctc when branch direction is preserved
=heuristic - use branch prediction data to control sctc
This optimization considers both of the following cases:
foo: ...
jcc L1 original
...
L1: jmp bar # TAILJMP
->
foo: ...
jcc bar iff jcc L1 is expected
...
L1 is unreachable
OR
foo: ...
jcc L2
L1: jmp dest # TAILJMP
L2: ...
->
foo: jncc dest # TAILJMP
L2: ...
L1 is unreachable
For this particular case, the first basic block ends with a conditional branch and has two successors, one fall-through and one for when the condition is true. The target of the conditional is a basic block with a single unconditional branch (i.e. tail call) to another function. We don't care about the contents of the fall-through block.
(cherry picked from FBD3719617)
Summary:
This is part of a series of clean-up patches to make bolt
cleanly compile with clang 4.0. This patch fixes an error where clang
will fail to compile because it does not support passing a
const_iterator to std::vector<T>::emplace(Iter, ...).
(cherry picked from FBD4242546)
Summary:
In order to improve gdb experience with BOLT we have to make
sure the output file has a single .eh_frame section. Otherwise
gdb will use either old or new section for unwinding purposes.
This diff relocates the original .eh_frame section next to
the new one generated by LLVM. Later we merge two sections
into one and make sure only the newly created section has
.eh_frame name.
(cherry picked from FBD4203943)
Summary:
AVX-512 disassembler support in LLVM is not quite ready yet.
Before we feel more comfortable about it we disable processing
of all functions that use any EVEX-encoded instructions.
(cherry picked from FBD4028706)
Summary:
Modified function discovery process to tolerate more functions and
symbols coming from assembly. The processing order now matches
the memory order of the functions (input symbol table is unsorted).
Added basic support for functions with multiple entries. When
a function references its internal address other than with
a branch instruction, that address could potentially escape.
We mark such addresses as entry points and make sure they
are treated as roots by unreachable code elimination.
Without relocations we have to mark multiple-entry functions
as non-simple.
(cherry picked from FBD3950243)
Summary:
Added support for jump tables in code compiled with "-fpic".
Code pattern generated for position-independent jump tables
is quite different, as is the format of the tables.
More details in comments.
Coverage increased slightly for a test, mostly due to the code
coming from external lib that was compiled with "-fpic".
(cherry picked from FBD3940771)
Summary:
Allow UCE when blocks have EH info. Since UCE may remove blocks
that are referenced from debugging info data structures, we don't
actually delete them. We just mark them with an "invalid" index
and store them in a different vector to be cleaned up later once
the BinaryFunction is destroyed. The debugging code just skips
any BBs that have an invalid index.
Eliminating blocks may also expose useless jmp instructions, i.e.
a jmp around a dead block could just be a fallthrough. I've added
a new routine to cleanup these jmps. Although, @maks is working on
changing fixBranches() so that it can be used instead.
(cherry picked from FBD3793259)
Summary:
Add level for "-jump-tables=<n>" option:
1 - all jump tables are output in the same section (default).
2 - basic splitting, if the table is used it is output to hot section
otherwise to cold one.
3 - aggressively split compound jump tables and collect profile for
all entries.
Option "-print-jump-tables" outputs all jump tables for debugging
and/or analyzing purposes. Use with "-jump-tables=3" to get profile
values for every entry in a jump table.
(cherry picked from FBD3912119)
Summary:
Get rid of all uses of getIndex/getLayoutIndex/getOffset outside of BinaryFunction.
Also made some other offset related methods private.
(cherry picked from FBD3861968)
Summary:
Add -print-sorted-by and -print-sorted-by-order command line options.
The first option takes a list of dyno stats keys used to sort functions
that are printed at the end of all optimization passes. Only the top
100 functions are printed. The -print-sorted-by-order option can be
either ascending or descending (descending is the default).
(cherry picked from FBD3898818)
Summary:
While working on PLT dyno stats I've noticed that we were missing
BinaryFunctions for some symbols that were not PLT. Upon closer inspection
turned out that those symbols were marked as zero-sized functions in
symbol table, but they had duplicates with non-zero size. Since the
zero-size symbols were preceding other duplicates, we were not creating
BinaryFunction for them and they were not added as duplicates.
The 2 most prominent functions that were missing for a test were free() and
malloc(). There's not much to optimize in these functions, but they were
contributing quite significantly to dyno stats.
As a result dyno stats for this test needed an adjustment.
Also several assembly functions (e.g. _init()) had zero size, and now we
set the size to the max size and start processing those. It's good for
coverage but will not affect the performance.
(cherry picked from FBD3874622)
Summary:
Option "-jump-tables=1" enables experimental support for jump tables.
The option hasn't been tested with optimizations other than block
re-ordering.
Only non-PIC jump tables are supported at the moment.
(cherry picked from FBD3867849)
Summary:
This is just a bit of refactoring to make sure that BinaryFunction goes
through methods to get at the state in BinaryBasicBlock. I did this so
that changing the way Index/LayoutIndex/Valid works will be easier.
(cherry picked from FBD3860899)
Summary:
Add "-reorder-blocks=cluster-shuffle" for performance experiments.
Use "-bolt-seed=<N>" to set a randomization seed.
(cherry picked from FBD3851035)
Summary:
Switch table can contain __builtin_unreachable(). As a result,
a compiler may place an entry into a jump table that contains
an address immediately past the last instruction in the function.
Sometimes it may coincide with a start of the next function in
the binary. Thus when we check for switch tables in such cases
we have to check more than a single entry until we see either
an address inside containing function or some address outside
different from the address past the last instruction.
Additonally, don't stop disassembly after discovering that the
function was not simple. We need to detect all outside
references whenever possible.
(cherry picked from FBD3850825)
Summary:
For now we make SCTC a special pass that runs at the end of all
optimizations and transformations right after fixupBranches().
Since it's the last pass, it has to do its own UCE.
(cherry picked from FBD3838051)
Summary:
Add "-dyno-stats" option that prints instruction stats based on
the execution profile similar to below:
BOLT-INFO: program-wide dynostats after optimizations:
executed forward branches : 109706407 (+8.1%)
taken forward branches : 13769074 (-55.5%)
executed backward branches : 24517582 (-25.0%)
taken backward branches : 15330256 (-27.2%)
executed unconditional branches : 6009826 (-35.5%)
function calls : 17192114 (+0.0%)
executed instructions : 837733057 (-0.4%)
total branches : 140233815 (-2.3%)
taken branches : 35109156 (-42.8%)
Also fixed pseudo instruction discrepancies and added assertions
for BinaryBasicBlock::getNumPseudos() to make sure the number is
synchronized with real number of pseudo instructions.
(cherry picked from FBD3826995)
Summary:
The CFG represents "the ultimate source of truth". Transformations
on functions and blocks have to update the CFG and fixBranches() would
make sure the correct branch instructions are inserted at the end of
basic blocks (or removed when necessary).
We do require a conditional branch at the end of the basic block if
the block has 2 successors as CFG currently lacks the conditional
code support (it will probably stay that way). We only use this
branch instruction for its conditional code, the destination is
determined by CFG - first successor representing true/taken branch,
while the second successor - false/fall-through branch.
When we reverse the branch condition, the CFG is updated accordingly.
The previous version used to insert jumps after some terminating
instructions sometimes resulting in a larger code than needed. As a
result with the new version 1 extra function becomes overwritten for
HHVM binary.
With this diff we also convert conditional branches with one successor
(result of code from __builtin_unreachable()) into unconditional
jumps.
(cherry picked from FBD3802062)
Summary:
This will make it easier to run experiments with the same baseline
BOLT binary but different command line options.
(cherry picked from FBD3831978)
Summary:
A previous diff accidentally disabled tail call conversion.
Additionally some test cases relied on output of "-v=2". Fix those.
(cherry picked from FBD3823760)
Summary:
I've added a verbosity level to help keep the BOLT spewage to a minimum.
The default level is pretty terse now, level 1 is closer to the original,
I've saved level 2 for the noisiest of messages. Error messages should
never be suppressed by the verbosity level only warnings and info messages.
The rational behind stream usage is as follows:
outs() for info and debugging controlled by command line flags.
errs() for errors and warnings.
dbgs() for output within DEBUG().
With the exception of a few of the level 2 messages I don't have any strong feelings about the others.
(cherry picked from FBD3814259)
Summary:
While creating remember_state/restore_state CFI sequences, we
were always placing remember_state instruction into the first
basic block. However, when we have hot-cold splitting, the cold
part has and independent FDE entry in .eh_frame, and thus the
restore_state instruction was missing its counter part.
The fix is to adjust the basic block that is used for placing
remember_state instruction whenever we see the hot-cold split
boundary.
(cherry picked from FBD3767102)
Summary:
Analyze indirect branches and convert them into indirect
tail calls when possible. We analyze the memory contents
when the address could be calculated statically and also
detect epilogue code.
(cherry picked from FBD3754395)
Summary:
We only need ClusterEdges in reordering algorithm optimized for
branches and the computation is quite resource-hungry, thus it
makes sense to only do it when needed.
Some refactoring too.
(cherry picked from FBD3721107)
Summary:
Add the following info the graphviz CFG dump:
- Edges are labeled with the jmp instruction that leads to that edge.
- Edges include the count and misprediction count.
- Nodes have (offset, BB index, BB layout index)
- Nodes optionally have tooltips which contain the code of the basic block.
(enabled with -dot-tooltip-code)
- Added dashed edges to landing pads.
(cherry picked from FBD3646568)
Summary:
Eliminated BinaryFunction::getName(). The function was confusing since
the name is ambigous. Instead we have BinaryFunction::getPrintName()
used for printing and whenever unique string identifier is needed
one can use getSymbol()->getName(). In the next diff I'll have
a map from MCSymbol to BinaryFunction in BinaryContext to facilitate
function lookup from instruction operand expressions.
There's one bug fixed where the function was called only under assert()
in ICF::foldFunction().
For output we update all symbols associated with the function. At the
moment it has no effect on the generated binary but in the future we
would like to have all symbols in the symbol table updated.
(cherry picked from FBD3704790)
Summary:
This adds functionality for a more aggressive inlining pass, that can
inline tail calls and functions with more than one basic block.
(cherry picked from FBD3677856)
Summary:
Add three new MCOperand types: Annotation, LandingPad and GnuArgsSize.
Annotation is used for associating random data with MCInsts. Clients can
construct their own annotation types (subclassed from MCAnnotation) and
associate them with instructions. Annotations are looked up by string keys.
Annotations can be added, removed and queried using an instance of the
MCInstrAnalysis class.
The LandingPad operand is a MCSymbol, uint64_t pair used to encode exception
handling information for call instructions.
GnuArgsSize is used to annotate calls with the DW_CFA_GNU_args_size attribute.
(cherry picked from FBD3597877)
Summary:
BOLT attempts to convert jumps that serve as tail calls to dedicated tail call
instructions, but this is impossible when the jump is conditional because there is
no corresponding tail call instruction. This was causing the creation of a duplicate
fall-through edge for basic blocks terminated with a conditional jump serving as
a tail call when there is profile data available for the non-taken branch. In this
case, the first fall-through edge had a count taken from the profile data, while
the second has a count computed (incorrectly) by
BinaryFunction::inferFallThroughCounts.
(cherry picked from FBD3560504)
Summary:
LLVM was missing assembler print string for indirect tail
calls which are synthetic instructions created by us.
(cherry picked from FBD3640197)
Summary:
This diff adds a number of methods to BinaryFunction that can be used to edit the CFG after it is created.
The basic public functions are:
- createBasicBlock - create a new block that is not inserted into the CFG.
- insertBasicBlocks - insert a range of blocks (made with createBasicBlock) into the CFG.
- updateLayout - update the CFG layout (either by inserting new blocks at a certain point or recomputing the entire layout).
- fixFallthroughBranch - add a direct jump to the fallthrough successor for a given block.
There are a number of private helper functions used to implement the above.
This was split off the ICP diff to simplify it a bit.
(cherry picked from FBD3611313)
Summary:
This algorithm is similar to our main clustering algorithm but uses
a different heuristic for selecting edges to become fall-throughs.
The weight of an edge is calculated as the win in branches if we choose
to layout this edge as a fall-through. For example, the edges A -> B with
execution count 100 and A -> C with execution count 500 (where B and C
are the only successors of A) have weights -400 and +400 respectively.
(cherry picked from FBD3606591)
Summary:
Added an ICF pass to BOLT, that can recognize identical functions
and replace references to these functions with references to just one
representative.
(cherry picked from FBD3460297)
Summary:
I've factored out the instruction printing and size computation routines to
methods on BinaryContext. I've also added some more debug print functions.
This was split off the ICP diff to simplify it a bit.
(cherry picked from FBD3610690)
Summary:
Instructions that load data from the a read-only data section and their
target address can be computed statically (e.g. RIP-relative addressing)
are modified to corresponding instructions that use immediate operands.
We apply the transformation only when the resulting instruction will have
smaller or equal size.
(cherry picked from FBD3397112)