Summary:
Add -print-sorted-by and -print-sorted-by-order command line options.
The first option takes a list of dyno stats keys used to sort functions
that are printed at the end of all optimization passes. Only the top
100 functions are printed. The -print-sorted-by-order option can be
either ascending or descending (descending is the default).
(cherry picked from FBD3898818)
Summary:
While working on PLT dyno stats I've noticed that we were missing
BinaryFunctions for some symbols that were not PLT. Upon closer inspection
turned out that those symbols were marked as zero-sized functions in
symbol table, but they had duplicates with non-zero size. Since the
zero-size symbols were preceding other duplicates, we were not creating
BinaryFunction for them and they were not added as duplicates.
The 2 most prominent functions that were missing for a test were free() and
malloc(). There's not much to optimize in these functions, but they were
contributing quite significantly to dyno stats.
As a result dyno stats for this test needed an adjustment.
Also several assembly functions (e.g. _init()) had zero size, and now we
set the size to the max size and start processing those. It's good for
coverage but will not affect the performance.
(cherry picked from FBD3874622)
Summary:
Option "-jump-tables=1" enables experimental support for jump tables.
The option hasn't been tested with optimizations other than block
re-ordering.
Only non-PIC jump tables are supported at the moment.
(cherry picked from FBD3867849)
Summary:
This is just a bit of refactoring to make sure that BinaryFunction goes
through methods to get at the state in BinaryBasicBlock. I did this so
that changing the way Index/LayoutIndex/Valid works will be easier.
(cherry picked from FBD3860899)
Summary:
Add "-reorder-blocks=cluster-shuffle" for performance experiments.
Use "-bolt-seed=<N>" to set a randomization seed.
(cherry picked from FBD3851035)
Summary:
Switch table can contain __builtin_unreachable(). As a result,
a compiler may place an entry into a jump table that contains
an address immediately past the last instruction in the function.
Sometimes it may coincide with a start of the next function in
the binary. Thus when we check for switch tables in such cases
we have to check more than a single entry until we see either
an address inside containing function or some address outside
different from the address past the last instruction.
Additonally, don't stop disassembly after discovering that the
function was not simple. We need to detect all outside
references whenever possible.
(cherry picked from FBD3850825)
Summary:
For now we make SCTC a special pass that runs at the end of all
optimizations and transformations right after fixupBranches().
Since it's the last pass, it has to do its own UCE.
(cherry picked from FBD3838051)
Summary:
Add "-dyno-stats" option that prints instruction stats based on
the execution profile similar to below:
BOLT-INFO: program-wide dynostats after optimizations:
executed forward branches : 109706407 (+8.1%)
taken forward branches : 13769074 (-55.5%)
executed backward branches : 24517582 (-25.0%)
taken backward branches : 15330256 (-27.2%)
executed unconditional branches : 6009826 (-35.5%)
function calls : 17192114 (+0.0%)
executed instructions : 837733057 (-0.4%)
total branches : 140233815 (-2.3%)
taken branches : 35109156 (-42.8%)
Also fixed pseudo instruction discrepancies and added assertions
for BinaryBasicBlock::getNumPseudos() to make sure the number is
synchronized with real number of pseudo instructions.
(cherry picked from FBD3826995)
Summary:
The CFG represents "the ultimate source of truth". Transformations
on functions and blocks have to update the CFG and fixBranches() would
make sure the correct branch instructions are inserted at the end of
basic blocks (or removed when necessary).
We do require a conditional branch at the end of the basic block if
the block has 2 successors as CFG currently lacks the conditional
code support (it will probably stay that way). We only use this
branch instruction for its conditional code, the destination is
determined by CFG - first successor representing true/taken branch,
while the second successor - false/fall-through branch.
When we reverse the branch condition, the CFG is updated accordingly.
The previous version used to insert jumps after some terminating
instructions sometimes resulting in a larger code than needed. As a
result with the new version 1 extra function becomes overwritten for
HHVM binary.
With this diff we also convert conditional branches with one successor
(result of code from __builtin_unreachable()) into unconditional
jumps.
(cherry picked from FBD3802062)
Summary:
This will make it easier to run experiments with the same baseline
BOLT binary but different command line options.
(cherry picked from FBD3831978)
Summary:
A previous diff accidentally disabled tail call conversion.
Additionally some test cases relied on output of "-v=2". Fix those.
(cherry picked from FBD3823760)
Summary:
I've added a verbosity level to help keep the BOLT spewage to a minimum.
The default level is pretty terse now, level 1 is closer to the original,
I've saved level 2 for the noisiest of messages. Error messages should
never be suppressed by the verbosity level only warnings and info messages.
The rational behind stream usage is as follows:
outs() for info and debugging controlled by command line flags.
errs() for errors and warnings.
dbgs() for output within DEBUG().
With the exception of a few of the level 2 messages I don't have any strong feelings about the others.
(cherry picked from FBD3814259)
Summary:
While creating remember_state/restore_state CFI sequences, we
were always placing remember_state instruction into the first
basic block. However, when we have hot-cold splitting, the cold
part has and independent FDE entry in .eh_frame, and thus the
restore_state instruction was missing its counter part.
The fix is to adjust the basic block that is used for placing
remember_state instruction whenever we see the hot-cold split
boundary.
(cherry picked from FBD3767102)
Summary:
Analyze indirect branches and convert them into indirect
tail calls when possible. We analyze the memory contents
when the address could be calculated statically and also
detect epilogue code.
(cherry picked from FBD3754395)
Summary:
We only need ClusterEdges in reordering algorithm optimized for
branches and the computation is quite resource-hungry, thus it
makes sense to only do it when needed.
Some refactoring too.
(cherry picked from FBD3721107)
Summary:
Add the following info the graphviz CFG dump:
- Edges are labeled with the jmp instruction that leads to that edge.
- Edges include the count and misprediction count.
- Nodes have (offset, BB index, BB layout index)
- Nodes optionally have tooltips which contain the code of the basic block.
(enabled with -dot-tooltip-code)
- Added dashed edges to landing pads.
(cherry picked from FBD3646568)
Summary:
Eliminated BinaryFunction::getName(). The function was confusing since
the name is ambigous. Instead we have BinaryFunction::getPrintName()
used for printing and whenever unique string identifier is needed
one can use getSymbol()->getName(). In the next diff I'll have
a map from MCSymbol to BinaryFunction in BinaryContext to facilitate
function lookup from instruction operand expressions.
There's one bug fixed where the function was called only under assert()
in ICF::foldFunction().
For output we update all symbols associated with the function. At the
moment it has no effect on the generated binary but in the future we
would like to have all symbols in the symbol table updated.
(cherry picked from FBD3704790)
Summary:
This adds functionality for a more aggressive inlining pass, that can
inline tail calls and functions with more than one basic block.
(cherry picked from FBD3677856)
Summary:
Add three new MCOperand types: Annotation, LandingPad and GnuArgsSize.
Annotation is used for associating random data with MCInsts. Clients can
construct their own annotation types (subclassed from MCAnnotation) and
associate them with instructions. Annotations are looked up by string keys.
Annotations can be added, removed and queried using an instance of the
MCInstrAnalysis class.
The LandingPad operand is a MCSymbol, uint64_t pair used to encode exception
handling information for call instructions.
GnuArgsSize is used to annotate calls with the DW_CFA_GNU_args_size attribute.
(cherry picked from FBD3597877)
Summary:
BOLT attempts to convert jumps that serve as tail calls to dedicated tail call
instructions, but this is impossible when the jump is conditional because there is
no corresponding tail call instruction. This was causing the creation of a duplicate
fall-through edge for basic blocks terminated with a conditional jump serving as
a tail call when there is profile data available for the non-taken branch. In this
case, the first fall-through edge had a count taken from the profile data, while
the second has a count computed (incorrectly) by
BinaryFunction::inferFallThroughCounts.
(cherry picked from FBD3560504)
Summary:
LLVM was missing assembler print string for indirect tail
calls which are synthetic instructions created by us.
(cherry picked from FBD3640197)
Summary:
This diff adds a number of methods to BinaryFunction that can be used to edit the CFG after it is created.
The basic public functions are:
- createBasicBlock - create a new block that is not inserted into the CFG.
- insertBasicBlocks - insert a range of blocks (made with createBasicBlock) into the CFG.
- updateLayout - update the CFG layout (either by inserting new blocks at a certain point or recomputing the entire layout).
- fixFallthroughBranch - add a direct jump to the fallthrough successor for a given block.
There are a number of private helper functions used to implement the above.
This was split off the ICP diff to simplify it a bit.
(cherry picked from FBD3611313)
Summary:
This algorithm is similar to our main clustering algorithm but uses
a different heuristic for selecting edges to become fall-throughs.
The weight of an edge is calculated as the win in branches if we choose
to layout this edge as a fall-through. For example, the edges A -> B with
execution count 100 and A -> C with execution count 500 (where B and C
are the only successors of A) have weights -400 and +400 respectively.
(cherry picked from FBD3606591)
Summary:
Added an ICF pass to BOLT, that can recognize identical functions
and replace references to these functions with references to just one
representative.
(cherry picked from FBD3460297)
Summary:
I've factored out the instruction printing and size computation routines to
methods on BinaryContext. I've also added some more debug print functions.
This was split off the ICP diff to simplify it a bit.
(cherry picked from FBD3610690)
Summary:
Instructions that load data from the a read-only data section and their
target address can be computed statically (e.g. RIP-relative addressing)
are modified to corresponding instructions that use immediate operands.
We apply the transformation only when the resulting instruction will have
smaller or equal size.
(cherry picked from FBD3397112)
Summary:
Loop detection for the CFG data structure. Added a GraphTraits
specialization for BOLT's CFG that allows us to use LLVM's loop
detection interface.
(cherry picked from FBD3604837)
Summary:
Generate short versions of branch instructions by default and rely on
relaxation to produce longer versions when needed.
Also produce short versions of arithmetic instructions if immediate
fits into one byte. This was only triggered once on HHVM binary.
(cherry picked from FBD3591466)
Summary:
This fixes the initialization of basic block execution counts, where
we should skip edges to the first basic block but we were not
skipping the corresponding profile info.
Also, I removed a check that was done twice.
(cherry picked from FBD3519265)
Summary:
I noticed the BinaryFunction::viewGraph() method that hadn't been implemented
and decided I could use a simple DOT dumper for CFGs while working on the indirect
call optimization.
I've implemented the bare minimum for the dumper. It's just nodes+BB labels with
dges. We can add more detailed information as needed/desired.
(cherry picked from FBD3509326)
Summary:
When a conditional jump is followed by one or more no-ops, the
destination of fall-through branch was recorded as the first no-op in
FuncBranchInfo. However the fall-through basic block after the jump
starts after the no-ops, so the profile data could not match the CFG
and was ignored.
(cherry picked from FBD3496084)
Summary:
The various reorder and clustering algorithms have been refactored
into separate classes, so that it is easier to add new algorithms and/or
change the logic of algorithm selection.
(cherry picked from FBD3473656)
Summary:
With ICF optimization in the linker we were getting mismatches of
function names in .fdata and BinaryFunction name. This diff adds
support for multiple function names for BinaryFunction and
does a match against all possible names for the profile.
(cherry picked from FBD3466215)
Summary:
Verify profile data for a function and reject if there are branches
that don't correspond to any branches in the function CFG. Note that
we have to ignore branches resulting from recursive calls.
Fix printing instruction offsets in disassembled state.
Allow function to have non-zero execution count even if we don't
have branch information.
(cherry picked from FBD3451596)
Summary: This will help optimization passes that need to modify the CFG after it is constructed. Otherwise, the BinaryBasicBlock pointers stored in the layout, successors and predecessors would need to be modified every time a new basic block is created.
(cherry picked from FBD3403372)
Summary:
Assembly functions could have no corresponding DW_AT_subprogram
entries, yet they are represented in module ranges (and .debug_aranges)
and will have line number information. Make sure we update those.
Eliminated unnecessary data structures and optimized some passes.
For .debug_loc unused location entries are no longer processed
resulting in smaller output files.
Overall it's a small processing time improvement and memory imporement.
(cherry picked from FBD3362540)
Summary: The inference algorithm for counts of fall through edges takes possible jumps to landing pad blocks into account. Also, the landing pad block execution counts are updated using profile data.
(cherry picked from FBD3350727)
Summary:
Splitting option now has different meanings/values. Since landing pads
are mostly always cold/frozen, we should split them before anything
else (we still check the execution count is 0). That's value '1'.
Everything else goes on top of that and has increased value (2 - large
functions, 3 - everything).
Sorting was non-deterministic and somewhat broken for functions
with EH ranges. Fixed that and added '-split-all-cold' option to
outline all 0-count blocks.
Fixed compilation of test cases. After my last commit the binaries
were linked to wrong source files (i.e. debug info). Had to rebuild
the binaries from updated sources.
(cherry picked from FBD3209369)
Summary:
GNU_args_size is a special kind of CFI that tells runtime to adjust
%rsp when control is passed to a landing pad. It is used for annotating
call instructions that pass (extra) parameters on the stack and there's
a corresponding landing pad.
It is also special in a way that its value is not handled by
DW_CFA_remember_state/DW_CFA_restore_state instruction sequence
that we utilize to restore the state after block re-ordering.
This diff adds association of call instructions with GNU_args_size value
when it's used. If the function does not use GNU_args_size, there is
no overhead. Otherwise, we regenerate GNU_args_size instruction during
code emission, i.e. after all optimizations and block-reordering.
(cherry picked from FBD3201322)
Summary:
Updates DWARF lexical blocks address ranges in the output binary after optimizations.
This is similar to updating function address ranges except that the ranges representation needs
to be more general, since address ranges can begin or end in the middle of a basic block.
The following changes were made:
- Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h
- Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly.
- Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists.
- Added a representation for Lexical Blocks (LexicalBlock.h)
- Small refactorings in DebugArangesWriter:
-- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges
-- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that)
- Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp)
- Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code)
- Added small test case (lexical_blocks_address_ranges_debug.test)
(cherry picked from FBD3113181)
Summary:
Populate function execution count while parsing fdata. Before
we used a quadratic algorithm to populate the execution count
(had to iterate over *all* branches for every single function).
Ignore non-symbol to non-symbol branches while parsing fdata.
These changes combined drop HHVM processing time from
4 minutes 53 seconds down to 2 minutes 9 seconds on my devserver.
Test case had to be modified since it contained irrelevant
branches from PLT to libc.
(cherry picked from FBD3106263)