Commit Graph

94 Commits

Author SHA1 Message Date
Bill Nell
510f227cbd BOLT: Add feature to sort functions by dyno stats.
Summary:
Add -print-sorted-by and -print-sorted-by-order command line options.
The first option takes a list of dyno stats keys used to sort functions
that are printed at the end of all optimization passes.  Only the top
100 functions are printed.  The -print-sorted-by-order option can be
either ascending or descending (descending is the default).

(cherry picked from FBD3898818)
2016-09-20 20:55:49 -07:00
Maksim Panchenko
62bff426c3 Do no collect dyno stats on functions with stale profile.
Summary:
Dyno stats collected on functions with invalid profile may appear
completely bogus. Skip them.

(cherry picked from FBD3879371)
2016-09-16 13:13:16 -07:00
Maksim Panchenko
2c9bf9afd6 Add PLT dyno stats.
Summary: Get PLT call stats.

(cherry picked from FBD3874799)
2016-09-15 15:47:10 -07:00
Maksim Panchenko
c4e36c1dd6 Fix issue with zero-size duplicate function symbols.
Summary:
While working on PLT dyno stats I've noticed that we were missing
BinaryFunctions for some symbols that were not PLT. Upon closer inspection
turned out that those symbols were marked as zero-sized functions in
symbol table, but they had duplicates with non-zero size. Since the
zero-size symbols were preceding other duplicates, we were not creating
BinaryFunction for them and they were not added as duplicates.

The 2 most prominent functions that were missing for a test were free() and
malloc().  There's not much to optimize in these functions, but they were
contributing quite significantly to dyno stats.

As a result dyno stats for this test needed an adjustment.

Also several assembly functions (e.g. _init()) had zero size, and now we
set the size to the max size and start processing those. It's good for
coverage but will not affect the performance.

(cherry picked from FBD3874622)
2016-09-15 15:47:10 -07:00
Maksim Panchenko
8dbf0e2b3d Add dyno stats for jump tables.
Summary: Add dyno stats for jump tables.

(cherry picked from FBD3871035)
2016-09-15 10:24:22 -07:00
Maksim Panchenko
2f3a859772 Add experimental jump table support.
Summary:
Option "-jump-tables=1" enables experimental support for jump tables.

The option hasn't been tested with optimizations other than block
re-ordering.

Only non-PIC jump tables are supported at the moment.

(cherry picked from FBD3867849)
2016-09-14 16:45:40 -07:00
Bill Nell
7483cd0fa6 BOLT: Clean up interface between BinaryFunction and BinaryBasicBlock.
Summary:
This is just a bit of refactoring to make sure that BinaryFunction goes
through methods to get at the state in BinaryBasicBlock.  I did this so
that changing the way Index/LayoutIndex/Valid works will be easier.

(cherry picked from FBD3860899)
2016-09-13 17:12:00 -07:00
Maksim Panchenko
b0f4031db3 Add cluster randomization layout algorithm.
Summary:
Add "-reorder-blocks=cluster-shuffle" for performance experiments.
Use "-bolt-seed=<N>" to set a randomization seed.

(cherry picked from FBD3851035)
2016-09-11 14:33:58 -07:00
Maksim Panchenko
52bfc3f92f Fix switch table detection. Disassemble all instructions in non-simple functions.
Summary:
Switch table can contain __builtin_unreachable(). As a result,
a compiler may place an entry into a jump table that contains
an address immediately past the last instruction in the function.
Sometimes it may coincide with a start of the next function in
the binary. Thus when we check for switch tables in such cases
we have to check more than a single entry until we see either
an address inside containing function or some address outside
different from the address past the last instruction.

Additonally, don't stop disassembly after discovering that the
function was not simple. We need to detect all outside
references whenever possible.

(cherry picked from FBD3850825)
2016-09-12 10:12:31 -07:00
Maksim Panchenko
617c6a13b7 Use BB.getNumNonPseudos() in more places.
Summary:
Use BB.getNumNonPseudos() in more places.

Fix analyze_potential script to pass the new parameter.

(cherry picked from FBD3844416)
2016-09-09 14:42:35 -07:00
Maksim Panchenko
c4c518ee9d Rewrite SCTC pass to do UCE and make it the last optimization pass.
Summary:
For now we make SCTC a special pass that runs at the end of all
optimizations and transformations right after fixupBranches().

Since it's the last pass, it has to do its own UCE.

(cherry picked from FBD3838051)
2016-09-08 14:52:26 -07:00
Maksim Panchenko
6bef336cc2 Add dyno stats to BOLT.
Summary:
Add "-dyno-stats" option that prints instruction stats based on
the execution profile similar to below:

BOLT-INFO: program-wide dynostats after optimizations:
  executed forward branches : 109706407 (+8.1%)
  taken forward branches : 13769074 (-55.5%)
  executed backward branches : 24517582 (-25.0%)
  taken backward branches : 15330256 (-27.2%)
  executed unconditional branches : 6009826 (-35.5%)
  function calls : 17192114 (+0.0%)
  executed instructions : 837733057 (-0.4%)
  total branches : 140233815 (-2.3%)
  taken branches : 35109156 (-42.8%)

Also fixed pseudo instruction discrepancies and added assertions
for BinaryBasicBlock::getNumPseudos() to make sure the number is
synchronized with real number of pseudo instructions.

(cherry picked from FBD3826995)
2016-08-29 21:11:22 -07:00
Maksim Panchenko
17e691915b Make BinaryFunction::fixBranches() more flexible and support CFG updates.
Summary:
The CFG represents "the ultimate source of truth". Transformations
on functions and blocks have to update the CFG and fixBranches() would
make sure the correct branch instructions are inserted at the end of
basic blocks (or removed when necessary).

We do require a conditional branch at the end of the basic block if
the block has 2 successors as CFG currently lacks the conditional
code support (it will probably stay that way). We only use this
branch instruction for its conditional code, the destination is
determined by CFG - first successor representing true/taken branch,
while the second successor - false/fall-through branch.

When we reverse the branch condition, the CFG is updated accordingly.

The previous version used to insert jumps after some terminating
instructions sometimes resulting in a larger code than needed. As a
result with the new version 1 extra function becomes overwritten for
HHVM binary.

With this diff we also convert conditional branches with one successor
(result of code from __builtin_unreachable()) into unconditional
jumps.

(cherry picked from FBD3802062)
2016-08-29 21:11:22 -07:00
Bill Nell
48b55300e0 BOLT: Make most command line options ZeroOrMore.
Summary:
This will make it easier to run experiments with the same baseline
BOLT binary but different command line options.

(cherry picked from FBD3831978)
2016-09-07 14:41:56 -07:00
Maksim Panchenko
1cf200107e Fix tail call conversion and test cases.
Summary:
A previous diff accidentally disabled tail call conversion.

Additionally some test cases relied on output of "-v=2". Fix those.

(cherry picked from FBD3823760)
2016-09-06 13:19:26 -07:00
Bill Nell
c27a6a5c63 Add verbosity level and clean up stream usage.
Summary:
I've added a verbosity level to help keep the BOLT spewage to a minimum.
The default level is pretty terse now, level 1 is closer to the original,
I've saved level 2 for the noisiest of messages.  Error messages should
never be suppressed by the verbosity level only warnings and info messages.

The rational behind stream usage is as follows:
outs() for info and debugging controlled by command line flags.
errs() for errors and warnings.
dbgs() for output within DEBUG().

With the exception of a few of the level 2 messages I don't have any strong feelings about the others.

(cherry picked from FBD3814259)
2016-09-02 14:15:29 -07:00
Maksim Panchenko
43acb6a28a Emit remember_state CFI in the same code region as restore_state.
Summary:
While creating remember_state/restore_state CFI sequences, we
were always placing remember_state instruction into the first
basic block. However, when we have hot-cold splitting, the cold
part has and independent FDE entry in .eh_frame, and thus the
restore_state instruction was missing its counter part.

The fix is to adjust the basic block that is used for placing
remember_state instruction whenever we see the hot-cold split
boundary.

(cherry picked from FBD3767102)
2016-08-24 14:25:33 -07:00
Maksim Panchenko
97f598fd17 Handling for indirect tail calls.
Summary:
Analyze indirect branches and convert them into indirect
tail calls when possible. We analyze the memory contents
when the address could be calculated statically and also
detect epilogue code.

(cherry picked from FBD3754395)
2016-08-22 14:24:09 -07:00
Maksim Panchenko
a10fb73ab3 Compute ClusterEdges only when necessary.
Summary:
We only need ClusterEdges in reordering algorithm optimized for
branches and the computation is quite resource-hungry, thus it
makes sense to only do it when needed.

Some refactoring too.

(cherry picked from FBD3721107)
2016-08-15 15:37:00 -07:00
Bill Nell
406aa62083 Add additional info to BOLT graphviz CFG dumps.
Summary:
Add the following info the graphviz CFG dump:
- Edges are labeled with the jmp instruction that leads to that edge.
- Edges include the count and misprediction count.
- Nodes have (offset, BB index, BB layout index)
- Nodes optionally have tooltips which contain the code of the basic block.
  (enabled with -dot-tooltip-code)
- Added dashed edges to landing pads.

(cherry picked from FBD3646568)
2016-07-29 19:18:37 -07:00
Maksim Panchenko
36df6057b0 Refactoring. Mainly NFC.
Summary:
Eliminated BinaryFunction::getName(). The function was confusing since
the name is ambigous. Instead we have BinaryFunction::getPrintName()
used for printing and whenever unique string identifier is needed
one can use getSymbol()->getName(). In the next diff I'll have
a map from MCSymbol to BinaryFunction in BinaryContext to facilitate
function lookup from instruction operand expressions.

There's one bug fixed where the function was called only under assert()
in ICF::foldFunction().

For output we update all symbols associated with the function. At the
moment it has no effect on the generated binary but in the future we
would like to have all symbols in the symbol table updated.

(cherry picked from FBD3704790)
2016-08-07 12:35:23 -07:00
Theodoros Kasampalis
32739247eb More aggressive inlining pass
Summary:
This adds functionality for a more aggressive inlining pass, that can
inline tail calls and functions with more than one basic block.

(cherry picked from FBD3677856)
2016-07-29 14:17:06 -07:00
Bill Nell
82d76ae18b Add MCInst annotation mechanism to MCInstrAnalysis class.
Summary:
Add three new MCOperand types: Annotation, LandingPad and GnuArgsSize.

Annotation is used for associating random data with MCInsts.  Clients can
construct their own annotation types (subclassed from MCAnnotation) and
associate them with instructions.  Annotations are looked up by string keys.

Annotations can be added, removed and queried using an instance of the
MCInstrAnalysis class.

The LandingPad operand is a MCSymbol, uint64_t pair used to encode exception
handling information for call instructions.

GnuArgsSize is used to annotate calls with the DW_CFA_GNU_args_size attribute.

(cherry picked from FBD3597877)
2016-07-28 10:34:50 -07:00
Theodoros Kasampalis
713e361f36 Fix for correct disassembling of conditional tail calls.
Summary:
BOLT attempts to convert jumps that serve as tail calls to dedicated tail call
instructions, but this is impossible when the jump is conditional because there is
no corresponding tail call instruction. This was causing the creation of a duplicate
fall-through edge for basic blocks terminated with a conditional jump serving as
a tail call when there is profile data available for the non-taken branch. In this
case, the first fall-through edge had a count taken from the profile data, while
the second has a count computed (incorrectly) by
BinaryFunction::inferFallThroughCounts.

(cherry picked from FBD3560504)
2016-07-13 18:57:40 -07:00
Maksim Panchenko
486ab273c7 Add printing support for indirect tail calls.
Summary:
LLVM was missing assembler print string for indirect tail
calls which are synthetic instructions created by us.

(cherry picked from FBD3640197)
2016-07-28 18:49:48 -07:00
Bill Nell
50e011f4e5 CFG editing functions
Summary:
This diff adds a number of methods to BinaryFunction that can be used to edit the CFG after it is created.

The basic public functions are:
  - createBasicBlock - create a new block that is not inserted into the CFG.
  - insertBasicBlocks - insert a range of blocks (made with createBasicBlock) into the CFG.
  - updateLayout - update the CFG layout (either by inserting new blocks at a certain point or recomputing the entire layout).
  - fixFallthroughBranch - add a direct jump to the fallthrough successor for a given block.

There are a number of private helper functions used to implement the above.

This was split off the ICP diff to simplify it a bit.

(cherry picked from FBD3611313)
2016-07-23 12:50:34 -07:00
Theodoros Kasampalis
ab599fe71a Basic block clustering algorithm for minimizing branches.
Summary:
This algorithm is similar to our main clustering algorithm but uses
a different heuristic for selecting edges to become fall-throughs.
The weight of an edge is calculated as the win in branches if we choose
to layout this edge as a fall-through. For example, the edges A -> B with
execution count 100 and A -> C with execution count 500 (where B and C
are the only successors of A) have weights -400 and +400 respectively.

(cherry picked from FBD3606591)
2016-07-15 16:11:30 -07:00
Theodoros Kasampalis
a9bb3320ad Identical Code Folding (ICF) pass
Summary:
Added an ICF pass to BOLT, that can recognize identical functions
and replace references to these functions with references to just one
representative.

(cherry picked from FBD3460297)
2016-06-09 11:36:55 -07:00
Bill Nell
82401630a2 Factor out instruction printing and size computation.
Summary:
I've factored out the instruction printing and size computation routines to
methods on BinaryContext.  I've also added some more debug print functions.

This was split off the ICP diff to simplify it a bit.

(cherry picked from FBD3610690)
2016-07-23 08:01:53 -07:00
Theodoros Kasampalis
156a55209c Simplification of loads from read-only data sections.
Summary:
Instructions that load data from the a read-only data section and their
target address can be computed statically (e.g. RIP-relative addressing)
are modified to corresponding instructions that use immediate operands.
We apply the transformation only when the resulting instruction will have
smaller or equal size.

(cherry picked from FBD3397112)
2016-06-03 00:58:11 -07:00
Theodoros Kasampalis
17b846586c Loop detection for BOLT's CFG.
Summary:
Loop detection for the CFG data structure. Added a GraphTraits
specialization for BOLT's CFG that allows us to use LLVM's loop
detection interface.

(cherry picked from FBD3604837)
2016-05-26 10:58:01 -07:00
Maksim Panchenko
bf46263eed Shorten instructions if possible.
Summary:
Generate short versions of branch instructions by default and rely on
relaxation to produce longer versions when needed.

Also produce short versions of arithmetic instructions if immediate
fits into one byte. This was only triggered once on HHVM binary.

(cherry picked from FBD3591466)
2016-07-19 11:19:18 -07:00
Theodoros Kasampalis
c20506c570 Fix in inferFallthroughCounts
Summary:
This fixes the initialization of basic block execution counts, where
we should skip edges to the first basic block but we were not
skipping the corresponding profile info.

Also, I removed a check that was done twice.

(cherry picked from FBD3519265)
2016-07-03 21:30:35 -07:00
Bill Nell
260f6fbdb6 Add option to dump CFGs in (simple) graphviz format during all passes.
Summary:
I noticed the BinaryFunction::viewGraph() method that hadn't been implemented
and decided I could use a simple DOT dumper for CFGs while working on the indirect
call optimization.

I've implemented the bare minimum for the dumper.  It's just nodes+BB labels with
dges. We can add more detailed information as needed/desired.

(cherry picked from FBD3509326)
2016-07-01 08:40:56 -07:00
Theodoros Kasampalis
287fa51324 Fix for ignoring fall-through profile data when jump is followed by no-op
Summary:
When a conditional jump is followed by one or more no-ops, the
destination of fall-through branch was recorded as the first no-op in
FuncBranchInfo. However the fall-through basic block after the jump
starts after the no-ops, so the profile data could not match the CFG
and was ignored.

(cherry picked from FBD3496084)
2016-06-27 14:51:38 -07:00
Theodoros Kasampalis
d09b00ebff Refactoring of the reordering algorithms
Summary:
The various reorder and clustering algorithms have been refactored
into separate classes, so that it is easier to add new algorithms and/or
change the logic of algorithm selection.

(cherry picked from FBD3473656)
2016-06-16 18:47:57 -07:00
Maksim Panchenko
f1192a7118 Support for multiple function names.
Summary:
With ICF optimization in the linker we were getting mismatches of
function names in .fdata and BinaryFunction name. This diff adds
support for multiple function names for BinaryFunction and
does a match against all possible names for the profile.

(cherry picked from FBD3466215)
2016-06-10 17:13:05 -07:00
Maksim Panchenko
70f82d9371 Reject profile data for functions that do not match.
Summary:
Verify profile data for a function and reject if there are branches
that don't correspond to any branches in the function CFG. Note that
we have to ignore branches resulting from recursive calls.

Fix printing instruction offsets in disassembled state.

Allow function to have non-zero execution count even if we don't
have branch information.

(cherry picked from FBD3451596)
2016-06-15 18:36:16 -07:00
Bill Nell
980a06265a Revert "Indirect call optimization."
This reverts commit 33966090e18545b64013614e7929ff1bdcdf10d5.

(cherry picked from FBD28110782)
2016-06-08 17:38:13 -07:00
Bill Nell
8bcfd9a392 Indirect call optimization.
(cherry picked from FBD28110629)
2016-06-07 16:27:52 -07:00
Bill Nell
45e2219ae4 Allocate BinaryBasicBlocks with new rather than storing them in the BasicBlocks vector.
Summary: This will help optimization passes that need to modify the CFG after it is constructed.  Otherwise, the BinaryBasicBlock pointers stored in the layout, successors and predecessors would need to be modified every time a new basic block is created.

(cherry picked from FBD3403372)
2016-06-07 16:27:52 -07:00
Maksim Panchenko
4460da0d81 Improvements for debug info.
Summary:
Assembly functions could have no corresponding DW_AT_subprogram
entries, yet they are represented in module ranges (and .debug_aranges)
and will have line number information. Make sure we update those.

Eliminated unnecessary data structures and optimized some passes.

For .debug_loc unused location entries are no longer processed
resulting in smaller output files.

Overall it's a small processing time improvement and memory imporement.

(cherry picked from FBD3362540)
2016-05-27 20:19:19 -07:00
Theodoros Kasampalis
65ac8bbdf2 Better edge counts for fall through blocks in presence of C++ exceptions.
Summary: The inference algorithm for counts of fall through edges takes possible jumps to landing pad blocks into account. Also, the landing pad block execution counts are updated using profile data.

(cherry picked from FBD3350727)
2016-05-26 15:10:09 -07:00
Theodoros Kasampalis
485f9220b7 Taking LP counts into account for FT count inference
(cherry picked from FBD28110493)
2016-05-24 09:26:25 -07:00
Theodoros Kasampalis
fb5f18b2dc Correctly updating landing pad exec counts.
(cherry picked from FBD28110316)
2016-05-23 16:16:25 -07:00
Maksim Panchenko
43bc4a09ad Changed splitting options and fixed sorting.
Summary:
Splitting option now has different meanings/values. Since landing pads
are mostly always cold/frozen, we should split them before anything
else (we still check the execution count is 0). That's value '1'.
Everything else goes on top of that and has increased value (2 - large
functions, 3 - everything).

Sorting was non-deterministic and somewhat broken for functions
with EH ranges. Fixed that and added '-split-all-cold' option to
outline all 0-count blocks.

Fixed compilation of test cases. After my last commit the binaries
were linked to wrong source files (i.e. debug info). Had to rebuild
the binaries from updated sources.

(cherry picked from FBD3209369)
2016-04-20 15:31:11 -07:00
Maksim Panchenko
4f44d60947 Special handling for GNU_args_size call frame instruction.
Summary:
GNU_args_size is a special kind of CFI that tells runtime to adjust
%rsp when control is passed to a landing pad. It is used for annotating
call instructions that pass (extra) parameters on the stack and there's
a corresponding landing pad.

It is also special in a way that its value is not handled by
DW_CFA_remember_state/DW_CFA_restore_state instruction sequence
that we utilize to restore the state after block re-ordering.

This diff adds association of call instructions with GNU_args_size value
when it's used. If the function does not use GNU_args_size, there is
no overhead. Otherwise, we regenerate GNU_args_size instruction during
code emission, i.e. after all optimizations and block-reordering.

(cherry picked from FBD3201322)
2016-04-19 22:00:29 -07:00
Gabriel Poesia
ad344c4387 Group debugging info representation and serialization code.
Summary:
Moved the classes related to representing and serializing DWARF entities into a single
header, DebugData.h.

(cherry picked from FBD3153279)
2016-04-07 15:06:43 -07:00
Gabriel Poesia
ffa9641e16 Update DWARF lexical blocks address ranges.
Summary:
Updates DWARF lexical blocks address ranges in the output binary after optimizations.
This is similar to updating function address ranges except that the ranges representation needs
to be more general, since address ranges can begin or end in the middle of a basic block.

The following changes were made:

- Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h
- Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly.
- Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists.
- Added a representation for Lexical Blocks (LexicalBlock.h)
- Small refactorings in DebugArangesWriter:
-- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges
-- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that)
- Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp)
- Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code)
- Added small test case (lexical_blocks_address_ranges_debug.test)

(cherry picked from FBD3113181)
2016-03-28 17:45:22 -07:00
Maksim Panchenko
595d0885d9 Populate function execution count while parsing fdata.
Summary:
Populate function execution count while parsing fdata. Before
we used a quadratic algorithm to populate the execution count
(had to iterate over *all* branches for every single function).

Ignore non-symbol to non-symbol branches while parsing fdata.

These changes combined drop HHVM processing time from
4 minutes 53 seconds down to 2 minutes 9 seconds on my devserver.

Test case had to be modified since it contained irrelevant
branches from PLT to libc.

(cherry picked from FBD3106263)
2016-03-28 11:06:28 -07:00