Commit Graph

62 Commits

Author SHA1 Message Date
Theodoros Kasampalis
c20506c570 Fix in inferFallthroughCounts
Summary:
This fixes the initialization of basic block execution counts, where
we should skip edges to the first basic block but we were not
skipping the corresponding profile info.

Also, I removed a check that was done twice.

(cherry picked from FBD3519265)
2016-07-03 21:30:35 -07:00
Bill Nell
260f6fbdb6 Add option to dump CFGs in (simple) graphviz format during all passes.
Summary:
I noticed the BinaryFunction::viewGraph() method that hadn't been implemented
and decided I could use a simple DOT dumper for CFGs while working on the indirect
call optimization.

I've implemented the bare minimum for the dumper.  It's just nodes+BB labels with
dges. We can add more detailed information as needed/desired.

(cherry picked from FBD3509326)
2016-07-01 08:40:56 -07:00
Theodoros Kasampalis
287fa51324 Fix for ignoring fall-through profile data when jump is followed by no-op
Summary:
When a conditional jump is followed by one or more no-ops, the
destination of fall-through branch was recorded as the first no-op in
FuncBranchInfo. However the fall-through basic block after the jump
starts after the no-ops, so the profile data could not match the CFG
and was ignored.

(cherry picked from FBD3496084)
2016-06-27 14:51:38 -07:00
Theodoros Kasampalis
d09b00ebff Refactoring of the reordering algorithms
Summary:
The various reorder and clustering algorithms have been refactored
into separate classes, so that it is easier to add new algorithms and/or
change the logic of algorithm selection.

(cherry picked from FBD3473656)
2016-06-16 18:47:57 -07:00
Maksim Panchenko
f1192a7118 Support for multiple function names.
Summary:
With ICF optimization in the linker we were getting mismatches of
function names in .fdata and BinaryFunction name. This diff adds
support for multiple function names for BinaryFunction and
does a match against all possible names for the profile.

(cherry picked from FBD3466215)
2016-06-10 17:13:05 -07:00
Maksim Panchenko
70f82d9371 Reject profile data for functions that do not match.
Summary:
Verify profile data for a function and reject if there are branches
that don't correspond to any branches in the function CFG. Note that
we have to ignore branches resulting from recursive calls.

Fix printing instruction offsets in disassembled state.

Allow function to have non-zero execution count even if we don't
have branch information.

(cherry picked from FBD3451596)
2016-06-15 18:36:16 -07:00
Bill Nell
980a06265a Revert "Indirect call optimization."
This reverts commit 33966090e18545b64013614e7929ff1bdcdf10d5.

(cherry picked from FBD28110782)
2016-06-08 17:38:13 -07:00
Bill Nell
8bcfd9a392 Indirect call optimization.
(cherry picked from FBD28110629)
2016-06-07 16:27:52 -07:00
Bill Nell
45e2219ae4 Allocate BinaryBasicBlocks with new rather than storing them in the BasicBlocks vector.
Summary: This will help optimization passes that need to modify the CFG after it is constructed.  Otherwise, the BinaryBasicBlock pointers stored in the layout, successors and predecessors would need to be modified every time a new basic block is created.

(cherry picked from FBD3403372)
2016-06-07 16:27:52 -07:00
Maksim Panchenko
4460da0d81 Improvements for debug info.
Summary:
Assembly functions could have no corresponding DW_AT_subprogram
entries, yet they are represented in module ranges (and .debug_aranges)
and will have line number information. Make sure we update those.

Eliminated unnecessary data structures and optimized some passes.

For .debug_loc unused location entries are no longer processed
resulting in smaller output files.

Overall it's a small processing time improvement and memory imporement.

(cherry picked from FBD3362540)
2016-05-27 20:19:19 -07:00
Theodoros Kasampalis
65ac8bbdf2 Better edge counts for fall through blocks in presence of C++ exceptions.
Summary: The inference algorithm for counts of fall through edges takes possible jumps to landing pad blocks into account. Also, the landing pad block execution counts are updated using profile data.

(cherry picked from FBD3350727)
2016-05-26 15:10:09 -07:00
Theodoros Kasampalis
485f9220b7 Taking LP counts into account for FT count inference
(cherry picked from FBD28110493)
2016-05-24 09:26:25 -07:00
Theodoros Kasampalis
fb5f18b2dc Correctly updating landing pad exec counts.
(cherry picked from FBD28110316)
2016-05-23 16:16:25 -07:00
Maksim Panchenko
43bc4a09ad Changed splitting options and fixed sorting.
Summary:
Splitting option now has different meanings/values. Since landing pads
are mostly always cold/frozen, we should split them before anything
else (we still check the execution count is 0). That's value '1'.
Everything else goes on top of that and has increased value (2 - large
functions, 3 - everything).

Sorting was non-deterministic and somewhat broken for functions
with EH ranges. Fixed that and added '-split-all-cold' option to
outline all 0-count blocks.

Fixed compilation of test cases. After my last commit the binaries
were linked to wrong source files (i.e. debug info). Had to rebuild
the binaries from updated sources.

(cherry picked from FBD3209369)
2016-04-20 15:31:11 -07:00
Maksim Panchenko
4f44d60947 Special handling for GNU_args_size call frame instruction.
Summary:
GNU_args_size is a special kind of CFI that tells runtime to adjust
%rsp when control is passed to a landing pad. It is used for annotating
call instructions that pass (extra) parameters on the stack and there's
a corresponding landing pad.

It is also special in a way that its value is not handled by
DW_CFA_remember_state/DW_CFA_restore_state instruction sequence
that we utilize to restore the state after block re-ordering.

This diff adds association of call instructions with GNU_args_size value
when it's used. If the function does not use GNU_args_size, there is
no overhead. Otherwise, we regenerate GNU_args_size instruction during
code emission, i.e. after all optimizations and block-reordering.

(cherry picked from FBD3201322)
2016-04-19 22:00:29 -07:00
Gabriel Poesia
ad344c4387 Group debugging info representation and serialization code.
Summary:
Moved the classes related to representing and serializing DWARF entities into a single
header, DebugData.h.

(cherry picked from FBD3153279)
2016-04-07 15:06:43 -07:00
Gabriel Poesia
ffa9641e16 Update DWARF lexical blocks address ranges.
Summary:
Updates DWARF lexical blocks address ranges in the output binary after optimizations.
This is similar to updating function address ranges except that the ranges representation needs
to be more general, since address ranges can begin or end in the middle of a basic block.

The following changes were made:

- Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h
- Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly.
- Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists.
- Added a representation for Lexical Blocks (LexicalBlock.h)
- Small refactorings in DebugArangesWriter:
-- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges
-- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that)
- Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp)
- Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code)
- Added small test case (lexical_blocks_address_ranges_debug.test)

(cherry picked from FBD3113181)
2016-03-28 17:45:22 -07:00
Maksim Panchenko
595d0885d9 Populate function execution count while parsing fdata.
Summary:
Populate function execution count while parsing fdata. Before
we used a quadratic algorithm to populate the execution count
(had to iterate over *all* branches for every single function).

Ignore non-symbol to non-symbol branches while parsing fdata.

These changes combined drop HHVM processing time from
4 minutes 53 seconds down to 2 minutes 9 seconds on my devserver.

Test case had to be modified since it contained irrelevant
branches from PLT to libc.

(cherry picked from FBD3106263)
2016-03-28 11:06:28 -07:00
Gabriel Poesia
dc7cc1fb18 Fix default line number information for instructions.
Summary:
The line number information generated from a null pointer
was actually valid, which caused new instructions without the line number
information set to have a valid and wrong line number reference. This diff
fixes this by making the null pointer be assigned to an invalid line number
row.

(cherry picked from FBD3048453)
2016-03-14 11:40:52 -07:00
Gabriel Poesia
77a6b72842 BOLT: Read and tie .debug_line info to IR.
Summary:
Reads information in the DWARF .debug_line section using LLVM and
tie every MCInst to one line of a line table from the input binary. Subsequent
diffs will update this information to match the final binary layout and
output updated line tables.

(cherry picked from FBD2989813)
2016-02-25 16:57:07 -08:00
Maksim Panchenko
7f7d4af7e0 Add an option to use PT_GNU_STACK for new segment.
Summary:
Added an option to reuse existing program header entry.
This option allows for bfd tools like strip and objcopy
to operate on the optimized binary without destroying it.

Also, all new sections are now properly marked in ELF.

(cherry picked from FBD2943339)
2016-02-12 19:01:53 -08:00
Maksim Panchenko
d1526083fc Rename binary optimizer to BOLT.
Summary:
BOLT - Binary Optimization and Layout Tool replaces FLO.
I'm keeping .fdata extension for "feedback data".

(cherry picked from FBD2908028)
2016-02-05 14:42:04 -08:00
Maksim Panchenko
628d06b1e5 Preserve layout of basic blocks with 0 profile counts.
Summary:
Preserve original layout for basic blocks that have 0 execution
count. Since we don't optimize for size, it's better to rely on
the original input order.

(cherry picked from FBD2875335)
2016-01-21 14:18:30 -08:00
Maksim Panchenko
218c5f0916 Fix a bug with outlining first basic block.
Summary:
We should never outline the first basic block.
Also add an option to accept a file with the list of
functions to optimize.

(cherry picked from FBD2868184)
2016-01-26 16:03:58 -08:00
Maksim Panchenko
89578e2314 Allow to partially split functions with exceptions.
Summary:
We could split functions with exceptions even without creating
a new exception handling table. This limits us to only move
basic blocks that never throw, and are not a start of a
landing pad.

(cherry picked from FBD2862937)
2016-01-22 16:45:39 -08:00
Maksim Panchenko
bbb745efa9 Don't create empty basic blocks. Fix CFI bug.
Summary:
Some basic blocks were created empty because they only contained
alignment nop's. Ignore such nop's before basic block gets created.

Fixed intermittent aborts related to CFI update.

(cherry picked from FBD2844465)
2016-01-19 00:20:06 -08:00
Maksim Panchenko
4a44d187c6 Handle more CFI cases and some.
Summary:
  * Update CFI state for larger range of functions to increase coverage.
  * Issue more warnings indicating reasons for skipping functions.
  * Print top called functions in the binary.

(cherry picked from FBD2839734)
2016-01-16 14:58:22 -08:00
Maksim Panchenko
d9536e6092 Added an option to reverse original basic blocks order.
Summary:
Modified processing of "-reorder-blocks=" option and added an option
to reverse original basic blocks order for testing purposes.

(cherry picked from FBD2829862)
2016-01-13 17:19:40 -08:00
Maksim Panchenko
c9b7e3e09e Write updated LSDA's.
Summary: Write new exception ranges tables (LSDA's) into the output file.

(cherry picked from FBD2828312)
2015-12-18 17:00:46 -08:00
Maksim Panchenko
e2fcb371a8 Ignore functions referencing symbol at 0x0.
Summary:
Binary code could be weird. It could include calls to address 0 and
reference data at 0 (e.g. with lea on x86). LLVM JIT fatals
while resolving relocations against symbols at address 0x0. For now
we will stop emitting such code, i.e. we'll skip functions.

(cherry picked from FBD28109837)
2015-12-16 17:56:49 -08:00
Maksim Panchenko
f7d7a85a24 Turn EH ranges support back on.
Summary:
Changed the way EH info is stored/extracted from call instruction. Make
sure indirect calls work.

(cherry picked from FBD28109629)
2015-12-15 17:06:27 -08:00
Rafael Auler
fb6e8c5d0b Don't touch functions whose internal BBs are targets of interprocedural branches
Summary:
In a test binary, we found 8 cases where code in a function A would jump to the
middle of another function B. In this case, we cannot reorder function B because
this would change instruction offsets and break the program. This is pretty rare
but can happen in code written in assembly.

(cherry picked from FBD2719850)
2015-12-03 13:29:52 -08:00
Rafael Auler
ccbbb8f8b9 Teach llvm-flo how to split functions into hot and cold regions
Summary:
After basic block reordering, it may be possible that the reordered
function is now larger than the original because of the following reasons:

- jump offsets may change, forcing some jump instructions to use 4-byte
immediate operand instead of the 1-byte, shorter version.
- fall-throughs change, forcing us to emit an extra jump instruction to jump
to the original fall-through at the end of a basic block.

Since we currently do not change function addresses, we need to rewrite the
function back in the binary in the original location. If it doesn't fit, we were
dropping the function.

This patch adds a flag -split-functions that tells llvm-flo to split hot
functions into hot and cold separate regions. The hot region is written back
in the original function location, while the cold region is written in a
separate, far-away region reserved to flo via a linker script.

This patch also adds the logic to create and extra FDE to supply unwinding
information to the cold part of the function. Owing to this, we now need to
rewrite .eh_frame_hdr to another location and patch the EH_FRAME ELF segment
to point to this new .eh_frame_hdr.

(cherry picked from FBD2677996)
2015-11-19 17:59:41 -08:00
Rafael Auler
38dac03e6b Make llvm-flo print dynamic coverage of rewritten functions
Summary:
This is an attempt at determining the hotness of functions we are
rewriting and help detect if we are discarding hot functions. This patch
introduces logic to estimate the number of instructions executed in each
function by using the profile data for branches. It sums the products of
BB frequency and size. Since we can only do this for functions we have
successfully disassembled, created the CFG and annotated with profiling
data, all complex functions that were not disassembled are left out from
this analysis.

(cherry picked from FBD2654985)
2015-11-13 15:27:59 -08:00
Rafael Auler
75798a891b Do not bail on functions with indirect calls
Summary:
Previously, we were marking functions with indirect calls as too
complex to be disassembled, but this was unnecessarily conservative. This patch
removes this restriction.

(cherry picked from FBD2669627)
2015-11-02 09:46:50 -08:00
Rafael Auler
6c851dc2e3 Attempts to fix CFI state after reordering
Summary:
This patch introduces logic to check how the CFI instructions define a
table to help during stack unwinding at exception run time and attempts to fix
any problem in this table that may have been introduced by reordering the basic
blocks. If it fails to fix this problem, the function is marked as not simple
and not eligible for rewriting.

(cherry picked from FBD2633696)
2015-11-08 12:23:54 -08:00
Maksim Panchenko
bc9d6e3b6c Regenerate exception handling information after optimizations.
Summary:
Regenerate exception handling information after optimizations.
Use '-print-eh-ranges' to see CFG with updated ranges.

(cherry picked from FBD2660982)
2015-11-13 14:18:45 -08:00
Maksim Panchenko
be2a19523c Add exception handling information to CFG.
Summary:
Read .gcc_except_table and add information to CFG. Calls have extra operands
indicating there's a possible handler for exceptions and an action. Landing
pad information is recorded in BinaryFunction.

Also convert JMP instructions that are calls into tail calls pseudo
instructions so that they don't miss call instruction analysis.

(cherry picked from FBD2652775)
2015-11-12 18:56:58 -08:00
Rafael Auler
a30d04c3e2 Annotate BinaryFunctions with MCCFIInstructions encoding CFI
Summary:
In order to represent CFI information in our BinaryFunction class, this
patch adds a map of Offsets to CFI instructions. In this way, we make it easy to
check exactly where DWARF CFI information is annotated in the disassembled
function.

(cherry picked from FBD2619216)
2015-11-04 16:48:47 -08:00
Rafael Auler
0e8998713c Extract non-taken branch frequencies from LBR
Summary:
Previously, we inferred all non-taken branch frequencies with the
information we had for taken branches. This patch teaches perf2flo and llvm-flo
how to read and incorporate non-taken branch frequencies directly from the
traces available in LBR data and by disassembling the binary. It still leaves
the inference engine untouched in case we need it to fill out other
fall-throughs.

(cherry picked from FBD2589212)
2015-10-26 15:00:56 -07:00
Rafael Auler
13a520ab30 Implement two cluster layout heuristics
Summary:
Pettis' paper on block layout (PLDI'90) suggests we should order
clusters (or chains, using the paper terminology) using a specific criterion.
This patch implements two distinct ideas for cluster layout that can be
activated using different command-line flags. The first one reflects Pettis'
ideas on minimizing branch mispredictions and the second one is targeted at
reducing I-cache misses, described in the Ispike paper (CGO'04).

(cherry picked from FBD2588693)
2015-10-23 09:38:26 -07:00
Rafael Auler
2539539bde Fixes priority queue ordering in llvm-flo block reordering
Summary:
Fixes a bug which caused the block reordering heuristic to put in the
same cluster hot basic blocks and cold basic blocks, increasing I-cache misses.

(cherry picked from FBD2588203)
2015-10-27 03:04:58 -07:00
Maksim Panchenko
d4d773458c More control over function printing.
Summary:
Can use '-print-*' option to print function at specific stage.
Use '-print-all' to print at every stage.

(cherry picked from FBD2578196)
2015-10-23 15:52:59 -07:00
Maksim Panchenko
7f44331773 Issue warning when relaxed tail call is seen on input.
Summary:
Issue warning when we see a 2-byte tail call. Currently we
will increase the size of these instructions.

(cherry picked from FBD2575520)
2015-10-20 10:51:17 -07:00
Rafael Auler
546c4e6e84 Fix bug in BinaryFunction::fixBranches() in llvm-flo
Summary:
When the ignore-nops patch landed, it exposed a bug in fixBranches()
where it ignored empty BBs. However, we cannot ignore empty BBs when it is
reordered and its fall-through changes. We must update it with a jump to the
original fall-through. This patch fixes this.

(cherry picked from FBD2568244)
2015-10-21 16:25:16 -07:00
Rafael Auler
dc848b5376 Fix entry BB execution count in llvm-flo
Summary:
When we have tailcalls, the execution count for the entry point is
wrongly computed. Fix this.

(cherry picked from FBD2563112)
2015-10-20 16:48:54 -07:00
Rafael Auler
ab63ca9afb Implement unreachable BB elimination in llvm-flo
Summary:
It is important to remove dead blocks to free up space in functions
and allow us to reorder blocks or align branch targets with more
freedom. This patch implements a simple algorithm to delete all basic
blocks that are not reachable from the entry point. Note that C++
exceptions may create "unreachable" blocks, so this option must be
used with care.

(cherry picked from FBD2562637)
2015-10-20 12:47:37 -07:00
Rafael Auler
9f41a0d263 Do not schedule BBs before the entry point
Summary:
SPEC CPU2006 perlbench triggered a bug in our heuristic block
reordering algorithm where a hot edge that targets the entry point (as in a
recursive tail call) would make us try to allocate the call site before the
function entry point. Since we don't update function addresses yet, moving the
entry point will corrupt the program. This patch fixes this.

(cherry picked from FBD2562528)
2015-10-20 12:30:22 -07:00
Rafael Auler
b0115a4536 Teach llvm-flo how to handle two back-to-back JMPs
Summary:
If we have two consecutive JMP instructions and no branches to the
second one, the second one is dead code, but llvm-flo does not handle these
cases properly and put two JMPs in the same BB. This patch fixes this, putting
the extraneous JMP in a separate block, making it easy for us to detect it is
dead code and remove it later in a separate step.

(cherry picked from FBD2562465)
2015-10-20 10:17:38 -07:00
Maksim Panchenko
85b99eb7b7 Eliminate nop instruction in input and derive alignment.
Summary:
Nop instructions are primarily used for alignment purposes on the input.
We remove all nops when we build CFG and derive alignment of basic blocks
based on existing alignment and a presence of nops before it. This
will not always work as some basic blocks will be naturally aligned
without necessity for nops. However, it's better than random alignment.
We would also add heuristics for BB alignment based on execution profile.

(cherry picked from FBD2561740)
2015-10-20 10:51:17 -07:00