Commit Graph

175 Commits

Author SHA1 Message Date
Bill Nell
ddefc770b0 [BOLT] Refactoring of section handling code
Summary:
This is a big refactoring of the section handling code.  I've removed the SectionInfoMap and NoteSectionInfo and stored all the associated info about sections in BinaryContext and BinarySection classes.  BinarySections should now hold all the info we care about for each section.  They can be initialized from SectionRefs but don't necessarily require one to be created.  There are only one or two spots that needed access to the original SectionRef to work properly.

The trickiest part was making sure RewriteInstance.cpp iterated over the proper sets of sections for each of it's different types of processing.  The different sets are broken down roughly as allocatable and non-alloctable and "registered" (I couldn't think up a better name).  "Registered" means that the section has been updated to include output information, i.e. contents, file offset/address, new size, etc.  It may help to have special iterators on BinaryContext to iterate over the different classes to make things easier.  I can do that if you guys think it is worthwhile.

I found pointee_iterator in the llvm ADT code.  Use that for iterating over BBs in BinaryFunction rather than the custom iterator class.

(cherry picked from FBD6879086)
2018-02-01 16:33:43 -08:00
Maksim Panchenko
6744f0dbeb [BOLT] Fix jump table placement for non-simple functions
Summary:
When we move a jump table to either hot or cold new section
(-jump-tables=move), we rely on a number of taken branches from the table
to decide if it's hot or cold. However, if the function is non-simple, we
always get 0 count, and always move the table to the cold section.
Instead, we should make a conservative decision based on the execution
count of the function.

(cherry picked from FBD7058127)
2018-02-22 11:20:46 -08:00
Maksim Panchenko
5599c01911 [BOLT] Fixes for new profile
Summary:
Do a better job of recording fall-through branches in new profile mode
(-prof-compat-mode=0). For this we need to record offsets for all
instructions that are last in the containing basic block.

Change the way we convert conditional tail calls. Now we never reverse
the condition. This is required for better profile matching.
The original approach of preserving the direction was controversial
to start with.

Add "-infer-fall-throughs" option (on by default) to allow disabling
inference of fall-through edge counts.

(cherry picked from FBD6994293)
2018-02-13 11:21:59 -08:00
Maksim Panchenko
1298d99a41 [BOLT] Limited "support" for AVX-512
Summary:
In relocation mode trap on entry to any function that has AVX-512
instructions. This is controlled by "-trap-avx512" option which is on
by default. If the option is disabled and AVX-512 instruction is seen
in relocation mode, then we abort while re-writing the binary.

(cherry picked from FBD6893165)
2018-02-02 16:07:11 -08:00
Rafael Auler
8a5a30156e [BOLT rebase] Rebase fixes on top of LLVM Feb2018
Summary:
This commit includes all code necessary to make BOLT working again
after the rebase. This includes a redesign of the EHFrame work,
cherry-pick of the 3dnow disassembly work, compilation error fixes,
and port of the debug_info work. The macroop fusion feature is not
ported yet.

The rebased version has minor changes to the "executed instructions"
dynostats counter because REP prefixes are considered a part of the
instruction it applies to. Also, some X86 instructions had the "mayLoad"
tablegen property removed, which BOLT  uses to identify and account
for loads, thus reducing the total number of loads reported by
dynostats. This was observed in X86::MOVDQUmr. TRAP instructions are
not terminators anymore, changing our CFG. This commit adds compensation
to preserve this old behavior and minimize tests changes. debug_info
sections are now slightly larger. The discriminator field in the line
table is slightly different due to a change upstream. New profiles
generated with the other bolt are incompatible with this version
because of different hash values calculated for functions, so they will
be considered 100% stale. This commit changes the corresponding test
to XFAIL so it can be updated. The hash function changes because it
relies on raw opcode values, which change according to the opcodes
described in the X86 tablegen files. When processing HHVM, bolt was
observed to be using about 800MB more memory in the rebased version
and being about 5% slower.

(cherry picked from FBD7078072)
2018-02-06 15:00:23 -08:00
Maksim Panchenko
600cf0ecf6 [BOLT] Fix memory regression
Summary:
This fixes the increased memory consumption introduced in an earlier
diff while I was working on new profiling infra.

The increase came from a delayed release of memory allocated to
intermediate structures used to build CFG. In this diff we release
them ASAP, and don't keep them for all functions at the same time.

(cherry picked from FBD6890067)
2018-02-02 14:46:21 -08:00
Maksim Panchenko
f85264ae18 [BOLT] Reduce the usage of "Offset" annotation
Summary:
Limiting "Offset" annotation only to instructions that actually
need it, improves the memory consumption on HHVM binary by 1GB.

(cherry picked from FBD6878943)
2018-02-01 14:36:29 -08:00
Bill Nell
2640b4071f [BOLT] Refactoring - add BinarySection class
Summary: Add BinarySection class that is a wrapper around SectionRef.  This is refactoring work for static data reordering.

(cherry picked from FBD6792785)
2018-01-23 15:10:24 -08:00
Rafael Auler
907ca25841 [BOLT-AArch64] Support large test binary
Summary:
Rewrite how data/code markers are interpreted, so the code
can have constant islands essentially anywhere. This is necessary to
accomodate custom AArch64 assembly code coming from mozjpeg. Allow
any function to refer to the constant island owned by any other
function. When this happens, we pull the constant island from the
referred function and emit it as our own, so it will live nearby
the code that refers to it, allowing us to freely reorder functions
and code pieces. Make bolt more strict about not changing anything
in non-simple ARM functions, as we need to preserve offsets for
those functions we don't interpret their jump tables (currently
any function with jump tables in ARM is non-simple and is left
untouched).

(cherry picked from FBD6402324)
2017-11-22 16:17:36 -08:00
Maksim Panchenko
b6cb112feb [BOLT] New profile format
Summary:
A new profile that is more resilient to minor binary modifications.

BranchData is eliminated. For calls, the data is converted into instruction
annotations if the profile matches a function. If a profile cannot be matched,
AllCallSites data should have call sites profiles.

The new profile format is YAML, which is quite verbose. It still takes
less space than the older format because we avoid function name repetition.

The plan is to get rid of the old profile format eventually.

merge-fdata does not work with the new format yet.

(cherry picked from FBD6753747)
2017-12-13 23:12:01 -08:00
Rafael Auler
f8f52d01d0 [BOLT-AArch64] Support SPEC17 programs and organize AArch64 tests
Summary:
Add a few new relocation types to support a wider variety of
binaries, add support for constant island duplication (so we can split
functions in large binaries) and make LongJmp pass really precise with
respect to layout, so we don't miss stubs insertions at the correct
places for really large binaries. In LongJmp, introduce "freeze"
annotations so fixBranches won't mess the jumps we carefully determined
that needed a stub.

(cherry picked from FBD6294390)
2017-11-09 16:59:18 -08:00
Maksim Panchenko
d15b93bade [BOLT] Major overhaul of profiling in BOLT
Summary:
Profile reading was tightly coupled with building CFG. Since I plan
to move to a new profile format that will be associated with CFG
it is critical to decouple the two phases.

We now have read profile right after the cfg was constructed, but
before it is "canonicalized", i.e. CTCs will till be there.

After reading the profile, we do a post-processing pass that fixes
CFG and does some post-processing for debug info, such as
inference of fall-throughs, which is still required with the current
format.

Another good reason for decoupling is that we can use profile with
CFG to more accurately record fall-through branches during
aggregation.

At the moment we use "Offset" annotations to facilitate location
of instructions corresponding to the profile. This might not be
super efficient. However, once we switch to the new profile format
the offsets would be no longer needed. We might keep them for
the aggregator, but if we have to trust LBR data that might
not be strictly necessary.

I've tried to make changes while keeping backwards compatibly. This makes
it easier to verify correctness of the changes, but that also means
that we lose accuracy of the profile.

Some refactoring is included.

Flag "-prof-compat-mode" (on by default) is used for bug-level
backwards compatibility. Disable it for more accurate tracing.

(cherry picked from FBD6506156)
2017-11-28 09:57:21 -08:00
Maksim Panchenko
b6f7c68a6c [BOLT] Automatically detect and use relocations
Summary:
If relocations are available in the binary, use them by default.
If "-relocs" is specified, then require relocations for further
processing. Use "-relocs=0" to forcefully ignore relocations.

Instead of `opts::Relocs` use `BinaryContext::HasRelocations` to check
for the presence of the relocations.

(cherry picked from FBD6530023)
2017-12-09 21:40:39 -08:00
Maksim Panchenko
2b9bafed83 [BOLT] Consistent DFS ordering for landing pads
Summary:
The list of landing pads in BinaryBasicBlock was sorted by their address
in memory. As a result, the DFS order was not always deterministic.
The change is to store landing pads in the order they appear in invoke
instructions while keeping them unique.

Also, add Throwers verification to validateCFG().

(cherry picked from FBD6529032)
2017-12-08 20:27:49 -08:00
Maksim Panchenko
10274633ee [BOLT] Options to facilitate debugging
Summary:
Some helpful options:

  -print-dyno-stats-only
    while printing functions output dyno-stats and skip instructions

  -report-stale
    print a list of functions with a stale profile

(cherry picked from FBD6505141)
2017-12-06 15:45:57 -08:00
Rafael Auler
21eb2139ee Introduce pass to reduce jump tables footprint
Summary:
Add a pass to identify indirect jumps to jump tables and reduce
their entries size from 8 to 4 bytes. For PIC jump tables, it will
convert the PIC code to non-PIC (since BOLT only processes static code,
it makes no sense to use expensive PIC-style jumps in static code). Add
corresponding improvements to register scavenging pass and add a MCInst
matcher machinery.

(cherry picked from FBD6421582)
2017-11-02 00:30:11 -07:00
Bill Nell
591e0ef3ba [BOLT] Add timers for non-optimization related phases.
Summary: Add timers for non-optimization related phases.  There are two new options, -time-build for disassembling functions and building CFGs, and -time-rewrite for phases in executeRewritePass().

(cherry picked from FBD6422006)
2017-11-27 18:00:24 -08:00
Rafael Auler
dc23def477 [PERF2BOLT] Fix aggregator wrt traces with REP RET
Summary:
Previously the perf2bolt aggregator was rejecting traces
finishing with REP RET (return instruction with REP prefix) as a
result of the migration from objdump output to LLVM disassembler,
which decodes REP as a separate instruction. Add code to detect
REP RET and treat it as a single return instruction.

(cherry picked from FBD6417496)
2017-11-27 12:58:21 -08:00
Bill Nell
b2f132c7c2 [RFC] [BOLT] Use iterators for MC branch/call analysis code.
Summary:
Here's an implementation of an abstract instruction iterator for the branch/call
analysis code in MCInstrAnalysis.  I'm posting it up to see what you guys think.
It's a bit sloppy with constness and probably needs more tidying up.

(cherry picked from FBD6244012)
2017-11-04 19:22:05 -07:00
Bill Nell
c4d7460ed6 [BOLT] Improve ICP for virtual method calls and jump tables using value profiling.
Summary:
Use value profiling data to remove the method pointer loads from vtables when doing ICP at virtual function and jump table callsites.

The basic process is the following:
1. Work backwards from the callsite to find the most recent def of the call register.
2. Work back from the call register def to find the instruction where the vtable is loaded.
3. Find out of there is any value profiling data associated with the vtable load.  If so, record all these addresses as potential vtables + method offsets.
4. Since the addresses extracted by #3 will be vtable + method offset, we need to figure out the method offset in order to determine the actual vtable base address.  At this point I virtually execute all the instructions that occur between #3 and #2 that touch the method pointer register.  The result of this execution should be the method offset.
5. Fetch the actual method address from the appropriate data section containing the vtable using the computed method offset.  Make sure that this address maps to an actual function symbol.
6. Try to associate a vtable pointer with each target address in SymTargets.  If every target has a vtable, then this is almost certainly a virtual method callsite.
7. Use the vtable address when generating the promoted call code.  It's basically the same as regular ICP code except that the compare is against the vtable and not the method pointer.  Additionally, the instructions to load up the method are dumped into the cold call block.

For jump tables, the basic idea is the same.  I use the memory profiling data to find the hottest slots in the jumptable and then use that information to compute the indices of the hottest entries. We can then compare the index register to the hot index values and avoid the load from the jump table.

Note: I'm assuming the whole call is in a single BB.  According to @rafaelauler, this isn't always the case on ARM.    This also isn't always the case on X86 either.  If there are non-trivial arguments that are passed by value, there could be branches in between the setup and the call.  I'm going to leave fixing this until later since it makes things a bit more complicated.

I've also fixed a bug where ICP was introducing a conditional tail call.  I made sure that SCTC fixes these up afterwards.  I have no idea why I made it introduce a CTC in the first place.

(cherry picked from FBD6120768)
2017-10-20 12:11:34 -07:00
Maksim Panchenko
0836fa7d08 [BOLT] Fix handling of RememberState CFI
Summary:
When RememberState CFI happens to be the last CFI in a basic block, we
used to set the state of the next basic block to a CFI prior to
executing RememberState instruction. This contradicts comments in
annotateCFIState() function and also differs form behaviour of
getCFIStateAtInstr(). As a result we were getting code like the
following:

  .LBB0121166 (21 instructions, align : 1)
    CFI State : 0
    ....
      0000001a:   !CFI    $1      ; OpOffset Reg6 -16
      0000001a:   !CFI    $2      ; OpRememberState
    ....
    Successors: .Ltmp4167600, .Ltmp4167601
    CFI State: 3

  .Ltmp4167601 (13 instructions, align : 1)
    CFI State : 2
    ....

Notice that the state at the entry of the 2nd basic block is less than
the state at the exit of the previous basic block.

In practice we have never seen basic blocks where RememberState was the
last CFI instruction in the basic block, and hence we've never run into
this issue before.

The fix is a synchronization of handling of last RememberState
instruction by annotateCFIState() and getCFIStateAtInstr().
In the example above, the CFI state at the entry to the second BB will
be 3 after this diff.

(cherry picked from FBD6314916)
2017-11-13 11:05:47 -08:00
Rafael Auler
a3b719e0f9 [BOLT] Fix ASAN bugs
Summary:
Fix a leak in DEBUGRewriter.cpp and an address out of bounds
issue in edit distance calculation.

(cherry picked from FBD6290026)
2017-11-08 14:29:20 -08:00
Rafael Auler
dd6ecdd782 [BOLT-AArch64] Support reordering spec06 gcc relocs
Summary:
Enhance the basic infrastructure for relocation mode for
AArch64 to make a reasonably large program work after reordering (gcc).

Detect jump table patterns and skip optimizing functions with jump
tables in AArch64, as those will require extra future effort to fully
decode. To make these work in relocation mode, we skip changing
the function body and introduce a mode to preserve even the original
nops. By not changing any local offsets in the function, the input
original jump tables should just work.

Functions with no jump tables are optimized with BB reordering. No other
optimizations have been tested.

(cherry picked from FBD6130117)
2017-10-16 11:12:22 -07:00
Rafael Auler
624b2d984a [BOLT-AArch64] Support relocation mode for bzip2
Summary:
As we deal with incomplete addresses in address-computing
sequences of code in AArch64, we found it is easier to handle them in
relocation mode in the presence of relocations.

Incomplete addresses may mislead BOLT into thinking there are
instructions referring to a basic block when, in fact, this may be the
base address of a data reference. If the relocation is present, we can
easily spot such cases.

This diff contains extensions in relocation mode to understand and
deal with AArch64 relocations. It also adds code to process data inside
functions as marked by AArch64 ABI (symbol table entries named "$d").
In our code, this is called constant islands handling. Last, it extends
bughunter with a "cross" mode, in which the host generates the binaries
and the user test them (uploading to the target), useful when debugging
in AArch64.

(cherry picked from FBD6024570)
2017-09-20 10:43:01 -07:00
Rafael Auler
76d7740cc9 [BOLT-AArch64] Support reordering bzip2 no relocs
Summary:
Add functionality to support reordering bzip2 compiled to
AArch64, with function splitting but without relocations:

 * Expand the AArch64 backend to support inverting branches and
analyzing branches so BOLT reordering machinery is able to shuffle
blocks and fix branches correctly;
 * Add a new pass named LongJmp to add stubs whenever code needs to
jump to the cold area, when using function splitting, because of the
limited target encoding capability in AArch64 (as a RISC architecture).

(cherry picked from FBD5748184)
2017-08-31 11:45:37 -07:00
Rafael Auler
fe6e9b4ab5 [BOLT-AArch64] Support rewriting bzip2
Summary:
Add basic AArch64 read/write capability to be able to
disassemble bzip2 for AArch64 compiled with gcc 5.4.0 and write
it back after going through the basic BOLT pipeline with no block
reordering (NOPs/unreachable blocks get removed).

This is not for relocation mode.

(cherry picked from FBD5701994)
2017-08-24 14:37:35 -07:00
Maksim Panchenko
e838b354ce [BOLT][Refactoring] Move basic block reordering to BinaryPasses
Summary:
Refactor basic block reordering code out of the BinaryFunction.

BinaryFunction::isSplit() is now checking if the first and the last
blocks in the layout belong to the same fragment. As a result,
it no longer returns true for functions that have their cold part
optimized away.

Change type for returned "size" from unsigned to size_t.

Fix lines over 80 characters long.

(cherry picked from FBD6250825)
2017-11-06 11:52:31 -08:00
Bill Nell
46866f5fa0 [BOLT] Refactor branch analysis code.
Summary:
Move the indirect branch analysis code from BinaryFunction to MCInstrAnalysis/X86MCTargetDesc.cpp.

In the process of doing this, I've added an MCRegInfo to MCInstrAnalysis which allowed me to remove a bunch of extra method parameters.  I've also had to refactor how BinaryFunction held on to instructions/offsets so that it would be easy to pass a sequence of instructions to the analysis code (rather than a map keyed by offset).

Note: I think there are a bunch of MCInstrAnalysis methods that have a BitVector output parameter that could be changed to a return value since the size of the vector is based on the number of registers, i.e. from MCRegisterInfo.  I haven't done this in order to keep the diff a more manageable size.

(cherry picked from FBD6213556)
2017-11-01 10:26:07 -07:00
Maksim Panchenko
1288c81c9b [BOLT][Refactoring] Change landing pads handling
Summary: Change the way we store and handle landing pads and throwers.

(cherry picked from FBD6169992)
2017-10-26 18:36:30 -07:00
Maksim Panchenko
61e5fbf8c3 [BOLT][Refactoring] Get rid of TailCallTerminatedBlocks, etc.
Summary:
More changes to allow separation of CFG construction and
profile assignment. Misc cleanups.

(cherry picked from FBD6158653)
2017-10-23 23:32:40 -07:00
Maksim Panchenko
1e1833c8a2 [BOLT][Refactoring] Make CTC first class operand, etc.
Summary:
This diff is a preparation for decoupling function disassembly,
profile association, and CFG construction phases.

We used to have multiple ways to mark conditional tail calls with
annotations or TailCallOffsets map. Since CTC information is affecting
the correctness, it is justifiable to have it as a operand class for
instruction with a destination (0 is a valid one).

"Offset" annotation now replaces "EdgeCountData" and
"IndirectBranchData" annotations to extract profile data for any
given instruction.

Inlining for small functions was broken in a presence of
profiled (annotated) instructions and hence I had to remove
"-inline-small-functions" from the test case.

Also fix an issue with UNDEF section for created __hot_start/__hot_end
symbols. Now the symbols use ABS section.

(cherry picked from FBD6087284)
2017-10-12 14:57:11 -07:00
Rafael Auler
42f957bb75 [BOLT] Integrate perf2bolt into llvm-bolt
Summary:
Move the data aggregator logic from our python script to
our C++ LLVM/BOLT libs. This has a dramatic reduction in processing
time for profiling data (from 45 minutes for HHVM to 5 minutes) because
we directly use BOLT as a disassembler in order to validate traces found
in the LBR and to add the fallthrough counts. Previously, the python
approach relied on parsing the output objdump to check traces.

(cherry picked from FBD5761313)
2017-09-01 18:13:51 -07:00
Rafael Auler
9df155ce11 [BOLT] Introduce non-LBR mode
Summary:
Add support to read profiles collected without LBR. This
involves adapting our data aggregator perf2bolt and adding support
in llvm-bolt itself to read this data.

This patch also introduces different options to convert basic block
execution count to edge count, so BOLT can operate with its regular
algorithms to perform basic block layout. The most successful approach
is the default one.

(cherry picked from FBD5664735)
2017-08-02 10:59:33 -07:00
Maksim Panchenko
49d1f5698d [BOLT] PLT optimization
Summary:
Add an option to optimize PLT calls:

  -plt  - optimize PLT calls (requires linking with -znow)
    =none - do not optimize PLT calls
    =hot  - optimize executed (hot) PLT calls
    =all  - optimize all PLT calls

When optimized, the calls are converted to use GOT reference
indirectly. GOT entries are guaranteed to contain a valid
function pointer if lazy binding is disabled - hence the
requirement for linker's -znow option.

Note: we can add an entry to .dynamic and drop a requirement
for -znow if we were moving .dynamic to a new segment.

(cherry picked from FBD5579789)
2017-08-04 11:21:05 -07:00
Maksim Panchenko
0c07445110 [BOLT] Fix printing of dyno-stats
Summary:
We used to print dyno-stats after instruction lowering
which was skewing our metrics as tail calls were no longer
recognized as calls for one thing. The fix is to control
the point at which dyno-stats printing pass is run and run
it immediately before instruction lowering. In the future we
may decide to run the pass before some other intervening pass.

(cherry picked from FBD5605639)
2017-08-10 13:18:44 -07:00
Rafael Auler
21c48f7d78 Fix profiling for functions with multiple entry points
Summary:
Fix issue in memcpy where one of its entry points was getting
no profiling data and was wrongly considered cold, being put in the cold
region.

(cherry picked from FBD5569156)
2017-08-02 18:14:01 -07:00
Maksim Panchenko
e4290d083f [BOLT] Disable last basic block assertion.
Summary:
While converting code from __builtin_unreachable() we were asserting
that a basic block with a conditional jump and a single CFG successor
was the last one before converting the jump to an unconditional one.

However, if that code was executed after a conditional tail call
conversion in the same function, the original last basic block
will no longer be the last one in the post-conversion layout.

I'm disabling the assertion since it doesn't seem worth it to add
extra checks for the basic block that used to be the last one.

(cherry picked from FBD5570298)
2017-08-04 19:39:45 -07:00
Maksim Panchenko
ae409f0b27 [BOLT] Better match LTO functions profile.
Summary:
* Improve profile matching for LTO binaries that don't match 100%.
* Fix profile matching for '.LTHUNK*' functions.
* Add external outgoing branches (calls) for profile validation.

There's an improvement for 100% match profile and for stale LTO
profile. However, we are still not fully closing the gap with
stale profile when LTO is enabled.

(NOTE: I haven't updated all test cases yet)

(cherry picked from FBD5529293)
2017-07-17 11:22:22 -07:00
Bohan Ren
eb63a0b295 [BOLT] Expand BOLT report for basic block ordering
Summary:
Add a new positional option onto bolt: "-print-function-statistics=<uint64>"
which prints information about block ordering for requested number of functions.

(cherry picked from FBD5105323)
2017-05-22 11:04:01 -07:00
Rafael Auler
583790ee22 Fix dynostats for conditional tail calls
Summary:
Don't treat conditional tail calls as branches for dynostats. Count
taken conditional tails calls as calls. Change SCTC to report dynamic
numbers after it is done.

(cherry picked from FBD5203708)
2017-06-07 14:20:39 -07:00
Rafael Auler
d850ca3622 [BOLT] Add shrink wrapping pass
Summary:
Add an implementation for shrink wrapping, a frame optimization
that moves callee-saved register spills from hot prologues to cold
successors.

(cherry picked from FBD4983706)
2017-05-01 16:52:54 -07:00
Maksim Panchenko
6c32079d57 [BOLT] Update addresses for DW_TAG_GNU_call_site and DW_TAG_label.
Summary:
Some DWARF tags (such as GNU_call_site and label) reference instruction
addresses in the input binary. When we update debug info we need to
update these tags too with new addresses.

Also fix base address used for calculation of output addresses in
relocation mode.

(cherry picked from FBD5155814)
2017-05-31 09:36:49 -07:00
Maksim Panchenko
2e744e6867 [BOLT] Emit sorted DWARF ranges and location lists.
Summary:
When producing address ranges and location lists for debug info
add a post-processing step that sorts them and merges adjacent
entries.

Fix a memory allocation/free issue for .debug_ranges section.

(cherry picked from FBD5130583)
2017-05-24 15:20:27 -07:00
Maksim Panchenko
2428567f7d [BOLT] Fix no-assertions build.
(cherry picked from FBD5130285)
2017-05-25 10:29:38 -07:00
Rafael Auler
2ee4bbd3c1 [BOLT] Optimize jump tables with hot entries
Summary:
This diff is similar to Bill's diff for optimizing jump tables
(and is built on top of it), but it differs in the strategy used to
optimize the jump table. The previous approach loads the target address
from the jump table and compare it to check if it is a hot target. This
accomplishes branch misprediction reduction by promote the indirect jmp
to a (more predictable) direct jmp.

  load  %r10, JMPTABLE
  cmp   %r10, HOTTARGET
  je    HOTTARGET
  ijmp  [JMPTABLE + %index * scale]

The idea in this diff is instead to make dcache better by avoiding the
load of the jump table, leaving branch mispredictions as a secondary
target. To do this we compare the index used in the indirect jmp and if
it matches a known hot entry, it performs a direct jump to the target.

  cmp  %index, HOTINDEX
  je   CORRESPONDING_TARGET
  ijmp [JMPTABLE + %index * scale]

The downside of this approach is that we may have multiple indices
associated with a single target, but we only have profiling to show
which targets are hot and we have no clue about which indices are hot.

  INDEX    TARGET
  0        4004f8
  8        4004f8
  10       4003d0
  18       4004f8

  Profiling data:
  TARGET   COUNT
  4004f8   10020
  4003d0   17

In this example, we know 4004f8 is hot, but to make a direct call to it
we need to check for indices 0, 8 and 18 -- 3 comparisons instead of 1.

Therefore, once we know a target is hot, we must generate code to
compare against all possible indices associated with this target because
we don't know which index is the hot one (IF there's a hotter index).

  cmp %index, 0
  je  4004f8
  cmp %index, 8
  je  4004f8
  cmp %index, 18
  je  4004f8
  (... up to N comparisons as in --indirect-call-promotion-topn=N )
  ijmp [JMPTABLE + %index * scale]

(cherry picked from FBD5005620)
2017-05-01 14:04:40 -07:00
Bill Nell
3a3bcd767e Don't add useless uncond branch to fallthroughs when running SCTC.
Summary:
SCTC was sometimes adding unconditional branches to fallthrough blocks.
This diff checks to see if the unconditional branch is really necessary, e.g.
it's not to a fallthrough block.

(cherry picked from FBD5098493)
2017-05-19 14:45:46 -07:00
Maksim Panchenko
96adec51eb [BOLT] Rework debug info processing.
Summary:
Multiple improvements to debug info handling:
  * Add support for relocation mode.
  * Speed-up processing.
  * Reduce memory consumption.
  * Bug fixes.

The high-level idea behind the new debug handling is that we don't save
intermediate state for ranges and location lists. Instead we depend
on function and basic block address transformations to update the info
as a final post-processing step.

For HHVM in non-relocation mode the peak memory went down from 55GB to 35GB. Processing time went from over 6 minutes to under 5 minutes.

(cherry picked from FBD5113431)
2017-05-16 09:27:34 -07:00
Bill Nell
4806b13835 [BOLT] Add jump table support to ICP
Summary:
Add jump table support to ICP.  The optimization is basically the same
as ICP for tail calls.  The big difference is that the profiling data
comes from the jump table and the targets are local symbols rather than
global.

I've removed an instruction from ICP for tail calls.  The code used to
have a conditional jump to a block with a direct jump to the target, i.e.

  B1: cmp foo,(%rax)
      jne B3
  B2: jmp foo
  B3: ...

this code is now:

  B1: cmp foo,(%rax)
      je  foo
  B2: ...

The other changes in this diff:
- Move ICP + new jump table support to separate file in Passes.
- Improve the CFG validation to handle jump tables.
- Fix the double jump peephole so that the successor of the modified
  block is updated properly.  Also make sure that any existing branches
  in the block are modified to properly reflect the new CFG.
- Add an invocation of the double jump peephole to SCTC.  This allows
  us to remove a call to peepholes/UCE occurring after fixBranches() in
  the pass manager.
- Miscellaneous cleanups to BOLT output.

(cherry picked from FBD4727757)
2017-03-08 19:58:33 -08:00
Maksim Panchenko
3f42fdf7da [BOLT] Update function address and size in relocation mode.
Summary:
Set function addresses after code emission but before we update
debug info and symbol table entries.

(cherry picked from FBD5029609)
2017-05-08 22:51:36 -07:00
Maksim Panchenko
13c89e6ef1 [BOLT] Fix branch data for __builtin_unreachable().
Summary:
When we have a conditional branch past the end of function (a result
of a call to__builtin_unreachable()), we replace the branch with nop,
but keep branch information for validation purposes. If that branch
has a recorded profile we mistakenly create an additional successor
to a containing basic block (a 3rd successor).

Instead of adding the branch to FTBranches list we should be adding
to IgnoredBranches.

(cherry picked from FBD4912840)
2017-04-18 23:32:11 -07:00