Commit Graph

39 Commits

Author SHA1 Message Date
Maksim Panchenko
e9c6c73bb8 [BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.

For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.

After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.

One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.

Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.

As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.

New semantics for `-split-functions=<n>`:

  -split-functions - split functions into hot and cold regions
    =0 -   do not split any function
    =1 -   in non-relocation mode only split functions too large to fit
           into original code space
    =2 -   same as 1 (backwards compatibility)
    =3 -   split all functions

(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
Rafael Auler
cc4b2fb614 [BOLT] Efficient edge profiling in instrumented mode
Summary:
Change our edge profiling technique when using instrumentation
to do not instrument every edge. Instead, build the spanning tree
for the CFG and omit instrumentation for edges in the spanning tree.
Infer the edge count for these edges when writing the profile during
run time. The inference works with a bottom-up traversal of the spanning
tree and establishes the value of the edge connecting to the parent based
on a simple flow equation involving output and input edges, where the
only unknown variable is the parent edge.

This requires some engineering in the runtime lib to support dynamic
allocation for building these graphs at runtime.

(cherry picked from FBD17062773)
2019-08-07 16:09:50 -07:00
Maksim Panchenko
f588d7a6ea [BOLT] Tighter control of jump table detection
Summary:
We were too permissive by allowing more jump tables during the
preliminary scan of memory. This allowed for jump tables to be
falsely detected. And since we didn't have a way to backtrack
the jump table creation, we had to assert.

This diff refactors the code that analyzes jump table contents.
Preliminary and final passes share the same code. The only difference
should be the detection of instruction boundaries that are available
during the final pass.

This should affect strict relocation mode only.

(cherry picked from FBD16923335)
2019-08-19 14:06:36 -07:00
Maksim Panchenko
8d5854ef09 [BOLT] Add option to verify instruction encoder/decoder
Summary:
Add option `-check-encoding` to verify if the input to LLVM disassembler
matches the output of the assembler. When set, the verification runs on
every instruction in processed functions.

I'm not enabling the option by default as it could be quite noisy on x86
where instruction encoding is ambiguous and can include redundant
prefixes.

(cherry picked from FBD16595415)
2019-07-31 16:03:49 -07:00
Maksim Panchenko
a9b9aa1e02 [BOLT] Add code padding verification
Summary:
In non-relocation mode, we allow data objects to be embedded in the
code. Such objects could be unmarked, and could occupy an area between
functions, the area which is considered to be code padding.

When we disassemble code, we detect references into the padding area
and adjust it, so that it is not overwritten during the code emission.
We assume the reference to be pointing to the beginning of the object.

However, assembly-written functions may reference the middle of an
object and use negative offsets to reference data fields. Thus,
conservatively, we reduce the possibly-overwritten padding area to
a minimum if the object reference was detected.

Since we also allow functions with unknown code in non-relocation mode,
it is possible that we miss references to some objects in code.
To cover such cases, we need to verify the padding area before we
allow to overwrite it.

(cherry picked from FBD16477787)
2019-07-23 20:48:41 -07:00
laith sakka
744a2417dd Run findSubprograms in preprocessDebugInfo in parallel
Summary:
While reading debug info the function findSubprograms
runs on each compilation unit. This diff parallelize that loop
reducing its runtime duration by 70%.

(cherry picked from FBD16362867)
2019-07-17 20:54:53 -07:00
laith sakka
9977b03fea Run reorder blocks in parallel
Summary:
This diff change reorderBasicBlocks pass to run in parallel,
it does so by adding locks to the fix branches function,
and creating temporary MCCodeEmitters when estimating basic block code size.

(cherry picked from FBD16161149)
2019-07-08 12:32:58 -07:00
Rafael Auler
1169f1fdd8 [BOLT] Support duplicating jump tables
Summary:
If two indirect branches use the same jump table, we need to
detect this and duplicate dump tables so we can modify this CFG
correctly. This is necessary for instrumentation and shrink wrapping.
For the latter, we only detect this and bail, fixing this old known
issue with shrink wrapping.

Other minor changes to support better instrumentation: add an option
to instrument only hot functions, add LOCK prefix to instrumentation
increment instruction, speed up splitting critical edges by avoiding
calling recomputeLandingPads() unnecessarily.

(cherry picked from FBD16101312)
2019-07-02 16:56:41 -07:00
Rafael Auler
8880969ced [BOLT] Restrict creation of jump tables
Summary:
Heuristic that creates a jump table for every memory access,
including those we do not match against a pattern in an indirect jump,
is too permissive and has false positives. Guard this logic under
strict mode until we figure out a better strategy.

(cherry picked from FBD16192205)
2019-07-10 15:41:34 -07:00
Maksim Panchenko
e89ad0db4b [BOLT] Introduce strict relocation mode
Summary:
In strict relocation mode we rely on relocations to represent all
possible entry points into a function. Most of the code generated by
tested compilers (gcc and clang) will result in relocations against
any internal labels for jump tables and for computed goto tables.

In situations where we cannot properly reconstruct a jump table, or when
we cannot determine a table that guides an indirect jump, e.g. when
multiple computed goto tables are used, we conservatively assume that
the indirect jump can end up at any possible basic block referenced by
relocations.

In strict mode, simple functions may include the aforementioned
instructions with unknown control flow with a conservative list of
destinations added to the containing basic block. This allows us to
expand coverage of simple functions and to enable code reordering
optimizations for more functions.

The strict mode is recommended when BOLT is used with a well-formed
code generated by a compiler.

To use the strict mode, add "-strict" on the command line.

Another effect of this diff, is that with relocations, we will always
replace the immediate operand of an instruction with a symbol if the
relocation exists against this operand.

Also this diff fixes issues with Clang compiled with -fpic.

(cherry picked from FBD15872849)
2019-06-28 09:21:27 -07:00
Maksim Panchenko
06e7a1e059 [BOLT] Ignore false function references
Summary:
A relocation can have an addend that makes it look as the relocated
value is in a different section from the symbol being relocated.
E.g., a relocation against a variable in .rodata could have a negative
offset that will make it look like it is against a symbol in .text
(a section that typically precedes .rodata).

Unless the relocation is against a section symbol, we know
exactly the symbol that is being relocated and there is no issue.
However, when the linker leaves only a section relocation (i.e. a
relocation against a section symbol when a temporary original symbol
gets deleted), we have to guess the relocated symbol, and can falsely
detect a function reference in the case described above.

The fix is to keep a section relocation if the corresponding
relocated value falls into a different section, and to detect and
ignore false function reference.

(cherry picked from FBD16030791)
2019-06-27 03:20:17 -07:00
laith sakka
1ec091e6f5 Parallelize ICF Pass
Summary:
ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds.
this diff perform some parallelization for the pass to make it faster.
A 60% reduction in the ICF runtime  is measured on the parallel version for HHVM.

(cherry picked from FBD15589515)
2019-05-31 16:45:31 -07:00
Maksim Panchenko
9894de0094 [BOLT] Check instruction boundaries while populating jump tables
Summary:
Now that we populate jump tables after all functions are disassembled,
we can check for instruction boundaries corresponding to jump table
entries. No need to delegate this task to postProcessJumpTables().

(cherry picked from FBD15814762)
2019-06-13 15:31:30 -07:00
Maksim Panchenko
9e2ad3f593 [BOLT] Delay populating jump tables
Summary:
During the initial disassembly pass, only identify jump tables
without populating the contents. Later, after all functions have been
disassembled, we have a better idea of jump table boundaries and can do
a better job of populating their entries.

As a result, we no longer have embedded jump tables (i.e. a jump table
that is parter of another jump table). If we ever need to keep
sequential jump tables inseparable during the output, we can always
add such functionality later.

Fixes facebookincubator/BOLT#56.

(cherry picked from FBD15800427)
2019-06-12 18:21:02 -07:00
Maksim Panchenko
fac6a89c23 [BOLT] Better handling of address references
Summary:
We used to handle PC-relative address references differently from direct
address references. As a result, some cases, such as escaped function
label address, were not handled when dealing with absolute (non-PIC)
code. This diff moves processing of an address reference into
BinaryContext::handleAddressRef() which is called for both PIC and
non-PIC code.

(cherry picked from FBD15643535)
2019-06-04 15:30:22 -07:00
Maksim Panchenko
be344c8de7 [BOLT] Refactor handling of interproc refs
Summary:
Move handling of interprocedural references to BinaryContext.

Post-process indirect branches immediately after the CFG is built.

This is almost NFC. Since indirect branches are now post-processed
before the profile data is processed it interferes with the way the
profile data in YAML format is handled.

(cherry picked from FBD15456003)
2019-05-22 11:26:58 -07:00
Maksim Panchenko
fee61231ef [BOLT] Move JumpTable management to BinaryContext
Summary:
Make BinaryContext responsible for creation and management of
JumpTables. This will be used for detection and resolution of jump table
conflicts across functions.

(cherry picked from FBD15196017)
2019-05-02 17:42:06 -07:00
Rafael Auler
4e4d39c21c [BOLT] Update symbols for secondary entry points
Summary:
Update the output ELF symbol table for symbols representing
secondary entry points for functions. Previously, those were left
unchanged in the symtab.

(cherry picked from FBD15010517)
2019-04-18 16:32:22 -07:00
Brian Gesiak
eba1a67730 Fix casting issues on macOS
Summary:
`size_t` is platform-dependent, and on macOS it is defined as
`unsigned long long`. This is not the same type as is used in many calls
to templated functions that expect the same type. As a result, on macOS,
calls to `std::max` fail because a template function that takes
`uint64_t, unsigned long long` cannot be found.

To work around the issue:

* Specify explicit `std::max` and `std::min` functions where necessary,
  to work around the compiler trying (and failing) to find a suitable
  instantiation.
* For lambda return types, specify an explicit return type where necessary.
* For `operator ==()` calls, use an explicit cast where necessary.

(cherry picked from FBD15030283)
2019-04-22 11:27:50 -04:00
Maksim Panchenko
99ef4c90c1 [BOLT] Basic support for split functions
Summary:
This adds very basic and limited support for split functions.
In non-relocation mode, split functions are ignored, while their debug
info is properly updated. No support in the relocation mode yet.

Split functions consist of a main body and one or more fragments.
For fragments, the main part is called their parent. Any fragment
could only be entered via its parent or another fragment.

The short-term goal is to correctly update debug information for split
functions, while the long-term goal is to have a complete support
including full optimization. Note that if we don't detect split
bodies, we would have to add multiple entry points via tail calls,
which we would rather avoid.

Parent functions and fragments are represented by a `BinaryFunction`
and are marked accordingly. For now they are marked as non-simple, and
thus only supported in non-relocation mode. Once we start building a
CFG, it should be a common graph (i.e. the one that includes all
fragments) in the parent function.

The function discovery is unchanged, except for the detection of
`\.cold\.` pattern in the function name, which automatically marks the
function as a fragment of another function.

Because of the local function name ambiguity, we cannot rely on the
function name to establish child fragment and parent relationship.
Instead we rely on disassembly processing.

`BinaryContext::getBinaryFunctionContainingAddress()` now returns a
parent function if an address from its fragment is passed.

There's no jump table support at the moment. Jump tables can have
source and destinations in both fragment and parent.

Parent functions that enter their fragments via C++ exception handling
mechanism are not yet supported.

(cherry picked from FBD14970569)
2019-04-16 10:24:34 -07:00
Maksim Panchenko
a8e05d067d [BOLT] Add interface to extract values from static addresses
(cherry picked from FBD14858028)
2019-04-09 12:29:40 -07:00
Maksim Panchenko
7d89b113d8 [BOLT][NFC] Indentation fix
(cherry picked from FBD14856700)
2019-04-09 11:31:45 -07:00
Maksim Panchenko
7fd487066f [BOLT] Move BinaryFunctions into a BinaryContext and more
Summary:
A long due refactoring that makes interfaces cleaner and less awkward.
Mainly makes the future work way easier.

(cherry picked from FBD14766284)
2019-04-03 15:52:01 -07:00
Maksim Panchenko
163adbec9f [BOLT] Refactor allocatable sections rewrite part
Summary:
This refactoring makes it easier to create new code sections and control
code placement. As an example, cold code is being placed into
".text.cold" which is emitted independently from ".text", and the final
address assignment becomes more flexible.

Previously, in non-relocation mode we used to emit temporary section
name into .shstrtab. This resulted in unnecessary bloat of this section.

There was unnecessary padding emitted at the end of text section. After
fixing this, the output binary becomes smaller.

I had to change the way exception handling tables are re-written
as the current infra does not support cross-section label difference.
This means we have to emit absolute landing pad addresses, which might
not work for PIE binaries. I'm going to address this once I investigate
the current exception handling issues in PIEs.

This diff temporarily disables "-hot-functions-at-end" option.

(cherry picked from FBD14475693)
2019-03-14 18:51:05 -07:00
Maksim Panchenko
ff6e21290f [BOLT] New inliner implementation
Summary:
Addresses correctness issues related to inlining.
Inlining heuristics are not part of this diff.

(cherry picked from FBD13796888)
2019-01-31 11:23:02 -08:00
Maksim Panchenko
b0f7fddd35 [BOLT] Add method for better function size estimation
Summary:
Add BinaryContext::calculateEmittedSize() that ephemerally emits code
to allow precise estimation of the function size. Relaxation and
macro-op alignment adjustments are taken into account.

(cherry picked from FBD13092139)
2018-11-15 16:02:16 -08:00
Facebook Github Bot
b166ccbea8 [BOLT][PR] Fix compiler warnings in BinaryContext and RegAnalysis
Summary:
This pull request fixes two compiler warnings:

- missing `break;` in a switch-case statement in RegAnalysis.cpp (-Wimplicit-fallthrough warning)
- misleading indentation in BinaryContext.cpp (-Wmisleading-indentation warning)
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/39
GitHub Author: Andreas Ziegler <andreas.ziegler@fau.de>

(cherry picked from FBD10202092)
2018-10-04 10:46:16 -07:00
Maksim Panchenko
ce508b58c6 [BOLT] Support relocations without symbols
Summary:
lld may generate relocations without associated symbols. Instead of
rejecting binaries with such relocations, we can re-create the symbol
the relocation is against based on the extracted value.

(cherry picked from FBD10054576)
2018-09-21 12:00:20 -07:00
Rafael Auler
bd0b99c45d [BOLT] Change stub-insertion pass for AArch64
Summary:
Previously, we were expanding eligible branches with stubs. After
expansion, we were computing which stubs were unnecessary and removing them,
assuming ranges were shortening as code is removed. The problem with this
approach is that for branches that refer to code that is not managed by
BOLT, the distance to that location can increase and we can end up with an
out-of-range branch.

This rewrites the pass to be simpler, only increasing size and expanding code
with stubs as needed after each iteration, stopping when code stops increasing.
Besides this rewrite, the stub-insertion pass now supports stubs grouping
similar to what the linker does, allowing different functions to share the
same veneer that jumps to a common callee. It also fixes a bug in the previous
implementation that, in very large functions that use TBZ/TBNZ (+-32KB range),
it would mistakenly try to reuse a local stub BB that is out of range.

This includes a change to allow hot functions to be put at the end of the
.text section, closer to the heap, requiring no veneers to jump to JITted
code. And finally it enables eliminate veneers pass by default.

(cherry picked from FBD10023158)
2018-09-17 13:36:59 -07:00
Maksim Panchenko
1387a9d761 [BOLT] Keep .text section in file when using old text
Summary:
If we reuse text section under `-use-old-text` option, then there's no
need to rename it. Tools, such as perf, seem to not like binaries
without `.text`.

Additionally, check if the code fits into `.text` using the page
alignment, otherwise we were skipping the alignment relying on the user
detecting the warning message. This could have resulted in unexpected
performance drops.

Also add `-no-huge-pages` option to use regular page size for code
alignment purposes (i.e. 4KiB instead of 2MiB).

(cherry picked from FBD10024670)
2018-09-24 20:58:31 -07:00
Maksim Panchenko
53b72d0f2e [BOLT] Ignore symbols from non-allocatable sections
Summary:
While creating BinaryData objects we used to process all symbol table
entries. However, some symbols could belong to non-allocatable sections,
and thus we have to ignore them for the purpose of analyzing in-memory
data.

(cherry picked from FBD9666511)
2018-09-05 14:36:52 -07:00
Maksim Panchenko
2511b09985 [BOLT][DWARF] Fix line info for empty CU DIEs
Summary:
In some rare cases a compiler may generate DWARF that contains an empty
CU DIE that references a debug line fragment. That fragment will contain
no file name information, and we fail to register it. Then, as a result,
DW_AT_stmt_list is not updated for the CU. This may cause some
DWARF-processing tools to segfault.

As a solution/workaround, we register "<unknown>" file name for such
debug line tables.

(cherry picked from FBD9526705)
2018-08-27 20:12:59 -07:00
Maksim Panchenko
fe9f8219fa [BOLT] Fix TBSS-related issue
Summary:
TLS segment provide a template for initializing thread-local storage
for every new thread. It consists of initialized  and uninitialized
parts. The uninitialized part of TLS, .tbss, is completely meaningless
from a binary analysis perspective. It doesn't take any space in the
file, or in memory. Note that this is different from a regular .bss
section that takes space in memory.

We should not place .tbss into a list of allocatable sections, otherwise
it may cause conflicts with objects contained in the next section.

(cherry picked from FBD9074056)
2018-07-30 16:30:18 -07:00
Laith Saed Sakka
27f3032447 Add initial function injection support
Summary:
This diff have the API needed to inject functions using bolt.
In relocation mode injected functions are emitted between the cold and the hot functions,
In non-reloc mode injected functions are emitted a next text section.

(cherry picked from FBD8715965)
2018-07-08 12:14:08 -07:00
Rafael Auler
35c09dc4dd [BOLT] Add a user friendly error reporting message
Summary:
In case we fail to disassemble or to build the CFG for a
function, print instructions on bug reporting.

(cherry picked from FBD8549737)
2018-06-20 12:03:24 -07:00
Maksim Panchenko
232046f9b2 [Bolt] Reduce verbosity while reporting hash collisions
Summary:
Don't report all data objects with hash collisions by default. Only
report the summary, and use -v=1 for providing the full list.

(cherry picked from FBD8372241)
2018-06-11 17:17:25 -07:00
Bill Nell
706abb6c95 [BOLT] Hash anonymous symbol names
Summary:
This diff replaces the addresses in all the {SYMBOLat,HOLEat,DATAat} symbols with hash values based on the data contained in the symbol.  It should make the profiling data for anonymous symbols robust to address changes.

The only small problem with this approach is that the hashed name for padding symbols of the same size collide frequently.  This shouldn't be a big deal since it would be weird if those symbols were hot.

On a test run with hhvm there were 26 collisions (out of ~338k symbols).  Most of the collisions were from small (2,4,8 byte) objects.

(cherry picked from FBD7134261)
2018-06-06 03:17:32 -07:00
Bill Nell
729da2da22 [BOLT] Static data reordering pass.
Summary:
Enable BOLT to reorder data sections in a binary based on memory
profiling data.

This diff adds a new pass to BOLT that can reorder data sections for
better locality based on memory profiling data.  For now, the algorithm
to order data is primitive and just relies on the frequency of loads to
order the contents of a section.  We could probably do a lot better by
looking at what functions use the hot data and grouping together hot
data that is used by a single function (or cluster of functions).
Block ordering might give some hints on how to order the data better as
well.

The new pass has two basic modes: inplace and split (when inplace is
false).  The default is split since inplace hasn't really been tested
much.  When splitting is on, the cold data is copied to a "cold" version
of the section while the hot data is kept in the original section, e.g.
for .rodata, .rodata will contain the hot data and .bolt.org.rodata will
contain the cold bits.  In inplace mode, the section contents are
reordered inplace.  In either mode, all relocations to data within that
section are updated to reflect new data locations.

Things to improve:
- The current algorithm is really dumb and doesn't seem to lead to any
  wins.  It certainly could use some improvement.
- Private symbols can have data that leaks over to an adjacent symbol,
  e.g. a string that has a common suffix can start in one symbol and
  leak over (with the common suffix) into the next.  For now, we punt on
  adjacent private symbols.
- Handle ambiguous relocations better.  Section relocations that point
  to the boundary of two symbols will prevent the adjacent symbols from
  being moved because we can't tell which symbol the relocation is for.
- Handle jump tables.  Right now jump table support must be basic if
  data reordering is enabled.
- Being able to handle TLS.  A good amount of data access in some
  binaries are happening in TLS. It would be worthwhile to be able to
  reorder any TLS sections too.
- Handle sections with writeable data.  This hasn't been tested so
  probably won't work.  We could try to prevent false sharing in
  writeable sections as well.
- A pie in the sky goal would be to use DWARF info to reorder types.

(cherry picked from FBD6792876)
2018-04-20 20:03:31 -07:00
Maksim Panchenko
9c6f965616 [BOLT] Getting open-source ready
Summary:
BOLT sources are being moved under tools/llvm-bolt/src
and tools/llvm-bolt will contain more files such as LICENSE.txt,
README.txt, etc.

Remove trailing white spaces from our sources.

Create llvm.patch by running

  > git diff f137ed238db11440f03083b1c88b7ffc0f4af65e include lib > \
    tools/llvm-bolt/llvm.patch

README.txt has instructions on checking out sources and applying the
patch.

(cherry picked from FBD7878380)
2018-05-04 10:10:41 -07:00