Commit Graph

81 Commits

Author SHA1 Message Date
Amir Ayupov
c36b71686c Improve cold fragment name matching
Summary:
Fix cold fragment name matching regex by replacing existing
regexes `.*\.cold\..*` and  `.*\.cold`
and combining them into `.*\.cold(\.\d)?`,
applied to restored name (with BOLT-added suffixes stripped)

This allows matching names like "execute_stack_op.cold/1", which
previously weren't recognized.

(cherry picked from FBD24804880)
2020-11-09 12:38:51 -08:00
Amir Ayupov
f86a78a4e7 Lost in rebase: call registerFragment with a reference to TargetBF
Summary: Fixes broken build due to a lost dereferencing

(cherry picked from FBD24799948)
2020-11-06 12:22:22 -08:00
Amir Ayupov
2b09d672ce Conservatively handle jump tables in split functions
Summary:
- Allow jump table entries to point to locations inside the function and its fragments.
Reasoning behind this is that jump table identification has the logic of stopping at entry which belongs to a function different from the one originally referencing jump table. This assumption is invalid for jump tables with entries pointing to both parent function and cold fragments, leading to "unclaimed PC-relative relocations" assertion.

- Add fragment identification heuristic based on function name regex and contiguous jump table entries.
Currently, parent-to-fragment relationship is set up based on interprocedural references – direct references from the parent function. These references don't include references through jump table.
Additionally, some fragments are only reachable through jump table. In that case, in order to fully consume jump table, add parent-to-fragment relationship during `analyzeJumpTable` using the following heuristics:
  1. Fragment is identified as such based on name (contains `.cold.` part), but
  2. Parent function is not set – no direct interprocedural references to that fragment, and
  3. Fragment has the name of the form <parent>.cold(.\d+)

* For split functions with jump table entries spanning parent and fragments, mark parent and all fragments as ignored.

(cherry picked from FBD24456904)
2020-11-06 11:19:03 -08:00
Amir Ayupov
dc48354f71 processInterproceduralReferences: record references to cold fragments as entry points
Summary:
For interprocedural references to fragments, record them as
fragment entry points. Not registering these entry points leads to
UCE removing the blocks and "Undefined temporary symbol"
assertion.

(cherry picked from FBD24511281)
2020-11-06 10:57:47 -08:00
Amir Ayupov
5452287710 Extract BinaryContext::registerFragment
Summary: registerFragment to be reused in adding fragments reachable only through jump tables.

(cherry picked from FBD24656651)
2020-11-06 10:27:33 -08:00
Maksim Panchenko
4f4239ceba [BOLT] Fix C++ exceptions for shared objects
Summary:
Fix several issues to make C++ exceptions work in shared objects:
  * Set MCObjectFileInfo PIC type based on the input binary type.
  * Support indirect (DW_EH_PE_indirect) encoding while writing
    exception Type Table.
  * Use different LPStart value and landing pad encoding for .so's.
  * Disable splitting of exception-handling code for .so's because of
    the new encoding.

(cherry picked from FBD24698765)
2020-11-04 11:44:02 -08:00
Rafael Auler
37921b489a [BOLT] Please sanitizers
Summary:
In BinaryContext, we had StringRef holding a reference to
an r-value std::string. This triggers clang's address sanitizer
warnings. In MCPlusBuilder we had a left shift overflowing a type,
which is undefined behavior. Similarly, in CallGraph, we had a hash
function shifting a negative value, which is also UB. The last two
triggers the UB sanitizer.

(cherry picked from FBD24661045)
2020-10-30 15:11:52 -07:00
Maksim Panchenko
53bd88c7fe [BOLT] Refactor reading of debug line info
Summary:
Match BinaryFunction to a DWARFUnit based on the unit's address ranges
skipping the parsing of DIEs.

(cherry picked from FBD24269325)
2020-10-12 21:04:42 -07:00
Maksim Panchenko
0465d952cc [BOLT] Refactor PatchEntries pass
Summary:
Use injected functions with fixed addresses to patch original function
entries.

(cherry picked from FBD24133890)
2020-10-09 16:06:27 -07:00
Maksim Panchenko
a82cff0f52 [BOLT] Eliminate "shallow" function lookup
Summary:
Whenever we search for a function based on its address in the input
binary, we now always return a corresponding fragment for split
functions. If the user needs an access to the main fragment, they can
call getTopmostFragment().

(cherry picked from FBD23670311)
2020-09-14 15:48:32 -07:00
Maksim Panchenko
62469b5036 [BOLT] Do no map sections with zero address
Summary:
Sections that do not originate from the input binary will have an
input address set to zero and thus do not have to be mapped.

Mapping such sections caused a build time regression in non-relocation
mode.

(cherry picked from FBD23670334)
2020-09-14 14:31:50 -07:00
Amir Ayupov
8f7cb54ae5 Added execution count threshold option
Summary:
Added execution count threshold option (execution-count-threshold) controlling the optimizations that are sensitive to the accuracy of the profiling data:
- BB reordering
- function splitting
- frame opts
- shrink wrapping
- indirect call promotion

(cherry picked from FBD22682171)
2020-07-27 18:07:18 -07:00
Maksim Panchenko
3e795c8a5f [BOLT] Ignore addresses from non-allocatable sections
Summary:
We've accidentally registered TBSS section address with a BinaryContext
resulting in addresses being attributed to it when
getSectionForAddress() was called.

(cherry picked from FBD22369323)
2020-07-06 14:39:44 -07:00
Maksim Panchenko
4aaa8892dd [BOLT] Ignore duplicate relocations
Summary:
lld linker may emit static relocations against addresses that also have
dynamic relocations associated with them. When this happens, BOLT fails
to validate the extracted value at the address.

Read dynamic relocations in the binary and ignore static relocations at
addresses that have a duplicate dynamic relocation.

(cherry picked from FBD22192345)
2020-06-23 12:22:58 -07:00
Maksim Panchenko
db4642d0a6 [BOLT] Support -hot-text in lite mode
Summary: Update special symbol references in functions that are not emitted.

(cherry picked from FBD22120995)
2020-06-18 11:10:41 -07:00
Maksim Panchenko
0ce0bce9e7 [BOLT] Support for lite mode with relocations
Summary:
Add '-lite' support for relocations for improved processing time,
memory consumption, and more resilient processing of binaries with
embedded assembly code.

In lite relocation mode, BOLT will skip full processing of functions
without a profile. It will run scanExternalRefs() on such functions
to discover external references and to create internal relocations
to update references to optimized functions.

Note that we could have relied on the compiler/linker to provide
relocations for function references. However, there's no assurance
that all such references are reported. E.g., the compiler can resolve
inter-procedural references internally, leaving no relocations
for the linker.

The scan process takes about <10 seconds per 100MB of code on modern
hardware. It's a reasonable overhead to live with considering the
flexibility it provides.

If BOLT fails to scan or disassemble a function, .e.g., due to a data
object embedded in code, or an unsupported instruction, it enables a
patching mode to guarantee that the failed function will call
optimized/moved versions of functions. The patching happens at original
function entry points.

'-skip=<func1,func2,...>' option now can be used to skip processing of
arbitrary functions in the relocation mode.

With '-use-old-text' or '-strict' we require all functions to be
processed. As such, it is incompatible with '-lite' option,
and '-skip' option will only disable optimizations of listed
functions, not their disassembly and emission.

(cherry picked from FBD22040717)
2020-06-15 00:15:47 -07:00
Alexander Shaposhnikov
0823882d47 Link functions on MachO
Summary: Add first bits for linking functions on MachO.

(cherry picked from FBD21991721)
2020-06-12 20:16:27 -07:00
Maksim Panchenko
8729171182 [BOLT] Refactor profile-handling code
Summary:
This diff handles several issues related to profile reading and
handling:
  * Unifies interface used by 3 profile readers in ProfileReaderBase.
  * Adds automatic detection of the profile file contents.
  * Removes reader-specific fields from BinaryFunction and BinaryData.
    All the information is stored in instruction annotations.
  * Removes implicit memory dependencies in annotations on profile
    reader instance.
  * Adds lite mode support to YAML reader.
  * Moves profile reading code out of BinaryFunction.

(cherry picked from FBD21601411)
2020-05-07 23:00:29 -07:00
Maksim Panchenko
b62a1774af [BOLT] Cover PIC jump table reference in non-strict mode
Summary:
In non-strict relocation mode it was possible to miss a jump table
reference leading to incorrect code.

(cherry picked from FBD21251467)
2020-04-26 17:51:07 -07:00
Maksim Panchenko
ac36e17a73 [BOLT][BFC] Refactor code for adding secondary function entries
Summary:
In non-relocation mode, the code for marking a function non-simple was
decoupled from the code that added new entry points.  Fix that.

(cherry picked from FBD21264247)
2020-04-27 13:40:53 -07:00
Maksim Panchenko
5296b6d12a [BOLT] Change symbol handling for secondary function entries
Summary:
Some functions could be called at an address inside their function body.
Typically, these functions are written in assembly as C/C++ does not
have a multi-entry function concept. The addresses inside a function
body that could be referenced from outside are called secondary entry
points.

In BOLT we support processing functions with secondary/multiple entry
points. We used to mark basic blocks representing those entry points
with a special flag. There was only one problem - each basic block has
exactly one MCSymbol associated with it, and for the most efficient
processing we prefer that symbol to be local/temporary. However, in
certain scenarios, e.g. when running in non-relocation mode, we need
the entry symbol to be global/non-temporary.

We could create global symbols for secondary points ahead of time when
the entry point is marked in the symbol table. But not all such entries
are properly marked. This means that potentially we could discover an
entry point only after disassembling the code that references it, and
it could happen after a local label was already created at the same
location together with all its references. Replacing the local symbol
and updating the references turned out to be an error-prone process.

This diff takes a different approach. All basic blocks are created with
permanently local symbols. Whenever there's a need to add a secondary
entry point, we create an extra global symbol or use an existing one
at that location. Containing BinaryFunction maps a local symbol of a
basic block to the global symbol representing a secondary entry point.
This way we can tell if the basic block is a secondary entry point,
and we emit both symbols for all secondary entry points. Since secondary
entry points are quite rare, the overhead of this approach is minimal.

Note that the same location could be referenced via local symbol from
inside a function and via global entry point symbol from outside.
This is true for both primary and secondary entry points.

(cherry picked from FBD21150193)
2020-04-19 22:29:54 -07:00
Maksim Panchenko
abda7dc6a7 [BOLT] Fix ICF non-determinism in non-relocation mode
Summary:
ICF may fold functions in arbitrary order when running multi-threaded.
This is fine in relocation mode as we end up with just one function
holding all function symbols.

However, in non-relocation mode we keep all function bodies, and if we
keep merging profiles in non-deterministic order, we end up with
functions with non deterministic profiles. The fix for non-relocation
mode is to not merge profiles as the factual new profile could be
different from the merged one since both function instances are
potentially callable.

Additionally, emit extra symbols for ICF functions in non-relocation
mode to make it possible to track the folding.

(cherry picked from FBD20889866)
2020-04-04 20:12:38 -07:00
Maksim Panchenko
1f3e351a9c [BOLT] Refactor code and data emission code
Summary:
Consolidate code and data emission code in ELF-independent
BinaryEmitter. The high-level interface includes only two
functions emitBinaryContext() and emitFunctionBody() used
by RewriteInstance and BinaryContext respectively.

(cherry picked from FBD20332901)
2020-03-06 15:06:37 -08:00
Maksim Panchenko
cb9c991dcb [BOLT] Remove allow-section-relocations option
Summary: The option is not used. Remove all related code.

(cherry picked from FBD20237859)
2020-03-03 15:51:24 -08:00
Maksim Panchenko
c7e012e145 [BOLT][NFC] Get rid of BestFit parameter
Summary: The parameter is no longer used.

(cherry picked from FBD20236516)
2020-03-03 14:28:42 -08:00
Alexander Shaposhnikov
b0cbb60165 [BOLT] Fix begin decrementing
Summary: Fix begin decrementing.

(cherry picked from FBD20232474)
2020-03-03 13:36:32 -08:00
Rafael Auler
a9d85413ac [BOLT] Emit long nops by default
Summary:
Change our X86 target to use long nops by default. In general,
BOLT does not put nops into the instruction stream that is going to be
executed, since it doesn't align basic blocks, only functions. Since we
rebased BOLT, our relationship with MCAssembler changed because it
stopped using multibyte nops and we never needed to revisit that. But it
makes a difference if we want to mitigate perf issues with the Intel
JCC erratum, since the nops inserted are going to be decoded and
executed. To make MCAssembler emit long nops again, we need to explictly
set mattr (Features) of the X86 target.

(cherry picked from FBD19987277)
2020-02-19 16:13:58 -08:00
Maksim Panchenko
9711286858 [BOLT] Get rid of BinarySection::IsLocal
Summary: The flag is no longer used/needed.

(cherry picked from FBD19951571)
2020-02-18 09:20:17 -08:00
Alexander Shaposhnikov
c3c4b15a2e [BOLT] Remove BinaryContext::getFunctionData
Summary:
In this diff we refactor the code around getting the original binary encoding of function's body.
The main changes are: remove BinaryContext::getFunctionData, remove the parameter of the method BinaryFunction::disassemble, introduce BinaryFunction::getData.

(cherry picked from FBD19824368)
2020-02-10 15:35:11 -08:00
Maksim Panchenko
d57513e4ab [BOLT] Fix symbol table issue with ICF
Summary: Not all symbol table entries were updated after ICF.

(cherry picked from FBD19319685)
2020-01-08 13:32:59 -08:00
Maksim Panchenko
ac697b7d3a [BOLT] Replace list of Names with Symbols for BinaryFunction
Summary:
BinaryFunction used to have a list of Names associated with its main
entry point. However, the function is primarily identified by its
corresponding symbol or symbols, and these symbols are available as we
are creating them for a corresponding BinaryData object.

There's also no reason to emit symbols for alternative function names
(aliases), so change the code to only emit needed symbols.

When we emit a cold fragment for a function, only emit one cold symbol
for the fragment instead of one per every main entry symbol/name.

When we match a symbol to an entry point in the function, with this
change we can first go through the list of main entry symbols (now that
they are available).

(cherry picked from FBD19426709)
2020-01-13 11:56:59 -08:00
Alexander Shaposhnikov
7a59783d7a [BOLT] Move createBinaryContext to BinaryContext
Summary:
1. Move createBinaryContext to BinaryContext.
1. Add support for nonlinux triples in createBinaryContext.
2. Remove unnecessary std::move in DWARFRewriter.cpp.

(cherry picked from FBD19421314)
2020-01-15 15:23:45 -08:00
Maksim Panchenko
0283271f29 [BOLT] Do no report error on mismatched instruction encoding
Summary:
When the validation of instruction encoding fails but we are able to
continue processing the binary, do no report an error. Report encoding
format only under `-v=1`.

(cherry picked from FBD19376531)
2020-01-13 11:24:10 -08:00
Maksim Panchenko
45b27d7b44 [BOLT] Get rid of Names in BinaryData
Summary:
For BinaryData, we used to maintain a vector of StringRef names and also
a vector of pointers to MCSymbol's associated with the data. There was
an unnecessary duplication of information and an associated overhead of
keeping it in sync. Fix it by removing Names and using Symbols wherever
Names were used.

Also merge two variants of registerNameAtAddress() and remove
unreachable/dead code in the process.

(cherry picked from FBD19359123)
2020-01-10 16:17:47 -08:00
Maksim Panchenko
088e3c032a [BOLT] Improve handling of secondary function entry points
Summary:
"Fix symbol table entries for secondary entries" diff broke the inliner.

Fix the breakage and make the discovery of secondary entry points more
accurate.

Add ability to BinaryContext::getFunctionForSymbol() to return an entry
point discriminator and use it instead of calling getEntryForSymbol()
and isSecondaryEntry(). This is the preferred way since
getFunctionForSymbol() is thread-safe.

(cherry picked from FBD19295983)
2020-01-06 14:57:15 -08:00
Rafael Auler
de284bc510 [BOLT] Fix symbol table entries for secondary entries
Summary:
Commit "Support full instrumentation" changed the map
SymbolToFunction in BinaryContext to map secondary entries of functions
too. This introduced unexpected behavior in our symbol table rewriting
logic, which caused it to mistakenly write them with the address of the
original function. Fix the behavior of getBinaryFunctionAtAddress to
correct this. Also fix other users of SymbolToFunction to ensure they
are not accidentally using secondary entries when they shouldn't.

(cherry picked from FBD19168319)
2019-12-18 12:14:42 -08:00
Maksim Panchenko
3cc4fc267b [BOLT] Proper support for -trap-avx512 option
Summary:
If -trap-avx512 option is not set, verify that we correctly encode
AVX-512 instructions and treat them as ordinary instructions.

(cherry picked from FBD18666427)
2019-11-22 14:53:20 -08:00
Maksim Panchenko
a09659fd54 [BOLT] Refactor markAmbiguousRelocations()
Summary:
Refactor markAmbiguousRelocations() code and move it to BinaryContext.

Also remove a redundant check.

(cherry picked from FBD18623815)
2019-11-18 14:08:17 -08:00
Maksim Panchenko
658f270417 [BOLT] Refactor data PC relocations in BinaryContext
Summary:
We only use locations of PC relocations and ignore the rest of the data.
There's no need to store type and value.

(cherry picked from FBD18623280)
2019-11-19 18:52:08 -08:00
Maksim Panchenko
6796b7216b [BOLT] Fix jump table analysis for non-simple functions
Summary:
When we disassemble functions, we add discovered jump tables to a global
container in BinaryContext. Later, we analyze and verify all jump
tables. However, analysis for non-simple functions might fail for numerous
reasons, e.g. there would be no instruction at a destination. Since we
are not overwriting non-simple functions, it is not a critical error.
Thus, we can safely skip jump table analysis for non-simple functions.

(cherry picked from FBD18422997)
2019-11-10 21:09:01 -08:00
Maksim Panchenko
d5ddb320ef [BOLT] Free memory for CFG after emission
Summary:
Once we emit function code, we no longer need CFG for next phases
that use basic blocks for address-translation and symbol update
purposes. We free memory used by CFG and instructions. The freed
memory gets reused by later phases resulting in overall memory usage
reduction.

We can probably improve memory consumption even further by replacing
BinaryBasicBlocks with more compact data structures.

(cherry picked from FBD18408954)
2019-10-31 16:54:48 -07:00
Maksim Panchenko
103b0a77cc [BOLT] Fix non-determinism while reading debug info
Summary:
When reading debug info in parallel, CUs for functions were populated in
parallel and the order was non-deterministic. We used the first CU from
the non-deterministically-ordered list to set the line number resulting
in different outputs.

The fix is to sort the list after it's been created and before assigning
the line table unit.

(cherry picked from FBD17946889)
2019-10-14 17:57:36 -07:00
Maksim Panchenko
e9c6c73bb8 [BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.

For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.

After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.

One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.

Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.

As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.

New semantics for `-split-functions=<n>`:

  -split-functions - split functions into hot and cold regions
    =0 -   do not split any function
    =1 -   in non-relocation mode only split functions too large to fit
           into original code space
    =2 -   same as 1 (backwards compatibility)
    =3 -   split all functions

(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
Rafael Auler
cc4b2fb614 [BOLT] Efficient edge profiling in instrumented mode
Summary:
Change our edge profiling technique when using instrumentation
to do not instrument every edge. Instead, build the spanning tree
for the CFG and omit instrumentation for edges in the spanning tree.
Infer the edge count for these edges when writing the profile during
run time. The inference works with a bottom-up traversal of the spanning
tree and establishes the value of the edge connecting to the parent based
on a simple flow equation involving output and input edges, where the
only unknown variable is the parent edge.

This requires some engineering in the runtime lib to support dynamic
allocation for building these graphs at runtime.

(cherry picked from FBD17062773)
2019-08-07 16:09:50 -07:00
Maksim Panchenko
f588d7a6ea [BOLT] Tighter control of jump table detection
Summary:
We were too permissive by allowing more jump tables during the
preliminary scan of memory. This allowed for jump tables to be
falsely detected. And since we didn't have a way to backtrack
the jump table creation, we had to assert.

This diff refactors the code that analyzes jump table contents.
Preliminary and final passes share the same code. The only difference
should be the detection of instruction boundaries that are available
during the final pass.

This should affect strict relocation mode only.

(cherry picked from FBD16923335)
2019-08-19 14:06:36 -07:00
Maksim Panchenko
8d5854ef09 [BOLT] Add option to verify instruction encoder/decoder
Summary:
Add option `-check-encoding` to verify if the input to LLVM disassembler
matches the output of the assembler. When set, the verification runs on
every instruction in processed functions.

I'm not enabling the option by default as it could be quite noisy on x86
where instruction encoding is ambiguous and can include redundant
prefixes.

(cherry picked from FBD16595415)
2019-07-31 16:03:49 -07:00
Maksim Panchenko
a9b9aa1e02 [BOLT] Add code padding verification
Summary:
In non-relocation mode, we allow data objects to be embedded in the
code. Such objects could be unmarked, and could occupy an area between
functions, the area which is considered to be code padding.

When we disassemble code, we detect references into the padding area
and adjust it, so that it is not overwritten during the code emission.
We assume the reference to be pointing to the beginning of the object.

However, assembly-written functions may reference the middle of an
object and use negative offsets to reference data fields. Thus,
conservatively, we reduce the possibly-overwritten padding area to
a minimum if the object reference was detected.

Since we also allow functions with unknown code in non-relocation mode,
it is possible that we miss references to some objects in code.
To cover such cases, we need to verify the padding area before we
allow to overwrite it.

(cherry picked from FBD16477787)
2019-07-23 20:48:41 -07:00
laith sakka
744a2417dd Run findSubprograms in preprocessDebugInfo in parallel
Summary:
While reading debug info the function findSubprograms
runs on each compilation unit. This diff parallelize that loop
reducing its runtime duration by 70%.

(cherry picked from FBD16362867)
2019-07-17 20:54:53 -07:00
laith sakka
9977b03fea Run reorder blocks in parallel
Summary:
This diff change reorderBasicBlocks pass to run in parallel,
it does so by adding locks to the fix branches function,
and creating temporary MCCodeEmitters when estimating basic block code size.

(cherry picked from FBD16161149)
2019-07-08 12:32:58 -07:00
Rafael Auler
1169f1fdd8 [BOLT] Support duplicating jump tables
Summary:
If two indirect branches use the same jump table, we need to
detect this and duplicate dump tables so we can modify this CFG
correctly. This is necessary for instrumentation and shrink wrapping.
For the latter, we only detect this and bail, fixing this old known
issue with shrink wrapping.

Other minor changes to support better instrumentation: add an option
to instrument only hot functions, add LOCK prefix to instrumentation
increment instruction, speed up splitting critical edges by avoiding
calling recomputeLandingPads() unnecessarily.

(cherry picked from FBD16101312)
2019-07-02 16:56:41 -07:00