Commit Graph

524 Commits

Author SHA1 Message Date
Maksim Panchenko
e50e89be9e [BOLT] Handle R_X86_64_converted_reloc_bit
Summary:
In binutils 2.30 a bfd linker accidentally started modifying some
relocations on output under `-q/--emit-relocs` by turning on
R_X86_64_converted_reloc_bit. As a result, BOLT ignored such
relocations and failed to correctly update the binary.

This diff filters out R_X86_64_converted_reloc_bit from the relocation
type.

(cherry picked from FBD14907832)
2019-04-11 17:11:08 -07:00
Maksim Panchenko
315ae74de3 [BOLT] Include <numeric> for std::iota
Summary: Some compilers require <numeric> header.

(cherry picked from FBD14868132)
2019-04-09 21:22:41 -07:00
Maksim Panchenko
88375d311e [BOLT] Sort basic block successors for printing
Summary:
For easier analysis of the hottest targets of jump tables it helps to
have basic block successors sorted based on the taken frequency.

(cherry picked from FBD14856640)
2019-04-09 11:27:23 -07:00
Maksim Panchenko
a8e05d067d [BOLT] Add interface to extract values from static addresses
(cherry picked from FBD14858028)
2019-04-09 12:29:40 -07:00
Maksim Panchenko
7d89b113d8 [BOLT][NFC] Indentation fix
(cherry picked from FBD14856700)
2019-04-09 11:31:45 -07:00
Rafael Auler
90996eb54b [PERF2BOLT] Print a better message if perf.data lacks LBR
Summary:
If processing the perf.data in LBR mode but the data was
collected without -j, currently we confusingly report all samples
to mismatch the input binary, even though the samples match but
lack LBR info. Change perf2bolt to detect this scenario and print
a helpful message instructing the user to collect data with LBR.

(cherry picked from FBD14817732)
2019-04-05 17:27:25 -07:00
Maksim Panchenko
624a0e810d [DWARF][BOLT] Convert DW_AT_(low|high)_pc to DW_AT_ranges only if necessary
Summary:
While updating DWARF, we used to convert address ranges for functions
into DW_AT_ranges format, even if the ranges were not split and still
had a simple [low, high) form. We had to do this because functions with
contiguous ranges could be sharing an abbrev with non-contiguous range
function, and we had to convert the abbrev.

It turns out, that the excessive usage of DW_AT_ranges may lead to
internal core dumps in gdb in the presence of .gdb_index.
I still don't know the root cause of it, but reducing the number
DW_AT_ranges used by DW_TAG_subprogram DIEs does alleviate the
issue.

We can keep a simple range for DIEs that are guaranteed not to
share an abbrev with any non-contiguous function. Hence we have to
postpone the update of function ranges until we've seen all DIEs.
Note that DIEs from different compilation units could share the same
abbrev, and hence we have to process DIEs from all compilation units.

(cherry picked from FBD14814043)
2019-04-01 20:26:41 -07:00
Maksim Panchenko
c8a927696c [BOLT] Detect internal references into a middle of instruction
Summary:
Some instructions in assembly-written functions could reference 8-byte
constants from another instructions using 4-byte offsets, presumably to
save a couple of bytes.

Detect such cases, and skip processing such functions until we teach
BOLT how to handle references into a middle of instruction.

(cherry picked from FBD14768212)
2019-04-03 22:31:12 -07:00
Maksim Panchenko
7fd487066f [BOLT] Move BinaryFunctions into a BinaryContext and more
Summary:
A long due refactoring that makes interfaces cleaner and less awkward.
Mainly makes the future work way easier.

(cherry picked from FBD14766284)
2019-04-03 15:52:01 -07:00
Maksim Panchenko
8894853f42 [BOLT][DWARF] Dedup .debug_abbrev section patches
Summary:
When we patch .debug_abbrev we issue many duplicate patches. Instead of
storing these patches as a vector, use a hash map. This saves some
processing time and memory.

(cherry picked from FBD14691292)
2019-03-29 14:22:54 -07:00
Maksim Panchenko
297d1a4e1a [BOLT] Do not write jump table section headers
Summary:
In non-relocation mode we were accidentally emitting section headers for
every single jump table. This happened with default
`-jump-tables=basic`.

(cherry picked from FBD14653282)
2019-03-27 13:58:31 -07:00
Maksim Panchenko
d1b76f2ac2 [BOLT] Allocate enough space past __hot_end for huge pages
Summary:
While using "-hot-text" option, we might not get enough cold text to
fill up the last huge page, and we can get data allocated on this page
producing undesirable effects. To prevent this from happening, always
make sure to allocate enough space past __hot_end.

(cherry picked from FBD14575100)
2019-03-21 21:13:45 -07:00
Maksim Panchenko
69faf61372 [BOLT] Fix section lookup while deleting symbols
Summary:
While removing redundant local symbols, we used new section index to
lookup the corresponding section in the old section table. As a result,
we used to either not remove the correct symbols, or remove the wrong
ones.

(cherry picked from FBD14552047)
2019-03-20 16:13:09 -07:00
Maksim Panchenko
b8d3dc40ea [BOLT] Use local binding for cold fragment symbols
Summary:
We used to use existing symbol binding while duplicating and renaming
cold fragment symbols. As a result, some of those were emitted with
global binding. This confuses gdb, and it starts treating those symbols
as additional entry points.

The fix is to always emit such symbols with a local binding. This also
means that we have to sort static symbol table before emission to make
sure local symbols precede all others.

(cherry picked from FBD14529265)
2019-03-19 13:46:21 -07:00
Maksim Panchenko
6bcb3389dd [BOLT] Place hot text mover functions into a separate section
Summary:
Create a separate pass for assigning functions to sections. Detect
functions originating from special sections (by default .stub and
.mover) and place them into ".text.mover" if "-hot-text" options is
specified.

Cold functions are isolated from hot functions even when no function
re-ordering is specified.

(cherry picked from FBD14512628)
2019-03-15 13:43:36 -07:00
Maksim Panchenko
17cd2034f3 [BOLT] Fix debug line info emission
Summary:
GDB does not like if the first entry in the line info table after
end_sequence entry is not marked with is_stmt. If this happens, it will
not print the correct line number information for such address. Note
that everything works fine starting with the first address marked
with is_stmt.

This could happen if the first instruction in the cold section wasn't
marked with is_stmt.

The fix is to always emit debug line info for the first instruction
in any function fragment with is_stmt flag.

(cherry picked from FBD14516629)
2019-03-18 19:22:26 -07:00
Maksim Panchenko
61ea19edf8 [BOLT][NFC] Fix compilation warnings
Summary: Get rid of warnings while building with Clang.

(cherry picked from FBD14495620)
2019-03-15 15:06:41 -07:00
Maksim Panchenko
0a55001a0e [BOLT] Fix -hot-functions-at-end option
Summary: Make "-hot-functions-at-end" option work again.

(cherry picked from FBD14476242)
2019-03-14 20:32:04 -07:00
Maksim Panchenko
163adbec9f [BOLT] Refactor allocatable sections rewrite part
Summary:
This refactoring makes it easier to create new code sections and control
code placement. As an example, cold code is being placed into
".text.cold" which is emitted independently from ".text", and the final
address assignment becomes more flexible.

Previously, in non-relocation mode we used to emit temporary section
name into .shstrtab. This resulted in unnecessary bloat of this section.

There was unnecessary padding emitted at the end of text section. After
fixing this, the output binary becomes smaller.

I had to change the way exception handling tables are re-written
as the current infra does not support cross-section label difference.
This means we have to emit absolute landing pad addresses, which might
not work for PIE binaries. I'm going to address this once I investigate
the current exception handling issues in PIEs.

This diff temporarily disables "-hot-functions-at-end" option.

(cherry picked from FBD14475693)
2019-03-14 18:51:05 -07:00
Maksim Panchenko
a9e64947c5 [NFC][BOLT] Move ExecutableFileMemoryManager into its own file
(cherry picked from FBD14474800)
2019-03-14 18:49:40 -07:00
Rafael Auler
c593563d1f Do not assert on addresses read from processIndirectBranch
Summary: As part of our heuristics to decode an indirect branch, if we
suspect the branch is an indirect tail call, we add its probable target
to the BC::InterproceduralReferences vector to detect functions with
more than one entry point. However, if this probable target is not in an
allocatable section, we were asserting. Remove this assertion and
change the code to conditionally store to InterproceduralReferences
instead. The probable target could be garbage at this point because
of analyzeIndirectBranch failing to identify the load instruction that
has the memory address of the target, so we should tolerate this.

(cherry picked from FBD14432821)
2019-03-12 16:36:35 -07:00
Maksim Panchenko
0c704eb75a [BOLT-HEATMAP] Initial heat map implementation
Summary:
Add heatmap subcommand to produce heatmaps based on perf.data with LBR.
The output is produced in colored ASCII format.

  llvm-bolt heatmap -p perf.data <executable>

    -block-size=<uint> - size of a heat map block in bytes (default 64)
    -line-size=<uint>  - number of entries per line (default 256)
    -max-address=<uint> - maximum address considered valid for heatmap
                          (default 4GB)
    -o=<string>        - heatmap output file (default stdout)

(cherry picked from FBD13969992)
2019-02-05 15:28:19 -08:00
Maksim Panchenko
ff6e21290f [BOLT] New inliner implementation
Summary:
Addresses correctness issues related to inlining.
Inlining heuristics are not part of this diff.

(cherry picked from FBD13796888)
2019-01-31 11:23:02 -08:00
Maksim Panchenko
365bd1f1c8 [BOLT] For non-simple functions always update jump tables in-place
Summary:
For non-simple function we can miss a reference to a jump table or
to an indirect goto table. If we move the jump table, the missed
reference will not get updated, and the corresponding indirect jump
will end up in the old (wrong) location. Updating the original jump
table in-place should take care of the issue.

(cherry picked from FBD13849776)
2019-01-28 13:46:18 -08:00
Rafael Auler
af81c7ff80 [perf2bolt] Add support for generating autofdo input
Summary:
Autofdo tools support.

(cherry picked from FBD13779026)
2019-01-22 17:21:45 -08:00
Maksim Panchenko
c6ce2abb7d [perf2bolt] Optimize memory usage in perf2bolt
Summary:
While converting perf profile, we only need CFG for functions that were
profiled and can skip building CFG for the rest. This saves us some
processing time and memory.

Breakdown processing of perf.data into two steps. The first
step parses the data, saves it in intermediate format, and marks
functions with the profile. The second step attributes the profile to
functions with CFG. When we disassemble and build CFG for functions in
aggregate-only mode, we skip functions without the profile.

(cherry picked from FBD13706697)
2019-01-15 23:43:40 -08:00
Maksim Panchenko
2fe0c38d6b [perf2bolt] Better tracking of process forking
Summary:
Improve tracking of forked processes.

If a process corresponding to the input binary has forked/started
before 'perf record' was initiated, then the full name of the binary
will be recorded in a corresponding MMAP2 event. We've being handling
such cases well so far.

However, if the process was forked after 'perf record' has started, and
execve(2) wasn't called afterwards, then there will be no MMAP2 event
recorded corresponding to the mapping of the main binary (unrelated
MMAP2 events could still be recorded).

To track such cases, we need to parse 'perf script --show-task-events'
command output, and to scan for PERF_RECORD_FORK events, and then add
forked process PIDs to the list associated with the input binary. If
the fork event was followed by an exec event (PERF_RECORD_COMM exec)
of a different binary, then the forked PID should be ignored. If the
exec event was associated with our input binary, then the correct MMAP2
event was recorded and parsed.

To track if the event occurred before or after 'perf record', we parse
event's time. This helps us to differentiate some events. E.g. the exec
event is only registered correctly if it happened after perf recording
has started (otherwise the "exec" part is missing), and thus we only
record forks with non-zero time stamps.

(cherry picked from FBD13250904)
2018-11-21 20:04:00 -08:00
Maksim Panchenko
067a385000 [BOLT] Add thresholds for function splitting
Summary:
Use newly added function size estimation to measure the effectiveness
and guide function splitting. Two new tuning options are added:

  -split-threshold=<uint>
    split function only if its main size is reduced by more than given
    amount of bytes. Default value: 0, i.e. split iff the size is reduced.
    Note that on some architectures the size can increase after splitting.
  -split-align-threshold=<uint>
    when deciding to split a function, apply this alignment while doing
    the size comparison (see -split-threshold). Default value: 2.

(cherry picked from FBD13136352)
2018-11-15 16:03:34 -08:00
Maksim Panchenko
b0f7fddd35 [BOLT] Add method for better function size estimation
Summary:
Add BinaryContext::calculateEmittedSize() that ephemerally emits code
to allow precise estimation of the function size. Relaxation and
macro-op alignment adjustments are taken into account.

(cherry picked from FBD13092139)
2018-11-15 16:02:16 -08:00
Maksim Panchenko
e1b8fade7f [BOLT] Add branch priority policy for blocks with 2 successors
Summary:
On x86 the difference between long and short jump instructions could be
either 4 or 3 bytes, depending if it's a conditional jump or not.
For a basic block with 2 jump instructions, if we know that one of
the successors is in a different code region, then we can make it
a target of an unconditional jump instruction. This will save 1 byte
in case the conditional jump happens to be a short one.

(cherry picked from FBD13078139)
2018-11-14 14:43:59 -08:00
Maksim Panchenko
40d9fcfdca [BOLT] Workaround for Clang de-virtualization bug
Summary:
When Clang is boot-strapped with (Thin)LTO, it may produce a code
fragment similar to below:

  .LFT663334 (6 instructions, align : 1)
    Predecessors: .LFT663333
      00000538:   movb    $0x1, %al
      0000053a:   movl    %eax, -0x2c(%rbp)
      0000053d:   movl    $"_ZN5clang6Parser12ConsumeParenEv/1", %ecx
      00000542:   testb   $0x1, %cl
      00000545:   movq    -0x40(%rbp), %r14
      00000549:   je      .Ltmp1071462
    Successors: .Ltmp1071462, .LFT663335

  .LFT663335 (2 instructions, align : 1)
    Predecessors: .LFT663334
      0000054b:   movq    (%r12), %rax
      0000054f:   movq    .Ltmp0(%rax), %rcx
    Successors: .Ltmp1071462

  .Ltmp1071462 (7 instructions, align : 1)
    Predecessors: .LFT663334, .LFT663335
      00000556:   movq    %r12, %rdi
      00000559:   callq   *%rcx
      .......

The code above is making a call by dereferencing a pointer to a member
function. A pointer to a member function could either be a regular
function, or a virtual function. To differentiate between the two, AMD64
ABI (originated from Itanium ABI) uses the last bit of the pointer. The
call instruction sequence varies depending if the function is virtual or
not, and the pointer's last bit is checked. If it's "1" then the value
of the pointer (minus 1) is used as an offset in the object vtable to
get the address of the function, otherwise the pointer is used directly
as a function address.

In this specific case, a de-virtualization is taking place, but it's not
complete. Compiler knows that the member function pointer is actually a
non-virtual function _ZN5clang6Parser12ConsumeParenEv (aka
"clang::Parser::ConsumeParen()"). However, it keeps the (dead) code that
checks the last bit of _ZN5clang6Parser12ConsumeParenEv, and furthermore
keeps the code (unreachable/dead) to make a virtual call while using
(_ZN5clang6Parser12ConsumeParenEv - 1) as an offset into the vtable.
This is obviously wrong, but since the code is unreachable, it will
never affect the runtime correctness.

The value "_ZN5clang6Parser12ConsumeParenEv - 1" falls into a last byte
of a function preceding _ZN5clang6Parser12ConsumeParenEv, and BOLT
creates a label ".Ltmp0" pointing to this last byte that is referenced
in by the instruction sequence above. It just happens that the last byte
is also in the middle of the last instruction, and as a result, BOLT
never emits the label, hence resulting in the error message "Undefined
temporary symbol".

The workaround is to detect non-pc-relative relocations from code
pointing to some (fptr - 1). Note that this is not completely
error-prone, but non-pc-relative references from code into a middle of
a function are quite rare, and chances that in a normal situation they
will point to a byte preceding some function address are virtually zero.

(cherry picked from FBD13030310)
2018-11-12 12:38:50 -08:00
Maksim Panchenko
30fd960951 [BOLT] Update local symbol count in symbol table
Summary:
Fix sh_info entry for symbol table section to reflect updated number of
local symbols.

(cherry picked from FBD10503216)
2018-10-22 18:48:12 -07:00
Maksim Panchenko
a76b13d48e [perf2bolt] Pre-aggregate LBR samples
Summary: Pre-aggregating LBR data cuts pef2bolt processing times in half.

(cherry picked from FBD10420286)
2018-10-02 17:16:26 -07:00
Rafael Auler
74a71c6812 Fix bug in analyzeRelocation for GOT entries
Summary:
Special case GOT relocs to ignore addend subtracting
logic in analyzeRelocation, since the addend does not refer to the
target of the instruction being analyzed. Also make the code honor
the comments in the special case about zeroed out ExtractValue but
non-zero addend.
Fix facebookincubator/BOLT#40

(cherry picked from FBD10355019)
2018-10-11 18:12:09 -07:00
Facebook Github Bot
b166ccbea8 [BOLT][PR] Fix compiler warnings in BinaryContext and RegAnalysis
Summary:
This pull request fixes two compiler warnings:

- missing `break;` in a switch-case statement in RegAnalysis.cpp (-Wimplicit-fallthrough warning)
- misleading indentation in BinaryContext.cpp (-Wmisleading-indentation warning)
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/39
GitHub Author: Andreas Ziegler <andreas.ziegler@fau.de>

(cherry picked from FBD10202092)
2018-10-04 10:46:16 -07:00
Igor Sugak
c3c80822a3 [BOLT] Capitalize i
Summary: as titled

(cherry picked from FBD10136655)
2018-10-01 16:22:46 -07:00
Igor Sugak
cc2276d3f1 [BOLT] fix build with gcc-4.8.5
Summary: These are two minor changes to make it copatible with gcc-4.8.5

(cherry picked from FBD9884971)
2018-09-17 12:17:33 -07:00
Maksim Panchenko
ce508b58c6 [BOLT] Support relocations without symbols
Summary:
lld may generate relocations without associated symbols. Instead of
rejecting binaries with such relocations, we can re-create the symbol
the relocation is against based on the extracted value.

(cherry picked from FBD10054576)
2018-09-21 12:00:20 -07:00
Rafael Auler
bd0b99c45d [BOLT] Change stub-insertion pass for AArch64
Summary:
Previously, we were expanding eligible branches with stubs. After
expansion, we were computing which stubs were unnecessary and removing them,
assuming ranges were shortening as code is removed. The problem with this
approach is that for branches that refer to code that is not managed by
BOLT, the distance to that location can increase and we can end up with an
out-of-range branch.

This rewrites the pass to be simpler, only increasing size and expanding code
with stubs as needed after each iteration, stopping when code stops increasing.
Besides this rewrite, the stub-insertion pass now supports stubs grouping
similar to what the linker does, allowing different functions to share the
same veneer that jumps to a common callee. It also fixes a bug in the previous
implementation that, in very large functions that use TBZ/TBNZ (+-32KB range),
it would mistakenly try to reuse a local stub BB that is out of range.

This includes a change to allow hot functions to be put at the end of the
.text section, closer to the heap, requiring no veneers to jump to JITted
code. And finally it enables eliminate veneers pass by default.

(cherry picked from FBD10023158)
2018-09-17 13:36:59 -07:00
Maksim Panchenko
1387a9d761 [BOLT] Keep .text section in file when using old text
Summary:
If we reuse text section under `-use-old-text` option, then there's no
need to rename it. Tools, such as perf, seem to not like binaries
without `.text`.

Additionally, check if the code fits into `.text` using the page
alignment, otherwise we were skipping the alignment relying on the user
detecting the warning message. This could have resulted in unexpected
performance drops.

Also add `-no-huge-pages` option to use regular page size for code
alignment purposes (i.e. 4KiB instead of 2MiB).

(cherry picked from FBD10024670)
2018-09-24 20:58:31 -07:00
Maksim Panchenko
53b72d0f2e [BOLT] Ignore symbols from non-allocatable sections
Summary:
While creating BinaryData objects we used to process all symbol table
entries. However, some symbols could belong to non-allocatable sections,
and thus we have to ignore them for the purpose of analyzing in-memory
data.

(cherry picked from FBD9666511)
2018-09-05 14:36:52 -07:00
Maksim Panchenko
8026760ac0 [BOLT] Fix another issue with profile after ICP
Summary:
For jump tables ICP was using profile from the jump table itself which
doesn't work correct if the jump table is re-used at different code
locations.

(cherry picked from FBD9618774)
2018-08-30 13:21:50 -07:00
spupyrev
41ed5431a0 [BOLT] turning on the compact aligner by default
Summary: Making UseCompactAligner true by default

(cherry picked from FBD9325158)
2018-08-14 14:49:10 -07:00
Maksim Panchenko
cd19f718b4 [BOLT] Merge jump table profile data
Summary:
While running ICF pass we have skipped merging profile data for jump
tables. We were only updating profile in the CFG. Fix that.

(cherry picked from FBD9595523)
2018-08-30 13:21:29 -07:00
Maksim Panchenko
69e6004a42 [perf2bolt] Fix processing of binaries with names over 15 chars long
Summary:
Do not truncate the binary name for comparison purposes as the binary
name we are getting from "perf script" is no longer truncated.

(cherry picked from FBD9596409)
2018-08-30 14:51:10 -07:00
Rafael Auler
d0a80b0870 [BOLT] Change ForceRelocation behavior
Summary:
Only record address as addend if the target of the relocation
is the pseudo-symbol Zero.

(cherry picked from FBD9551543)
2018-08-28 18:15:13 -07:00
Maksim Panchenko
708a550084 [BOLT] Fix profile after ICP
Summary:
After optimizing a target of a jump table, ICP was not updating edge
counts corresponding to that target. As a result the edge could be left
hot and negatively influence the code layout.

(cherry picked from FBD9524396)
2018-08-23 22:47:46 -07:00
Maksim Panchenko
2511b09985 [BOLT][DWARF] Fix line info for empty CU DIEs
Summary:
In some rare cases a compiler may generate DWARF that contains an empty
CU DIE that references a debug line fragment. That fragment will contain
no file name information, and we fail to register it. Then, as a result,
DW_AT_stmt_list is not updated for the CU. This may cause some
DWARF-processing tools to segfault.

As a solution/workaround, we register "<unknown>" file name for such
debug line tables.

(cherry picked from FBD9526705)
2018-08-27 20:12:59 -07:00
Rafael Auler
a7e0704be6 [BOLT] Reduce AArch64 target feature flags
Summary:
Eliminate some flags that are not recognized and
are currently printing warnings when BOLT runs on AArch64.

(cherry picked from FBD9499971)
2018-08-24 10:42:00 -07:00
Rafael Auler
af1177d99f [BOLT] Add mattr options to AArch64 target
Summary:
Make the AArch64 subtarget enable all features, so the disassembler
won't choke on extension instructions.

(cherry picked from FBD9477066)
2018-08-22 18:47:39 -07:00