Summary:
In binutils 2.30 a bfd linker accidentally started modifying some
relocations on output under `-q/--emit-relocs` by turning on
R_X86_64_converted_reloc_bit. As a result, BOLT ignored such
relocations and failed to correctly update the binary.
This diff filters out R_X86_64_converted_reloc_bit from the relocation
type.
(cherry picked from FBD14907832)
Summary:
For easier analysis of the hottest targets of jump tables it helps to
have basic block successors sorted based on the taken frequency.
(cherry picked from FBD14856640)
Summary:
If processing the perf.data in LBR mode but the data was
collected without -j, currently we confusingly report all samples
to mismatch the input binary, even though the samples match but
lack LBR info. Change perf2bolt to detect this scenario and print
a helpful message instructing the user to collect data with LBR.
(cherry picked from FBD14817732)
Summary:
While updating DWARF, we used to convert address ranges for functions
into DW_AT_ranges format, even if the ranges were not split and still
had a simple [low, high) form. We had to do this because functions with
contiguous ranges could be sharing an abbrev with non-contiguous range
function, and we had to convert the abbrev.
It turns out, that the excessive usage of DW_AT_ranges may lead to
internal core dumps in gdb in the presence of .gdb_index.
I still don't know the root cause of it, but reducing the number
DW_AT_ranges used by DW_TAG_subprogram DIEs does alleviate the
issue.
We can keep a simple range for DIEs that are guaranteed not to
share an abbrev with any non-contiguous function. Hence we have to
postpone the update of function ranges until we've seen all DIEs.
Note that DIEs from different compilation units could share the same
abbrev, and hence we have to process DIEs from all compilation units.
(cherry picked from FBD14814043)
Summary:
Some instructions in assembly-written functions could reference 8-byte
constants from another instructions using 4-byte offsets, presumably to
save a couple of bytes.
Detect such cases, and skip processing such functions until we teach
BOLT how to handle references into a middle of instruction.
(cherry picked from FBD14768212)
Summary:
A long due refactoring that makes interfaces cleaner and less awkward.
Mainly makes the future work way easier.
(cherry picked from FBD14766284)
Summary:
When we patch .debug_abbrev we issue many duplicate patches. Instead of
storing these patches as a vector, use a hash map. This saves some
processing time and memory.
(cherry picked from FBD14691292)
Summary:
In non-relocation mode we were accidentally emitting section headers for
every single jump table. This happened with default
`-jump-tables=basic`.
(cherry picked from FBD14653282)
Summary:
While using "-hot-text" option, we might not get enough cold text to
fill up the last huge page, and we can get data allocated on this page
producing undesirable effects. To prevent this from happening, always
make sure to allocate enough space past __hot_end.
(cherry picked from FBD14575100)
Summary:
While removing redundant local symbols, we used new section index to
lookup the corresponding section in the old section table. As a result,
we used to either not remove the correct symbols, or remove the wrong
ones.
(cherry picked from FBD14552047)
Summary:
We used to use existing symbol binding while duplicating and renaming
cold fragment symbols. As a result, some of those were emitted with
global binding. This confuses gdb, and it starts treating those symbols
as additional entry points.
The fix is to always emit such symbols with a local binding. This also
means that we have to sort static symbol table before emission to make
sure local symbols precede all others.
(cherry picked from FBD14529265)
Summary:
Create a separate pass for assigning functions to sections. Detect
functions originating from special sections (by default .stub and
.mover) and place them into ".text.mover" if "-hot-text" options is
specified.
Cold functions are isolated from hot functions even when no function
re-ordering is specified.
(cherry picked from FBD14512628)
Summary:
GDB does not like if the first entry in the line info table after
end_sequence entry is not marked with is_stmt. If this happens, it will
not print the correct line number information for such address. Note
that everything works fine starting with the first address marked
with is_stmt.
This could happen if the first instruction in the cold section wasn't
marked with is_stmt.
The fix is to always emit debug line info for the first instruction
in any function fragment with is_stmt flag.
(cherry picked from FBD14516629)
Summary:
This refactoring makes it easier to create new code sections and control
code placement. As an example, cold code is being placed into
".text.cold" which is emitted independently from ".text", and the final
address assignment becomes more flexible.
Previously, in non-relocation mode we used to emit temporary section
name into .shstrtab. This resulted in unnecessary bloat of this section.
There was unnecessary padding emitted at the end of text section. After
fixing this, the output binary becomes smaller.
I had to change the way exception handling tables are re-written
as the current infra does not support cross-section label difference.
This means we have to emit absolute landing pad addresses, which might
not work for PIE binaries. I'm going to address this once I investigate
the current exception handling issues in PIEs.
This diff temporarily disables "-hot-functions-at-end" option.
(cherry picked from FBD14475693)
Summary: As part of our heuristics to decode an indirect branch, if we
suspect the branch is an indirect tail call, we add its probable target
to the BC::InterproceduralReferences vector to detect functions with
more than one entry point. However, if this probable target is not in an
allocatable section, we were asserting. Remove this assertion and
change the code to conditionally store to InterproceduralReferences
instead. The probable target could be garbage at this point because
of analyzeIndirectBranch failing to identify the load instruction that
has the memory address of the target, so we should tolerate this.
(cherry picked from FBD14432821)
Summary:
Add heatmap subcommand to produce heatmaps based on perf.data with LBR.
The output is produced in colored ASCII format.
llvm-bolt heatmap -p perf.data <executable>
-block-size=<uint> - size of a heat map block in bytes (default 64)
-line-size=<uint> - number of entries per line (default 256)
-max-address=<uint> - maximum address considered valid for heatmap
(default 4GB)
-o=<string> - heatmap output file (default stdout)
(cherry picked from FBD13969992)
Summary:
For non-simple function we can miss a reference to a jump table or
to an indirect goto table. If we move the jump table, the missed
reference will not get updated, and the corresponding indirect jump
will end up in the old (wrong) location. Updating the original jump
table in-place should take care of the issue.
(cherry picked from FBD13849776)
Summary:
While converting perf profile, we only need CFG for functions that were
profiled and can skip building CFG for the rest. This saves us some
processing time and memory.
Breakdown processing of perf.data into two steps. The first
step parses the data, saves it in intermediate format, and marks
functions with the profile. The second step attributes the profile to
functions with CFG. When we disassemble and build CFG for functions in
aggregate-only mode, we skip functions without the profile.
(cherry picked from FBD13706697)
Summary:
Improve tracking of forked processes.
If a process corresponding to the input binary has forked/started
before 'perf record' was initiated, then the full name of the binary
will be recorded in a corresponding MMAP2 event. We've being handling
such cases well so far.
However, if the process was forked after 'perf record' has started, and
execve(2) wasn't called afterwards, then there will be no MMAP2 event
recorded corresponding to the mapping of the main binary (unrelated
MMAP2 events could still be recorded).
To track such cases, we need to parse 'perf script --show-task-events'
command output, and to scan for PERF_RECORD_FORK events, and then add
forked process PIDs to the list associated with the input binary. If
the fork event was followed by an exec event (PERF_RECORD_COMM exec)
of a different binary, then the forked PID should be ignored. If the
exec event was associated with our input binary, then the correct MMAP2
event was recorded and parsed.
To track if the event occurred before or after 'perf record', we parse
event's time. This helps us to differentiate some events. E.g. the exec
event is only registered correctly if it happened after perf recording
has started (otherwise the "exec" part is missing), and thus we only
record forks with non-zero time stamps.
(cherry picked from FBD13250904)
Summary:
Use newly added function size estimation to measure the effectiveness
and guide function splitting. Two new tuning options are added:
-split-threshold=<uint>
split function only if its main size is reduced by more than given
amount of bytes. Default value: 0, i.e. split iff the size is reduced.
Note that on some architectures the size can increase after splitting.
-split-align-threshold=<uint>
when deciding to split a function, apply this alignment while doing
the size comparison (see -split-threshold). Default value: 2.
(cherry picked from FBD13136352)
Summary:
Add BinaryContext::calculateEmittedSize() that ephemerally emits code
to allow precise estimation of the function size. Relaxation and
macro-op alignment adjustments are taken into account.
(cherry picked from FBD13092139)
Summary:
On x86 the difference between long and short jump instructions could be
either 4 or 3 bytes, depending if it's a conditional jump or not.
For a basic block with 2 jump instructions, if we know that one of
the successors is in a different code region, then we can make it
a target of an unconditional jump instruction. This will save 1 byte
in case the conditional jump happens to be a short one.
(cherry picked from FBD13078139)
Summary:
When Clang is boot-strapped with (Thin)LTO, it may produce a code
fragment similar to below:
.LFT663334 (6 instructions, align : 1)
Predecessors: .LFT663333
00000538: movb $0x1, %al
0000053a: movl %eax, -0x2c(%rbp)
0000053d: movl $"_ZN5clang6Parser12ConsumeParenEv/1", %ecx
00000542: testb $0x1, %cl
00000545: movq -0x40(%rbp), %r14
00000549: je .Ltmp1071462
Successors: .Ltmp1071462, .LFT663335
.LFT663335 (2 instructions, align : 1)
Predecessors: .LFT663334
0000054b: movq (%r12), %rax
0000054f: movq .Ltmp0(%rax), %rcx
Successors: .Ltmp1071462
.Ltmp1071462 (7 instructions, align : 1)
Predecessors: .LFT663334, .LFT663335
00000556: movq %r12, %rdi
00000559: callq *%rcx
.......
The code above is making a call by dereferencing a pointer to a member
function. A pointer to a member function could either be a regular
function, or a virtual function. To differentiate between the two, AMD64
ABI (originated from Itanium ABI) uses the last bit of the pointer. The
call instruction sequence varies depending if the function is virtual or
not, and the pointer's last bit is checked. If it's "1" then the value
of the pointer (minus 1) is used as an offset in the object vtable to
get the address of the function, otherwise the pointer is used directly
as a function address.
In this specific case, a de-virtualization is taking place, but it's not
complete. Compiler knows that the member function pointer is actually a
non-virtual function _ZN5clang6Parser12ConsumeParenEv (aka
"clang::Parser::ConsumeParen()"). However, it keeps the (dead) code that
checks the last bit of _ZN5clang6Parser12ConsumeParenEv, and furthermore
keeps the code (unreachable/dead) to make a virtual call while using
(_ZN5clang6Parser12ConsumeParenEv - 1) as an offset into the vtable.
This is obviously wrong, but since the code is unreachable, it will
never affect the runtime correctness.
The value "_ZN5clang6Parser12ConsumeParenEv - 1" falls into a last byte
of a function preceding _ZN5clang6Parser12ConsumeParenEv, and BOLT
creates a label ".Ltmp0" pointing to this last byte that is referenced
in by the instruction sequence above. It just happens that the last byte
is also in the middle of the last instruction, and as a result, BOLT
never emits the label, hence resulting in the error message "Undefined
temporary symbol".
The workaround is to detect non-pc-relative relocations from code
pointing to some (fptr - 1). Note that this is not completely
error-prone, but non-pc-relative references from code into a middle of
a function are quite rare, and chances that in a normal situation they
will point to a byte preceding some function address are virtually zero.
(cherry picked from FBD13030310)
Summary:
Special case GOT relocs to ignore addend subtracting
logic in analyzeRelocation, since the addend does not refer to the
target of the instruction being analyzed. Also make the code honor
the comments in the special case about zeroed out ExtractValue but
non-zero addend.
Fixfacebookincubator/BOLT#40
(cherry picked from FBD10355019)
Summary:
This pull request fixes two compiler warnings:
- missing `break;` in a switch-case statement in RegAnalysis.cpp (-Wimplicit-fallthrough warning)
- misleading indentation in BinaryContext.cpp (-Wmisleading-indentation warning)
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/39
GitHub Author: Andreas Ziegler <andreas.ziegler@fau.de>
(cherry picked from FBD10202092)
Summary:
lld may generate relocations without associated symbols. Instead of
rejecting binaries with such relocations, we can re-create the symbol
the relocation is against based on the extracted value.
(cherry picked from FBD10054576)
Summary:
Previously, we were expanding eligible branches with stubs. After
expansion, we were computing which stubs were unnecessary and removing them,
assuming ranges were shortening as code is removed. The problem with this
approach is that for branches that refer to code that is not managed by
BOLT, the distance to that location can increase and we can end up with an
out-of-range branch.
This rewrites the pass to be simpler, only increasing size and expanding code
with stubs as needed after each iteration, stopping when code stops increasing.
Besides this rewrite, the stub-insertion pass now supports stubs grouping
similar to what the linker does, allowing different functions to share the
same veneer that jumps to a common callee. It also fixes a bug in the previous
implementation that, in very large functions that use TBZ/TBNZ (+-32KB range),
it would mistakenly try to reuse a local stub BB that is out of range.
This includes a change to allow hot functions to be put at the end of the
.text section, closer to the heap, requiring no veneers to jump to JITted
code. And finally it enables eliminate veneers pass by default.
(cherry picked from FBD10023158)
Summary:
If we reuse text section under `-use-old-text` option, then there's no
need to rename it. Tools, such as perf, seem to not like binaries
without `.text`.
Additionally, check if the code fits into `.text` using the page
alignment, otherwise we were skipping the alignment relying on the user
detecting the warning message. This could have resulted in unexpected
performance drops.
Also add `-no-huge-pages` option to use regular page size for code
alignment purposes (i.e. 4KiB instead of 2MiB).
(cherry picked from FBD10024670)
Summary:
While creating BinaryData objects we used to process all symbol table
entries. However, some symbols could belong to non-allocatable sections,
and thus we have to ignore them for the purpose of analyzing in-memory
data.
(cherry picked from FBD9666511)
Summary:
For jump tables ICP was using profile from the jump table itself which
doesn't work correct if the jump table is re-used at different code
locations.
(cherry picked from FBD9618774)
Summary:
While running ICF pass we have skipped merging profile data for jump
tables. We were only updating profile in the CFG. Fix that.
(cherry picked from FBD9595523)
Summary:
Do not truncate the binary name for comparison purposes as the binary
name we are getting from "perf script" is no longer truncated.
(cherry picked from FBD9596409)
Summary:
After optimizing a target of a jump table, ICP was not updating edge
counts corresponding to that target. As a result the edge could be left
hot and negatively influence the code layout.
(cherry picked from FBD9524396)
Summary:
In some rare cases a compiler may generate DWARF that contains an empty
CU DIE that references a debug line fragment. That fragment will contain
no file name information, and we fail to register it. Then, as a result,
DW_AT_stmt_list is not updated for the CU. This may cause some
DWARF-processing tools to segfault.
As a solution/workaround, we register "<unknown>" file name for such
debug line tables.
(cherry picked from FBD9526705)