Summary:
Add support for reading value profiling info from perf data. This diff adds support in DataReader/DataAggregator for value profiling data. Each event is recorded as two Locations (a PC and an address/value) and a count.
For now, I'm assuming that the value profiling data is in the same file as the usual BOLT profiling data. Collecting both at the same time seems to work.
(cherry picked from FBD6076877)
Summary:
Move the data aggregator logic from our python script to
our C++ LLVM/BOLT libs. This has a dramatic reduction in processing
time for profiling data (from 45 minutes for HHVM to 5 minutes) because
we directly use BOLT as a disassembler in order to validate traces found
in the LBR and to add the fallthrough counts. Previously, the python
approach relied on parsing the output objdump to check traces.
(cherry picked from FBD5761313)
Summary:
Add support to read profiles collected without LBR. This
involves adapting our data aggregator perf2bolt and adding support
in llvm-bolt itself to read this data.
This patch also introduces different options to convert basic block
execution count to edge count, so BOLT can operate with its regular
algorithms to perform basic block layout. The most successful approach
is the default one.
(cherry picked from FBD5664735)
Summary:
Fix issue in memcpy where one of its entry points was getting
no profiling data and was wrongly considered cold, being put in the cold
region.
(cherry picked from FBD5569156)
Summary:
* Improve profile matching for LTO binaries that don't match 100%.
* Fix profile matching for '.LTHUNK*' functions.
* Add external outgoing branches (calls) for profile validation.
There's an improvement for 100% match profile and for stale LTO
profile. However, we are still not fully closing the gap with
stale profile when LTO is enabled.
(NOTE: I haven't updated all test cases yet)
(cherry picked from FBD5529293)
Summary:
Perform indirect call promotion optimization in BOLT.
The code scans the instructions during CFG creation for all
indirect calls. Right now indirect tail calls are not handled
since the functions are marked not simple. The offsets of the
indirect calls are stored for later use by the ICP pass.
The indirect call promotion pass visits each indirect call and
examines the BranchData for each. If the most frequent targets
from that callsite exceed the specified threshold (default 90%),
the call is promoted. Otherwise, it is ignored. By default,
only one target is considered at each callsite.
When an candiate callsite is processed, we modify the callsite
to test for the most common call targets before calling through
the original generic call mechanism.
The CFG and layout are modified by ICP.
A few new command line options have been added:
-indirect-call-promotion
-indirect-call-promotion-threshold=<percentage>
-indirect-call-promotion-topn=<int>
The threshold is the minimum frequency of a call target needed
before ICP is triggered.
The topn option controls the number of targets to consider for
each callsite, e.g. ICP is triggered if topn=2 and the total
requency of the top two call targets exceeds the threshold.
Example of ICP:
C++ code:
int B_count = 0;
int C_count = 0;
struct A { virtual void foo() = 0; }
struct B : public A { virtual void foo() { ++B_count; }; };
struct C : public A { virtual void foo() { ++C_count; }; };
A* a = ...
a->foo();
...
original:
400863: 49 8b 07 mov (%r15),%rax
400866: 4c 89 ff mov %r15,%rdi
400869: ff 10 callq *(%rax)
40086b: 41 83 e6 01 and $0x1,%r14d
40086f: 4d 89 e6 mov %r12,%r14
400872: 4c 0f 44 f5 cmove %rbp,%r14
400876: 4c 89 f7 mov %r14,%rdi
...
after ICP:
40085e: 49 8b 07 mov (%r15),%rax
400861: 4c 89 ff mov %r15,%rdi
400864: 49 ba e0 0b 40 00 00 movabs $0x400be0,%r10
40086b: 00 00 00
40086e: 4c 3b 10 cmp (%rax),%r10
400871: 75 29 jne 40089c <main+0x9c>
400873: 41 ff d2 callq *%r10
400876: 41 83 e6 01 and $0x1,%r14d
40087a: 4d 89 e6 mov %r12,%r14
40087d: 4c 0f 44 f5 cmove %rbp,%r14
400881: 4c 89 f7 mov %r14,%rdi
...
40089c: ff 10 callq *(%rax)
40089e: eb d6 jmp 400876 <main+0x76>
(cherry picked from FBD3612218)
Summary:
BOLT attempts to convert jumps that serve as tail calls to dedicated tail call
instructions, but this is impossible when the jump is conditional because there is
no corresponding tail call instruction. This was causing the creation of a duplicate
fall-through edge for basic blocks terminated with a conditional jump serving as
a tail call when there is profile data available for the non-taken branch. In this
case, the first fall-through edge had a count taken from the profile data, while
the second has a count computed (incorrectly) by
BinaryFunction::inferFallThroughCounts.
(cherry picked from FBD3560504)
Summary:
If a profile data was collected on a stripped binary but an input
to BOLT is unstripped, we would use a different mangling scheme for
local functions and ignore their profiles. To solve the issue this
diff adds alternative name for all local functions such that one
of the names would match the name in the profile.
If the input binary was stripped, we reject it, unless "-allow-stripped"
option was passed. It's more complicated to do a matching in this case
since we have less information than at the time of profile collection.
It's also not that simple to tell if the profile was gathered on a
stripped binary (in which case we would have no issue matching data).
(cherry picked from FBD3548012)
Summary:
Added perf2bolt functionality for extracting branch records
with histories of previous branches. The length of the histories
is user defined, and the default is 0 (previous functionality). Also,
DataReader can parse perf2bolt output with histories.
Note: creating profile data with long histories can increase their
size significantly (2x for history of length 1, 3x for length 2 etc).
(cherry picked from FBD3473983)
Summary:
With ICF optimization in the linker we were getting mismatches of
function names in .fdata and BinaryFunction name. This diff adds
support for multiple function names for BinaryFunction and
does a match against all possible names for the profile.
(cherry picked from FBD3466215)
Summary:
Populate function execution count while parsing fdata. Before
we used a quadratic algorithm to populate the execution count
(had to iterate over *all* branches for every single function).
Ignore non-symbol to non-symbol branches while parsing fdata.
These changes combined drop HHVM processing time from
4 minutes 53 seconds down to 2 minutes 9 seconds on my devserver.
Test case had to be modified since it contained irrelevant
branches from PLT to libc.
(cherry picked from FBD3106263)
Summary:
In a recent commit, we changed local symbols to be specially tagged
with the number 2 (local sym) instead of 1 (sym). This patch modifies the reader
to don't choke when seeing a 2 in the symbol id field.
(cherry picked from FBD2552776)
Summary:
Changes DataReader to organize branch perf data per function name and
sets up logistics to bring this data to BinaryFunction::buildCFG(). To do this,
we expand BinaryContext with a const reference to DataReader. This patch also
adds the "-dump-functions" flag to force llvm-flo to dump the current state of
BinaryFunctions once they are disassembled and their CFG built, allowing us to
test whether the builder is sane with LLVM LIT tests.
(cherry picked from FBD2534675)
Summary:
This patch introduces DataReader, a module responsible for
parsing llvm flo data files into in-memory data structures.
(cherry picked from FBD2515754)