Commit Graph

18 Commits

Author SHA1 Message Date
Bill Nell
9e42885d04 [BOLT] Add value profiling to BOLT
Summary:
Add support for reading value profiling info from perf data.  This diff adds support in DataReader/DataAggregator for value profiling data.  Each event is recorded as two Locations (a PC and an address/value) and a count.

For now, I'm assuming that the value profiling data is in the same file as the usual BOLT profiling data.  Collecting both at the same time seems to work.

(cherry picked from FBD6076877)
2017-10-16 13:09:43 -07:00
Rafael Auler
42f957bb75 [BOLT] Integrate perf2bolt into llvm-bolt
Summary:
Move the data aggregator logic from our python script to
our C++ LLVM/BOLT libs. This has a dramatic reduction in processing
time for profiling data (from 45 minutes for HHVM to 5 minutes) because
we directly use BOLT as a disassembler in order to validate traces found
in the LBR and to add the fallthrough counts. Previously, the python
approach relied on parsing the output objdump to check traces.

(cherry picked from FBD5761313)
2017-09-01 18:13:51 -07:00
Rafael Auler
9df155ce11 [BOLT] Introduce non-LBR mode
Summary:
Add support to read profiles collected without LBR. This
involves adapting our data aggregator perf2bolt and adding support
in llvm-bolt itself to read this data.

This patch also introduces different options to convert basic block
execution count to edge count, so BOLT can operate with its regular
algorithms to perform basic block layout. The most successful approach
is the default one.

(cherry picked from FBD5664735)
2017-08-02 10:59:33 -07:00
Rafael Auler
21c48f7d78 Fix profiling for functions with multiple entry points
Summary:
Fix issue in memcpy where one of its entry points was getting
no profiling data and was wrongly considered cold, being put in the cold
region.

(cherry picked from FBD5569156)
2017-08-02 18:14:01 -07:00
Maksim Panchenko
ae409f0b27 [BOLT] Better match LTO functions profile.
Summary:
* Improve profile matching for LTO binaries that don't match 100%.
* Fix profile matching for '.LTHUNK*' functions.
* Add external outgoing branches (calls) for profile validation.

There's an improvement for 100% match profile and for stale LTO
profile. However, we are still not fully closing the gap with
stale profile when LTO is enabled.

(NOTE: I haven't updated all test cases yet)

(cherry picked from FBD5529293)
2017-07-17 11:22:22 -07:00
Bill Nell
d74997c3cc Indirect call promotion optimization.
Summary:
Perform indirect call promotion optimization in BOLT.

The code scans the instructions during CFG creation for all
indirect calls.  Right now indirect tail calls are not handled
since the functions are marked not simple.  The offsets of the
indirect calls are stored for later use by the ICP pass.

The indirect call promotion pass visits each indirect call and
examines the BranchData for each.  If the most frequent targets
from that callsite exceed the specified threshold (default 90%),
the call is promoted.  Otherwise, it is ignored.  By default,
only one target is considered at each callsite.

When an candiate callsite is processed, we modify the callsite
to test for the most common call targets before calling through
the original generic call mechanism.

The CFG and layout are modified by ICP.

A few new command line options have been added:
-indirect-call-promotion
-indirect-call-promotion-threshold=<percentage>
-indirect-call-promotion-topn=<int>

The threshold is the minimum frequency of a call target needed
before ICP is triggered.

The topn option controls the number of targets to consider for
each callsite, e.g. ICP is triggered if topn=2 and the total
requency of the top two call targets exceeds the threshold.

Example of ICP:

C++ code:

  int B_count = 0;
  int C_count = 0;

  struct A { virtual void foo() = 0; }
  struct B : public A { virtual void foo() { ++B_count; }; };
  struct C : public A { virtual void foo() { ++C_count; }; };

  A* a = ...
  a->foo();
  ...

original:
  400863:	49 8b 07             	mov    (%r15),%rax
  400866:	4c 89 ff             	mov    %r15,%rdi
  400869:	ff 10                	callq  *(%rax)
  40086b:	41 83 e6 01          	and    $0x1,%r14d
  40086f:	4d 89 e6             	mov    %r12,%r14
  400872:	4c 0f 44 f5          	cmove  %rbp,%r14
  400876:	4c 89 f7             	mov    %r14,%rdi
  ...

after ICP:
  40085e:	49 8b 07             	mov    (%r15),%rax
  400861:	4c 89 ff             	mov    %r15,%rdi
  400864:	49 ba e0 0b 40 00 00 	movabs $0x400be0,%r10
  40086b:	00 00 00
  40086e:	4c 3b 10             	cmp    (%rax),%r10
  400871:	75 29                	jne    40089c <main+0x9c>
  400873:	41 ff d2             	callq  *%r10
  400876:	41 83 e6 01          	and    $0x1,%r14d
  40087a:	4d 89 e6             	mov    %r12,%r14
  40087d:	4c 0f 44 f5          	cmove  %rbp,%r14
  400881:	4c 89 f7             	mov    %r14,%rdi
  ...

  40089c:	ff 10                	callq  *(%rax)
  40089e:	eb d6                	jmp    400876 <main+0x76>

(cherry picked from FBD3612218)
2016-09-07 18:59:23 -07:00
Theodoros Kasampalis
713e361f36 Fix for correct disassembling of conditional tail calls.
Summary:
BOLT attempts to convert jumps that serve as tail calls to dedicated tail call
instructions, but this is impossible when the jump is conditional because there is
no corresponding tail call instruction. This was causing the creation of a duplicate
fall-through edge for basic blocks terminated with a conditional jump serving as
a tail call when there is profile data available for the non-taken branch. In this
case, the first fall-through edge had a count taken from the profile data, while
the second has a count computed (incorrectly) by
BinaryFunction::inferFallThroughCounts.

(cherry picked from FBD3560504)
2016-07-13 18:57:40 -07:00
Maksim Panchenko
84b5b9e462 Create alternative name for local symbols.
Summary:
If a profile data was collected on a stripped binary but an input
to BOLT is unstripped, we would use a different mangling scheme for
local functions and ignore their profiles. To solve the issue this
diff adds alternative name for all local functions such that one
of the names would match the name in the profile.

If the input binary was stripped, we reject it, unless "-allow-stripped"
option was passed. It's more complicated to do a matching in this case
since we have less information than at the time of profile collection.
It's also not that simple to tell if the profile was gathered on a
stripped binary (in which case we would have no issue matching data).

(cherry picked from FBD3548012)
2016-07-11 18:51:13 -07:00
Theodoros Kasampalis
6eb4e5b687 perf2bolt can extract branch records with histories
Summary:
Added perf2bolt functionality for extracting branch records
with histories of previous branches. The length of the histories
is user defined, and the default is 0 (previous functionality). Also,
DataReader can parse perf2bolt output with histories.
Note: creating profile data with long histories can increase their
size significantly (2x for history of length 1, 3x for length 2 etc).

(cherry picked from FBD3473983)
2016-06-21 18:44:42 -07:00
Maksim Panchenko
f1192a7118 Support for multiple function names.
Summary:
With ICF optimization in the linker we were getting mismatches of
function names in .fdata and BinaryFunction name. This diff adds
support for multiple function names for BinaryFunction and
does a match against all possible names for the profile.

(cherry picked from FBD3466215)
2016-06-10 17:13:05 -07:00
Bill Nell
980a06265a Revert "Indirect call optimization."
This reverts commit 33966090e18545b64013614e7929ff1bdcdf10d5.

(cherry picked from FBD28110782)
2016-06-08 17:38:13 -07:00
Bill Nell
8bcfd9a392 Indirect call optimization.
(cherry picked from FBD28110629)
2016-06-07 16:27:52 -07:00
Theodoros Kasampalis
fb5f18b2dc Correctly updating landing pad exec counts.
(cherry picked from FBD28110316)
2016-05-23 16:16:25 -07:00
Maksim Panchenko
595d0885d9 Populate function execution count while parsing fdata.
Summary:
Populate function execution count while parsing fdata. Before
we used a quadratic algorithm to populate the execution count
(had to iterate over *all* branches for every single function).

Ignore non-symbol to non-symbol branches while parsing fdata.

These changes combined drop HHVM processing time from
4 minutes 53 seconds down to 2 minutes 9 seconds on my devserver.

Test case had to be modified since it contained irrelevant
branches from PLT to libc.

(cherry picked from FBD3106263)
2016-03-28 11:06:28 -07:00
Maksim Panchenko
d1526083fc Rename binary optimizer to BOLT.
Summary:
BOLT - Binary Optimization and Layout Tool replaces FLO.
I'm keeping .fdata extension for "feedback data".

(cherry picked from FBD2908028)
2016-02-05 14:42:04 -08:00
Rafael Auler
9a8d357d0b Fix DataReader to work with new local sym perf2flo format
Summary:
In a recent commit, we changed local symbols to be specially tagged
with the number 2 (local sym) instead of 1 (sym). This patch modifies the reader
to don't choke when seeing a 2 in the symbol id field.

(cherry picked from FBD2552776)
2015-10-16 17:00:36 -07:00
Rafael Auler
4c1da22ae9 Add branch count information to binary CFG
Summary:
Changes DataReader to organize branch perf data per function name and
sets up logistics to bring this data to BinaryFunction::buildCFG(). To do this,
we expand BinaryContext with a const reference to DataReader. This patch also
adds the "-dump-functions" flag to force llvm-flo to dump the current state of
BinaryFunctions once they are disassembled and their CFG built, allowing us to
test whether the builder is sane with LLVM LIT tests.

(cherry picked from FBD2534675)
2015-10-12 12:30:47 -07:00
Rafael Auler
e1a539b0ec Add initial implementation of DataReader
Summary:
This patch introduces DataReader, a module responsible for
parsing llvm flo data files into in-memory data structures.

(cherry picked from FBD2515754)
2015-10-05 18:31:25 -07:00