Commit Graph

325 Commits

Author SHA1 Message Date
spupyrev
244a476a2e using offsets for CG
Summary: Arc->AvgOffset can be used for function/block ordering to distinguish between calls from the beggining of a function and calls from the end of the function. This makes a difference for large functions.

(cherry picked from FBD6094221)
2017-10-18 15:18:52 -07:00
Maksim Panchenko
61e5fbf8c3 [BOLT][Refactoring] Get rid of TailCallTerminatedBlocks, etc.
Summary:
More changes to allow separation of CFG construction and
profile assignment. Misc cleanups.

(cherry picked from FBD6158653)
2017-10-23 23:32:40 -07:00
Bill Nell
c58996fd55 [BOLT] Add ability to specify custom printers for annotations.
Summary:
This will give us the ability to print annotations in a more meaningful way.  Especially annotations that could be interpreted in multiple ways.  I've added one register name printer for liveness analysis.  We can update the other dataflow annotations as needed.

I also noticed that BitVector annotations were leaking since they contain heap allocated memory.  I made removeAnnotation call the annotation destructor explicitly to mitigate this but it won't fix the problem when annotations are just dropped en masse.

(cherry picked from FBD6105999)
2017-10-19 12:36:48 -07:00
Maksim Panchenko
2ab7472329 [BOLT] Account for FDE functions when calculating max function size
Summary:
When we calculate maximum function size we only used to rely on the
symbol table information, and ignore function info coming from FDEs.
Invalid maximum function size can lead to code emission over
the code of neighbouring function.

Fix this by considering FDE functions when determining the maximum
function size.

(cherry picked from FBD6025613)
2017-10-10 14:54:09 -07:00
Maksim Panchenko
1e1833c8a2 [BOLT][Refactoring] Make CTC first class operand, etc.
Summary:
This diff is a preparation for decoupling function disassembly,
profile association, and CFG construction phases.

We used to have multiple ways to mark conditional tail calls with
annotations or TailCallOffsets map. Since CTC information is affecting
the correctness, it is justifiable to have it as a operand class for
instruction with a destination (0 is a valid one).

"Offset" annotation now replaces "EdgeCountData" and
"IndirectBranchData" annotations to extract profile data for any
given instruction.

Inlining for small functions was broken in a presence of
profiled (annotated) instructions and hence I had to remove
"-inline-small-functions" from the test case.

Also fix an issue with UNDEF section for created __hot_start/__hot_end
symbols. Now the symbols use ABS section.

(cherry picked from FBD6087284)
2017-10-12 14:57:11 -07:00
spupyrev
b77172ce2f updating cache metrics
Summary:
This is a replacement of a previous diff. The implemented metric
('graph distance') is not very useful at the moment but I plan to add
more relevant metrics in the subsequent diff. This diff fixes some
obvious problems and moves the call of CalcMetrics::printAll to the
right place.

(cherry picked from FBD6072312)
2017-10-16 16:53:50 -07:00
Maksim Panchenko
4c8f48be3d [BOLT] Fix function order output option
Summary:
Add support to output both function order and section order files
as the former is useful for offloading functions sorting and
the latter is useful for linker script generation:

  -generate-function-order=<file>
  -generate-link-sections=<file>

(cherry picked from FBD6078446)
2017-10-17 10:05:16 -07:00
Maksim Panchenko
bee9132a54 [BOLT] Change function order file format for linker script
Summary:
Change output of "-generate-function-order=<file>" to match expected
format used for a linker script:

  * Prefix function names with ".text".
  * Strip internal suffix from local function names. E.g. for function
    with names "foo/1" and "foo/foo.c/1" we will only output "foo".
  * Output (with indentation) duplicate names for folded functions.

(cherry picked from FBD6071020)
2017-10-16 15:22:05 -07:00
Maksim Panchenko
1605f07f5c [BOLT] Create symbol table entries under -hot-text if they did not exist
Summary:
If "-hot-text" options is specified and the input binary did not
have __hot_start/__hot_end symbols, then add them to the symbol table.

(cherry picked from FBD6027737)
2017-10-10 18:06:45 -07:00
Maksim Panchenko
3d3fefff46 [BOLT] Use 32 as the default max bytes for function alignment
Summary:
Several benchmarks (hhvm, compilers) show that 32 provides a good
balance between I-Cache performance and iTLB misses.

(cherry picked from FBD6026476)
2017-10-10 16:36:01 -07:00
Rafael Auler
7689cf2417 [BOLT] Fix bolt_info ELF note
Summary:
Small fix - align the end of the descriptor string as well,
since readelf will detect when it is not aligned and print an error
instead of printing BOLT version and command line.

(cherry picked from FBD6023643)
2017-10-10 13:30:05 -07:00
Rafael Auler
0cc2a62f6a [BOLT] Write bolt info according to ELF spec
Summary:
Follow ELF spec for NOTE sections when writing bolt info.
Since tools such as "readelf -n" will not recognize a custom code
identifying our new note section, we use GNU "gold linker version"
note, tricking readelf into printing bolt info.

(cherry picked from FBD6010153)
2017-10-06 17:54:26 -07:00
Rafael Auler
0ed144a188 [PERF2BOLT] Check build-ids of binaries when aggregating
Summary:
Check the build-id of the input binary against the build-id of
the binary used during profiling data collection with perf, as reported
in perf.data. If they differ, issue a warning, since the user should use
exactly the same binary. If we cannot determine the build-id of either
the input binary or the one registered in the input perf.data, cancel the
build-id check but print a log message.

(cherry picked from FBD6001917)
2017-10-06 14:42:46 -07:00
spupyrev
f77a6acd71 fixing sizes
Summary: In some (weird) cases, a Function is marked 'split' but doesn't contain any 'cold' basic block. In that case, the size of the last basic block of the function is computed incorrectly. Hence, this fix.

(cherry picked from FBD6012963)
2017-10-09 14:15:38 -07:00
Rafael Auler
9df6dce234 [PERF2BOLT] Fix aggregator wrt new output format of perf
Summary:
Perf is now outputting one less space, which broke our previous
(flaky) assumptions about field separators when processing the output
file. Make it more resilient by accepting any number of spaces before
reading LBR entries.

(cherry picked from FBD6014941)
2017-10-09 15:52:13 -07:00
Rafael Auler
f02c8c29ee [PERF2BOLT] Improve user messages about profiling stats
Summary:
Improve messages and color-code bad traces percentage, warning
user about a potential input binary mismatch.

(cherry picked from FBD5915934)
2017-09-26 14:42:43 -07:00
Maksim Panchenko
f32784f4cb [BOLT] Ignore Clang LTO artifact file symbol
Summary:
The presence of ld-temp.o symbol is somewhat indeterministic.
I couldn't find out exactly when it's generated, it could be
related to LTO vs ThinLTO, but not always.

If the symbol is there, it could affect names of most
of functions in LTO binary. The status of the symbol
may change between the binary the profile was collected on,
and the binary BOLT is called on. As a result, we may mismatch
many function names.

It is safe to ignore this symbol.

(cherry picked from FBD5908955)
2017-09-25 18:05:37 -07:00
Bill Nell
aa05dc91c5 Fix SCTC bug when two pred/succ BB are in a loop.
Summary: It's possible that two basic blocks being conidered for SCTC are in a loop in the CFG.  In this case a block that is both a predecessor and a successor may have been processed and marked invalid by a previous iteration of the SCTC loop. We should skip rewriting in this case.

(cherry picked from FBD5886721)
2017-09-21 15:45:39 -07:00
Rafael Auler
42f957bb75 [BOLT] Integrate perf2bolt into llvm-bolt
Summary:
Move the data aggregator logic from our python script to
our C++ LLVM/BOLT libs. This has a dramatic reduction in processing
time for profiling data (from 45 minutes for HHVM to 5 minutes) because
we directly use BOLT as a disassembler in order to validate traces found
in the LBR and to add the fallthrough counts. Previously, the python
approach relied on parsing the output objdump to check traces.

(cherry picked from FBD5761313)
2017-09-01 18:13:51 -07:00
Maksim Panchenko
156fc73157 [BOLT] Fix SCTC bug
Summary:
If conditional branch has been converted to conditional tail call,
it may be considered for SCTC optimization later since it will
appear as a tail call. We have to make sure that the tail call
we are considering is not a conditional branch.

(cherry picked from FBD5884777)
2017-09-19 16:59:05 -07:00
Maksim Panchenko
b006d2a860 [BOLT] Fix issue with exception handlers splitting
Summary:
A cold part of a function can start with a landing pad. As a
result, this landing pad will have offset 0 from the start
of the corresponding FDE, and it wouldn't get registered by
exception-handling runtime.

The solution is to use a different landing pad base address
(LPStart), such as (FDE_start - 1).

(cherry picked from FBD5876561)
2017-09-20 13:32:46 -07:00
Rafael Auler
ef0ec9edf9 [BOLT] Fix frameopt=all for gcc
Summary:
Fix two bugs. First, stack pointer tracking, the dataflow
analysis, was converging to the "superposition" state (meaning that at
this point there are multiple and conflicting states) too early in case
the entry state in the BB was "empty" AND there was an SP computation in
the block. In these cases, we need to propagate an "empty" value as well
and wait for an iteration where the input is not empty (only entry BBs
start with a non-empty well-defined value). Previously, it was
propagating "superposition", meaning there is a conflict of states in
this block, which is not true, since the input is empty and, therefore,
there is no preceding state to justify a collision of states.

Second, if SPT failed and has no idea about the stack values in a block
(if it is in the superposition state at a given point in a BB), shrink
wrapping should not attempt to insert computation into those blocks
that we do not understand what is happening. Fix it to bail on those
cases.

(cherry picked from FBD5858402)
2017-09-18 16:26:00 -07:00
Rafael Auler
9df155ce11 [BOLT] Introduce non-LBR mode
Summary:
Add support to read profiles collected without LBR. This
involves adapting our data aggregator perf2bolt and adding support
in llvm-bolt itself to read this data.

This patch also introduces different options to convert basic block
execution count to edge count, so BOLT can operate with its regular
algorithms to perform basic block layout. The most successful approach
is the default one.

(cherry picked from FBD5664735)
2017-08-02 10:59:33 -07:00
Maksim Panchenko
29d4f4cfac [BOLT] Ignore TLS relocations types
Summary:
No special handling is required for TLS relocations types,
and if we see them in the binary we can safely ignore those
types.

(cherry picked from FBD5853889)
2017-09-13 11:21:47 -07:00
Maksim Panchenko
ec5b3b0a65 [BOLT] Fix bug in SCTC
Summary:
After SCTC optimization fixDoubleJumps() was relying on CFG information
on the number of successors of a basic block. It ignored the fact that
conditional tail call had a successor outside of the function and
deleted a containing basic block.

Discovered while testing old HHVM with disabled jump tables.

(cherry picked from FBD5752903)
2017-08-31 17:28:14 -07:00
Maksim Panchenko
bd8e4b9e87 [BOLT] Support PIC-style exception tables
Summary:
Exceptions tables for PIC may contain indirect type references
that are also encoded using relative addresses.

This diff adds support for such encodings. We read PIC-style
type info table, and write it using new encoding.

(cherry picked from FBD5716060)
2017-08-27 17:04:06 -07:00
Maksim Panchenko
49d1f5698d [BOLT] PLT optimization
Summary:
Add an option to optimize PLT calls:

  -plt  - optimize PLT calls (requires linking with -znow)
    =none - do not optimize PLT calls
    =hot  - optimize executed (hot) PLT calls
    =all  - optimize all PLT calls

When optimized, the calls are converted to use GOT reference
indirectly. GOT entries are guaranteed to contain a valid
function pointer if lazy binding is disabled - hence the
requirement for linker's -znow option.

Note: we can add an entry to .dynamic and drop a requirement
for -znow if we were moving .dynamic to a new segment.

(cherry picked from FBD5579789)
2017-08-04 11:21:05 -07:00
Maksim Panchenko
0c07445110 [BOLT] Fix printing of dyno-stats
Summary:
We used to print dyno-stats after instruction lowering
which was skewing our metrics as tail calls were no longer
recognized as calls for one thing. The fix is to control
the point at which dyno-stats printing pass is run and run
it immediately before instruction lowering. In the future we
may decide to run the pass before some other intervening pass.

(cherry picked from FBD5605639)
2017-08-10 13:18:44 -07:00
Rafael Auler
21c48f7d78 Fix profiling for functions with multiple entry points
Summary:
Fix issue in memcpy where one of its entry points was getting
no profiling data and was wrongly considered cold, being put in the cold
region.

(cherry picked from FBD5569156)
2017-08-02 18:14:01 -07:00
Rafael Auler
b81ff8a8fc [BOLT] Fix SCTC issue with hot-cold split
Summary:
SCTC was deleting an unconditional branch to a block in the
cold area because it was the next block in the layout vector. Fix the
condition to only delete such branches when source and target are in
the same allocation area (either both hot or both cold).

(cherry picked from FBD5570300)
2017-08-04 20:14:24 -07:00
Maksim Panchenko
e4290d083f [BOLT] Disable last basic block assertion.
Summary:
While converting code from __builtin_unreachable() we were asserting
that a basic block with a conditional jump and a single CFG successor
was the last one before converting the jump to an unconditional one.

However, if that code was executed after a conditional tail call
conversion in the same function, the original last basic block
will no longer be the last one in the post-conversion layout.

I'm disabling the assertion since it doesn't seem worth it to add
extra checks for the basic block that used to be the last one.

(cherry picked from FBD5570298)
2017-08-04 19:39:45 -07:00
Maksim Panchenko
ae409f0b27 [BOLT] Better match LTO functions profile.
Summary:
* Improve profile matching for LTO binaries that don't match 100%.
* Fix profile matching for '.LTHUNK*' functions.
* Add external outgoing branches (calls) for profile validation.

There's an improvement for 100% match profile and for stale LTO
profile. However, we are still not fully closing the gap with
stale profile when LTO is enabled.

(NOTE: I haven't updated all test cases yet)

(cherry picked from FBD5529293)
2017-07-17 11:22:22 -07:00
Maksim Panchenko
d27b31ee07 [BOLT] Fix reading LSDA address for PIC code
Summary:
Fix a bug while reading LSDA address in PIC format. The base address was
wrong for PC-relative value. There's more work involved in making PIC
code with C++ exceptions work.

(cherry picked from FBD5538755)
2017-08-01 11:19:01 -07:00
Yue Zhao
eb64d03b73 Reformat the register strings in the output so Stoke can parse without preprocessing.
Summary:
Minor change. Reformat the def-in, live-out register strings so that Stoke can parse
without doing preprocessing.

(cherry picked from FBD5537421)
2017-07-27 12:52:56 -07:00
Bohan Ren
87481cb494 [BOLT] Improve Jump-Distance Metric -- Consider Function Execution Count
Summary:
Function execution count is very important. When calculating metric, we
should care more about functions which are known to be executed.

The correlations between this metric and both CPU time is slightly improved
to be close to  96% and the correlation between this metric and Cache Miss
remains the same 96%.

Thanks the suggestion from Sergey!

(cherry picked from FBD5494720)
2017-07-25 16:27:00 -07:00
Rafael Auler
787db1cf3e Recognize AArch64 as a valid input
Summary:
BOLT needs to be configured with the LLVM
AArch64 backend. If the backend is linked into the LLVM
library, start processing AArch64 binaries.

(cherry picked from FBD5489369)
2017-07-25 09:11:42 -07:00
Yue Zhao
70bad8d34d add: get function score to find hot functions refine the dumped csv format
Summary: minor modification of the bolt stoke pass

(cherry picked from FBD5471011)
2017-07-13 15:02:52 -07:00
Yue Zhao
6d845719ce get analysis information of functions
Summary:
complete the StokeInfo pass,
ignore previous arc diff

(cherry picked from FBD5306863)
2017-06-13 17:24:27 -07:00
Rafael Auler
4e29afeb18 [BOLT] Add cold symbols to the symbol table
Summary:
Create new .symtab and .strtab sections, so we can change their
sizes and not only patch them. Remove local symbols and add symbols to
identify the cold part of split functions.

(cherry picked from FBD5345460)
2017-06-27 16:25:59 -07:00
Bohan Ren
4d34471eeb [BOLT] Improved Jump-Distance Metric
Summary:
Current existing Jump-Distance Metric (Previously named Call-Distance) will ignore some traversals.
This modified version adds those missing traversals back.

The correlation remains the same: around 97% correlation with CPU and
Cache Miss (which implies that even though some traversals are ignored,
it doesn't affect correlation that much.)

(cherry picked from FBD5369653)
2017-07-04 15:59:29 -07:00
Rafael Auler
4ecd3856e9 [BOLT] Fix shrink-wrapping bugs
Summary:
Make shrink-wrapping more stable. Changes:

* Correctly detect landing pads at the dominance frontier, bailing
  on such cases because we are not prepared to split LPs that are target
  of a critical edge.
* Disable FOP's store removal by default - this is experimental and
  shouldn t go to prod because removing a store that we failed to detect
  it's actually necessary is disastrous. This pass currently doesn't
  have a great impact on the number of stores reduced, so it is not a
  problem. Most stores reduced are due shrink wrapping anyway.
* Fix stack access identification - correctly estimate memory length of
  weird instructions, bail if we don't know.
* Make rules for shrink-wrapping more strict: cancel shrink wrapping on
  a number of cases when we are not 100% sure that we are dealing with a
  regular callee-saved register.
* Add basic block folding to SW. Sometimes when splitting critical edges
  we create a lot of redundant BBs with the same instructions, same
  successor but different predecessor. Fold all identical BBs created by
  splitting critical edges.
* Change defaults: now the threshold used to determine when to perform
  SW is more conservative, to be sure we are moving a spill to a colder
  area. This effort, along with BB folding, helps us to avoid hurting
  icache performance by indiscriminately increasing code size.

(cherry picked from FBD5315086)
2017-06-22 16:34:01 -07:00
Bohan Ren
ec304396c3 [BOLT] Call Distance Metric
Summary:
Designed a new metric, which shows 93.46% correltation with Cache Miss
and 86% correlation with CPU Time.

Definition:

One can get all the traversal path for each function. And for each traversal,
we will define a distance. The distance represents how far two connected
basic blocks are. Therefore, for each traversal, I will go through the
basic blocks one by one, until the end of the traversal and sum up the
distance for the neighboring basic blocks.
Distance between two connected basic blocks is the distance of the
centers of two blocks in the binary file.

(cherry picked from FBD5242526)
2017-06-13 16:29:39 -07:00
Rafael Auler
3469396269 [BOLT] Set local symbols in relocation mode to zero
Summary:
Strobelight is getting confused by local symbols that we do not
update in relocation mode. These symbols were preserved by the linker in
relocation mode in order support emitting relocations against local
labels, but they are unused.

Issue a quick fix to this by detecting such symbols and setting their
value to zero.

This patch also fixes an issue with the symbol table that was assigning
the wrong section index to symbols associated with the .text section.

(cherry picked from FBD5271277)
2017-06-16 20:04:43 -07:00
Bill Nell
59e90f0f43 [BOLT] Make function reordering more robust with stale data.
Summary:
Rewrote the guts of buildCallGraph.  There are two new options to control how the CG is created.  UsePerfData controls whether we use the perf data directly to construct the CG for functions with a stale profile.  IgnoreRecursiveCalls omits recursive calls from the CG since they might be skewing results unfairly for heavily recursive functions.

I've changed the way BinaryFunction::estimateHotSize() works.  If the function is marked as split, I count the size of all the non-cold blocks.  This gives a different but more accurate answer than the old method.

I've improved and updated the CG build stats with extra information.

(cherry picked from FBD5224183)
2017-06-09 13:17:36 -07:00
Rafael Auler
8233c7d204 [BOLT] Bail frame analysis on PUSHes escaping vars
Summary:
Some PUSH instructions may contain memory addresses pushed to
the stack. If this memory address is from an object in the stack, cancel
further frame analysis for this function since it may be escaping a
variable.

This fixes a bug with deleting used stores (in frameopt) in hhvm trunk.

(cherry picked from FBD5270590)
2017-06-16 15:02:26 -07:00
Yue Zhao
37d0f81df5 BinaryFunction.h: Clarify commet for getSize(), add getNumNonPseudos()
Summary: Minor fix and add new function

(cherry picked from FBD5270376)
2017-06-16 17:06:13 -07:00
Bill Nell
dc4dd64800 [BOLT] More HFSort+ refactoring
Summary: Move most of hfsort+ into a class so the state can more easily be shared.

(cherry picked from FBD5216206)
2017-06-08 10:55:28 -07:00
Bohan Ren
f819f53d27 Normalize Clusters Twice
Summary:
This one will normalize cluster twice, leaving edges connecting two
basic block untouched

(cherry picked from FBD5207416)
2017-06-07 20:25:30 -07:00
Rafael Auler
eeea415dd2 [BOLT] Fix SCTC execution count assertion
Summary:
SCTC is currently asserting (my fault :-) when running in
combination with hot jump table entries optimization. This optimization
sets the frequency for edges connecting basic blocks it creates and jump
table targets based on the execution count of the original BB containing
the indirect jump.

This is OK as an estimation, but it breaks our assumption that the sum of
the frequency of preds edges equals to our BB frequency. This happens
because the frequency of the BB is rarely equal to its outgoing edges
frequency.

SCTC, in turn, was updating the execution count for BBs with tail calls
by subtracting the frequency count of predecessor edges. Because hot
jump table entries optimization broke the BB exec count = sum(preds freq)
invariant, SCTC was asserting.

To trigger this, the input program must have a jump table where each
entry contains a tail call. This happens in the HHVM binary for func
_ZN4HPHP11collections5issetEPNS_10ObjectDataEPKNS_10TypedValueE.

(cherry picked from FBD5222504)
2017-06-09 15:52:50 -07:00
Bohan Ren
eb63a0b295 [BOLT] Expand BOLT report for basic block ordering
Summary:
Add a new positional option onto bolt: "-print-function-statistics=<uint64>"
which prints information about block ordering for requested number of functions.

(cherry picked from FBD5105323)
2017-05-22 11:04:01 -07:00