Summary: Arc->AvgOffset can be used for function/block ordering to distinguish between calls from the beggining of a function and calls from the end of the function. This makes a difference for large functions.
(cherry picked from FBD6094221)
Summary:
This will give us the ability to print annotations in a more meaningful way. Especially annotations that could be interpreted in multiple ways. I've added one register name printer for liveness analysis. We can update the other dataflow annotations as needed.
I also noticed that BitVector annotations were leaking since they contain heap allocated memory. I made removeAnnotation call the annotation destructor explicitly to mitigate this but it won't fix the problem when annotations are just dropped en masse.
(cherry picked from FBD6105999)
Summary:
When we calculate maximum function size we only used to rely on the
symbol table information, and ignore function info coming from FDEs.
Invalid maximum function size can lead to code emission over
the code of neighbouring function.
Fix this by considering FDE functions when determining the maximum
function size.
(cherry picked from FBD6025613)
Summary:
This diff is a preparation for decoupling function disassembly,
profile association, and CFG construction phases.
We used to have multiple ways to mark conditional tail calls with
annotations or TailCallOffsets map. Since CTC information is affecting
the correctness, it is justifiable to have it as a operand class for
instruction with a destination (0 is a valid one).
"Offset" annotation now replaces "EdgeCountData" and
"IndirectBranchData" annotations to extract profile data for any
given instruction.
Inlining for small functions was broken in a presence of
profiled (annotated) instructions and hence I had to remove
"-inline-small-functions" from the test case.
Also fix an issue with UNDEF section for created __hot_start/__hot_end
symbols. Now the symbols use ABS section.
(cherry picked from FBD6087284)
Summary:
This is a replacement of a previous diff. The implemented metric
('graph distance') is not very useful at the moment but I plan to add
more relevant metrics in the subsequent diff. This diff fixes some
obvious problems and moves the call of CalcMetrics::printAll to the
right place.
(cherry picked from FBD6072312)
Summary:
Add support to output both function order and section order files
as the former is useful for offloading functions sorting and
the latter is useful for linker script generation:
-generate-function-order=<file>
-generate-link-sections=<file>
(cherry picked from FBD6078446)
Summary:
Change output of "-generate-function-order=<file>" to match expected
format used for a linker script:
* Prefix function names with ".text".
* Strip internal suffix from local function names. E.g. for function
with names "foo/1" and "foo/foo.c/1" we will only output "foo".
* Output (with indentation) duplicate names for folded functions.
(cherry picked from FBD6071020)
Summary:
If "-hot-text" options is specified and the input binary did not
have __hot_start/__hot_end symbols, then add them to the symbol table.
(cherry picked from FBD6027737)
Summary:
Several benchmarks (hhvm, compilers) show that 32 provides a good
balance between I-Cache performance and iTLB misses.
(cherry picked from FBD6026476)
Summary:
Small fix - align the end of the descriptor string as well,
since readelf will detect when it is not aligned and print an error
instead of printing BOLT version and command line.
(cherry picked from FBD6023643)
Summary:
Follow ELF spec for NOTE sections when writing bolt info.
Since tools such as "readelf -n" will not recognize a custom code
identifying our new note section, we use GNU "gold linker version"
note, tricking readelf into printing bolt info.
(cherry picked from FBD6010153)
Summary:
Check the build-id of the input binary against the build-id of
the binary used during profiling data collection with perf, as reported
in perf.data. If they differ, issue a warning, since the user should use
exactly the same binary. If we cannot determine the build-id of either
the input binary or the one registered in the input perf.data, cancel the
build-id check but print a log message.
(cherry picked from FBD6001917)
Summary: In some (weird) cases, a Function is marked 'split' but doesn't contain any 'cold' basic block. In that case, the size of the last basic block of the function is computed incorrectly. Hence, this fix.
(cherry picked from FBD6012963)
Summary:
Perf is now outputting one less space, which broke our previous
(flaky) assumptions about field separators when processing the output
file. Make it more resilient by accepting any number of spaces before
reading LBR entries.
(cherry picked from FBD6014941)
Summary:
The presence of ld-temp.o symbol is somewhat indeterministic.
I couldn't find out exactly when it's generated, it could be
related to LTO vs ThinLTO, but not always.
If the symbol is there, it could affect names of most
of functions in LTO binary. The status of the symbol
may change between the binary the profile was collected on,
and the binary BOLT is called on. As a result, we may mismatch
many function names.
It is safe to ignore this symbol.
(cherry picked from FBD5908955)
Summary: It's possible that two basic blocks being conidered for SCTC are in a loop in the CFG. In this case a block that is both a predecessor and a successor may have been processed and marked invalid by a previous iteration of the SCTC loop. We should skip rewriting in this case.
(cherry picked from FBD5886721)
Summary:
Move the data aggregator logic from our python script to
our C++ LLVM/BOLT libs. This has a dramatic reduction in processing
time for profiling data (from 45 minutes for HHVM to 5 minutes) because
we directly use BOLT as a disassembler in order to validate traces found
in the LBR and to add the fallthrough counts. Previously, the python
approach relied on parsing the output objdump to check traces.
(cherry picked from FBD5761313)
Summary:
If conditional branch has been converted to conditional tail call,
it may be considered for SCTC optimization later since it will
appear as a tail call. We have to make sure that the tail call
we are considering is not a conditional branch.
(cherry picked from FBD5884777)
Summary:
A cold part of a function can start with a landing pad. As a
result, this landing pad will have offset 0 from the start
of the corresponding FDE, and it wouldn't get registered by
exception-handling runtime.
The solution is to use a different landing pad base address
(LPStart), such as (FDE_start - 1).
(cherry picked from FBD5876561)
Summary:
Fix two bugs. First, stack pointer tracking, the dataflow
analysis, was converging to the "superposition" state (meaning that at
this point there are multiple and conflicting states) too early in case
the entry state in the BB was "empty" AND there was an SP computation in
the block. In these cases, we need to propagate an "empty" value as well
and wait for an iteration where the input is not empty (only entry BBs
start with a non-empty well-defined value). Previously, it was
propagating "superposition", meaning there is a conflict of states in
this block, which is not true, since the input is empty and, therefore,
there is no preceding state to justify a collision of states.
Second, if SPT failed and has no idea about the stack values in a block
(if it is in the superposition state at a given point in a BB), shrink
wrapping should not attempt to insert computation into those blocks
that we do not understand what is happening. Fix it to bail on those
cases.
(cherry picked from FBD5858402)
Summary:
Add support to read profiles collected without LBR. This
involves adapting our data aggregator perf2bolt and adding support
in llvm-bolt itself to read this data.
This patch also introduces different options to convert basic block
execution count to edge count, so BOLT can operate with its regular
algorithms to perform basic block layout. The most successful approach
is the default one.
(cherry picked from FBD5664735)
Summary:
No special handling is required for TLS relocations types,
and if we see them in the binary we can safely ignore those
types.
(cherry picked from FBD5853889)
Summary:
After SCTC optimization fixDoubleJumps() was relying on CFG information
on the number of successors of a basic block. It ignored the fact that
conditional tail call had a successor outside of the function and
deleted a containing basic block.
Discovered while testing old HHVM with disabled jump tables.
(cherry picked from FBD5752903)
Summary:
Exceptions tables for PIC may contain indirect type references
that are also encoded using relative addresses.
This diff adds support for such encodings. We read PIC-style
type info table, and write it using new encoding.
(cherry picked from FBD5716060)
Summary:
Add an option to optimize PLT calls:
-plt - optimize PLT calls (requires linking with -znow)
=none - do not optimize PLT calls
=hot - optimize executed (hot) PLT calls
=all - optimize all PLT calls
When optimized, the calls are converted to use GOT reference
indirectly. GOT entries are guaranteed to contain a valid
function pointer if lazy binding is disabled - hence the
requirement for linker's -znow option.
Note: we can add an entry to .dynamic and drop a requirement
for -znow if we were moving .dynamic to a new segment.
(cherry picked from FBD5579789)
Summary:
We used to print dyno-stats after instruction lowering
which was skewing our metrics as tail calls were no longer
recognized as calls for one thing. The fix is to control
the point at which dyno-stats printing pass is run and run
it immediately before instruction lowering. In the future we
may decide to run the pass before some other intervening pass.
(cherry picked from FBD5605639)
Summary:
Fix issue in memcpy where one of its entry points was getting
no profiling data and was wrongly considered cold, being put in the cold
region.
(cherry picked from FBD5569156)
Summary:
SCTC was deleting an unconditional branch to a block in the
cold area because it was the next block in the layout vector. Fix the
condition to only delete such branches when source and target are in
the same allocation area (either both hot or both cold).
(cherry picked from FBD5570300)
Summary:
While converting code from __builtin_unreachable() we were asserting
that a basic block with a conditional jump and a single CFG successor
was the last one before converting the jump to an unconditional one.
However, if that code was executed after a conditional tail call
conversion in the same function, the original last basic block
will no longer be the last one in the post-conversion layout.
I'm disabling the assertion since it doesn't seem worth it to add
extra checks for the basic block that used to be the last one.
(cherry picked from FBD5570298)
Summary:
* Improve profile matching for LTO binaries that don't match 100%.
* Fix profile matching for '.LTHUNK*' functions.
* Add external outgoing branches (calls) for profile validation.
There's an improvement for 100% match profile and for stale LTO
profile. However, we are still not fully closing the gap with
stale profile when LTO is enabled.
(NOTE: I haven't updated all test cases yet)
(cherry picked from FBD5529293)
Summary:
Fix a bug while reading LSDA address in PIC format. The base address was
wrong for PC-relative value. There's more work involved in making PIC
code with C++ exceptions work.
(cherry picked from FBD5538755)
Summary:
Minor change. Reformat the def-in, live-out register strings so that Stoke can parse
without doing preprocessing.
(cherry picked from FBD5537421)
Summary:
Function execution count is very important. When calculating metric, we
should care more about functions which are known to be executed.
The correlations between this metric and both CPU time is slightly improved
to be close to 96% and the correlation between this metric and Cache Miss
remains the same 96%.
Thanks the suggestion from Sergey!
(cherry picked from FBD5494720)
Summary:
BOLT needs to be configured with the LLVM
AArch64 backend. If the backend is linked into the LLVM
library, start processing AArch64 binaries.
(cherry picked from FBD5489369)
Summary:
Create new .symtab and .strtab sections, so we can change their
sizes and not only patch them. Remove local symbols and add symbols to
identify the cold part of split functions.
(cherry picked from FBD5345460)
Summary:
Current existing Jump-Distance Metric (Previously named Call-Distance) will ignore some traversals.
This modified version adds those missing traversals back.
The correlation remains the same: around 97% correlation with CPU and
Cache Miss (which implies that even though some traversals are ignored,
it doesn't affect correlation that much.)
(cherry picked from FBD5369653)
Summary:
Make shrink-wrapping more stable. Changes:
* Correctly detect landing pads at the dominance frontier, bailing
on such cases because we are not prepared to split LPs that are target
of a critical edge.
* Disable FOP's store removal by default - this is experimental and
shouldn t go to prod because removing a store that we failed to detect
it's actually necessary is disastrous. This pass currently doesn't
have a great impact on the number of stores reduced, so it is not a
problem. Most stores reduced are due shrink wrapping anyway.
* Fix stack access identification - correctly estimate memory length of
weird instructions, bail if we don't know.
* Make rules for shrink-wrapping more strict: cancel shrink wrapping on
a number of cases when we are not 100% sure that we are dealing with a
regular callee-saved register.
* Add basic block folding to SW. Sometimes when splitting critical edges
we create a lot of redundant BBs with the same instructions, same
successor but different predecessor. Fold all identical BBs created by
splitting critical edges.
* Change defaults: now the threshold used to determine when to perform
SW is more conservative, to be sure we are moving a spill to a colder
area. This effort, along with BB folding, helps us to avoid hurting
icache performance by indiscriminately increasing code size.
(cherry picked from FBD5315086)
Summary:
Designed a new metric, which shows 93.46% correltation with Cache Miss
and 86% correlation with CPU Time.
Definition:
One can get all the traversal path for each function. And for each traversal,
we will define a distance. The distance represents how far two connected
basic blocks are. Therefore, for each traversal, I will go through the
basic blocks one by one, until the end of the traversal and sum up the
distance for the neighboring basic blocks.
Distance between two connected basic blocks is the distance of the
centers of two blocks in the binary file.
(cherry picked from FBD5242526)
Summary:
Strobelight is getting confused by local symbols that we do not
update in relocation mode. These symbols were preserved by the linker in
relocation mode in order support emitting relocations against local
labels, but they are unused.
Issue a quick fix to this by detecting such symbols and setting their
value to zero.
This patch also fixes an issue with the symbol table that was assigning
the wrong section index to symbols associated with the .text section.
(cherry picked from FBD5271277)
Summary:
Rewrote the guts of buildCallGraph. There are two new options to control how the CG is created. UsePerfData controls whether we use the perf data directly to construct the CG for functions with a stale profile. IgnoreRecursiveCalls omits recursive calls from the CG since they might be skewing results unfairly for heavily recursive functions.
I've changed the way BinaryFunction::estimateHotSize() works. If the function is marked as split, I count the size of all the non-cold blocks. This gives a different but more accurate answer than the old method.
I've improved and updated the CG build stats with extra information.
(cherry picked from FBD5224183)
Summary:
Some PUSH instructions may contain memory addresses pushed to
the stack. If this memory address is from an object in the stack, cancel
further frame analysis for this function since it may be escaping a
variable.
This fixes a bug with deleting used stores (in frameopt) in hhvm trunk.
(cherry picked from FBD5270590)
Summary:
SCTC is currently asserting (my fault :-) when running in
combination with hot jump table entries optimization. This optimization
sets the frequency for edges connecting basic blocks it creates and jump
table targets based on the execution count of the original BB containing
the indirect jump.
This is OK as an estimation, but it breaks our assumption that the sum of
the frequency of preds edges equals to our BB frequency. This happens
because the frequency of the BB is rarely equal to its outgoing edges
frequency.
SCTC, in turn, was updating the execution count for BBs with tail calls
by subtracting the frequency count of predecessor edges. Because hot
jump table entries optimization broke the BB exec count = sum(preds freq)
invariant, SCTC was asserting.
To trigger this, the input program must have a jump table where each
entry contains a tail call. This happens in the HHVM binary for func
_ZN4HPHP11collections5issetEPNS_10ObjectDataEPKNS_10TypedValueE.
(cherry picked from FBD5222504)
Summary:
Add a new positional option onto bolt: "-print-function-statistics=<uint64>"
which prints information about block ordering for requested number of functions.
(cherry picked from FBD5105323)