Summary:
This is preparation work for static data reordering.
I've created a new class called BinaryData which represents a symbol
contained in a section. It records almost all the information relevant
for dealing with data, e.g. names, address, size, alignment, profiling
data, etc. BinaryContext still stores and manages BinaryData objects
similar to how it managed symbols and global addresses before. The
interfaces are not changed too drastically from before either. There is
a bit of overlap between BinaryData and BinaryFunction. I would have
liked to do some more refactoring to make a BinaryFunctionFragment that
subclassed from BinaryData and then have BinaryFunction be composed or
associated with BinaryFunctionFragments.
I've also attempted to use (symbol + offset) for when addresses are
pointing into the middle of symbols with known sizes. This changes the
simplify rodata loads optimization slightly since the expression on an
instruction can now also be a (symbol + offset) rather than just a symbol.
One of the overall goals for this refactoring is to make sure every
relocation is associated with a BinaryData object. This requires adding
"hole" BinaryData's wherever there are gaps in a section's address space.
Most of the holes seem to be data that has no associated symbol info. In
this case we can't do any better than lumping all the adjacent hole
symbols into one big symbol (there may be more than one actual data
object that contributes to a hole). At least the combined holes should
be moveable.
Jump tables have similar issues. They appear to mostly be sub-objects
for top level local symbols. The main problem is that we can't recognize
jump tables at the time we scan the symbol table, we have to wait til
disassembly. When a jump table is discovered we add it as a sub-object
to the existing local symbol. If there are one or more existing
BinaryData's that appear in the address range of a newly created jump
table, those are added as sub-objects as well.
(cherry picked from FBD6362544)
Summary:
Fix a bug introduced by rebasing with respect to aligned ULEBs.
This wasn't breaking anything but it is good to keep LDSA aligned.
(cherry picked from FBD7094742)
Summary:
This commit includes all code necessary to make BOLT working again
after the rebase. This includes a redesign of the EHFrame work,
cherry-pick of the 3dnow disassembly work, compilation error fixes,
and port of the debug_info work. The macroop fusion feature is not
ported yet.
The rebased version has minor changes to the "executed instructions"
dynostats counter because REP prefixes are considered a part of the
instruction it applies to. Also, some X86 instructions had the "mayLoad"
tablegen property removed, which BOLT uses to identify and account
for loads, thus reducing the total number of loads reported by
dynostats. This was observed in X86::MOVDQUmr. TRAP instructions are
not terminators anymore, changing our CFG. This commit adds compensation
to preserve this old behavior and minimize tests changes. debug_info
sections are now slightly larger. The discriminator field in the line
table is slightly different due to a change upstream. New profiles
generated with the other bolt are incompatible with this version
because of different hash values calculated for functions, so they will
be considered 100% stale. This commit changes the corresponding test
to XFAIL so it can be updated. The hash function changes because it
relies on raw opcode values, which change according to the opcodes
described in the X86 tablegen files. When processing HHVM, bolt was
observed to be using about 800MB more memory in the rebased version
and being about 5% slower.
(cherry picked from FBD7078072)
Summary:
Do not assign a LP to tail calls. They are not calls in the
view of an unwinder, they are just regular branches. We were hitting an
assertion in BinaryFunction::removeConditionalTailCalls() complaining
about landing pads in a CTC, however it was in fact a
builtin_unreachable being conservatively treated as a CTC.
(cherry picked from FBD6564957)
Summary:
Here's an implementation of an abstract instruction iterator for the branch/call
analysis code in MCInstrAnalysis. I'm posting it up to see what you guys think.
It's a bit sloppy with constness and probably needs more tidying up.
(cherry picked from FBD6244012)
Summary:
Move the indirect branch analysis code from BinaryFunction to MCInstrAnalysis/X86MCTargetDesc.cpp.
In the process of doing this, I've added an MCRegInfo to MCInstrAnalysis which allowed me to remove a bunch of extra method parameters. I've also had to refactor how BinaryFunction held on to instructions/offsets so that it would be easy to pass a sequence of instructions to the analysis code (rather than a map keyed by offset).
Note: I think there are a bunch of MCInstrAnalysis methods that have a BitVector output parameter that could be changed to a return value since the size of the vector is based on the number of registers, i.e. from MCRegisterInfo. I haven't done this in order to keep the diff a more manageable size.
(cherry picked from FBD6213556)
Summary:
A cold part of a function can start with a landing pad. As a
result, this landing pad will have offset 0 from the start
of the corresponding FDE, and it wouldn't get registered by
exception-handling runtime.
The solution is to use a different landing pad base address
(LPStart), such as (FDE_start - 1).
(cherry picked from FBD5876561)
Summary:
Exceptions tables for PIC may contain indirect type references
that are also encoded using relative addresses.
This diff adds support for such encodings. We read PIC-style
type info table, and write it using new encoding.
(cherry picked from FBD5716060)
Summary:
Add an option to optimize PLT calls:
-plt - optimize PLT calls (requires linking with -znow)
=none - do not optimize PLT calls
=hot - optimize executed (hot) PLT calls
=all - optimize all PLT calls
When optimized, the calls are converted to use GOT reference
indirectly. GOT entries are guaranteed to contain a valid
function pointer if lazy binding is disabled - hence the
requirement for linker's -znow option.
Note: we can add an entry to .dynamic and drop a requirement
for -znow if we were moving .dynamic to a new segment.
(cherry picked from FBD5579789)
Summary:
Fix a bug while reading LSDA address in PIC format. The base address was
wrong for PC-relative value. There's more work involved in making PIC
code with C++ exceptions work.
(cherry picked from FBD5538755)
Summary:
Each BOLT-specific option now belongs to BoltCategory or BoltOptCategory.
Use alphabetical order for options in source code (does not affect
output).
The result is a cleaner output of "llvm-bolt -help" which does not
include any unrelated llvm options and is close to the following:
.....
BOLT generic options:
-data=<string> - <data file>
-dyno-stats - print execution info based on profile
-hot-text - hot text symbols support (relocation mode)
-o=<string> - <output file>
-relocs - relocation mode - use relocations to move functions in the binary
-update-debug-sections - update DWARF debug sections of the executable
-use-gnu-stack - use GNU_STACK program header for new segment (workaround for issues with strip/objcopy)
-use-old-text - re-use space in old .text if possible (relocation mode)
-v=<uint> - set verbosity level for diagnostic output
BOLT optimization options:
-align-blocks - try to align BBs inserting nops
-align-functions=<uint> - align functions at a given value (relocation mode)
-align-functions-max-bytes=<uint> - maximum number of bytes to use to align functions
-boost-macroops - try to boost macro-op fusions by avoiding the cache-line boundary
-eliminate-unreachable - eliminate unreachable code
-frame-opt - optimize stack frame accesses
......
(cherry picked from FBD4793684)
Summary:
The new interface for handling Call Frame Information:
* CFI state at any point in a function (in CFG state) is defined by
CFI state at basic block entry and CFI instructions inside the
block. The state is independent of basic blocks layout order
(this is implied by CFG state but wasn't always true in the past).
* Use BinaryBasicBlock::getCFIStateAtInstr(const MCInst *Inst) to
get CFI state at any given instruction in the program.
* No need to call fixCFIState() after any given pass. fixCFIState()
is called only once during function finalization, and any function
transformations after that point are prohibited.
* When introducing new basic blocks, make sure CFI state at entry
is set correctly and matches CFI instructions in the basic block
(if any).
* When splitting basic blocks, use getCFIStateAtInstr() to get
a state at the split point, and set the new basic block's CFI
state to this value.
Introduce CFG_Finalized state to indicate that no further optimizations
are allowed on the function. This state is reached after we have synced
CFI instructions and updated EH info.
Rename "-print-after-fixup" option to "-print-finalized".
This diffs fixes CFI for cases when we split conditional tail calls,
and for indirect call promotion optimization.
(cherry picked from FBD4629307)
Summary:
The CFI instructions parser in libDebugInfo was relying on
undefined behavior to parse operands by assuming the order function
parameters are evaluated in a function call site is defined (it is
not). This patch fix this and makes our clang and gcc tests agree.
It also fixes wrong LIT tests in our codebase with respect to the
order of DW_CFA_def_cfa operands.
(cherry picked from FBD4255227)
Summary:
In order to improve gdb experience with BOLT we have to make
sure the output file has a single .eh_frame section. Otherwise
gdb will use either old or new section for unwinding purposes.
This diff relocates the original .eh_frame section next to
the new one generated by LLVM. Later we merge two sections
into one and make sure only the newly created section has
.eh_frame name.
(cherry picked from FBD4203943)
Summary:
We used to patch an existing .eh_frame_hdr and append contents
for split functions at the end. However, this approach does not
work in relocation mode since function addresses change and split
functions will not necessarily be at the end.
Instead of patching and appending we generate the new .eh_frame_hdr
based on contents of old and new .eh_frame sections.
(cherry picked from FBD4180756)
Summary:
In a prev diff I disabled inclusion of FDEs for cold fragments that
we fail to write. The side effect of it was that we failed to
write FDE for the next function with a cold fragment since it
had the same assigned address that we had put in FailedAddresses.
The correct fix is to assign zero address to failed cold fragments
and ignore them when we write .eh_frame_hdr.
(cherry picked from FBD4156740)
Summary:
Modify the MC layer (MCDwarf.h|cpp) to understand CFI
instructions dealing with DWARF expressions. Add code to emit DWARF
expressions in MCDwarf. Change llvm-bolt to pass these CFI instructions
to streamer instead of bailing on them. Change -dump-eh-frame option in
llvm-bolt to dump the EH frame of the rewritten binary in addition to
the one in the original binary, allowing us to proper test this patch.
(cherry picked from FBD4194452)
Summary:
Modified function discovery process to tolerate more functions and
symbols coming from assembly. The processing order now matches
the memory order of the functions (input symbol table is unsorted).
Added basic support for functions with multiple entries. When
a function references its internal address other than with
a branch instruction, that address could potentially escape.
We mark such addresses as entry points and make sure they
are treated as roots by unreachable code elimination.
Without relocations we have to mark multiple-entry functions
as non-simple.
(cherry picked from FBD3950243)
Summary:
Add "-dyno-stats" option that prints instruction stats based on
the execution profile similar to below:
BOLT-INFO: program-wide dynostats after optimizations:
executed forward branches : 109706407 (+8.1%)
taken forward branches : 13769074 (-55.5%)
executed backward branches : 24517582 (-25.0%)
taken backward branches : 15330256 (-27.2%)
executed unconditional branches : 6009826 (-35.5%)
function calls : 17192114 (+0.0%)
executed instructions : 837733057 (-0.4%)
total branches : 140233815 (-2.3%)
taken branches : 35109156 (-42.8%)
Also fixed pseudo instruction discrepancies and added assertions
for BinaryBasicBlock::getNumPseudos() to make sure the number is
synchronized with real number of pseudo instructions.
(cherry picked from FBD3826995)
Summary:
This will make it easier to run experiments with the same baseline
BOLT binary but different command line options.
(cherry picked from FBD3831978)
Summary:
I've added a verbosity level to help keep the BOLT spewage to a minimum.
The default level is pretty terse now, level 1 is closer to the original,
I've saved level 2 for the noisiest of messages. Error messages should
never be suppressed by the verbosity level only warnings and info messages.
The rational behind stream usage is as follows:
outs() for info and debugging controlled by command line flags.
errs() for errors and warnings.
dbgs() for output within DEBUG().
With the exception of a few of the level 2 messages I don't have any strong feelings about the others.
(cherry picked from FBD3814259)
Summary:
Eliminated BinaryFunction::getName(). The function was confusing since
the name is ambigous. Instead we have BinaryFunction::getPrintName()
used for printing and whenever unique string identifier is needed
one can use getSymbol()->getName(). In the next diff I'll have
a map from MCSymbol to BinaryFunction in BinaryContext to facilitate
function lookup from instruction operand expressions.
There's one bug fixed where the function was called only under assert()
in ICF::foldFunction().
For output we update all symbols associated with the function. At the
moment it has no effect on the generated binary but in the future we
would like to have all symbols in the symbol table updated.
(cherry picked from FBD3704790)
Summary:
Add three new MCOperand types: Annotation, LandingPad and GnuArgsSize.
Annotation is used for associating random data with MCInsts. Clients can
construct their own annotation types (subclassed from MCAnnotation) and
associate them with instructions. Annotations are looked up by string keys.
Annotations can be added, removed and queried using an instance of the
MCInstrAnalysis class.
The LandingPad operand is a MCSymbol, uint64_t pair used to encode exception
handling information for call instructions.
GnuArgsSize is used to annotate calls with the DW_CFA_GNU_args_size attribute.
(cherry picked from FBD3597877)
Summary:
GNU_args_size is a special kind of CFI that tells runtime to adjust
%rsp when control is passed to a landing pad. It is used for annotating
call instructions that pass (extra) parameters on the stack and there's
a corresponding landing pad.
It is also special in a way that its value is not handled by
DW_CFA_remember_state/DW_CFA_restore_state instruction sequence
that we utilize to restore the state after block re-ordering.
This diff adds association of call instructions with GNU_args_size value
when it's used. If the function does not use GNU_args_size, there is
no overhead. Otherwise, we regenerate GNU_args_size instruction during
code emission, i.e. after all optimizations and block-reordering.
(cherry picked from FBD3201322)
Summary:
Skip DW_CFA_expression and DW_CFA_val_expression instructions
properly, according to DWARF spec.
If CFI range does not match function range skip that function.
(cherry picked from FBD3040502)
Summary:
If we see an unknown CFI instruction, skip processing the function
containing it instead of aborting execution.
(cherry picked from FBD2964557)
Summary:
Added an option to reuse existing program header entry.
This option allows for bfd tools like strip and objcopy
to operate on the optimized binary without destroying it.
Also, all new sections are now properly marked in ELF.
(cherry picked from FBD2943339)
Summary:
We could split functions with exceptions even without creating
a new exception handling table. This limits us to only move
basic blocks that never throw, and are not a start of a
landing pad.
(cherry picked from FBD2862937)
Summary:
Fixes some issues discovered after hhvm switched to gcc 4.9.
Add support for DW_CFA_GNU_args_size instruction.
Allow CFI instruction after the last instruction in a function.
Reverse conditions of assert for DW_CFA_set_loc.
(cherry picked from FBD28110096)
Summary:
Our CFI parser in the LLVM library was giving up on parsing all CFI
instructions when finding a single instruction with expression operands. Yet,
all gcc-4.9 binaries seem to have at least one CFI instruction with expression
operands (DW_CFA_def_cfa_expression). This patch fixes this and makes DebugInfo
continue to parse other instructions, even though it does not completely parse
DWARF expressions yet. However, this seems to be enough to allow llvm-flo to
process gcc-4.9 binaries because the FDEs with DWARF expressions are linked to
the PLT region, and not to functions that we process.
If we ever try to read a function whose CFI depends on DWARF expression, which
is unlikely, llvm-flo will assert.
(cherry picked from FBD2693088)
Summary:
After basic block reordering, it may be possible that the reordered
function is now larger than the original because of the following reasons:
- jump offsets may change, forcing some jump instructions to use 4-byte
immediate operand instead of the 1-byte, shorter version.
- fall-throughs change, forcing us to emit an extra jump instruction to jump
to the original fall-through at the end of a basic block.
Since we currently do not change function addresses, we need to rewrite the
function back in the binary in the original location. If it doesn't fit, we were
dropping the function.
This patch adds a flag -split-functions that tells llvm-flo to split hot
functions into hot and cold separate regions. The hot region is written back
in the original function location, while the cold region is written in a
separate, far-away region reserved to flo via a linker script.
This patch also adds the logic to create and extra FDE to supply unwinding
information to the cold part of the function. Owing to this, we now need to
rewrite .eh_frame_hdr to another location and patch the EH_FRAME ELF segment
to point to this new .eh_frame_hdr.
(cherry picked from FBD2677996)
Summary:
This patch adds logic to detect when the binary has extra space
reserved for us via the __flo_storage symbol. If this symbol is present,
it means we have extra space in the binary to write extraneous information.
When we write a new .eh_frame, we cannot discard the old .eh_frame because
it may still contain relevant information for functions we do not reorder.
Thus, we write the new .eh_frame into __flo_storage and patch the current
.eh_frame_hdr to point to the new .eh_frame only for the functions we touched,
generating a binary that works with a bi-.eh_frame model.
(cherry picked from FBD2639326)
Summary:
This patch introduces logic to check how the CFI instructions define a
table to help during stack unwinding at exception run time and attempts to fix
any problem in this table that may have been introduced by reordering the basic
blocks. If it fails to fix this problem, the function is marked as not simple
and not eligible for rewriting.
(cherry picked from FBD2633696)
Summary:
Regenerate exception handling information after optimizations.
Use '-print-eh-ranges' to see CFG with updated ranges.
(cherry picked from FBD2660982)
Summary:
There were two issues: we were trying to process non-simple functions,
i.e. function that we don't fully understand, and then we failed to stop
iterating if EH closing label was after the last instruction in a
function.
(cherry picked from FBD2664460)
Summary:
Read .gcc_except_table and add information to CFG. Calls have extra operands
indicating there's a possible handler for exceptions and an action. Landing
pad information is recorded in BinaryFunction.
Also convert JMP instructions that are calls into tail calls pseudo
instructions so that they don't miss call instruction analysis.
(cherry picked from FBD2652775)
Summary:
In order to represent CFI information in our BinaryFunction class, this
patch adds a map of Offsets to CFI instructions. In this way, we make it easy to
check exactly where DWARF CFI information is annotated in the disassembled
function.
(cherry picked from FBD2619216)
Summary:
We need to parse the whole contents of .gcc_except_table even if we are
not printing exceptions. Otherwise we are missing type index table and
miscalculate the size of the current table.
(cherry picked from FBD2632965)
Summary: In order to reorder binaries with C++ exceptions, we first need to
read DWARF CFI (call frame info) from binaries in a table in the .eh_frame
ELF section. This table contains unwinding information we need to be aware of
when reordering basic blocks, so as to avoid corrupting it. This patch also
cleans up some code from Exceptions.cpp due to a refactoring where we moved
some functions to the LLVM's libSupport.
(cherry picked from FBD2614464)
Summary:
Print actions for exception ranges from .gcc_except_table.
Types are printed as names if the name is available from symbol table.
(cherry picked from FBD2612631)