Summary:
Added an optimization pass of inlining calls to small functions (with only one
basic block). Inlining is done in a very simple way, inserting instructions to
simulate the changes to the stack pointer that call/ret would make before/after the
inlined function executes. Also, the heuristic prefers to inline calls that happen
in the hottest blocks (by looking at their execution count). Calls in cold blocks are
ignored.
(cherry picked from FBD3233516)
Summary:
Many functions (around 600) in the HHVM binary are simply
a single unconditional jump instruction to another function. These can
be trivially optimized by modifying the call sites to directly call the
branch target instead (because it also happens with more than one jump
in sequence, we do it iteratively).
This diff also adds a very simple analysis/optimization pass system in
which this pass is the first one to be implemented. A follow-up to this
could be to move the current optimizations to other passes.
(cherry picked from FBD3211138)
Summary:
Many functions (around 600) in the HHVM binary are simply
a single unconditional jump instruction to another function. These can
be trivially optimized by modifying the call sites to directly call the
branch target instead (because it also happens with more than one jump
in sequence, we do it iteratively).
This diff also adds a very simple analysis/optimization pass system in
which this pass is the first one to be implemented. A follow-up to this
could be to move the current optimizations to other passes.
(cherry picked from FBD3211138)
Summary:
Fix the error message by not printing it :)
Explanation: a previous diff accidentally removed this error message from within
the DEBUG macro, and it's expected that we'll have a bunch of them since a lot
of the DIEs we try to update are empty or meaningless. For instance (and mainly), there
is a huge number of lexical block DIEs with no attributes in .debug_info.
In the first phase of collecting debugging info, we store the offsets of all
these DIEs, only later to realize that we cannot update their address
ranges because they have none.
A better fix would be to check this earlier and not store offsets of DIEs
we cannot update to begin with.
(cherry picked from FBD3236923)
Summary:
In a test binary some functions are placed in a segment
preceding the segment containing .text section. As a result,
we were miscalculating maximum function size as the calculation
was based on addresses only.
This diff fixes the calculation by checking if symbol after function
belongs to the same section. If it does not, then we set the maximum
function size based on the size of the containing section and not
on the address distance to the next symbol.
(cherry picked from FBD3229205)
Summary:
Added option "-break-funcs=func1,func2,...." to coredump in any
given function by introducing ud2 sequence at the beginning of the
function. Useful for debugging and validating stack traces.
Also renamed options containing "_" to use "-" instead.
Also run hhvm test with "-update-debug-sections".
(cherry picked from FBD3210248)
Summary:
Splitting option now has different meanings/values. Since landing pads
are mostly always cold/frozen, we should split them before anything
else (we still check the execution count is 0). That's value '1'.
Everything else goes on top of that and has increased value (2 - large
functions, 3 - everything).
Sorting was non-deterministic and somewhat broken for functions
with EH ranges. Fixed that and added '-split-all-cold' option to
outline all 0-count blocks.
Fixed compilation of test cases. After my last commit the binaries
were linked to wrong source files (i.e. debug info). Had to rebuild
the binaries from updated sources.
(cherry picked from FBD3209369)
Summary:
GNU_args_size is a special kind of CFI that tells runtime to adjust
%rsp when control is passed to a landing pad. It is used for annotating
call instructions that pass (extra) parameters on the stack and there's
a corresponding landing pad.
It is also special in a way that its value is not handled by
DW_CFA_remember_state/DW_CFA_restore_state instruction sequence
that we utilize to restore the state after block re-ordering.
This diff adds association of call instructions with GNU_args_size value
when it's used. If the function does not use GNU_args_size, there is
no overhead. Otherwise, we regenerate GNU_args_size instruction during
code emission, i.e. after all optimizations and block-reordering.
(cherry picked from FBD3201322)
Summary:
Simple functions which we fail to rewrite after optimizations were
having wrong debugging information because the latter would reflect the optimized
version of the function.
There are only 48 functions (at this time) in this situation in the HHVM binary.
The simple fix is to add another full pass. Another more complicated path, which will
be more efficient, is to reset only the BinaryContext and emit again, but then we need
to recreate all symbols in the new MCContext and update the pointers. I started
taking this path but it started getting too complicated for only those 48 functions
(needed to create a new map of global symbols, recreate landing pads - which needed
to have the internal intermediate labels in the functions kept to be updated too, etc).
Because the overhead is quite large (another full emission pass - around 4m30s here)
and the impact is small I put this behind a new
command-line flag which is off by default: -fix-debuginfo-large-functions.
(cherry picked from FBD3166576)
Summary:
Update address ranges of inlined functions and try/catch blocks.
This was missing and lead gdb to show weird information in a core dump we inspected
because of the several nestings of inline in the call stack.
This is very similar to Lexical Blocks, so the change is to basically generalize that
code to do the same for DW_AT_try_block, DW_AT_catch_block and DW_AT_inlined_subroutine.
(cherry picked from FBD3169417)
Summary:
readelf was showing some errors because we weren't updating DIEs that were not shallow
in the DIE tree, or DIEs of functions with addresses we don't recognize (mostly functions with
address 0, which could have been removed by the Linker Script but still have debugging information
there). These DIEs need to be updated because their abbreviations are patched.
(cherry picked from FBD3159335)
Summary:
We were updating only one DIE per function, but because the Linker Script may map
multiple functions to the same address this would cause us to generate invalid debug info
(as some DIEs weren't updated but their abbreviations were changed).
(cherry picked from FBD3157263)
Summary:
Non-simple functions aren't emitted, and thus didn't have line number information
emitted. This diff emits it for those functions by extending LLVM's generation of the line number program to allow for absolute addresses (it is wholly symbolic), then iterating over the relevant line tables from the input and appending entries with absolute addresses to the line tables to be emited.
This still leaves the simple but not overwritten functions unhandled (there were 48 in HHVM in
my last run). However, I think that to fix them we'd need another pass, since by the time we
realize a simple function wont't fit, debug line info was already written to the output.
(cherry picked from FBD3148468)
Summary:
Summary: Update DWARF location lists in .debug_loc and pointers to
them in .debug_info so that gdb can print variables which change
location during their lifetime.
The following changes were made:
- Refactored BasicBlockOffsetRanges to allow ranges to be tied to binary information (so that we can reuse it for location lists)
- Implemented range compression optimization in BasicBlockOffsetRanges (needed otherwise too much data was being generated).
- Added representation for location lists (LocationList.h, BinaryContext.h)
- Implemented .debug_loc serializer that keeps the updated offsets (DebugLocWriter.{h,cpp})
- After disassembly, traverse entries in .debug_loc and save them in context (BinaryContext.cpp)
- After optimizations, serialize .debug_loc and update pointers in .debug_info (RewriteInstance.cpp)
(cherry picked from FBD3130682)
Summary:
Add a parameter value to "-split-functions=" option to allow splitting
only when the function is too large to fit:
0 - never split
1 - split if too large to fit
2 - always split
We may use this option when the profile data is not very precise.
In that case excessive splitting may increase iTLB misses.
(cherry picked from FBD3137700)
Summary:
This fixes a problem in which bolt was generating a malformed .debug_info
section on the bzip2 binary. The bug was the following:
- A simple and a non-simple function shared an abbreviation
- The abbreviation was patched to contain DW_AT_ranges because of the simple function
- The non-simple function's data was not updated, but then it didn't match the
layout expected by the abbreviation anymore
And because we were already creating an address ranges list in .debug_ranges even
for non-simple functions, it doesn't make sense not to use it anyway.
(cherry picked from FBD3129219)
Summary:
Updates DWARF lexical blocks address ranges in the output binary after optimizations.
This is similar to updating function address ranges except that the ranges representation needs
to be more general, since address ranges can begin or end in the middle of a basic block.
The following changes were made:
- Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h
- Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly.
- Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists.
- Added a representation for Lexical Blocks (LexicalBlock.h)
- Small refactorings in DebugArangesWriter:
-- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges
-- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that)
- Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp)
- Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code)
- Added small test case (lexical_blocks_address_ranges_debug.test)
(cherry picked from FBD3113181)
Summary:
Before this diff LLVM used to iterate over all sections to find the
one with an address we want to remap. Since we have extremely
large number of section this process is highly inefficient.
Instead we add a new interface to remap a section with a given ID
(which effectively is an index into an array of sections), and
pass the ID instead of the address.
This cuts down the processing time of hhvm binary by 10 seconds,
and brings the total processing time to a little under 2 minutes.
(cherry picked from FBD3110015)
Summary:
[WIP] Update DWARF info for function address ranges.
This diff currently does not work for unknown reasons,
but I'm describing here what's the current state.
According to both llvm-dwarf and readelf our output seems correct,
but GDB does not interpret it as expected. All details go below in
hope I missed something.
I couldn't actually track the whole change that introduced support for
what we need in gdb yet, but I think I can get to it
(2007-12-04: Support
lexical bocks and function bodies that occupy non-contiguous address ranges). I have reasons to believe gdb at least at some
nges).
The set of introduced changes was basically this:
- After disassembly, iterate over the DIEs in .debug_info and find the
ones that correspond to each BinaryFunction.
- Refactor DebugArangesWriter to also write addresses of functions to
.debug_ranges and track the offsets of function address ranges there
- Add some infrastructure to facilitate patching the binary in
simple ways (BinaryPatcher.h)
- In RewriteInstance, after writing .debug_ranges already with
function address ranges, for each function do:
-- Find the abbreviation corresponding to the function
-- Patch .debug_abbrev to replace DW_AT_low_pc with DW_AT_ranges and
DW_AT_high_pc with DW_AT_producer (I'll explain this hack below).
Also patch the corresponding forms to DW_FORM_sec_offset and
DW_FORM_string (null-terminated in-place string).
-- Patch debug_info with the .debug_ranges offset in place of
the first 4 bytes of DW_AT_low_pc (DW_AT_ranges only occupies 4
bytes whereas low_pc occupies 8), and write an arbitrary string
in-place in the other 12 bytes that were the 4 MSB of low_pc
and the 8 bytes of high_pc before the patch. This depends on
low_pc and high_pc being put consecutively by the compiler, but
it serves to validate the idea. I tried another way of doing it
that does not rely on this but it didn't work either and I believe
the reason for either not working is the same (and still unknown,
but unrelated to them. I might be wrong though, and if I find yet
another way of doing it I may try it). The other way was to
use a form of DW_FORM_data8 for the section offset. This is
disallowed by the specification, but I doubt gdb validates this,
as it's just easier to store it as 64-bit anyway as this is even
necessary to support 64-bit DWARF (which is not what gcc generates
by default apparently).
I still need to make changes to the diff to make it production-ready,
but first I want to figure out why it doesn't work as expected.
By looking at the output of llvm-dwarfdump or readelf, all of
.debug_ranges, .debug_abbrev and .debug_info seem to have been
correctly updated. However, gdb seems to have serious problems with
what we write.
(In fact, readelf --debug-dump=Ranges shows some funny warning messages
of the form ("Warning: There is a hole [0x100 - 0x120] in .debug_ranges"),
but I played around with this and it seems it's just because no
compile unit was using these ranges. Changing .debug_info apparently
changes these warnings, so they seem to be unrelated to the section
itself. Also looking at the hex dump of the section doesn't help,
as everything seems fine. llvm-dwarfdump doesn't say anything.
So I think .debug_ranges is fine.)
The result is that gdb not only doesn't show the function name as we
wanted, but it also stops showing line number information.
Apparently it's not reading/interpreting the address ranges at all,
and so the functions now have no associated address ranges, only the
symbol value which allows one to put a breakpoint in the function,
but not to show source code.
As this left me without more ideas of what to try to feed gdb with,
I believe the most promising next trial is to try to debug gdb itself,
unless someone spots anything I missed.
I found where the interesting part of the code lies for this
case (gdb/dwarf2read.c and some other related files, but mainly that one).
It seems in some parts gdb uses DW_AT_ranges for only getting
its lowest and highest addresses and setting that as low_pc and
high_pc (see dwarf2_get_pc_bounds in gdb's code and where it's called).
I really hope this is not actually the case for
function address ranges. I'll investigate this further. Otherwise
I don't think any changes we make will make it work as initially
intended, as we'll simply need gdb to support it and in that case it
doesn't.
(cherry picked from FBD3073641)
Summary:
We used to output .debug_line information for every instruction, but because of the way
gdb (and probably lldb as of llvm::DWARFDebugLine::LineTable::findAddress) queries the
line table it's not necessary to output information for two instructions if they follow
each other and map to the same source line. By not repeating this information we generate
a bit less .debug_line data.
(cherry picked from FBD3056402)
Summary:
The line number information generated from a null pointer
was actually valid, which caused new instructions without the line number
information set to have a valid and wrong line number reference. This diff
fixes this by making the null pointer be assigned to an invalid line number
row.
(cherry picked from FBD3048453)
Summary:
Write the .debug_aranges section after optimizations to the output binary.
Each function generates at least one range and at most two (one extra for its cold part).
The writing is done manually because LLVM's implementation is tied to the output of
.debug_info (see EmitGenDwarfInfo and EmitGenDwarfARanges in lib/MC/MCDwarf.cpp),
which we don't want to trigger right now.
(cherry picked from FBD3043108)
Summary:
At the moment we rely solely on the symbol table information to discover
function boundaries. However, similar information is contained in
.eh_frame. Verify that the information from these two sources is
consistent, and if it's not, then skip processing the functions with
conflicting information.
(cherry picked from FBD3043800)
Summary:
After we add new line number information we have to update stmt_list
offsets in .debug_info. For this I had to add a primitive relocations
support for non-allocatable sections we are copying from input file.
Also enabled functionality to process relocations in non-allocatable
sections that LLVM is generating, such as .debug_line. I thought
we already had it, but apparently it didn't work, at least not
for ELF binaries.
(cherry picked from FBD3037903)
Summary:
Writes .debug_line section by setting the state
in MCContext that LLVM needs to produce and output the
line tables. This basically consists of setting the
current location and compile unit offset. This makes LLVM
output .debug_line in the temporary file, but not yet in
the generated ELF file.
Also computes the line table offsets for each compile unit
and saves them into BinaryContext. Added an option to
print these offsets.
(cherry picked from FBD3004554)
Summary:
The is a set of changes that allow modification of non-allocatable
sections in ELF binary. Primarily for the purpose of updating debug
info.
Extend LLVM interface to allow processing relocations in non-allocatable
sections. This allows to produce .debug* sections with resolved
relocations against generated code.
Extend BOLT rewriting framework to allow appending contents to
non-allocatable sections in the binary.
Re-worked ELF binary rewriting to support the above and to allow future
extensions (e.g. new section names).
(cherry picked from FBD3023403)
Summary:
Reads information in the DWARF .debug_line section using LLVM and
tie every MCInst to one line of a line table from the input binary. Subsequent
diffs will update this information to match the final binary layout and
output updated line tables.
(cherry picked from FBD2989813)
Summary:
Force the splitting of the function into hot/cold even when
the function fits into original slot.
This reduces BOLT optimization time by 50% without affecting
hhvm performance.
(cherry picked from FBD2973773)
Summary:
If we see an unknown CFI instruction, skip processing the function
containing it instead of aborting execution.
(cherry picked from FBD2964557)
Summary:
Added an option to reuse existing program header entry.
This option allows for bfd tools like strip and objcopy
to operate on the optimized binary without destroying it.
Also, all new sections are now properly marked in ELF.
(cherry picked from FBD2943339)
Summary:
We used to require pre-allocated space in the input binary so that
we can write extra sections in there (.eh_frame, .eh_frame_hdr,
.gcc_except_table, etc.). With this diff there's no further
need for pre-allocated storage as we create a new segment and
can use as much space as needed.
There are certain limitations on where the new segment could
be allocated, and as a result the size of the file may increase.
There's currently a limitation if the binary size is close to 4GB
we cannot allocate new segment prior to that and as a result
we require debug info to be stripped to reduce the file size.
The fix is in progress.
(cherry picked from FBD2916029)
Summary:
We use intermediate .o file for debugging purposes, but there's no
reason to generate it by default. Only do it if "-keep-tmp" is
specified.
(cherry picked from FBD2912098)
Summary:
Preserve original layout for basic blocks that have 0 execution
count. Since we don't optimize for size, it's better to rely on
the original input order.
(cherry picked from FBD2875335)
Summary:
We should never outline the first basic block.
Also add an option to accept a file with the list of
functions to optimize.
(cherry picked from FBD2868184)
Summary:
We could split functions with exceptions even without creating
a new exception handling table. This limits us to only move
basic blocks that never throw, and are not a start of a
landing pad.
(cherry picked from FBD2862937)
Summary:
Some basic blocks were created empty because they only contained
alignment nop's. Ignore such nop's before basic block gets created.
Fixed intermittent aborts related to CFI update.
(cherry picked from FBD2844465)
Summary:
* Update CFI state for larger range of functions to increase coverage.
* Issue more warnings indicating reasons for skipping functions.
* Print top called functions in the binary.
(cherry picked from FBD2839734)
Summary:
Modified processing of "-reorder-blocks=" option and added an option
to reverse original basic blocks order for testing purposes.
(cherry picked from FBD2829862)
Summary:
In a test binary, we found 8 cases where code in a function A would jump to the
middle of another function B. In this case, we cannot reorder function B because
this would change instruction offsets and break the program. This is pretty rare
but can happen in code written in assembly.
(cherry picked from FBD2719850)
Summary:
We found out that the insertion of extra nops to preserve alignment of
some loop bodies do not pay off the increased function size, since this extra
size may inhibit us from rewriting a reordered version of this function.
(cherry picked from FBD2718466)