2021-12-21 10:21:41 -08:00
|
|
|
//===- bolt/Passes/SplitFunctions.cpp - Pass for splitting function code --===//
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
//
|
2021-03-15 18:04:18 -07:00
|
|
|
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
|
|
|
// See https://llvm.org/LICENSE.txt for license information.
|
|
|
|
|
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
//
|
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
//
|
2021-12-21 10:21:41 -08:00
|
|
|
// This file implements the SplitFunctions pass.
|
|
|
|
|
//
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
|
2021-10-08 11:47:10 -07:00
|
|
|
#include "bolt/Passes/SplitFunctions.h"
|
2022-08-18 21:51:01 -07:00
|
|
|
#include "bolt/Core/BinaryBasicBlock.h"
|
2021-10-08 11:47:10 -07:00
|
|
|
#include "bolt/Core/BinaryFunction.h"
|
2022-08-18 21:50:35 -07:00
|
|
|
#include "bolt/Core/FunctionLayout.h"
|
2021-10-08 11:47:10 -07:00
|
|
|
#include "bolt/Core/ParallelUtilities.h"
|
2023-02-27 15:26:14 -08:00
|
|
|
#include "bolt/Utils/CommandLineOpts.h"
|
2022-08-18 21:51:01 -07:00
|
|
|
#include "llvm/ADT/STLExtras.h"
|
|
|
|
|
#include "llvm/ADT/Sequence.h"
|
|
|
|
|
#include "llvm/ADT/SmallVector.h"
|
|
|
|
|
#include "llvm/ADT/iterator_range.h"
|
2020-12-01 16:29:39 -08:00
|
|
|
#include "llvm/Support/CommandLine.h"
|
2022-06-24 17:00:20 -07:00
|
|
|
#include "llvm/Support/FormatVariadic.h"
|
2022-06-29 13:01:46 -07:00
|
|
|
#include <algorithm>
|
2022-08-18 21:50:35 -07:00
|
|
|
#include <iterator>
|
2022-09-08 14:46:57 -07:00
|
|
|
#include <memory>
|
2022-08-18 21:51:01 -07:00
|
|
|
#include <numeric>
|
2022-06-29 13:01:46 -07:00
|
|
|
#include <random>
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
#include <vector>
|
|
|
|
|
|
|
|
|
|
#define DEBUG_TYPE "bolt-opts"
|
|
|
|
|
|
|
|
|
|
using namespace llvm;
|
|
|
|
|
using namespace bolt;
|
|
|
|
|
|
2022-06-24 17:00:20 -07:00
|
|
|
namespace {
|
|
|
|
|
class DeprecatedSplitFunctionOptionParser : public cl::parser<bool> {
|
|
|
|
|
public:
|
|
|
|
|
explicit DeprecatedSplitFunctionOptionParser(cl::Option &O)
|
|
|
|
|
: cl::parser<bool>(O) {}
|
|
|
|
|
|
|
|
|
|
bool parse(cl::Option &O, StringRef ArgName, StringRef Arg, bool &Value) {
|
|
|
|
|
if (Arg == "2" || Arg == "3") {
|
|
|
|
|
Value = true;
|
|
|
|
|
errs() << formatv("BOLT-WARNING: specifying non-boolean value \"{0}\" "
|
|
|
|
|
"for option -{1} is deprecated\n",
|
|
|
|
|
Arg, ArgName);
|
|
|
|
|
return false;
|
|
|
|
|
}
|
|
|
|
|
return cl::parser<bool>::parse(O, ArgName, Arg, Value);
|
|
|
|
|
}
|
|
|
|
|
};
|
|
|
|
|
} // namespace
|
|
|
|
|
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
namespace opts {
|
|
|
|
|
|
|
|
|
|
extern cl::OptionCategory BoltOptCategory;
|
|
|
|
|
|
|
|
|
|
extern cl::opt<bool> SplitEH;
|
2020-07-27 18:07:18 -07:00
|
|
|
extern cl::opt<unsigned> ExecutionCountThreshold;
|
2022-06-29 13:01:46 -07:00
|
|
|
extern cl::opt<uint32_t> RandomSeed;
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
|
2022-06-05 13:29:49 -07:00
|
|
|
static cl::opt<bool> AggressiveSplitting(
|
|
|
|
|
"split-all-cold", cl::desc("outline as many cold basic blocks as possible"),
|
|
|
|
|
cl::cat(BoltOptCategory));
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
|
2022-06-05 13:29:49 -07:00
|
|
|
static cl::opt<unsigned> SplitAlignThreshold(
|
|
|
|
|
"split-align-threshold",
|
|
|
|
|
cl::desc("when deciding to split a function, apply this alignment "
|
|
|
|
|
"while doing the size comparison (see -split-threshold). "
|
|
|
|
|
"Default value: 2."),
|
|
|
|
|
cl::init(2),
|
|
|
|
|
|
|
|
|
|
cl::Hidden, cl::cat(BoltOptCategory));
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
|
2022-06-24 17:00:20 -07:00
|
|
|
static cl::opt<bool, false, DeprecatedSplitFunctionOptionParser>
|
|
|
|
|
SplitFunctions("split-functions",
|
2022-08-18 21:50:35 -07:00
|
|
|
cl::desc("split functions into fragments"),
|
2022-06-24 17:00:20 -07:00
|
|
|
cl::cat(BoltOptCategory));
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
|
2022-06-05 13:29:49 -07:00
|
|
|
static cl::opt<unsigned> SplitThreshold(
|
|
|
|
|
"split-threshold",
|
|
|
|
|
cl::desc("split function only if its main size is reduced by more than "
|
|
|
|
|
"given amount of bytes. Default value: 0, i.e. split iff the "
|
|
|
|
|
"size is reduced. Note that on some architectures the size can "
|
|
|
|
|
"increase after splitting."),
|
|
|
|
|
cl::init(0), cl::Hidden, cl::cat(BoltOptCategory));
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
|
2022-08-18 21:50:35 -07:00
|
|
|
static cl::opt<SplitFunctionsStrategy> SplitStrategy(
|
|
|
|
|
"split-strategy", cl::init(SplitFunctionsStrategy::Profile2),
|
|
|
|
|
cl::values(clEnumValN(SplitFunctionsStrategy::Profile2, "profile2",
|
|
|
|
|
"split each function into a hot and cold fragment "
|
|
|
|
|
"using profiling information")),
|
|
|
|
|
cl::values(clEnumValN(
|
|
|
|
|
SplitFunctionsStrategy::Random2, "random2",
|
|
|
|
|
"split each function into a hot and cold fragment at a randomly chosen "
|
|
|
|
|
"split point (ignoring any available profiling information)")),
|
2022-08-18 21:51:01 -07:00
|
|
|
cl::values(clEnumValN(
|
|
|
|
|
SplitFunctionsStrategy::RandomN, "randomN",
|
|
|
|
|
"split each function into N fragments at a randomly chosen split "
|
|
|
|
|
"points (ignoring any available profiling information)")),
|
2022-08-18 21:50:35 -07:00
|
|
|
cl::values(clEnumValN(
|
|
|
|
|
SplitFunctionsStrategy::All, "all",
|
|
|
|
|
"split all basic blocks of each function into fragments such that each "
|
|
|
|
|
"fragment contains exactly a single basic block")),
|
|
|
|
|
cl::desc("strategy used to partition blocks into fragments"),
|
|
|
|
|
cl::cat(BoltOptCategory));
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
} // namespace opts
|
|
|
|
|
|
2022-06-29 13:01:46 -07:00
|
|
|
namespace {
|
2022-09-08 14:46:57 -07:00
|
|
|
bool hasFullProfile(const BinaryFunction &BF) {
|
|
|
|
|
return llvm::all_of(BF.blocks(), [](const BinaryBasicBlock &BB) {
|
|
|
|
|
return BB.getExecutionCount() != BinaryBasicBlock::COUNT_NO_PROFILE;
|
|
|
|
|
});
|
|
|
|
|
}
|
2022-06-29 13:01:46 -07:00
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
bool allBlocksCold(const BinaryFunction &BF) {
|
|
|
|
|
return llvm::all_of(BF.blocks(), [](const BinaryBasicBlock &BB) {
|
|
|
|
|
return BB.getExecutionCount() == 0;
|
|
|
|
|
});
|
|
|
|
|
}
|
2022-06-29 13:01:46 -07:00
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
struct SplitProfile2 final : public SplitStrategy {
|
|
|
|
|
bool canSplit(const BinaryFunction &BF) override {
|
|
|
|
|
return BF.hasValidProfile() && hasFullProfile(BF) && !allBlocksCold(BF);
|
2022-06-29 13:01:46 -07:00
|
|
|
}
|
|
|
|
|
|
2022-09-08 17:10:11 -07:00
|
|
|
bool keepEmpty() override { return false; }
|
2022-06-29 13:01:46 -07:00
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
void fragment(const BlockIt Start, const BlockIt End) override {
|
2022-09-03 11:17:32 -07:00
|
|
|
for (BinaryBasicBlock *const BB : llvm::make_range(Start, End)) {
|
2022-09-08 17:10:11 -07:00
|
|
|
if (BB->getExecutionCount() == 0)
|
|
|
|
|
BB->setFragmentNum(FragmentNum::cold());
|
2022-09-03 11:17:32 -07:00
|
|
|
}
|
2022-06-29 13:01:46 -07:00
|
|
|
}
|
|
|
|
|
};
|
|
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
struct SplitRandom2 final : public SplitStrategy {
|
|
|
|
|
std::minstd_rand0 Gen;
|
2022-06-29 13:01:46 -07:00
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
SplitRandom2() : Gen(opts::RandomSeed.getValue()) {}
|
2022-06-29 13:01:46 -07:00
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
bool canSplit(const BinaryFunction &BF) override { return true; }
|
2022-06-29 13:01:46 -07:00
|
|
|
|
2022-09-08 17:10:11 -07:00
|
|
|
bool keepEmpty() override { return false; }
|
|
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
void fragment(const BlockIt Start, const BlockIt End) override {
|
|
|
|
|
using DiffT = typename std::iterator_traits<BlockIt>::difference_type;
|
2022-09-08 17:10:11 -07:00
|
|
|
const DiffT NumBlocks = End - Start;
|
|
|
|
|
assert(NumBlocks > 0 && "Cannot fragment empty function");
|
|
|
|
|
|
|
|
|
|
// We want to split at least one block
|
|
|
|
|
const auto LastSplitPoint = std::max<DiffT>(NumBlocks - 1, 1);
|
|
|
|
|
std::uniform_int_distribution<DiffT> Dist(1, LastSplitPoint);
|
|
|
|
|
const DiffT SplitPoint = Dist(Gen);
|
|
|
|
|
for (BinaryBasicBlock *BB : llvm::make_range(Start + SplitPoint, End))
|
2022-08-18 21:50:35 -07:00
|
|
|
BB->setFragmentNum(FragmentNum::cold());
|
2022-06-29 13:01:46 -07:00
|
|
|
|
|
|
|
|
LLVM_DEBUG(dbgs() << formatv("BOLT-DEBUG: randomly chose last {0} (out of "
|
|
|
|
|
"{1} possible) blocks to split\n",
|
2022-09-08 17:10:11 -07:00
|
|
|
NumBlocks - SplitPoint, End - Start));
|
2022-08-18 21:50:35 -07:00
|
|
|
}
|
|
|
|
|
};
|
|
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
struct SplitRandomN final : public SplitStrategy {
|
|
|
|
|
std::minstd_rand0 Gen;
|
2022-08-18 21:51:01 -07:00
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
SplitRandomN() : Gen(opts::RandomSeed.getValue()) {}
|
2022-08-18 21:51:01 -07:00
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
bool canSplit(const BinaryFunction &BF) override { return true; }
|
2022-08-18 21:51:01 -07:00
|
|
|
|
2022-09-08 17:10:11 -07:00
|
|
|
bool keepEmpty() override { return false; }
|
|
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
void fragment(const BlockIt Start, const BlockIt End) override {
|
|
|
|
|
using DiffT = typename std::iterator_traits<BlockIt>::difference_type;
|
2022-09-08 17:10:11 -07:00
|
|
|
const DiffT NumBlocks = End - Start;
|
|
|
|
|
assert(NumBlocks > 0 && "Cannot fragment empty function");
|
|
|
|
|
|
|
|
|
|
// With n blocks, there are n-1 places to split them.
|
|
|
|
|
const DiffT MaximumSplits = NumBlocks - 1;
|
|
|
|
|
// We want to generate at least two fragment if possible, but if there is
|
|
|
|
|
// only one block, no splits are possible.
|
|
|
|
|
const auto MinimumSplits = std::min<DiffT>(MaximumSplits, 1);
|
|
|
|
|
std::uniform_int_distribution<DiffT> Dist(MinimumSplits, MaximumSplits);
|
2022-08-18 21:51:01 -07:00
|
|
|
// Choose how many splits to perform
|
2022-09-08 14:46:57 -07:00
|
|
|
const DiffT NumSplits = Dist(Gen);
|
2022-08-18 21:51:01 -07:00
|
|
|
|
|
|
|
|
// Draw split points from a lottery
|
2022-09-08 17:10:11 -07:00
|
|
|
SmallVector<unsigned, 0> Lottery(MaximumSplits);
|
|
|
|
|
// Start lottery at 1, because there is no meaningful splitpoint before the
|
|
|
|
|
// first block.
|
|
|
|
|
std::iota(Lottery.begin(), Lottery.end(), 1u);
|
2022-09-08 14:46:57 -07:00
|
|
|
std::shuffle(Lottery.begin(), Lottery.end(), Gen);
|
2022-08-18 21:51:01 -07:00
|
|
|
Lottery.resize(NumSplits);
|
|
|
|
|
llvm::sort(Lottery);
|
|
|
|
|
|
|
|
|
|
// Add one past the end entry to lottery
|
2022-09-08 17:10:11 -07:00
|
|
|
Lottery.push_back(NumBlocks);
|
2022-08-18 21:51:01 -07:00
|
|
|
|
|
|
|
|
unsigned LotteryIndex = 0;
|
|
|
|
|
unsigned BBPos = 0;
|
|
|
|
|
for (BinaryBasicBlock *const BB : make_range(Start, End)) {
|
|
|
|
|
// Check whether to start new fragment
|
|
|
|
|
if (BBPos >= Lottery[LotteryIndex])
|
|
|
|
|
++LotteryIndex;
|
|
|
|
|
|
|
|
|
|
// Because LotteryIndex is 0 based and cold fragments are 1 based, we can
|
|
|
|
|
// use the index to assign fragments.
|
|
|
|
|
BB->setFragmentNum(FragmentNum(LotteryIndex));
|
|
|
|
|
|
|
|
|
|
++BBPos;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
};
|
|
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
struct SplitAll final : public SplitStrategy {
|
|
|
|
|
bool canSplit(const BinaryFunction &BF) override { return true; }
|
2022-06-29 13:01:46 -07:00
|
|
|
|
2022-09-08 17:10:27 -07:00
|
|
|
bool keepEmpty() override {
|
|
|
|
|
// Keeping empty fragments allows us to test, that empty fragments do not
|
|
|
|
|
// generate symbols.
|
|
|
|
|
return true;
|
|
|
|
|
}
|
2022-09-08 17:10:11 -07:00
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
void fragment(const BlockIt Start, const BlockIt End) override {
|
2022-09-08 17:10:11 -07:00
|
|
|
unsigned Fragment = 0;
|
|
|
|
|
for (BinaryBasicBlock *const BB : llvm::make_range(Start, End))
|
2022-08-18 21:50:35 -07:00
|
|
|
BB->setFragmentNum(FragmentNum(Fragment++));
|
2022-06-29 13:01:46 -07:00
|
|
|
}
|
|
|
|
|
};
|
|
|
|
|
} // namespace
|
|
|
|
|
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
namespace llvm {
|
|
|
|
|
namespace bolt {
|
|
|
|
|
|
2020-07-27 18:07:18 -07:00
|
|
|
bool SplitFunctions::shouldOptimize(const BinaryFunction &BF) const {
|
|
|
|
|
// Apply execution count threshold
|
|
|
|
|
if (BF.getKnownExecutionCount() < opts::ExecutionCountThreshold)
|
|
|
|
|
return false;
|
|
|
|
|
|
|
|
|
|
return BinaryFunctionPass::shouldOptimize(BF);
|
|
|
|
|
}
|
|
|
|
|
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
void SplitFunctions::runOnFunctions(BinaryContext &BC) {
|
2022-06-24 17:00:20 -07:00
|
|
|
if (!opts::SplitFunctions)
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
return;
|
|
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
std::unique_ptr<SplitStrategy> Strategy;
|
2022-08-18 21:50:35 -07:00
|
|
|
bool ForceSequential = false;
|
|
|
|
|
|
|
|
|
|
switch (opts::SplitStrategy) {
|
|
|
|
|
case SplitFunctionsStrategy::Profile2:
|
2022-09-08 14:46:57 -07:00
|
|
|
Strategy = std::make_unique<SplitProfile2>();
|
2022-08-18 21:50:35 -07:00
|
|
|
break;
|
|
|
|
|
case SplitFunctionsStrategy::Random2:
|
2022-09-08 14:46:57 -07:00
|
|
|
Strategy = std::make_unique<SplitRandom2>();
|
2022-08-18 21:50:35 -07:00
|
|
|
// If we split functions randomly, we need to ensure that across runs with
|
|
|
|
|
// the same input, we generate random numbers for each function in the same
|
|
|
|
|
// order.
|
|
|
|
|
ForceSequential = true;
|
|
|
|
|
break;
|
2022-08-18 21:51:01 -07:00
|
|
|
case SplitFunctionsStrategy::RandomN:
|
2022-09-08 14:46:57 -07:00
|
|
|
Strategy = std::make_unique<SplitRandomN>();
|
2022-08-18 21:51:01 -07:00
|
|
|
ForceSequential = true;
|
|
|
|
|
break;
|
2022-08-18 21:50:35 -07:00
|
|
|
case SplitFunctionsStrategy::All:
|
2022-09-08 14:46:57 -07:00
|
|
|
Strategy = std::make_unique<SplitAll>();
|
2022-08-18 21:50:35 -07:00
|
|
|
break;
|
|
|
|
|
}
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
|
|
|
|
|
ParallelUtilities::PredicateTy SkipFunc = [&](const BinaryFunction &BF) {
|
|
|
|
|
return !shouldOptimize(BF);
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
ParallelUtilities::runOnEachFunction(
|
2022-09-08 14:46:57 -07:00
|
|
|
BC, ParallelUtilities::SchedulingPolicy::SP_BB_LINEAR,
|
|
|
|
|
[&](BinaryFunction &BF) { splitFunction(BF, *Strategy); }, SkipFunc,
|
2022-06-29 13:01:46 -07:00
|
|
|
"SplitFunctions", ForceSequential);
|
2019-12-13 17:27:03 -08:00
|
|
|
|
2021-12-28 16:36:17 -08:00
|
|
|
if (SplitBytesHot + SplitBytesCold > 0)
|
2019-12-13 17:27:03 -08:00
|
|
|
outs() << "BOLT-INFO: splitting separates " << SplitBytesHot
|
|
|
|
|
<< " hot bytes from " << SplitBytesCold << " cold bytes "
|
|
|
|
|
<< format("(%.2lf%% of split functions is hot).\n",
|
|
|
|
|
100.0 * SplitBytesHot / (SplitBytesHot + SplitBytesCold));
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
}
|
|
|
|
|
|
2022-09-08 14:46:57 -07:00
|
|
|
void SplitFunctions::splitFunction(BinaryFunction &BF, SplitStrategy &S) {
|
2022-06-29 13:01:46 -07:00
|
|
|
if (BF.empty())
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
return;
|
|
|
|
|
|
2022-08-18 21:50:35 -07:00
|
|
|
if (!S.canSplit(BF))
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
return;
|
|
|
|
|
|
2022-07-16 17:23:21 -07:00
|
|
|
FunctionLayout &Layout = BF.getLayout();
|
|
|
|
|
BinaryFunction::BasicBlockOrderType PreSplitLayout(Layout.block_begin(),
|
|
|
|
|
Layout.block_end());
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
|
2021-04-08 00:19:26 -07:00
|
|
|
BinaryContext &BC = BF.getBinaryContext();
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
size_t OriginalHotSize;
|
|
|
|
|
size_t HotSize;
|
|
|
|
|
size_t ColdSize;
|
|
|
|
|
if (BC.isX86()) {
|
|
|
|
|
std::tie(OriginalHotSize, ColdSize) = BC.calculateEmittedSize(BF);
|
2020-12-01 16:29:39 -08:00
|
|
|
LLVM_DEBUG(dbgs() << "Estimated size for function " << BF
|
|
|
|
|
<< " pre-split is <0x"
|
|
|
|
|
<< Twine::utohexstr(OriginalHotSize) << ", 0x"
|
|
|
|
|
<< Twine::utohexstr(ColdSize) << ">\n");
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
}
|
|
|
|
|
|
2022-07-16 17:23:21 -07:00
|
|
|
BinaryFunction::BasicBlockOrderType NewLayout(Layout.block_begin(),
|
|
|
|
|
Layout.block_end());
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
// Never outline the first basic block.
|
2022-07-16 17:23:21 -07:00
|
|
|
NewLayout.front()->setCanOutline(false);
|
|
|
|
|
for (BinaryBasicBlock *const BB : NewLayout) {
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
if (!BB->canOutline())
|
|
|
|
|
continue;
|
2022-09-08 17:10:11 -07:00
|
|
|
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
// Do not split extra entry points in aarch64. They can be referred by
|
|
|
|
|
// using ADRs and when this happens, these blocks cannot be placed far
|
|
|
|
|
// away due to the limited range in ADR instruction.
|
|
|
|
|
if (BC.isAArch64() && BB->isEntryPoint()) {
|
|
|
|
|
BB->setCanOutline(false);
|
|
|
|
|
continue;
|
|
|
|
|
}
|
2022-03-10 12:08:57 -08:00
|
|
|
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
if (BF.hasEHRanges() && !opts::SplitEH) {
|
2022-03-10 12:08:57 -08:00
|
|
|
// We cannot move landing pads (or rather entry points for landing pads).
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
if (BB->isLandingPad()) {
|
|
|
|
|
BB->setCanOutline(false);
|
|
|
|
|
continue;
|
|
|
|
|
}
|
|
|
|
|
// We cannot move a block that can throw since exception-handling
|
|
|
|
|
// runtime cannot deal with split functions. However, if we can guarantee
|
|
|
|
|
// that the block never throws, it is safe to move the block to
|
|
|
|
|
// decrease the size of the function.
|
2021-04-08 00:19:26 -07:00
|
|
|
for (MCInst &Instr : *BB) {
|
2022-03-10 12:08:57 -08:00
|
|
|
if (BC.MIB->isInvoke(Instr)) {
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
BB->setCanOutline(false);
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2022-09-08 17:10:11 -07:00
|
|
|
BF.getLayout().updateLayoutIndices();
|
|
|
|
|
S.fragment(NewLayout.begin(), NewLayout.end());
|
|
|
|
|
|
|
|
|
|
// Make sure all non-outlineable blocks are in the main-fragment.
|
|
|
|
|
for (BinaryBasicBlock *const BB : NewLayout) {
|
|
|
|
|
if (!BB->canOutline())
|
|
|
|
|
BB->setFragmentNum(FragmentNum::main());
|
|
|
|
|
}
|
|
|
|
|
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
if (opts::AggressiveSplitting) {
|
|
|
|
|
// All blocks with 0 count that we can move go to the end of the function.
|
|
|
|
|
// Even if they were natural to cluster formation and were seen in-between
|
|
|
|
|
// hot basic blocks.
|
2022-09-08 17:10:11 -07:00
|
|
|
llvm::stable_sort(NewLayout, [&](const BinaryBasicBlock *const A,
|
|
|
|
|
const BinaryBasicBlock *const B) {
|
|
|
|
|
return A->getFragmentNum() < B->getFragmentNum();
|
2022-07-16 17:23:21 -07:00
|
|
|
});
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
} else if (BF.hasEHRanges() && !opts::SplitEH) {
|
|
|
|
|
// Typically functions with exception handling have landing pads at the end.
|
|
|
|
|
// We cannot move beginning of landing pads, but we can move 0-count blocks
|
|
|
|
|
// comprising landing pads to the end and thus facilitate splitting.
|
2022-07-16 17:23:21 -07:00
|
|
|
auto FirstLP = NewLayout.begin();
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
while ((*FirstLP)->isLandingPad())
|
|
|
|
|
++FirstLP;
|
|
|
|
|
|
2022-07-16 17:23:21 -07:00
|
|
|
std::stable_sort(FirstLP, NewLayout.end(),
|
2021-12-14 16:52:51 -08:00
|
|
|
[&](BinaryBasicBlock *A, BinaryBasicBlock *B) {
|
2022-09-08 17:10:11 -07:00
|
|
|
return A->getFragmentNum() < B->getFragmentNum();
|
2021-12-14 16:52:51 -08:00
|
|
|
});
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
}
|
|
|
|
|
|
2022-09-08 17:10:11 -07:00
|
|
|
// Make sure that fragments are increasing.
|
|
|
|
|
FragmentNum CurrentFragment = NewLayout.back()->getFragmentNum();
|
|
|
|
|
for (BinaryBasicBlock *const BB : reverse(NewLayout)) {
|
|
|
|
|
if (BB->getFragmentNum() > CurrentFragment)
|
|
|
|
|
BB->setFragmentNum(CurrentFragment);
|
|
|
|
|
CurrentFragment = BB->getFragmentNum();
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (!S.keepEmpty()) {
|
|
|
|
|
FragmentNum CurrentFragment = FragmentNum::main();
|
|
|
|
|
FragmentNum NewFragment = FragmentNum::main();
|
|
|
|
|
for (BinaryBasicBlock *const BB : NewLayout) {
|
|
|
|
|
if (BB->getFragmentNum() > CurrentFragment) {
|
|
|
|
|
CurrentFragment = BB->getFragmentNum();
|
|
|
|
|
NewFragment = FragmentNum(NewFragment.get() + 1);
|
|
|
|
|
}
|
|
|
|
|
BB->setFragmentNum(NewFragment);
|
|
|
|
|
}
|
|
|
|
|
}
|
2022-08-18 21:50:35 -07:00
|
|
|
|
2022-07-16 17:23:21 -07:00
|
|
|
BF.getLayout().update(NewLayout);
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
|
2022-06-24 16:51:46 -07:00
|
|
|
// For shared objects, invoke instructions and corresponding landing pads
|
|
|
|
|
// have to be placed in the same fragment. When we split them, create
|
|
|
|
|
// trampoline landing pads that will redirect the execution to real LPs.
|
|
|
|
|
TrampolineSetType Trampolines;
|
2022-03-10 12:08:57 -08:00
|
|
|
if (!BC.HasFixedLoadAddress && BF.hasEHRanges() && BF.isSplit())
|
2022-06-24 16:51:46 -07:00
|
|
|
Trampolines = createEHTrampolines(BF);
|
2022-03-10 12:08:57 -08:00
|
|
|
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
// Check the new size to see if it's worth splitting the function.
|
|
|
|
|
if (BC.isX86() && BF.isSplit()) {
|
|
|
|
|
std::tie(HotSize, ColdSize) = BC.calculateEmittedSize(BF);
|
2020-12-01 16:29:39 -08:00
|
|
|
LLVM_DEBUG(dbgs() << "Estimated size for function " << BF
|
|
|
|
|
<< " post-split is <0x" << Twine::utohexstr(HotSize)
|
|
|
|
|
<< ", 0x" << Twine::utohexstr(ColdSize) << ">\n");
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
if (alignTo(OriginalHotSize, opts::SplitAlignThreshold) <=
|
|
|
|
|
alignTo(HotSize, opts::SplitAlignThreshold) + opts::SplitThreshold) {
|
2023-02-27 15:26:14 -08:00
|
|
|
if (opts::Verbosity >= 2) {
|
|
|
|
|
outs() << "BOLT-INFO: Reversing splitting of function "
|
|
|
|
|
<< formatv("{0}:\n {1:x}, {2:x} -> {3:x}\n", BF, HotSize,
|
|
|
|
|
ColdSize, OriginalHotSize);
|
|
|
|
|
}
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
|
2022-06-24 16:51:46 -07:00
|
|
|
// Reverse the action of createEHTrampolines(). The trampolines will be
|
|
|
|
|
// placed immediately before the matching destination resulting in no
|
|
|
|
|
// extra code.
|
|
|
|
|
if (PreSplitLayout.size() != BF.size())
|
|
|
|
|
PreSplitLayout = mergeEHTrampolines(BF, PreSplitLayout, Trampolines);
|
|
|
|
|
|
2021-12-28 16:36:17 -08:00
|
|
|
for (BinaryBasicBlock &BB : BF)
|
2022-08-18 21:48:19 -07:00
|
|
|
BB.setFragmentNum(FragmentNum::main());
|
2022-07-16 17:23:21 -07:00
|
|
|
BF.getLayout().update(PreSplitLayout);
|
2019-12-13 17:27:03 -08:00
|
|
|
} else {
|
|
|
|
|
SplitBytesHot += HotSize;
|
|
|
|
|
SplitBytesCold += ColdSize;
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2022-06-24 16:51:46 -07:00
|
|
|
SplitFunctions::TrampolineSetType
|
|
|
|
|
SplitFunctions::createEHTrampolines(BinaryFunction &BF) const {
|
2022-03-10 12:08:57 -08:00
|
|
|
const auto &MIB = BF.getBinaryContext().MIB;
|
|
|
|
|
|
|
|
|
|
// Map real landing pads to the corresponding trampolines.
|
2022-06-24 16:51:46 -07:00
|
|
|
TrampolineSetType LPTrampolines;
|
2022-03-10 12:08:57 -08:00
|
|
|
|
|
|
|
|
// Iterate over the copy of basic blocks since we are adding new blocks to the
|
|
|
|
|
// function which will invalidate its iterators.
|
|
|
|
|
std::vector<BinaryBasicBlock *> Blocks(BF.pbegin(), BF.pend());
|
|
|
|
|
for (BinaryBasicBlock *BB : Blocks) {
|
|
|
|
|
for (MCInst &Instr : *BB) {
|
2022-12-06 14:15:54 -08:00
|
|
|
const std::optional<MCPlus::MCLandingPad> EHInfo = MIB->getEHInfo(Instr);
|
2022-03-10 12:08:57 -08:00
|
|
|
if (!EHInfo || !EHInfo->first)
|
|
|
|
|
continue;
|
|
|
|
|
|
|
|
|
|
const MCSymbol *LPLabel = EHInfo->first;
|
|
|
|
|
BinaryBasicBlock *LPBlock = BF.getBasicBlockForLabel(LPLabel);
|
2022-08-18 21:51:51 -07:00
|
|
|
if (BB->getFragmentNum() == LPBlock->getFragmentNum())
|
2022-03-10 12:08:57 -08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
|
|
const MCSymbol *TrampolineLabel = nullptr;
|
2022-08-18 21:51:51 -07:00
|
|
|
const TrampolineKey Key(BB->getFragmentNum(), LPLabel);
|
|
|
|
|
auto Iter = LPTrampolines.find(Key);
|
2022-03-10 12:08:57 -08:00
|
|
|
if (Iter != LPTrampolines.end()) {
|
|
|
|
|
TrampolineLabel = Iter->second;
|
|
|
|
|
} else {
|
|
|
|
|
// Create a trampoline basic block in the same fragment as the thrower.
|
|
|
|
|
// Note: there's no need to insert the jump instruction, it will be
|
|
|
|
|
// added by fixBranches().
|
|
|
|
|
BinaryBasicBlock *TrampolineBB = BF.addBasicBlock();
|
2022-08-18 21:51:51 -07:00
|
|
|
TrampolineBB->setFragmentNum(BB->getFragmentNum());
|
2022-03-10 12:08:57 -08:00
|
|
|
TrampolineBB->setExecutionCount(LPBlock->getExecutionCount());
|
|
|
|
|
TrampolineBB->addSuccessor(LPBlock, TrampolineBB->getExecutionCount());
|
|
|
|
|
TrampolineBB->setCFIState(LPBlock->getCFIState());
|
|
|
|
|
TrampolineLabel = TrampolineBB->getLabel();
|
2022-08-18 21:51:51 -07:00
|
|
|
LPTrampolines.insert(std::make_pair(Key, TrampolineLabel));
|
2022-03-10 12:08:57 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Substitute the landing pad with the trampoline.
|
|
|
|
|
MIB->updateEHInfo(Instr,
|
|
|
|
|
MCPlus::MCLandingPad(TrampolineLabel, EHInfo->second));
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (LPTrampolines.empty())
|
2022-06-24 16:51:46 -07:00
|
|
|
return LPTrampolines;
|
2022-03-10 12:08:57 -08:00
|
|
|
|
|
|
|
|
// All trampoline blocks were added to the end of the function. Place them at
|
|
|
|
|
// the end of corresponding fragments.
|
2022-07-16 17:23:21 -07:00
|
|
|
BinaryFunction::BasicBlockOrderType NewLayout(BF.getLayout().block_begin(),
|
|
|
|
|
BF.getLayout().block_end());
|
|
|
|
|
stable_sort(NewLayout, [&](BinaryBasicBlock *A, BinaryBasicBlock *B) {
|
2022-08-18 21:51:51 -07:00
|
|
|
return A->getFragmentNum() < B->getFragmentNum();
|
2022-07-16 17:23:21 -07:00
|
|
|
});
|
|
|
|
|
BF.getLayout().update(NewLayout);
|
2022-03-10 12:08:57 -08:00
|
|
|
|
|
|
|
|
// Conservatively introduce branch instructions.
|
|
|
|
|
BF.fixBranches();
|
|
|
|
|
|
|
|
|
|
// Update exception-handling CFG for the function.
|
|
|
|
|
BF.recomputeLandingPads();
|
2022-06-24 16:51:46 -07:00
|
|
|
|
|
|
|
|
return LPTrampolines;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
SplitFunctions::BasicBlockOrderType SplitFunctions::mergeEHTrampolines(
|
|
|
|
|
BinaryFunction &BF, SplitFunctions::BasicBlockOrderType &Layout,
|
|
|
|
|
const SplitFunctions::TrampolineSetType &Trampolines) const {
|
2022-08-18 21:51:51 -07:00
|
|
|
DenseMap<const MCSymbol *, SmallVector<const MCSymbol *, 0>>
|
|
|
|
|
IncomingTrampolines;
|
|
|
|
|
for (const auto &Entry : Trampolines) {
|
|
|
|
|
IncomingTrampolines[Entry.getFirst().Target].emplace_back(
|
|
|
|
|
Entry.getSecond());
|
|
|
|
|
}
|
|
|
|
|
|
2022-06-24 16:51:46 -07:00
|
|
|
BasicBlockOrderType MergedLayout;
|
|
|
|
|
for (BinaryBasicBlock *BB : Layout) {
|
2022-08-18 21:51:51 -07:00
|
|
|
auto Iter = IncomingTrampolines.find(BB->getLabel());
|
|
|
|
|
if (Iter != IncomingTrampolines.end()) {
|
|
|
|
|
for (const MCSymbol *const Trampoline : Iter->getSecond()) {
|
|
|
|
|
BinaryBasicBlock *LPBlock = BF.getBasicBlockForLabel(Trampoline);
|
|
|
|
|
assert(LPBlock && "Could not find matching landing pad block.");
|
|
|
|
|
MergedLayout.push_back(LPBlock);
|
|
|
|
|
}
|
2022-06-24 16:51:46 -07:00
|
|
|
}
|
|
|
|
|
MergedLayout.push_back(BB);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return MergedLayout;
|
2022-03-10 12:08:57 -08:00
|
|
|
}
|
|
|
|
|
|
[BOLT][non-reloc] Change function splitting in non-relocation mode
Summary:
This diff applies to non-relocation mode mostly. In this mode, we are
limited by original function boundaries, i.e. if a function becomes
larger after optimizations (e.g. because of the newly introduced
branches) then we might not be able to write the optimized version,
unless we split the function. At the same time, we do not benefit from
function splitting as we do in the relocation mode since we are not
moving functions/fragments, and the hot code does not become more
compact.
For the reasons described above, we used to execute multiple re-write
attempts to optimize the binary and we would only split functions that
were too large to fit into their original space.
After the first attempt, we would know functions that did not fit
into their original space. Then we would re-run all our passes again
feeding back the function information and forcefully splitting
such functions. Some functions still wouldn't fit even after the
splitting (mostly because of the branch relaxation for conditional tail
calls that does not happen in non-relocation mode). Yet we have emitted
debug info as if they were successfully overwritten. That's why we had
one more stage to write the functions again, marking failed-to-emit
functions non-simple. Sadly, there was a bug in the way 2nd and 3rd
attempts interacted, and we were not splitting the functions correctly
and as a result we were emitting less optimized code.
One of the reasons we had the multi-pass rewrite scheme in place, was
that we did not have an ability to precisely estimate the code size
before the actual code emission. Recently, BinaryContext obtained such
functionality, and now we can use it instead of relying on the
multi-pass rewrite. This eliminates redundant work of re-running
the same function passes multiple times.
Because function splitting runs before a number of optimization passes
that run on post-CFG state (those rely on the splitting pass), we
cannot estimate the non-split code size with 100% accuracy. However,
it is good enough for over 99% of the cases to extract most of the
performance gains for the binary.
As a result of eliminating the multi-pass rewrite, the processing time
in non-relocation mode with `-split-functions=2` is greatly reduced.
With debug info update, it is less than half of what it used to be.
New semantics for `-split-functions=<n>`:
-split-functions - split functions into hot and cold regions
=0 - do not split any function
=1 - in non-relocation mode only split functions too large to fit
into original code space
=2 - same as 1 (backwards compatibility)
=3 - split all functions
(cherry picked from FBD17362607)
2019-09-11 15:42:22 -07:00
|
|
|
} // namespace bolt
|
|
|
|
|
} // namespace llvm
|