[BOLT][NFC] Use brstack in guides and user outputs (#163950)

Update guides to use brstack, with a mention to BRBE for AArch64. Use
brstack in user-facing outputs.

---------

Co-authored-by: Amir Ayupov <aaupov@fb.com>
This commit is contained in:
Paschalis Mpeis
2025-10-20 10:30:06 +01:00
committed by GitHub
parent b90a8d385e
commit 96688d4b3c
12 changed files with 42 additions and 38 deletions

View File

@@ -108,9 +108,10 @@ $ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
#### For Services
Once you get the service deployed and warmed-up, it is time to collect perf
data with LBR (branch information). The exact perf command to use will depend
on the service. E.g., to collect the data for all processes running on the
server for the next 3 minutes use:
data with brstack (branch information). Different architectures implement this
using different hardware units, for example LBR on X86, and BRBE on AArch64.
The exact perf command to use will depend on the service. E.g., to collect the
data for all processes running on the server for the next 3 minutes use:
```
$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
```
@@ -163,7 +164,7 @@ $ perf2bolt -p perf.data -o perf.fdata <executable>
This command will aggregate branch data from `perf.data` and store it in a
format that is both more compact and more resilient to binary modifications.
If the profile was collected without LBRs, you will need to add `-nl` flag to
If the profile was collected without brstacks, you will need to add `-nl` flag to
the command line above.
### Step 3: Optimize with BOLT

View File

@@ -1,7 +1,7 @@
# Code Heatmaps
BOLT has gained the ability to print code heatmaps based on
sampling-based profiles generated by `perf`, either with `LBR` data or not.
sampling-based profiles generated by `perf`, either with `brstack` data or not.
The output is produced in colored ASCII to be displayed in a color-capable
terminal. It looks something like this:
@@ -20,9 +20,9 @@ or if you want to monitor the existing process(es):
$ perf record -e cycles:u -j any,u [-p PID|-a] -- sleep <interval>
```
Running with LBR (`-j any,u` or `-b`) is recommended. Heatmaps can be generated
from basic events by using the llvm-bolt-heatmap option `-nl` (no LBR) but
such heatmaps do not have the coverage provided by LBR and may only be useful
Running with brstack (`-j any,u` or `-b`) is recommended. Heatmaps can be generated
from basic events by using the llvm-bolt-heatmap option `-nl` (no brstack) but
such heatmaps do not have the coverage provided by brstack and may only be useful
for finding event hotspots at larger code block granularities.
Once the run is complete, and `perf.data` is generated, run llvm-bolt-heatmap:

View File

@@ -97,7 +97,7 @@ BOLT-INFO: basic block reordering modified layout of 7848 (10.32%) functions
790053908 : all conditional branches (=)
...
```
The statistics in the output is based on the LBR profile collected with `perf`, and since we were using
The statistics in the output is based on the brstack profile (LBR) collected with `perf`, and since we were using
the `cycles` counter, its accuracy is affected. However, the relative improvement in `taken conditional
branches` is a good indication that BOLT was able to straighten out the code even after PGO.

View File

@@ -5,7 +5,7 @@
Many Linux applications spend a significant amount of their execution time in the kernel. Thus, when we consider code optimization for system performance, it is essential to improve the CPU utilization not only in the user-space applications and libraries but also in the kernel. BOLT has demonstrated double-digit gains while being applied to user-space programs. This guide shows how to apply BOLT to the x86-64 Linux kernel and enhance your system's performance. In our experiments, BOLT boosted database TPS by 2 percent when applied to the kernel compiled with the highest level optimizations, including PGO and LTO. The database spent ~40% of the time in the kernel and was quite sensitive to kernel performance.
BOLT optimizes code layout based on a low-level execution profile collected with the Linux `perf` tool. The best quality profile should include branch history, such as Intel's last branch records (LBR). BOLT runs on a linked binary and reorders the code while combining frequently executed blocks of instructions in a manner best suited for the hardware. Other than branch instructions, most of the code is left unchanged. Additionally, BOLT updates all metadata associated with the modified code, including DWARF debug information and Linux ORC unwind information.
BOLT optimizes code layout based on a low-level execution profile collected with the Linux `perf` tool. The best quality profile should include branch history (brstack), such as Intel's last branch records (LBR) or AArch64's Branch Record Buffer Extension (BRBE). BOLT runs on a linked binary and reorders the code while combining frequently executed blocks of instructions in a manner best suited for the hardware. Other than branch instructions, most of the code is left unchanged. Additionally, BOLT updates all metadata associated with the modified code, including DWARF debug information and Linux ORC unwind information.
While BOLT optimizations are not specific to the Linux kernel, certain quirks distinguish the kernel from user-level applications.

View File

@@ -46,16 +46,15 @@ namespace opts {
static cl::opt<bool>
BasicAggregation("nl",
cl::desc("aggregate basic samples (without LBR info)"),
cl::desc("aggregate basic samples (without brstack info)"),
cl::cat(AggregatorCategory));
cl::opt<bool> ArmSPE("spe", cl::desc("Enable Arm SPE mode."),
cl::cat(AggregatorCategory));
static cl::opt<std::string>
ITraceAggregation("itrace",
cl::desc("Generate LBR info with perf itrace argument"),
cl::cat(AggregatorCategory));
static cl::opt<std::string> ITraceAggregation(
"itrace", cl::desc("Generate brstack info with perf itrace argument"),
cl::cat(AggregatorCategory));
static cl::opt<bool>
FilterMemProfile("filter-mem-profile",
@@ -201,7 +200,7 @@ void DataAggregator::start() {
}
if (opts::BasicAggregation) {
launchPerfProcess("events without LBR", MainEventsPPI,
launchPerfProcess("events without brstack", MainEventsPPI,
"script -F pid,event,ip");
} else if (!opts::ITraceAggregation.empty()) {
// Disable parsing memory profile from trace data, unless requested by user.
@@ -1069,7 +1068,7 @@ ErrorOr<DataAggregator::LBREntry> DataAggregator::parseLBREntry() {
if (std::error_code EC = Rest.getError())
return EC;
if (Rest.get().size() < 5) {
reportError("expected rest of LBR entry");
reportError("expected rest of brstack entry");
Diag << "Found: " << Rest.get() << "\n";
return make_error_code(llvm::errc::io_error);
}
@@ -1433,7 +1432,7 @@ std::error_code DataAggregator::printLBRHeatMap() {
errs() << "HEATMAP-ERROR: no basic event samples detected in profile. "
"Cannot build heatmap.";
} else {
errs() << "HEATMAP-ERROR: no LBR traces detected in profile. "
errs() << "HEATMAP-ERROR: no brstack traces detected in profile. "
"Cannot build heatmap. Use -nl for building heatmap from "
"basic events.\n";
}
@@ -1572,7 +1571,7 @@ void DataAggregator::printBranchStacksDiagnostics(
std::error_code DataAggregator::parseBranchEvents() {
std::string BranchEventTypeStr =
opts::ArmSPE ? "SPE branch events in LBR-format" : "branch events";
opts::ArmSPE ? "SPE branch events in brstack-format" : "branch events";
outs() << "PERF2BOLT: parse " << BranchEventTypeStr << "...\n";
NamedRegionTimer T("parseBranch", "Parsing branch events", TimerGroupName,
TimerGroupDesc, opts::TimeAggregator);
@@ -1620,7 +1619,7 @@ std::error_code DataAggregator::parseBranchEvents() {
clear(TraceMap);
outs() << "PERF2BOLT: read " << NumSamples << " samples and " << NumEntries
<< " LBR entries\n";
<< " brstack entries\n";
if (NumTotalSamples) {
if (NumSamples && NumSamplesNoLBR == NumSamples) {
// Note: we don't know if perf2bolt is being used to parse memory samples
@@ -1628,8 +1627,10 @@ std::error_code DataAggregator::parseBranchEvents() {
if (!opts::ArmSPE)
errs()
<< "PERF2BOLT-WARNING: all recorded samples for this binary lack "
"LBR. Record profile with perf record -j any or run perf2bolt "
"in no-LBR mode with -nl (the performance improvement in -nl "
"brstack. Record profile with perf record -j any or run "
"perf2bolt "
"in non-brstack mode with -nl (the performance improvement in "
"-nl "
"mode may be limited)\n";
else
errs()
@@ -1664,7 +1665,7 @@ void DataAggregator::processBranchEvents() {
}
std::error_code DataAggregator::parseBasicEvents() {
outs() << "PERF2BOLT: parsing basic events (without LBR)...\n";
outs() << "PERF2BOLT: parsing basic events (without brstack)...\n";
NamedRegionTimer T("parseBasic", "Parsing basic events", TimerGroupName,
TimerGroupDesc, opts::TimeAggregator);
while (hasData()) {
@@ -1688,7 +1689,7 @@ std::error_code DataAggregator::parseBasicEvents() {
}
void DataAggregator::processBasicEvents() {
outs() << "PERF2BOLT: processing basic events (without LBR)...\n";
outs() << "PERF2BOLT: processing basic events (without brstack)...\n";
NamedRegionTimer T("processBasic", "Processing basic events", TimerGroupName,
TimerGroupDesc, opts::TimeAggregator);
uint64_t OutOfRangeSamples = 0;
@@ -1777,7 +1778,8 @@ std::error_code DataAggregator::parsePreAggregatedLBRSamples() {
++AggregatedLBRs;
}
outs() << "PERF2BOLT: read " << AggregatedLBRs << " aggregated LBR entries\n";
outs() << "PERF2BOLT: read " << AggregatedLBRs
<< " aggregated brstack entries\n";
return std::error_code();
}
@@ -2426,7 +2428,7 @@ std::error_code DataAggregator::writeBATYAML(BinaryContext &BC,
void DataAggregator::dump() const { DataReader::dump(); }
void DataAggregator::dump(const PerfBranchSample &Sample) const {
Diag << "Sample LBR entries: " << Sample.LBR.size() << "\n";
Diag << "Sample brstack entries: " << Sample.LBR.size() << "\n";
for (const LBREntry &LBR : Sample.LBR)
Diag << LBR << '\n';
}

View File

@@ -570,7 +570,7 @@ void DataReader::readBasicSampleData(BinaryFunction &BF) {
if (!SampleDataOrErr)
return;
// Basic samples mode territory (without LBR info)
// Basic samples mode territory (without brstack info)
// First step is to assign BB execution count based on samples from perf
BF.ProfileMatchRatio = 1.0f;
BF.removeTagsFromProfile();
@@ -578,8 +578,8 @@ void DataReader::readBasicSampleData(BinaryFunction &BF) {
bool NormalizeByCalls = usesEvent("branches");
static bool NagUser = true;
if (NagUser) {
outs()
<< "BOLT-INFO: operating with basic samples profiling data (no LBR).\n";
outs() << "BOLT-INFO: operating with basic samples profiling data (no "
"brstack).\n";
if (NormalizeByInsnCount)
outs() << "BOLT-INFO: normalizing samples by instruction count.\n";
else if (NormalizeByCalls)

View File

@@ -46,7 +46,7 @@ WRITE-BAT-CHECK: BOLT-INFO: BAT section size (bytes): 404
READ-BAT-CHECK-NOT: BOLT-ERROR: unable to save profile in YAML format for input file processed by BOLT
READ-BAT-CHECK: BOLT-INFO: Parsed 5 BAT entries
READ-BAT-CHECK: PERF2BOLT: read 79 aggregated LBR entries
READ-BAT-CHECK: PERF2BOLT: read 79 aggregated brstack entries
READ-BAT-CHECK: HEATMAP: building heat map
READ-BAT-CHECK: BOLT-INFO: 5 out of 21 functions in the binary (23.8%) have non-empty execution profile
READ-BAT-FDATA-CHECK: BOLT-INFO: 5 out of 16 functions in the binary (31.2%) have non-empty execution profile

View File

@@ -32,7 +32,7 @@ RUN: --block-size=1024 | FileCheck --check-prefix CHECK-HEATMAP-BAT-1K %s
CHECK-HEATMAP-BAT-1K: HEATMAP: dumping heatmap with bucket size 1024
CHECK-HEATMAP-BAT-1K-NOT: HEATMAP: dumping heatmap with bucket size
CHECK-HEATMAP: PERF2BOLT: read 81 aggregated LBR entries
CHECK-HEATMAP: PERF2BOLT: read 81 aggregated brstack entries
CHECK-HEATMAP: HEATMAP: invalid traces: 1
CHECK-HEATMAP: HEATMAP: dumping heatmap with bucket size 64
CHECK-HEATMAP: HEATMAP: dumping heatmap with bucket size 128
@@ -71,7 +71,7 @@ CHECK-HM-1024-NEXT: 0
CHECK-BAT-HM-64: (349, 1126]
CHECK-BAT-HM-4K: (605, 2182]
CHECK-HEATMAP-BAT: PERF2BOLT: read 79 aggregated LBR entries
CHECK-HEATMAP-BAT: PERF2BOLT: read 79 aggregated brstack entries
CHECK-HEATMAP-BAT: HEATMAP: invalid traces: 2
CHECK-HEATMAP-BAT: HEATMAP: dumping heatmap with bucket size 64
CHECK-HEATMAP-BAT: HEATMAP: dumping heatmap with bucket size 4096

View File

@@ -17,7 +17,7 @@
# CHECK-FDATA-NEXT: 1 _start [[#]] 1
# CHECK-BOLT: BOLT-INFO: pre-processing profile using branch profile reader
# CHECK-BOLT: BOLT-INFO: operating with basic samples profiling data (no LBR).
# CHECK-BOLT: BOLT-INFO: operating with basic samples profiling data (no brstack).
# CHECK-BOLT: BOLT-INFO: 1 out of 1 functions in the binary (100.0%) have non-empty execution profile
.globl _start

View File

@@ -6,6 +6,6 @@ RUN: %clang %cflags %p/../../Inputs/asm_foo.s %p/../../Inputs/asm_main.c -o %t.e
RUN: perf record -e cycles -q -o %t.perf.data -- %t.exe 2> /dev/null
RUN: perf2bolt -p %t.perf.data -o %t.perf.boltdata --spe %t.exe | FileCheck %s --check-prefix=CHECK-SPE-LBR
RUN: perf2bolt -p %t.perf.data -o %t.perf.boltdata --spe %t.exe | FileCheck %s --check-prefix=CHECK-SPE-BRSTACK
CHECK-SPE-LBR: PERF2BOLT: parse SPE branch events in LBR-format
CHECK-SPE-BRSTACK: PERF2BOLT: parse SPE branch events in brstack-format

View File

@@ -69,7 +69,8 @@ int main(int argc, char **argv) {
" - Sampled profile collected from the binary:\n"
" - perf data or pre-aggregated profile data (instrumentation profile "
"not supported)\n"
" - perf data can have basic (IP) or branch-stack (LBR) samples\n\n"
" - perf data can have basic (IP) or branch-stack (brstack) "
"samples\n\n"
" Outputs:\n"
" - Heatmaps: colored ASCII (requires a color-capable terminal or a"

View File

@@ -120,14 +120,14 @@ void mergeProfileHeaders(BinaryProfileHeader &MergedHeader,
if (!MergedHeader.Id.empty() && (MergedHeader.Id != Header.Id))
errs() << "WARNING: build-ids in merged profiles do not match\n";
// Cannot merge samples profile with LBR profile.
// Cannot merge samples profile with brstack profile.
if (!MergedHeader.Flags)
MergedHeader.Flags = Header.Flags;
constexpr auto Mask = llvm::bolt::BinaryFunction::PF_BRANCH |
llvm::bolt::BinaryFunction::PF_BASIC;
if ((MergedHeader.Flags & Mask) != (Header.Flags & Mask)) {
errs() << "ERROR: cannot merge LBR profile with non-LBR profile\n";
errs() << "ERROR: cannot merge brstack profile with non-brstack profile\n";
exit(1);
}
MergedHeader.Flags = MergedHeader.Flags | Header.Flags;