intel/llvm - llvm - Gitea: Git with a cup of tea

intel/llvm

mirror of https://github.com/intel/llvm.git synced 2026-01-15 04:17:17 +08:00

Author	SHA1	Message	Date
spupyrev	244a476a2e	using offsets for CG Summary: Arc->AvgOffset can be used for function/block ordering to distinguish between calls from the beggining of a function and calls from the end of the function. This makes a difference for large functions. (cherry picked from FBD6094221)	2017-10-18 15:18:52 -07:00
Maksim Panchenko	61e5fbf8c3	[BOLT][Refactoring] Get rid of TailCallTerminatedBlocks, etc. Summary: More changes to allow separation of CFG construction and profile assignment. Misc cleanups. (cherry picked from FBD6158653)	2017-10-23 23:32:40 -07:00
Bill Nell	c58996fd55	[BOLT] Add ability to specify custom printers for annotations. Summary: This will give us the ability to print annotations in a more meaningful way. Especially annotations that could be interpreted in multiple ways. I've added one register name printer for liveness analysis. We can update the other dataflow annotations as needed. I also noticed that BitVector annotations were leaking since they contain heap allocated memory. I made removeAnnotation call the annotation destructor explicitly to mitigate this but it won't fix the problem when annotations are just dropped en masse. (cherry picked from FBD6105999)	2017-10-19 12:36:48 -07:00
Maksim Panchenko	2ab7472329	[BOLT] Account for FDE functions when calculating max function size Summary: When we calculate maximum function size we only used to rely on the symbol table information, and ignore function info coming from FDEs. Invalid maximum function size can lead to code emission over the code of neighbouring function. Fix this by considering FDE functions when determining the maximum function size. (cherry picked from FBD6025613)	2017-10-10 14:54:09 -07:00
Maksim Panchenko	1e1833c8a2	[BOLT][Refactoring] Make CTC first class operand, etc. Summary: This diff is a preparation for decoupling function disassembly, profile association, and CFG construction phases. We used to have multiple ways to mark conditional tail calls with annotations or TailCallOffsets map. Since CTC information is affecting the correctness, it is justifiable to have it as a operand class for instruction with a destination (0 is a valid one). "Offset" annotation now replaces "EdgeCountData" and "IndirectBranchData" annotations to extract profile data for any given instruction. Inlining for small functions was broken in a presence of profiled (annotated) instructions and hence I had to remove "-inline-small-functions" from the test case. Also fix an issue with UNDEF section for created __hot_start/__hot_end symbols. Now the symbols use ABS section. (cherry picked from FBD6087284)	2017-10-12 14:57:11 -07:00
spupyrev	b77172ce2f	updating cache metrics Summary: This is a replacement of a previous diff. The implemented metric ('graph distance') is not very useful at the moment but I plan to add more relevant metrics in the subsequent diff. This diff fixes some obvious problems and moves the call of CalcMetrics::printAll to the right place. (cherry picked from FBD6072312)	2017-10-16 16:53:50 -07:00
Maksim Panchenko	4c8f48be3d	[BOLT] Fix function order output option Summary: Add support to output both function order and section order files as the former is useful for offloading functions sorting and the latter is useful for linker script generation: -generate-function-order=<file> -generate-link-sections=<file> (cherry picked from FBD6078446)	2017-10-17 10:05:16 -07:00
Maksim Panchenko	bee9132a54	[BOLT] Change function order file format for linker script Summary: Change output of "-generate-function-order=<file>" to match expected format used for a linker script: * Prefix function names with ".text". * Strip internal suffix from local function names. E.g. for function with names "foo/1" and "foo/foo.c/1" we will only output "foo". * Output (with indentation) duplicate names for folded functions. (cherry picked from FBD6071020)	2017-10-16 15:22:05 -07:00
Maksim Panchenko	1605f07f5c	[BOLT] Create symbol table entries under -hot-text if they did not exist Summary: If "-hot-text" options is specified and the input binary did not have __hot_start/__hot_end symbols, then add them to the symbol table. (cherry picked from FBD6027737)	2017-10-10 18:06:45 -07:00
Maksim Panchenko	3d3fefff46	[BOLT] Use 32 as the default max bytes for function alignment Summary: Several benchmarks (hhvm, compilers) show that 32 provides a good balance between I-Cache performance and iTLB misses. (cherry picked from FBD6026476)	2017-10-10 16:36:01 -07:00
Rafael Auler	7689cf2417	[BOLT] Fix bolt_info ELF note Summary: Small fix - align the end of the descriptor string as well, since readelf will detect when it is not aligned and print an error instead of printing BOLT version and command line. (cherry picked from FBD6023643)	2017-10-10 13:30:05 -07:00
Rafael Auler	0cc2a62f6a	[BOLT] Write bolt info according to ELF spec Summary: Follow ELF spec for NOTE sections when writing bolt info. Since tools such as "readelf -n" will not recognize a custom code identifying our new note section, we use GNU "gold linker version" note, tricking readelf into printing bolt info. (cherry picked from FBD6010153)	2017-10-06 17:54:26 -07:00
Rafael Auler	0ed144a188	[PERF2BOLT] Check build-ids of binaries when aggregating Summary: Check the build-id of the input binary against the build-id of the binary used during profiling data collection with perf, as reported in perf.data. If they differ, issue a warning, since the user should use exactly the same binary. If we cannot determine the build-id of either the input binary or the one registered in the input perf.data, cancel the build-id check but print a log message. (cherry picked from FBD6001917)	2017-10-06 14:42:46 -07:00
spupyrev	f77a6acd71	fixing sizes Summary: In some (weird) cases, a Function is marked 'split' but doesn't contain any 'cold' basic block. In that case, the size of the last basic block of the function is computed incorrectly. Hence, this fix. (cherry picked from FBD6012963)	2017-10-09 14:15:38 -07:00
Rafael Auler	9df6dce234	[PERF2BOLT] Fix aggregator wrt new output format of perf Summary: Perf is now outputting one less space, which broke our previous (flaky) assumptions about field separators when processing the output file. Make it more resilient by accepting any number of spaces before reading LBR entries. (cherry picked from FBD6014941)	2017-10-09 15:52:13 -07:00
Rafael Auler	f02c8c29ee	[PERF2BOLT] Improve user messages about profiling stats Summary: Improve messages and color-code bad traces percentage, warning user about a potential input binary mismatch. (cherry picked from FBD5915934)	2017-09-26 14:42:43 -07:00
Maksim Panchenko	f32784f4cb	[BOLT] Ignore Clang LTO artifact file symbol Summary: The presence of ld-temp.o symbol is somewhat indeterministic. I couldn't find out exactly when it's generated, it could be related to LTO vs ThinLTO, but not always. If the symbol is there, it could affect names of most of functions in LTO binary. The status of the symbol may change between the binary the profile was collected on, and the binary BOLT is called on. As a result, we may mismatch many function names. It is safe to ignore this symbol. (cherry picked from FBD5908955)	2017-09-25 18:05:37 -07:00
Bill Nell	aa05dc91c5	Fix SCTC bug when two pred/succ BB are in a loop. Summary: It's possible that two basic blocks being conidered for SCTC are in a loop in the CFG. In this case a block that is both a predecessor and a successor may have been processed and marked invalid by a previous iteration of the SCTC loop. We should skip rewriting in this case. (cherry picked from FBD5886721)	2017-09-21 15:45:39 -07:00
Rafael Auler	42f957bb75	[BOLT] Integrate perf2bolt into llvm-bolt Summary: Move the data aggregator logic from our python script to our C++ LLVM/BOLT libs. This has a dramatic reduction in processing time for profiling data (from 45 minutes for HHVM to 5 minutes) because we directly use BOLT as a disassembler in order to validate traces found in the LBR and to add the fallthrough counts. Previously, the python approach relied on parsing the output objdump to check traces. (cherry picked from FBD5761313)	2017-09-01 18:13:51 -07:00
Maksim Panchenko	156fc73157	[BOLT] Fix SCTC bug Summary: If conditional branch has been converted to conditional tail call, it may be considered for SCTC optimization later since it will appear as a tail call. We have to make sure that the tail call we are considering is not a conditional branch. (cherry picked from FBD5884777)	2017-09-19 16:59:05 -07:00
Maksim Panchenko	b006d2a860	[BOLT] Fix issue with exception handlers splitting Summary: A cold part of a function can start with a landing pad. As a result, this landing pad will have offset 0 from the start of the corresponding FDE, and it wouldn't get registered by exception-handling runtime. The solution is to use a different landing pad base address (LPStart), such as (FDE_start - 1). (cherry picked from FBD5876561)	2017-09-20 13:32:46 -07:00
Rafael Auler	ef0ec9edf9	[BOLT] Fix frameopt=all for gcc Summary: Fix two bugs. First, stack pointer tracking, the dataflow analysis, was converging to the "superposition" state (meaning that at this point there are multiple and conflicting states) too early in case the entry state in the BB was "empty" AND there was an SP computation in the block. In these cases, we need to propagate an "empty" value as well and wait for an iteration where the input is not empty (only entry BBs start with a non-empty well-defined value). Previously, it was propagating "superposition", meaning there is a conflict of states in this block, which is not true, since the input is empty and, therefore, there is no preceding state to justify a collision of states. Second, if SPT failed and has no idea about the stack values in a block (if it is in the superposition state at a given point in a BB), shrink wrapping should not attempt to insert computation into those blocks that we do not understand what is happening. Fix it to bail on those cases. (cherry picked from FBD5858402)	2017-09-18 16:26:00 -07:00
Rafael Auler	9df155ce11	[BOLT] Introduce non-LBR mode Summary: Add support to read profiles collected without LBR. This involves adapting our data aggregator perf2bolt and adding support in llvm-bolt itself to read this data. This patch also introduces different options to convert basic block execution count to edge count, so BOLT can operate with its regular algorithms to perform basic block layout. The most successful approach is the default one. (cherry picked from FBD5664735)	2017-08-02 10:59:33 -07:00
Maksim Panchenko	29d4f4cfac	[BOLT] Ignore TLS relocations types Summary: No special handling is required for TLS relocations types, and if we see them in the binary we can safely ignore those types. (cherry picked from FBD5853889)	2017-09-13 11:21:47 -07:00
Maksim Panchenko	ec5b3b0a65	[BOLT] Fix bug in SCTC Summary: After SCTC optimization fixDoubleJumps() was relying on CFG information on the number of successors of a basic block. It ignored the fact that conditional tail call had a successor outside of the function and deleted a containing basic block. Discovered while testing old HHVM with disabled jump tables. (cherry picked from FBD5752903)	2017-08-31 17:28:14 -07:00
Maksim Panchenko	bd8e4b9e87	[BOLT] Support PIC-style exception tables Summary: Exceptions tables for PIC may contain indirect type references that are also encoded using relative addresses. This diff adds support for such encodings. We read PIC-style type info table, and write it using new encoding. (cherry picked from FBD5716060)	2017-08-27 17:04:06 -07:00
Maksim Panchenko	49d1f5698d	[BOLT] PLT optimization Summary: Add an option to optimize PLT calls: -plt - optimize PLT calls (requires linking with -znow) =none - do not optimize PLT calls =hot - optimize executed (hot) PLT calls =all - optimize all PLT calls When optimized, the calls are converted to use GOT reference indirectly. GOT entries are guaranteed to contain a valid function pointer if lazy binding is disabled - hence the requirement for linker's -znow option. Note: we can add an entry to .dynamic and drop a requirement for -znow if we were moving .dynamic to a new segment. (cherry picked from FBD5579789)	2017-08-04 11:21:05 -07:00
Maksim Panchenko	0c07445110	[BOLT] Fix printing of dyno-stats Summary: We used to print dyno-stats after instruction lowering which was skewing our metrics as tail calls were no longer recognized as calls for one thing. The fix is to control the point at which dyno-stats printing pass is run and run it immediately before instruction lowering. In the future we may decide to run the pass before some other intervening pass. (cherry picked from FBD5605639)	2017-08-10 13:18:44 -07:00
Rafael Auler	21c48f7d78	Fix profiling for functions with multiple entry points Summary: Fix issue in memcpy where one of its entry points was getting no profiling data and was wrongly considered cold, being put in the cold region. (cherry picked from FBD5569156)	2017-08-02 18:14:01 -07:00
Rafael Auler	b81ff8a8fc	[BOLT] Fix SCTC issue with hot-cold split Summary: SCTC was deleting an unconditional branch to a block in the cold area because it was the next block in the layout vector. Fix the condition to only delete such branches when source and target are in the same allocation area (either both hot or both cold). (cherry picked from FBD5570300)	2017-08-04 20:14:24 -07:00
Maksim Panchenko	e4290d083f	[BOLT] Disable last basic block assertion. Summary: While converting code from __builtin_unreachable() we were asserting that a basic block with a conditional jump and a single CFG successor was the last one before converting the jump to an unconditional one. However, if that code was executed after a conditional tail call conversion in the same function, the original last basic block will no longer be the last one in the post-conversion layout. I'm disabling the assertion since it doesn't seem worth it to add extra checks for the basic block that used to be the last one. (cherry picked from FBD5570298)	2017-08-04 19:39:45 -07:00
Maksim Panchenko	ae409f0b27	[BOLT] Better match LTO functions profile. Summary: * Improve profile matching for LTO binaries that don't match 100%. * Fix profile matching for '.LTHUNK' functions. Add external outgoing branches (calls) for profile validation. There's an improvement for 100% match profile and for stale LTO profile. However, we are still not fully closing the gap with stale profile when LTO is enabled. (NOTE: I haven't updated all test cases yet) (cherry picked from FBD5529293)	2017-07-17 11:22:22 -07:00
Maksim Panchenko	d27b31ee07	[BOLT] Fix reading LSDA address for PIC code Summary: Fix a bug while reading LSDA address in PIC format. The base address was wrong for PC-relative value. There's more work involved in making PIC code with C++ exceptions work. (cherry picked from FBD5538755)	2017-08-01 11:19:01 -07:00
Yue Zhao	eb64d03b73	Reformat the register strings in the output so Stoke can parse without preprocessing. Summary: Minor change. Reformat the def-in, live-out register strings so that Stoke can parse without doing preprocessing. (cherry picked from FBD5537421)	2017-07-27 12:52:56 -07:00
Bohan Ren	87481cb494	[BOLT] Improve Jump-Distance Metric -- Consider Function Execution Count Summary: Function execution count is very important. When calculating metric, we should care more about functions which are known to be executed. The correlations between this metric and both CPU time is slightly improved to be close to 96% and the correlation between this metric and Cache Miss remains the same 96%. Thanks the suggestion from Sergey! (cherry picked from FBD5494720)	2017-07-25 16:27:00 -07:00
Rafael Auler	787db1cf3e	Recognize AArch64 as a valid input Summary: BOLT needs to be configured with the LLVM AArch64 backend. If the backend is linked into the LLVM library, start processing AArch64 binaries. (cherry picked from FBD5489369)	2017-07-25 09:11:42 -07:00
Yue Zhao	70bad8d34d	add: get function score to find hot functions refine the dumped csv format Summary: minor modification of the bolt stoke pass (cherry picked from FBD5471011)	2017-07-13 15:02:52 -07:00
Yue Zhao	6d845719ce	get analysis information of functions Summary: complete the StokeInfo pass, ignore previous arc diff (cherry picked from FBD5306863)	2017-06-13 17:24:27 -07:00
Rafael Auler	4e29afeb18	[BOLT] Add cold symbols to the symbol table Summary: Create new .symtab and .strtab sections, so we can change their sizes and not only patch them. Remove local symbols and add symbols to identify the cold part of split functions. (cherry picked from FBD5345460)	2017-06-27 16:25:59 -07:00
Bohan Ren	4d34471eeb	[BOLT] Improved Jump-Distance Metric Summary: Current existing Jump-Distance Metric (Previously named Call-Distance) will ignore some traversals. This modified version adds those missing traversals back. The correlation remains the same: around 97% correlation with CPU and Cache Miss (which implies that even though some traversals are ignored, it doesn't affect correlation that much.) (cherry picked from FBD5369653)	2017-07-04 15:59:29 -07:00
Rafael Auler	4ecd3856e9	[BOLT] Fix shrink-wrapping bugs Summary: Make shrink-wrapping more stable. Changes: * Correctly detect landing pads at the dominance frontier, bailing on such cases because we are not prepared to split LPs that are target of a critical edge. * Disable FOP's store removal by default - this is experimental and shouldn t go to prod because removing a store that we failed to detect it's actually necessary is disastrous. This pass currently doesn't have a great impact on the number of stores reduced, so it is not a problem. Most stores reduced are due shrink wrapping anyway. * Fix stack access identification - correctly estimate memory length of weird instructions, bail if we don't know. * Make rules for shrink-wrapping more strict: cancel shrink wrapping on a number of cases when we are not 100% sure that we are dealing with a regular callee-saved register. * Add basic block folding to SW. Sometimes when splitting critical edges we create a lot of redundant BBs with the same instructions, same successor but different predecessor. Fold all identical BBs created by splitting critical edges. * Change defaults: now the threshold used to determine when to perform SW is more conservative, to be sure we are moving a spill to a colder area. This effort, along with BB folding, helps us to avoid hurting icache performance by indiscriminately increasing code size. (cherry picked from FBD5315086)	2017-06-22 16:34:01 -07:00
Bohan Ren	ec304396c3	[BOLT] Call Distance Metric Summary: Designed a new metric, which shows 93.46% correltation with Cache Miss and 86% correlation with CPU Time. Definition: One can get all the traversal path for each function. And for each traversal, we will define a distance. The distance represents how far two connected basic blocks are. Therefore, for each traversal, I will go through the basic blocks one by one, until the end of the traversal and sum up the distance for the neighboring basic blocks. Distance between two connected basic blocks is the distance of the centers of two blocks in the binary file. (cherry picked from FBD5242526)	2017-06-13 16:29:39 -07:00
Rafael Auler	3469396269	[BOLT] Set local symbols in relocation mode to zero Summary: Strobelight is getting confused by local symbols that we do not update in relocation mode. These symbols were preserved by the linker in relocation mode in order support emitting relocations against local labels, but they are unused. Issue a quick fix to this by detecting such symbols and setting their value to zero. This patch also fixes an issue with the symbol table that was assigning the wrong section index to symbols associated with the .text section. (cherry picked from FBD5271277)	2017-06-16 20:04:43 -07:00
Bill Nell	59e90f0f43	[BOLT] Make function reordering more robust with stale data. Summary: Rewrote the guts of buildCallGraph. There are two new options to control how the CG is created. UsePerfData controls whether we use the perf data directly to construct the CG for functions with a stale profile. IgnoreRecursiveCalls omits recursive calls from the CG since they might be skewing results unfairly for heavily recursive functions. I've changed the way BinaryFunction::estimateHotSize() works. If the function is marked as split, I count the size of all the non-cold blocks. This gives a different but more accurate answer than the old method. I've improved and updated the CG build stats with extra information. (cherry picked from FBD5224183)	2017-06-09 13:17:36 -07:00
Rafael Auler	8233c7d204	[BOLT] Bail frame analysis on PUSHes escaping vars Summary: Some PUSH instructions may contain memory addresses pushed to the stack. If this memory address is from an object in the stack, cancel further frame analysis for this function since it may be escaping a variable. This fixes a bug with deleting used stores (in frameopt) in hhvm trunk. (cherry picked from FBD5270590)	2017-06-16 15:02:26 -07:00
Yue Zhao	37d0f81df5	BinaryFunction.h: Clarify commet for getSize(), add getNumNonPseudos() Summary: Minor fix and add new function (cherry picked from FBD5270376)	2017-06-16 17:06:13 -07:00
Bill Nell	dc4dd64800	[BOLT] More HFSort+ refactoring Summary: Move most of hfsort+ into a class so the state can more easily be shared. (cherry picked from FBD5216206)	2017-06-08 10:55:28 -07:00
Bohan Ren	f819f53d27	Normalize Clusters Twice Summary: This one will normalize cluster twice, leaving edges connecting two basic block untouched (cherry picked from FBD5207416)	2017-06-07 20:25:30 -07:00
Rafael Auler	eeea415dd2	[BOLT] Fix SCTC execution count assertion Summary: SCTC is currently asserting (my fault :-) when running in combination with hot jump table entries optimization. This optimization sets the frequency for edges connecting basic blocks it creates and jump table targets based on the execution count of the original BB containing the indirect jump. This is OK as an estimation, but it breaks our assumption that the sum of the frequency of preds edges equals to our BB frequency. This happens because the frequency of the BB is rarely equal to its outgoing edges frequency. SCTC, in turn, was updating the execution count for BBs with tail calls by subtracting the frequency count of predecessor edges. Because hot jump table entries optimization broke the BB exec count = sum(preds freq) invariant, SCTC was asserting. To trigger this, the input program must have a jump table where each entry contains a tail call. This happens in the HHVM binary for func _ZN4HPHP11collections5issetEPNS_10ObjectDataEPKNS_10TypedValueE. (cherry picked from FBD5222504)	2017-06-09 15:52:50 -07:00
Bohan Ren	eb63a0b295	[BOLT] Expand BOLT report for basic block ordering Summary: Add a new positional option onto bolt: "-print-function-statistics=<uint64>" which prints information about block ordering for requested number of functions. (cherry picked from FBD5105323)	2017-05-22 11:04:01 -07:00

1 2 3 4 5 ...

325 Commits