intel/llvm - llvm - Gitea: Git with a cup of tea

intel/llvm

mirror of https://github.com/intel/llvm.git synced 2026-01-14 03:50:17 +08:00

Author	SHA1	Message	Date
Rafael Auler	32b332ad2d	[BOLT] Fix ShrinkWrapping bugs and enable testing Summary: Fix a few ShrinkWrapping bugs: - Using push-pop mode in a function that required aligned stack - Correctly update the edges in jump tables after splitting critical edges - Fix stack pointer restores based on RBP + offset, when we change the stack layout in push-pop mode. (cherry picked from FBD6755232)	2017-12-14 17:26:19 -08:00
Rafael Auler	6d0401ccfb	[BOLT/LSDA] Fix alignment Summary: Fix a bug introduced by rebasing with respect to aligned ULEBs. This wasn't breaking anything but it is good to keep LDSA aligned. (cherry picked from FBD7094742)	2018-02-26 20:09:14 -08:00
Bill Nell	ddefc770b0	[BOLT] Refactoring of section handling code Summary: This is a big refactoring of the section handling code. I've removed the SectionInfoMap and NoteSectionInfo and stored all the associated info about sections in BinaryContext and BinarySection classes. BinarySections should now hold all the info we care about for each section. They can be initialized from SectionRefs but don't necessarily require one to be created. There are only one or two spots that needed access to the original SectionRef to work properly. The trickiest part was making sure RewriteInstance.cpp iterated over the proper sets of sections for each of it's different types of processing. The different sets are broken down roughly as allocatable and non-alloctable and "registered" (I couldn't think up a better name). "Registered" means that the section has been updated to include output information, i.e. contents, file offset/address, new size, etc. It may help to have special iterators on BinaryContext to iterate over the different classes to make things easier. I can do that if you guys think it is worthwhile. I found pointee_iterator in the llvm ADT code. Use that for iterating over BBs in BinaryFunction rather than the custom iterator class. (cherry picked from FBD6879086)	2018-02-01 16:33:43 -08:00
Maksim Panchenko	6744f0dbeb	[BOLT] Fix jump table placement for non-simple functions Summary: When we move a jump table to either hot or cold new section (-jump-tables=move), we rely on a number of taken branches from the table to decide if it's hot or cold. However, if the function is non-simple, we always get 0 count, and always move the table to the cold section. Instead, we should make a conservative decision based on the execution count of the function. (cherry picked from FBD7058127)	2018-02-22 11:20:46 -08:00
Andy Newell	e15623058e	Cache+ speed, reduce mallocs Summary: Speed of cache+ by skipping mallocs on vectors. Although this change speeds up the algorithm by 2x, this is still not enough for some binaries where some functions have ~2500 hot basic blocks. Hence, introduce a threshold for expensive optimizations in CachePlusReorderAlgorithm. If the number of hot basic blocks exceeds the threshold (2048 by default), we use a cheaper version, which is quite fast. (cherry picked from FBD6928075)	2018-02-09 09:58:19 -08:00
Maksim Panchenko	5599c01911	[BOLT] Fixes for new profile Summary: Do a better job of recording fall-through branches in new profile mode (-prof-compat-mode=0). For this we need to record offsets for all instructions that are last in the containing basic block. Change the way we convert conditional tail calls. Now we never reverse the condition. This is required for better profile matching. The original approach of preserving the direction was controversial to start with. Add "-infer-fall-throughs" option (on by default) to allow disabling inference of fall-through edge counts. (cherry picked from FBD6994293)	2018-02-13 11:21:59 -08:00
Maksim Panchenko	a24c5543ea	[BOLT] Improved function profile matching Summary: Prioritize functions with 100% name match when doing LTO "fuzzy" name matching. Avoid re-assigning profile to a function. (cherry picked from FBD6992179)	2018-02-14 12:30:27 -08:00
Maksim Panchenko	1298d99a41	[BOLT] Limited "support" for AVX-512 Summary: In relocation mode trap on entry to any function that has AVX-512 instructions. This is controlled by "-trap-avx512" option which is on by default. If the option is disabled and AVX-512 instruction is seen in relocation mode, then we abort while re-writing the binary. (cherry picked from FBD6893165)	2018-02-02 16:07:11 -08:00
Rafael Auler	8a5a30156e	[BOLT rebase] Rebase fixes on top of LLVM Feb2018 Summary: This commit includes all code necessary to make BOLT working again after the rebase. This includes a redesign of the EHFrame work, cherry-pick of the 3dnow disassembly work, compilation error fixes, and port of the debug_info work. The macroop fusion feature is not ported yet. The rebased version has minor changes to the "executed instructions" dynostats counter because REP prefixes are considered a part of the instruction it applies to. Also, some X86 instructions had the "mayLoad" tablegen property removed, which BOLT uses to identify and account for loads, thus reducing the total number of loads reported by dynostats. This was observed in X86::MOVDQUmr. TRAP instructions are not terminators anymore, changing our CFG. This commit adds compensation to preserve this old behavior and minimize tests changes. debug_info sections are now slightly larger. The discriminator field in the line table is slightly different due to a change upstream. New profiles generated with the other bolt are incompatible with this version because of different hash values calculated for functions, so they will be considered 100% stale. This commit changes the corresponding test to XFAIL so it can be updated. The hash function changes because it relies on raw opcode values, which change according to the opcodes described in the X86 tablegen files. When processing HHVM, bolt was observed to be using about 800MB more memory in the rebased version and being about 5% slower. (cherry picked from FBD7078072)	2018-02-06 15:00:23 -08:00
Maksim Panchenko	600cf0ecf6	[BOLT] Fix memory regression Summary: This fixes the increased memory consumption introduced in an earlier diff while I was working on new profiling infra. The increase came from a delayed release of memory allocated to intermediate structures used to build CFG. In this diff we release them ASAP, and don't keep them for all functions at the same time. (cherry picked from FBD6890067)	2018-02-02 14:46:21 -08:00
Maksim Panchenko	f85264ae18	[BOLT] Reduce the usage of "Offset" annotation Summary: Limiting "Offset" annotation only to instructions that actually need it, improves the memory consumption on HHVM binary by 1GB. (cherry picked from FBD6878943)	2018-02-01 14:36:29 -08:00
Bill Nell	501601259b	[BOLT] Fix branch info stats after SCTC Summary: SCTC was incorrectly swapping BranchInfo when reversing the branch condition. This was wrong because when we remove the successor BB later, it removes the BranchInfo for that BB. In this case the successor would be the BB with the stats we had just swapped. Instead leave BranchInfo as it is and read the branch count from the false or true branch depending on whether we reverse or replace the branch, respectively. The call to removeSuccessor later will remove the unused BranchInfo we no longer care about. (cherry picked from FBD6876799)	2018-02-01 14:24:26 -08:00
Bill Nell	1207e1d229	[BOLT] Fix lookup of non-allocatable sections in RewriteInstance Summary: Register all sections with BinaryContext. Store all sections in a set ordered by (address, size, name). Add two separate maps to lookup sections by address or by name. Non-allocatable sections are not stored in the address->section map since they all "start" at 0. (cherry picked from FBD6862973)	2018-01-31 12:12:59 -08:00
Qinfan Wu	2b8194fa50	Handle types CU list in updateGdbIndexSection Summary: Handle types CU list in `updateGdbIndexSection`. It looks like the types part of `.gdb_index` isn't empty when `-fdebug-types-section` is used. So instead of aborting, we copy the part to new `.gdb_index` section. (cherry picked from FBD6770460)	2018-01-31 11:52:39 -08:00
Maksim Panchenko	d114ef1fa5	[BOLT] Fix profile for multi-entry functions Summary: When we read profile for functions, we initialize counts for entry blocks first, and then populate counts for all blocks based on incoming edges. During the second phase we ignore the entry blocks because we expect them to be already initialized. For the primary entry at offset 0 it's the correct thing to do, since we treat all incoming branches as calls or tail calls. However, for secondary entries we only consider external edges to be from calls and don't increase entry count if an edge originates from inside the function. Thus we need to update the secondary entry basic block counts with internal edges too. (cherry picked from FBD6836817)	2018-01-23 15:18:41 -08:00
Bill Nell	304c8ba80a	[BOLT] Handle multiple sections with the same name Summary: Multiple sections can have the same name, so we need to make the NameToSectionMap into a multimap. (cherry picked from FBD6847622)	2018-01-30 13:18:40 -08:00
Rafael Auler	48370744d9	[BOLT] Do not assert on bad data Summary: A test is asserting on impossible addresses coming from perf.data, instead of just reporting it as bad data. Fix this behavior. (cherry picked from FBD6835590)	2018-01-29 10:37:30 -08:00
spupyrev	626e977c4a	[BOLT] faster cache+ implementation Summary: Speeding up cache+ algorithm. The idea is to find and merge "fallthrough" successors before main optimization. For a pair of blocks, A and B, block B is the fallthrough successor of A, if (i) all jumps (based on profile) from A goes to B and (ii) all jumps to B are from A. Such blocks should be adjacent in an optimal ordering, and should not be considered for splitting. (This gives the speed up). The gap between cache and cache+ reduced from ~2m to ~1m. (cherry picked from FBD6799900)	2018-01-24 12:29:38 -08:00
Bill Nell	89feb847ea	[BOLT] Refactor relocation analysis code. Summary: Refactor the relocation anaylsis code. It should be a little better at validating that the relocation value matches up with the symbol address + addend stored in the relocation (except on aarch64). It is also a little better at finding the symbol address used to do the lookup in BinaryContext, rather than just using symbol address + addend. (cherry picked from FBD6814702)	2018-01-24 05:42:11 -08:00
Bill Nell	2640b4071f	[BOLT] Refactoring - add BinarySection class Summary: Add BinarySection class that is a wrapper around SectionRef. This is refactoring work for static data reordering. (cherry picked from FBD6792785)	2018-01-23 15:10:24 -08:00
Rafael Auler	907ca25841	[BOLT-AArch64] Support large test binary Summary: Rewrite how data/code markers are interpreted, so the code can have constant islands essentially anywhere. This is necessary to accomodate custom AArch64 assembly code coming from mozjpeg. Allow any function to refer to the constant island owned by any other function. When this happens, we pull the constant island from the referred function and emit it as our own, so it will live nearby the code that refers to it, allowing us to freely reorder functions and code pieces. Make bolt more strict about not changing anything in non-simple ARM functions, as we need to preserve offsets for those functions we don't interpret their jump tables (currently any function with jump tables in ARM is non-simple and is left untouched). (cherry picked from FBD6402324)	2017-11-22 16:17:36 -08:00
Maksim Panchenko	b6cb112feb	[BOLT] New profile format Summary: A new profile that is more resilient to minor binary modifications. BranchData is eliminated. For calls, the data is converted into instruction annotations if the profile matches a function. If a profile cannot be matched, AllCallSites data should have call sites profiles. The new profile format is YAML, which is quite verbose. It still takes less space than the older format because we avoid function name repetition. The plan is to get rid of the old profile format eventually. merge-fdata does not work with the new format yet. (cherry picked from FBD6753747)	2017-12-13 23:12:01 -08:00
Rafael Auler	f8f52d01d0	[BOLT-AArch64] Support SPEC17 programs and organize AArch64 tests Summary: Add a few new relocation types to support a wider variety of binaries, add support for constant island duplication (so we can split functions in large binaries) and make LongJmp pass really precise with respect to layout, so we don't miss stubs insertions at the correct places for really large binaries. In LongJmp, introduce "freeze" annotations so fixBranches won't mess the jumps we carefully determined that needed a stub. (cherry picked from FBD6294390)	2017-11-09 16:59:18 -08:00
spupyrev	a599fe1bbc	[BOLT] a new block reordering algorithm Summary: A new block reordering algorithm, cache+, that is designed to optimize i-cache performance. On a high level, this algorithm is a greedy heuristic that merges clusters (ordered sequences) of basic blocks, similarly to how it is done in OptimizeCacheReorderAlgorithm. There are two important differences: (a) the metric that is optimized in the procedure, and (b) how two clusters are merged together. Initially all clusters are isolated basic blocks. On every iteration, we pick a pair of clusters whose merging yields the biggest increase in the ExtTSP metric (see CacheMetrics.cpp for exact implementation), which models how i-cache "friendly" a pecific cluster is. A pair of clusters giving the maximum gain is merged to a new clusters. The procedure stops when there is only one cluster left, or when merging does not increase ExtTSP. In the latter case, the remaining clusters are sorted by density. An important aspect is the way two clusters are merged. Unlike earlier algorithms (e.g., OptimizeCacheReorderAlgorithm or Pettis-Hansen), two clusters, X and Y, are first split into three, X1, X2, and Y. Then we consider all possible ways of gluing the three clusters (e.g., X1YX2, X1X2Y, X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score. This improves the quality of the final result (the search space is larger) while keeping the implementation sufficiently fast. (cherry picked from FBD6466264)	2017-12-01 16:54:08 -08:00
Rafael Auler	1fa80594cf	[BOLT] Do not assign a LP to tail calls Summary: Do not assign a LP to tail calls. They are not calls in the view of an unwinder, they are just regular branches. We were hitting an assertion in BinaryFunction::removeConditionalTailCalls() complaining about landing pads in a CTC, however it was in fact a builtin_unreachable being conservatively treated as a CTC. (cherry picked from FBD6564957)	2017-12-13 19:08:43 -08:00
Rafael Auler	660daac2d0	[BOLT] Fix -simplify-rodata-loads wrt data chunks with relocs Summary: The pass was previously copying data that would change after layout because it had a relocation at the copied address. (cherry picked from FBD6541334)	2017-12-11 17:07:56 -08:00
Maksim Panchenko	85f5f4fb63	[BOLT] Fix debugging derp (cherry picked from FBD28110992)	2017-12-11 17:22:49 -08:00
Maksim Panchenko	67cef1f536	debug (cherry picked from FBD28110897)	2017-12-11 11:44:07 -08:00
Maksim Panchenko	d15b93bade	[BOLT] Major overhaul of profiling in BOLT Summary: Profile reading was tightly coupled with building CFG. Since I plan to move to a new profile format that will be associated with CFG it is critical to decouple the two phases. We now have read profile right after the cfg was constructed, but before it is "canonicalized", i.e. CTCs will till be there. After reading the profile, we do a post-processing pass that fixes CFG and does some post-processing for debug info, such as inference of fall-throughs, which is still required with the current format. Another good reason for decoupling is that we can use profile with CFG to more accurately record fall-through branches during aggregation. At the moment we use "Offset" annotations to facilitate location of instructions corresponding to the profile. This might not be super efficient. However, once we switch to the new profile format the offsets would be no longer needed. We might keep them for the aggregator, but if we have to trust LBR data that might not be strictly necessary. I've tried to make changes while keeping backwards compatibly. This makes it easier to verify correctness of the changes, but that also means that we lose accuracy of the profile. Some refactoring is included. Flag "-prof-compat-mode" (on by default) is used for bug-level backwards compatibility. Disable it for more accurate tracing. (cherry picked from FBD6506156)	2017-11-28 09:57:21 -08:00
Maksim Panchenko	b6f7c68a6c	[BOLT] Automatically detect and use relocations Summary: If relocations are available in the binary, use them by default. If "-relocs" is specified, then require relocations for further processing. Use "-relocs=0" to forcefully ignore relocations. Instead of `opts::Relocs` use `BinaryContext::HasRelocations` to check for the presence of the relocations. (cherry picked from FBD6530023)	2017-12-09 21:40:39 -08:00
Maksim Panchenko	2b9bafed83	[BOLT] Consistent DFS ordering for landing pads Summary: The list of landing pads in BinaryBasicBlock was sorted by their address in memory. As a result, the DFS order was not always deterministic. The change is to store landing pads in the order they appear in invoke instructions while keeping them unique. Also, add Throwers verification to validateCFG(). (cherry picked from FBD6529032)	2017-12-08 20:27:49 -08:00
Maksim Panchenko	10274633ee	[BOLT] Options to facilitate debugging Summary: Some helpful options: -print-dyno-stats-only while printing functions output dyno-stats and skip instructions -report-stale print a list of functions with a stale profile (cherry picked from FBD6505141)	2017-12-06 15:45:57 -08:00
Rafael Auler	70d44ab20a	[BOLT] Add REX prefix rebalancing pass Summary: Add a pass to rebalance the usage of REX prefixes, moving them from the hot code path to the cold path whenever possible. To do this, we rank the usage frequency of each register and exchange an X86 classic reg with an extended one (which requires a REX prefix) whenever the classic register is being used less times than the extended one. There are two versions of this pass: regular one will only consider RBX as classic and R12-R15 as extended registers because those are callee-saved, which means their scope is local to the function and therefore they can be easily interchanged within the function without further consequences. The aggressive version relies on liveness analysis to detect if the value of a register is being used as a caller-saved value (written to without being read first), which also is eligible for reallocation. However, it showed limited results and is not the default option because it is expensive. Currently, this pass does not update debug info. This means that if a substitution is made, the AT_LOCATION of a variable inside a function may be outdated and GDB will display the wrong value if you ask it to print the value of the affected variable. Updating DWARF involves a painful task of writing a new DWARF expression parser/writer similar to the one we already have for CFI expressions. I'll defer the task of writing this until we determine this optimization is enabled in production. So far, it is experimental to be combined with other optimizations to help us find a new set of optimizations that is beneficial. (cherry picked from FBD6476659)	2017-11-14 18:20:40 -08:00
Bill Nell	cd0a075a08	[BOLT] Fix ICP nested jump table handling and general stats. Summary: Load elimination for ICP wasn't handling nested jump tables correctly. It wasn't offseting the indices by the range of the nested table. I also wasn't computing some of the stats ICP correctly in all cases which was leading to weird results in the stats. (cherry picked from FBD6453693)	2017-11-29 17:38:39 -08:00
spupyrev	48a53a7b55	a new i-cache metric Summary: The diff introduces two measures for i-cache performance: a TSP measure (currently used for optimization) and an "extended" TSP measure that takes into account jumps between non-consecutive basic blocks. The two measures are computed for estimated addresses/sizes of basic blocks and for the actually omitted addresses/sizes. Intuitively, the Extended-TSP metric quantifies the expected number of i-cache misses for a given ordering of basic blocks. It has 5 parameters: - FallthroughWeight is the impact of fallthrough jumps on the score - ForwardWeight is the impact of forward (but not fallthrough) jumps - BackwardWeight is the impact of backward jumps - ForwardDistance is the max distance of a forward jump affecting the score - BackwardDistance is the max distance of a backward jump affecting the score We're still learning the "best" values for the options but default values look reasonable so far. (cherry picked from FBD6331418)	2017-11-14 16:51:24 -08:00
Rafael Auler	21eb2139ee	Introduce pass to reduce jump tables footprint Summary: Add a pass to identify indirect jumps to jump tables and reduce their entries size from 8 to 4 bytes. For PIC jump tables, it will convert the PIC code to non-PIC (since BOLT only processes static code, it makes no sense to use expensive PIC-style jumps in static code). Add corresponding improvements to register scavenging pass and add a MCInst matcher machinery. (cherry picked from FBD6421582)	2017-11-02 00:30:11 -07:00
Bill Nell	39a8c36697	[BOLT] Use getNumPrimeOperands in shortenInstruction. Summary: Apply maks' review comments (cherry picked from FBD6451164)	2017-11-30 13:30:49 -08:00
Bill Nell	a71b5700c0	[BOLT] Fix bug in shortening peephole. Summary: The arithmetic shortening code on x86 was broken. It would sometimes shorten instructions with immediate operands that wouldn't fit into 8 bits. (cherry picked from FBD6444699)	2017-11-29 17:40:14 -08:00
Bill Nell	0bab742949	[BOLT] Fix icp-top-callsites option, remove icp-always-on. Summary: The icp-top-callsites option was using basic block counts to pick the top callsites while the ICP main loop was using branch info from the targets of each call. These numbers do not exactly match up so there was a dispcrepancy in computing the top calls. I've switch top callsites over to use the same stats as the main loop. The icp-always-on option was redundant with -icp-top-callsites=100, so I removed it. (cherry picked from FBD6370977)	2017-11-19 11:17:57 -08:00
Bill Nell	591e0ef3ba	[BOLT] Add timers for non-optimization related phases. Summary: Add timers for non-optimization related phases. There are two new options, -time-build for disassembling functions and building CFGs, and -time-rewrite for phases in executeRewritePass(). (cherry picked from FBD6422006)	2017-11-27 18:00:24 -08:00
Rafael Auler	dc23def477	[PERF2BOLT] Fix aggregator wrt traces with REP RET Summary: Previously the perf2bolt aggregator was rejecting traces finishing with REP RET (return instruction with REP prefix) as a result of the migration from objdump output to LLVM disassembler, which decodes REP as a separate instruction. Add code to detect REP RET and treat it as a single return instruction. (cherry picked from FBD6417496)	2017-11-27 12:58:21 -08:00
Bill Nell	b2f132c7c2	[RFC] [BOLT] Use iterators for MC branch/call analysis code. Summary: Here's an implementation of an abstract instruction iterator for the branch/call analysis code in MCInstrAnalysis. I'm posting it up to see what you guys think. It's a bit sloppy with constness and probably needs more tidying up. (cherry picked from FBD6244012)	2017-11-04 19:22:05 -07:00
Bill Nell	c4d7460ed6	[BOLT] Improve ICP for virtual method calls and jump tables using value profiling. Summary: Use value profiling data to remove the method pointer loads from vtables when doing ICP at virtual function and jump table callsites. The basic process is the following: 1. Work backwards from the callsite to find the most recent def of the call register. 2. Work back from the call register def to find the instruction where the vtable is loaded. 3. Find out of there is any value profiling data associated with the vtable load. If so, record all these addresses as potential vtables + method offsets. 4. Since the addresses extracted by #3 will be vtable + method offset, we need to figure out the method offset in order to determine the actual vtable base address. At this point I virtually execute all the instructions that occur between #3 and #2 that touch the method pointer register. The result of this execution should be the method offset. 5. Fetch the actual method address from the appropriate data section containing the vtable using the computed method offset. Make sure that this address maps to an actual function symbol. 6. Try to associate a vtable pointer with each target address in SymTargets. If every target has a vtable, then this is almost certainly a virtual method callsite. 7. Use the vtable address when generating the promoted call code. It's basically the same as regular ICP code except that the compare is against the vtable and not the method pointer. Additionally, the instructions to load up the method are dumped into the cold call block. For jump tables, the basic idea is the same. I use the memory profiling data to find the hottest slots in the jumptable and then use that information to compute the indices of the hottest entries. We can then compare the index register to the hot index values and avoid the load from the jump table. Note: I'm assuming the whole call is in a single BB. According to @rafaelauler, this isn't always the case on ARM. This also isn't always the case on X86 either. If there are non-trivial arguments that are passed by value, there could be branches in between the setup and the call. I'm going to leave fixing this until later since it makes things a bit more complicated. I've also fixed a bug where ICP was introducing a conditional tail call. I made sure that SCTC fixes these up afterwards. I have no idea why I made it introduce a CTC in the first place. (cherry picked from FBD6120768)	2017-10-20 12:11:34 -07:00
spupyrev	1475c4da71	speeding up caches for hfsort+ Summary: When running hfsort+, we invalidate too many cache entries, which leads to inefficiencies. It seems we only need to invalidate cache for pairs of clusters (Into, X) and (X, Into) when modifying cluster Into (for all clusters X). With the modification, we do not really need ShortCache, since it is computed only once per pair of clusters. (cherry picked from FBD6341039)	2017-11-15 14:17:39 -08:00
Maksim Panchenko	0836fa7d08	[BOLT] Fix handling of RememberState CFI Summary: When RememberState CFI happens to be the last CFI in a basic block, we used to set the state of the next basic block to a CFI prior to executing RememberState instruction. This contradicts comments in annotateCFIState() function and also differs form behaviour of getCFIStateAtInstr(). As a result we were getting code like the following: .LBB0121166 (21 instructions, align : 1) CFI State : 0 .... 0000001a: !CFI $1 ; OpOffset Reg6 -16 0000001a: !CFI $2 ; OpRememberState .... Successors: .Ltmp4167600, .Ltmp4167601 CFI State: 3 .Ltmp4167601 (13 instructions, align : 1) CFI State : 2 .... Notice that the state at the entry of the 2nd basic block is less than the state at the exit of the previous basic block. In practice we have never seen basic blocks where RememberState was the last CFI instruction in the basic block, and hence we've never run into this issue before. The fix is a synchronization of handling of last RememberState instruction by annotateCFIState() and getCFIStateAtInstr(). In the example above, the CFI state at the entry to the second BB will be 3 after this diff. (cherry picked from FBD6314916)	2017-11-13 11:05:47 -08:00
Bill Nell	7eaaaaba96	[BOLT] Add finer control of peephole pass. Summary: Add selective control over peephole options. This makes it easier to test which ones might have a positive effect. (cherry picked from FBD6289659)	2017-11-08 18:49:33 -08:00
Rafael Auler	a3b719e0f9	[BOLT] Fix ASAN bugs Summary: Fix a leak in DEBUGRewriter.cpp and an address out of bounds issue in edit distance calculation. (cherry picked from FBD6290026)	2017-11-08 14:29:20 -08:00
Bill Nell	e9aa6e1a33	[BOLT] Fix N-1'th sctc bug. Summary: The logic to append an unconditional branch at the end of a block that had the condition flipped on its conditional tail was broken. It should have been looking at the successor to PredBB instead of BB. It also wasn't skipping invalid blocks when finding the fallthrough block. This fixes the SCTC bug uncovered by @spupyrev's work on block reordering. (cherry picked from FBD6269493)	2017-11-07 16:00:26 -08:00
Maksim Panchenko	f8e6f66c1e	[BOLT] Fix segfault in debug print Summary: With "-debug" flag we are using a dump in intermediate state when basic block's list is initialized, but layout is not. In new isSplit() funciton we were checking the size() which uses basic block list, and then we were accessing the (uninitiazed) layout. Instead of checking size() we should be checking layout_size(). (cherry picked from FBD6277770)	2017-11-08 14:42:14 -08:00
spupyrev	a0c041f72a	[BOLT] Custom function alignment Summary: A new 'compact' function aligner that takes function sizes in consideration. The approach is based on the following assumptions: -- It is not desirable to introduce a large offset when aligning short functions, as it leads to a lot of "wasted" address space. -- For longer functions, the offset can be larger than the default 32 bytes; However, using 64 bytes for the offset still worsen performance, as again a lot of address space is wasted. -- Cold parts of functions can still use the default max-32 offset. The algorithm is switched on/off by flag 'use-compact-aligner' and is controlled by parameters align-functions-max-bytes and align-cold-functions-max-bytes described above. In my tests the best performance is produced with '-use-compact-aligner=true -align-functions-max-bytes=48 -align-cold-functions-max-bytes=32'. (cherry picked from FBD6194092)	2017-10-27 15:05:31 -07:00

1 2 3 4 5 ...

387 Commits