Bugzilla – Bug 209
[loadvn/inline/scalarrepl] Slow optimizations with extremely large basic blocks
Last modified: 2004-02-05 11:29:51
You need to log in before you can comment on or make changes to this bug.
When running gccas on this (large) testcase, the global optimizer takes a very long time (this is compiled in debug mode, so it's a ~3x slower than in release mode, but it's still bad): Total Execution Time: 415.0599 seconds (415.0464 wall clock) ---User Time--- --System Time-- --- Name --- 290.8299 ( 70.2%) 0.7900 ( 87.7%) Global Common Subexpression Elimination 67.7800 ( 16.3%) 0.0099 ( 1.1%) Scalar Replacement of Aggregates 46.3699 ( 11.1%) 0.0800 ( 8.8%) Function Integration/Inlining 7.9800 ( 1.9%) 0.0200 ( 2.2%) Combine redundant instructions 0.3999 ( 0.0%) 0.0000 ( 0.0%) Combine redundant instruction ... 3086 gcse - Number of instructions removed 8 inline - Number of functions deleted because all callers found 2002 inline - Number of functions inlined 10925 instcombine - Number of dead inst eliminated 19223 instcombine - Number of insts combined 7314 scalarrepl - Number of allocas promoted 2413 scalarrepl - Number of allocas broken up All of the rest of the passes take less than .3s to run. The GCSE slowdown is because the 'basicaa' pass currently has to recompute all of the information it needs for every query, which is very slow (there are some infrastructure improvements needed to fix this). There are many alias pairs that GCSE asks about. I'm not sure why SRoA is taking so long, besides the fact that it is doing a lot of expansions. The inlining pass can also probably be sped up. These two can be improved just by looking at the profiling output probably. This bug is just a placeholder so that I remember to get back to this when I have time. -Chris
Created an attachment (id=62) [details] Testcase Here's the C++ source code.
Created an attachment (id=63) [details] Here's the output of the C++ front-end This is the input to GCCAS that is causing it to take so long.
Wow, this is really bad. It looks like GCSE is not sped up substantially by optimization. Here's the timings for a release build: ---User Time--- --System Time-- --- Name --- 246.3399 ( 83.5%) 0.1799 ( 78.2%) Global Common Subexpression Elimination 25.5699 ( 8.6%) 0.0000 ( 0.0%) Scalar Replacement of Aggregates 21.7100 ( 7.3%) 0.0299 ( 13.0%) Function Integration/Inlining 0.8299 ( 0.2%) 0.0100 ( 4.3%) Combine redundant instructions 0.0899 ( 0.0%) 0.0100 ( 4.3%) Bytecode Writer ... -Chris
These two patches speed up SRoA/mem2reg from 42s to 0.62s on this testcase, which makes it take time comperable to all of the non-really-slow optimizations. http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011102.html http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011103.html -Chris
These patches speed up the inliner on this testcase from taking ~21s to taking ~0.8s on this this testcase: http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011109.html http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011111.html http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011124.html http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011129.html -Chris
This patch speeds up the GCSE pass by over 2x (57.14s -> 25.42s), by avoiding unnecessary work in the -load-vn pass. Note that without -load-vn, GCSE takes < 0.3 seconds to run. http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011136.html -Chris
Overall, the patches listed in this trail of this bug have fixed all of the noted efficiency problems in optimizers when they handle large basic blocks. Specifically, the inliner sped up from 21.7s -> 0.82s, SRoA sped up from 25.5s -> 0.46s, and loadvn/GCSE sped up from 246.3s -> 43.3s. This reduced the total time to gccas the testcase from >300s to 45s. All of these numbers are with a release build. As noted in the bug, the timings for loadvn/GCSE are still quite unreasonable, but this isn't specific to large-basic-block programs, and requires infrastructure improvements to fix (specifically GCSE needs a way to update VN/AA analyses on the fly). This will have to happen sometime in the future. -Chris
Hrm, on a whim, I tried running the testcase with -no-aa -load-vn -gcse, showing that alias analysis is taking a trivial amount of time < 1s, so the "infrastructure problem" that I blamed for the slow GCSE is not really the problem. Instead, it looks like load-vn is just being really silly, and needs to be fixed. *sigh* -Chris
Okay, with this final patch: http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011170.html The optimizer is finally able to handle the testcase respectably, taking ~2s to optimize it (as opposed to ~300s originally). I now get the following -time-passes output: Total Execution Time: 2.0800 seconds (2.0652 wall clock) ---User Time--- --System Time-- --- Name --- 0.8200 ( 39.6%) 0.0099 (100.0%) Function Integration/Inlining 0.5100 ( 24.6%) 0.0000 ( 0.0%) Global Common Subexpression Elimination 0.4699 ( 22.7%) 0.0000 ( 0.0%) Scalar Replacement of Aggregates 0.0599 ( 2.8%) 0.0000 ( 0.0%) Bytecode Writer 0.0500 ( 2.4%) 0.0000 ( 0.0%) Combine redundant instructions ... For reference, 'g++ -O2' takes 91s on the same testcase (with GCC 3.3). -Chris