First Last Prev Next    No search results available
Details
: [loadvn/inline/scalarrepl] Slow optimizations with extrem...
Bug#: 209
: libraries
: Scalar Optimizations
Status: RESOLVED
Resolution: FIXED
: All
: All
: 1.0
: P2
: normal
: 1.2

:
: quality-of-implementation
:
:
  Show dependency tree - Show dependency graph
People
Reporter: Chris Lattner <sabre@nondot.org>
Assigned To: Chris Lattner <sabre@nondot.org>

Attachments
Testcase (30.75 KB, text/plain)
2004-01-15 10:51, Chris Lattner
Details
Here's the output of the C++ front-end (42.50 KB, text/plain)
2004-01-15 10:52, Chris Lattner
Details


Note

You need to log in before you can comment on or make changes to this bug.

Related actions


Description:   Opened: 2004-01-15 10:50
When running gccas on this (large) testcase, the global optimizer takes a very
long time (this is compiled in debug mode, so it's a ~3x slower than in release
mode, but it's still bad):

  Total Execution Time: 415.0599 seconds (415.0464 wall clock)

   ---User Time---   --System Time--   --- Name ---
  290.8299 ( 70.2%)   0.7900 ( 87.7%)  Global Common Subexpression Elimination
  67.7800 ( 16.3%)   0.0099 (  1.1%)   Scalar Replacement of Aggregates
  46.3699 ( 11.1%)   0.0800 (  8.8%)   Function Integration/Inlining
   7.9800 (  1.9%)   0.0200 (  2.2%)   Combine redundant instructions
   0.3999 (  0.0%)   0.0000 (  0.0%)   Combine redundant instruction

...
  3086 gcse           - Number of instructions removed
     8 inline         - Number of functions deleted because all callers found
  2002 inline         - Number of functions inlined
 10925 instcombine    - Number of dead inst eliminated
 19223 instcombine    - Number of insts combined
  7314 scalarrepl     - Number of allocas promoted
  2413 scalarrepl     - Number of allocas broken up


All of the rest of the passes take less than .3s to run.

The GCSE slowdown is because the 'basicaa' pass currently has to recompute all
of the information it needs for every query, which is very slow (there are some
infrastructure improvements needed to fix this).  There are many alias pairs
that GCSE asks about.

I'm not sure why SRoA is taking so long, besides the fact that it is doing a lot
of expansions.  The inlining pass can also probably be sped up.  These two can
be improved just by looking at the profiling output probably.

This bug is just a placeholder so that I remember to get back to this when I
have time.

-Chris
------- Comment #1 From Chris Lattner 2004-01-15 10:51:26 -------
Created an attachment (id=62) [details]
Testcase

Here's the C++ source code.
------- Comment #2 From Chris Lattner 2004-01-15 10:52:25 -------
Created an attachment (id=63) [details]
Here's the output of the C++ front-end

This is the input to GCCAS that is causing it to take so long.
------- Comment #3 From Chris Lattner 2004-02-03 15:40:27 -------
Wow, this is really bad.  It looks like GCSE is not sped up substantially by
optimization.  Here's the timings for a release build:

   ---User Time---   --System Time--   --- Name ---
  246.3399 ( 83.5%)   0.1799 ( 78.2%)  Global Common Subexpression Elimination
   25.5699 (  8.6%)   0.0000 (  0.0%)  Scalar Replacement of Aggregates
   21.7100 (  7.3%)   0.0299 ( 13.0%)  Function Integration/Inlining
    0.8299 (  0.2%)   0.0100 (  4.3%)  Combine redundant instructions
    0.0899 (  0.0%)   0.0100 (  4.3%)  Bytecode Writer
...

-Chris
------- Comment #4 From Chris Lattner 2004-02-03 16:37:07 -------
These two patches speed up SRoA/mem2reg from 42s to 0.62s on this testcase,
which makes it take time comperable to all of the non-really-slow optimizations.

http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011102.html
http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011103.html

-Chris
------- Comment #6 From Chris Lattner 2004-02-04 18:39:52 -------
This patch speeds up the GCSE pass by over 2x (57.14s -> 25.42s), by avoiding
unnecessary work in the -load-vn pass.  Note that without -load-vn, GCSE takes <
0.3 seconds to run.

http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011136.html

-Chris
------- Comment #7 From Chris Lattner 2004-02-04 18:46:43 -------
Overall, the patches listed in this trail of this bug have fixed all of the
noted efficiency problems in optimizers when they handle large basic blocks.

Specifically, the inliner sped up from 21.7s -> 0.82s, SRoA sped up from 25.5s
-> 0.46s, and loadvn/GCSE sped up from 246.3s -> 43.3s.  This reduced the total
time to gccas the testcase from >300s to 45s.   All of these numbers are with a
release build.

As noted in the bug, the timings for loadvn/GCSE are still quite unreasonable,
but this isn't specific to large-basic-block programs, and requires
infrastructure improvements to fix (specifically GCSE needs a way to update
VN/AA analyses on the fly).  This will have to happen sometime in the future.

-Chris
------- Comment #8 From Chris Lattner 2004-02-04 22:59:07 -------
Hrm, on a whim, I tried running the testcase with -no-aa -load-vn -gcse, showing
that alias analysis is taking a trivial amount of time < 1s, so the
"infrastructure problem" that I blamed for the slow GCSE is not really the problem.

Instead, it looks like load-vn is just being really silly, and needs to be
fixed.  *sigh*

-Chris
------- Comment #9 From Chris Lattner 2004-02-05 11:29:51 -------
Okay, with this final patch:
http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040202/011170.html

The optimizer is finally able to handle the testcase respectably, taking ~2s to
optimize it (as opposed to ~300s originally).  I now get the following
-time-passes output:

  Total Execution Time: 2.0800 seconds (2.0652 wall clock)

   ---User Time---   --System Time--   --- Name ---
   0.8200 ( 39.6%)   0.0099 (100.0%)   Function Integration/Inlining
   0.5100 ( 24.6%)   0.0000 (  0.0%)   Global Common Subexpression Elimination
   0.4699 ( 22.7%)   0.0000 (  0.0%)   Scalar Replacement of Aggregates
   0.0599 (  2.8%)   0.0000 (  0.0%)   Bytecode Writer
   0.0500 (  2.4%)   0.0000 (  0.0%)   Combine redundant instructions
   ...

For reference, 'g++ -O2' takes 91s on the same testcase (with GCC 3.3).

-Chris

First Last Prev Next    No search results available