Commit 39ce5907 authored by Keith Randall's avatar Keith Randall

cmd/compile: rotate loops so conditional branch is at the end

Old loops look like this:
   loop:
     CMPQ ...
     JGE exit
     ...
     JMP loop
   exit:

New loops look like this:
    JMP entry
  loop:
    ...
  entry:
    CMPQ ...
    JLT loop

This removes one instruction (the unconditional jump) from
the inner loop.
Kinda surprisingly, it matters.

This is a bit different than the peeling that the old obj
library did in that we don't duplicate the loop exit test.
We just jump to the test.  I'm not sure if it is better or
worse to do that (peeling gets rid of the JMP but means more
code duplication), but this CL is certainly a much simpler
compiler change, so I'll try this way first.

The obj library used to do peeling before
CL https://go-review.googlesource.com/c/36205 turned it off.

Fixes #15837 (remove obj instruction reordering)
The reordering is already removed, this CL implements the only
part of that reordering that we'd like to keep.

Fixes #14758 (append loop)
name    old time/op    new time/op    delta
Foo-12     817ns ± 4%     538ns ± 0%  -34.08%   (p=0.000 n=10+9)
Bar-12     850ns ±11%     570ns ±13%  -32.88%  (p=0.000 n=10+10)

Update #19595 (BLAS slowdown)
name                       old time/op  new time/op  delta
DgemvMedMedNoTransIncN-12  13.2µs ± 9%  10.2µs ± 1%  -22.26%  (p=0.000 n=9+9)

Fixes #19633 (append loop)
name    old time/op    new time/op    delta
Foo-12     810ns ± 1%     540ns ± 0%  -33.30%   (p=0.000 n=8+9)

Update #18977 (Fannkuch11 regression)
name         old time/op    new time/op    delta
Fannkuch11-8                2.80s ± 0%     3.01s ± 0%  +7.47%   (p=0.000 n=9+10)
This one makes no sense.  There's strictly 1 less instruction in the
inner loop (17 instead of 18).  They are exactly the same instructions
except for the JMP that has been elided.

go1 benchmarks generally don't look very impressive.  But the gains for the
specific issues above make this CL still probably worth it.
name                      old time/op    new time/op    delta
BinaryTree17-8              2.32s ± 0%     2.34s ± 0%  +1.14%    (p=0.000 n=9+7)
Fannkuch11-8                2.80s ± 0%     3.01s ± 0%  +7.47%   (p=0.000 n=9+10)
FmtFprintfEmpty-8          44.1ns ± 1%    46.1ns ± 1%  +4.53%  (p=0.000 n=10+10)
FmtFprintfString-8         67.8ns ± 0%    74.4ns ± 1%  +9.80%   (p=0.000 n=10+9)
FmtFprintfInt-8            74.9ns ± 0%    78.4ns ± 0%  +4.67%   (p=0.000 n=8+10)
FmtFprintfIntInt-8          117ns ± 1%     123ns ± 1%  +4.69%   (p=0.000 n=9+10)
FmtFprintfPrefixedInt-8     160ns ± 1%     146ns ± 0%  -8.22%   (p=0.000 n=8+10)
FmtFprintfFloat-8           214ns ± 0%     206ns ± 0%  -3.91%    (p=0.000 n=8+8)
FmtManyArgs-8               468ns ± 0%     497ns ± 1%  +6.09%   (p=0.000 n=8+10)
GobDecode-8                6.16ms ± 0%    6.21ms ± 1%  +0.76%   (p=0.000 n=9+10)
GobEncode-8                4.90ms ± 0%    4.92ms ± 1%  +0.37%   (p=0.028 n=9+10)
Gzip-8                      209ms ± 0%     212ms ± 0%  +1.33%  (p=0.000 n=10+10)
Gunzip-8                   36.6ms ± 0%    38.0ms ± 1%  +4.03%    (p=0.000 n=9+9)
HTTPClientServer-8         84.2µs ± 0%    86.0µs ± 1%  +2.14%    (p=0.000 n=9+9)
JSONEncode-8               13.6ms ± 3%    13.8ms ± 1%  +1.55%   (p=0.003 n=9+10)
JSONDecode-8               53.2ms ± 5%    52.9ms ± 0%    ~     (p=0.280 n=10+10)
Mandelbrot200-8            3.78ms ± 0%    3.78ms ± 1%    ~      (p=0.661 n=10+9)
GoParse-8                  2.89ms ± 0%    2.94ms ± 2%  +1.50%  (p=0.000 n=10+10)
RegexpMatchEasy0_32-8      68.5ns ± 2%    68.9ns ± 1%    ~     (p=0.136 n=10+10)
RegexpMatchEasy0_1K-8       220ns ± 1%     225ns ± 1%  +2.41%  (p=0.000 n=10+10)
RegexpMatchEasy1_32-8      64.7ns ± 0%    64.5ns ± 0%  -0.28%  (p=0.042 n=10+10)
RegexpMatchEasy1_1K-8       348ns ± 1%     355ns ± 0%  +1.90%  (p=0.000 n=10+10)
RegexpMatchMedium_32-8      102ns ± 1%     105ns ± 1%  +2.95%  (p=0.000 n=10+10)
RegexpMatchMedium_1K-8     33.1µs ± 3%    32.5µs ± 0%  -1.75%  (p=0.000 n=10+10)
RegexpMatchHard_32-8       1.71µs ± 1%    1.70µs ± 1%  -0.84%   (p=0.002 n=10+9)
RegexpMatchHard_1K-8       51.1µs ± 0%    50.8µs ± 1%  -0.48%  (p=0.004 n=10+10)
Revcomp-8                   411ms ± 1%     402ms ± 0%  -2.22%   (p=0.000 n=10+9)
Template-8                 61.8ms ± 1%    59.7ms ± 0%  -3.44%    (p=0.000 n=9+9)
TimeParse-8                 306ns ± 0%     318ns ± 0%  +3.83%  (p=0.000 n=10+10)
TimeFormat-8                320ns ± 0%     318ns ± 1%  -0.53%   (p=0.012 n=7+10)

Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b
Reviewed-on: https://go-review.googlesource.com/38431
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: default avatarDavid Chase <drchase@google.com>
parent 6a48019e
......@@ -369,6 +369,7 @@ var passes = [...]pass{
{name: "late nilcheck", fn: nilcheckelim2},
{name: "flagalloc", fn: flagalloc, required: true}, // allocate flags register
{name: "regalloc", fn: regalloc, required: true}, // allocate int & float registers + stack slots
{name: "loop rotate", fn: loopRotate},
{name: "stackframe", fn: stackframe, required: true},
{name: "trim", fn: trim}, // remove empty blocks
}
......@@ -427,6 +428,8 @@ var passOrder = [...]constraint{
{"schedule", "flagalloc"},
// regalloc needs flags to be allocated first.
{"flagalloc", "regalloc"},
// loopRotate will confuse regalloc.
{"regalloc", "loop rotate"},
// stackframe needs to know about spilled registers.
{"regalloc", "stackframe"},
// trim needs regalloc to be done first.
......
......@@ -16,11 +16,6 @@ type loop struct {
children []*loop // loops nested directly within this loop. Initialized by assembleChildren().
exits []*Block // exits records blocks reached by exits from this loop. Initialized by findExits().
// Loops aren't that common, so rather than force regalloc to keep
// a map or slice for its data, just put it here.
spills []*Value
scratch int32
// Next three fields used by regalloc and/or
// aid in computation of inner-ness and list of blocks.
nBlocks int32 // Number of blocks in this loop but not within inner loops
......
// Copyright 2017 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
package ssa
// loopRotate converts loops with a check-loop-condition-at-beginning
// to loops with a check-loop-condition-at-end.
// This helps loops avoid extra unnecessary jumps.
//
// loop:
// CMPQ ...
// JGE exit
// ...
// JMP loop
// exit:
//
// JMP entry
// loop:
// ...
// entry:
// CMPQ ...
// JLT loop
func loopRotate(f *Func) {
loopnest := f.loopnest()
if len(loopnest.loops) == 0 {
return
}
// Set of blocks we're moving, by ID.
move := map[ID]struct{}{}
// Map from block ID to the moving block that should
// come right after it.
after := map[ID]*Block{}
// Check each loop header and decide if we want to move it.
for _, loop := range loopnest.loops {
b := loop.header
var p *Block // b's in-loop predecessor
for _, e := range b.Preds {
if e.b.Kind != BlockPlain {
continue
}
if loopnest.b2l[e.b.ID] != loop {
continue
}
p = e.b
}
if p == nil || p == b {
continue
}
// Place b after p.
move[b.ID] = struct{}{}
after[p.ID] = b
}
// Move blocks to their destinations in a single pass.
// We rely here on the fact that loop headers must come
// before the rest of the loop. And that relies on the
// fact that we only identify reducible loops.
j := 0
for i, b := range f.Blocks {
if _, ok := move[b.ID]; ok {
continue
}
f.Blocks[j] = b
j++
if a := after[b.ID]; a != nil {
if j > i {
f.Fatalf("head before tail in loop %s", b)
}
f.Blocks[j] = a
j++
}
}
if j != len(f.Blocks) {
f.Fatalf("bad reordering in looprotate")
}
}
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment