tip compiler speed

3,986 views
Skip to first unread message

roger peppe

unread,
May 19, 2015, 6:06:54 AM5/19/15
to golang-dev
I like to use tip to make sure all our stuff
still works there, but the compiler really
is quite a bit slower. How much speed improvement
can we hope for in Go 1.5 ?

Here's one example:

using go1.4.2:

% rm -r $GOPATH/pkg
% go-release go version
go version go1.4.2 linux/amd64
% time go-release go test -i
59.77user 11.08system 0:21.59elapsed 328%CPU (0avgtext+0avgdata
907840maxresident)k
43064inputs+348368outputs (7major+4004376minor)pagefaults 0swaps
% time go-release go test -c
5.74user 0.82system 0:06.65elapsed 98%CPU (0avgtext+0avgdata 862656maxresident)k
45056inputs+141240outputs (2major+320077minor)pagefaults 0swaps

using tip (b21ff39):

% rm -r $GOPATH/pkg
% go version
go version devel +b21ff39 Tue May 19 02:18:40 2015 +0000 linux/amd64
% time go test -i
217.04user 14.11system 1:17.14elapsed 299%CPU (0avgtext+0avgdata
1502748maxresident)k
13872inputs+373848outputs (61major+4461224minor)pagefaults 0swaps
% time go test -c
130.21user 8.25system 0:56.48elapsed 245%CPU (0avgtext+0avgdata
888336maxresident)k
120inputs+315560outputs (2major+2591109minor)pagefaults 0swaps

(to be fair, the 56.48 second case there seems to be an outlier;
the mode seems to be around 16s)

Russ Cox

unread,
May 19, 2015, 10:03:13 AM5/19/15
to roger peppe, golang-dev
On Tue, May 19, 2015 at 6:06 AM, roger peppe <rogp...@gmail.com> wrote:
I like to use tip to make sure all our stuff
still works there, but the compiler really
is quite a bit slower. How much speed improvement
can we hope for in Go 1.5 ?

You should expect that what's at tip is roughly what Go 1.5 will look like, for most aspects of the tree, including this one. If you see something that seems very wrong, please let us know, but for raw compilation speed 2-3x slower is not uncommon right now. 6s vs 56s is too much, though, unless your machine was swapping. You said that was an outlier, so I'm not worried, but if you consistently see 10x, I'd like to hear more details.

There are a variety of contributing factors, and I intend to write more about the performance characteristics at some later date (but in time for the Go 1.5 release); right now the focus is on shaking out the remaining bugs.

Thanks.
Russ

roger peppe

unread,
May 19, 2015, 11:38:53 AM5/19/15
to Russ Cox, golang-dev
Thanks. A good motivator!

Michael Hudson-Doyle

unread,
May 19, 2015, 7:43:44 PM5/19/15
to roger peppe, golang-dev
It's a total cop out, but if you have enough RAM running with GOGC=off
will probably be faster. The linker and I suspect the compiler are a
bad case for a mark & sweep gc: lots of small objects with pointers
between them, relatively few of which ever become garbage.
Alternatively, setting GOMAXPROCS=2 makes things a bit faster,
presumably because the GC can then do some of its mostly fruitless
marking in a separate thread.

Cheers,
mwh

Dave Cheney

unread,
May 19, 2015, 7:59:19 PM5/19/15
to Michael Hudson-Doyle, roger peppe, golang-dev
I agree with Michael.

Notwithstanding the fantastic improvements to code gen and gc
performance that Russ, Rick, Keith, Dmitry and Austin have been doing,
the compiler is a poor fit for the gc because it creates an intricate
mesh of *Nodes, none of which are unreferenced until the end of the
program, but all of which must be scanned. This is all work that the
old 1.4 compilers didn't have to do because they did bump pointer
malloc'ing, and then let the OS clean up the mess when the program
exited.

Turning off the GC entirely sounds unattractive, but as *Nodes are
never expected to be freed, is there any consideration for allocating
them in a huge [nnn]Node in the bss (I know the linker currently gets
this wrong), or in off heap memory mmap'ed from MAP_ANON ?

Thanks

Dave
> --
> You received this message because you are subscribed to the Google Groups "golang-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Keith Randall

unread,
May 19, 2015, 8:04:44 PM5/19/15
to Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
I don't think you could allocate Nodes in off-heap memory because they contain pointers to in-heap memory.  Unless you also moved the points-to transitive closure off-heap as well, and I don't think that is feasible.

Keith Randall

unread,
May 19, 2015, 8:05:42 PM5/19/15
to Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
bss is feasible, though.  I'm not sure what it gains you (the Nodes need to be scanned either way), but might be worth a try.

minux

unread,
May 19, 2015, 8:35:37 PM5/19/15
to Keith Randall, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
I did two experiments on linux/amd64 system without enough RAM:

1. make.bash
baseline:
68.97user 5.58system 0:28.80elapsed 258%CPU (0avgtext+0avgdata 551400maxresident)k
0inputs+836904outputs (0major+3229617minor)pagefaults 0swaps

If I disable GC in cmd/internal/gc only:
56.80user 6.62system 0:24.54elapsed 258%CPU (0avgtext+0avgdata 1011984maxresident)k
200inputs+837224outputs (0major+4387728minor)pagefaults 0swaps

If I disable GC in cmd/internal/ld only:
65.33user 5.33system 0:27.78elapsed 254%CPU (0avgtext+0avgdata 560556maxresident)k
0inputs+837000outputs (0major+3413689minor)pagefaults 0swaps

If I disable GC in both gc and ld:
53.40user 6.67system 0:23.71elapsed 253%CPU (0avgtext+0avgdata 1012040maxresident)k
0inputs+837280outputs (0major+4568796minor)pagefaults 0swaps


baseline:
26.96user 1.70system 0:12.79elapsed 224%CPU (0avgtext+0avgdata 205500maxresident)k
0inputs+117928outputs (0major+1133347minor)pagefaults 0swaps

GC disabled in gc only:
15.93user 2.20system 0:08.43elapsed 214%CPU (0avgtext+0avgdata 405796maxresident)k
0inputs+117928outputs (0major+1736495minor)pagefaults 0swaps

GC disabled in ld only:
26.62user 1.71system 0:12.69elapsed 223%CPU (0avgtext+0avgdata 256320maxresident)k
0inputs+117944outputs (0major+1148041minor)pagefaults 0swaps

GC disabled in both gc and ld:
15.22user 2.25system 0:07.99elapsed 218%CPU (0avgtext+0avgdata 405812maxresident)k
0inputs+117936outputs (0major+1751690minor)pagefaults 0swaps



The conclusion is that disable GC in gc will have the largest benefit as it can reduce run time
by as much as 1/3, but the RSS will roughly double.

The RSS increase suggests that gc still generates a lot of garbage, and if we can eliminate
those, then disable GC for gc makes sense.

Austin Clements

unread,
May 20, 2015, 2:56:38 PM5/20/15
to Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
On Tue, May 19, 2015 at 7:59 PM, Dave Cheney <da...@cheney.net> wrote:
Turning off the GC entirely sounds unattractive, but as *Nodes are
never expected to be freed, is there any consideration for allocating
them in a huge [nnn]Node in the bss (I know the linker currently gets
this wrong), or in off heap memory mmap'ed from MAP_ANON ?

I've been thinking along similar lines. On my task list is to try a batch allocator for *Node. It doesn't have to go in the BSS (it's probably better if it doesn't). Just allocate Nodes in large slices (probably doubling up to some many megabyte limit) and dole out *Nodes from these slices. The GC will still have to scan these, but the scan will be linear and thanks to an optimization Russ just put in for situations just like this, it will ignore self-pointers.

Feel free to beat me to doing this experiment.

Brad Fitzpatrick

unread,
May 20, 2015, 2:59:46 PM5/20/15
to Austin Clements, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
Josh experimented along these lines in https://go-review.googlesource.com/#/c/9418/ but with *Reg

It was kinda gross, but maybe the new self-pointer optimizations will make it more worth it and pay for the grossness.

Josh Bleecher Snyder

unread,
May 20, 2015, 3:18:31 PM5/20/15
to Austin Clements, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
I suspect that this will not help (but have not tried and would love
to be wrong). Here's why:

Running the CPU profiler, 40% of the time is spent on the line
mcentral.go:217. This is setting up the freelist for newly allocated
spans. Perhaps this is an artifact of CPU profiling (?). If not, it
suggests that the compiler performance would be significantly improved
by making setting up new spans cheaper, perhaps by enabling an
alternative representation of spans that can cheaply represent n steps
of a fixed size in a row. Something vaguely like:

type gclink struct {
n int // number of remaining sequential pointers; 0 means use next instead
step uintptr // offset of next sequential pointer
next gclinkptr // next set of sequential pointers
}

I don't know enough about the memory allocator to know whether this
would help, how difficult it might be to implement, or what other
havoc it might wreak.

I did try the obvious thing of unrolling the loop 8x, and it
(probably) helped a tiny bit. Compiling the rotate0.go test, I saw a
borderline-significant speed-up of 1.37% (n=100, p=0.06). However, the
loop is memory-bound, not CPU-bound, so it is unsurprising that
unrolling wouldn't have much impact.

If this effect is not an artifact of CPU profiling, addressing it
might help all short-lived programs that allocate heavily and free
little.

-josh

Rob Pike

unread,
May 20, 2015, 3:59:59 PM5/20/15
to Josh Bleecher Snyder, Austin Clements, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
I am in favor of reducing the amount of memory used, but caution strongly against introducing things like custom allocation methods to reduce GC overhead. It took me a while to understand this, as I once built custom allocators and thought I was making things better. I did it in C, where it's often necessary, so I started doing it in Go too. But I was thinking like a C programmer, not a Go programmer, and more important, introducing hard-to-find C-like bugs such as use-after-free and aliasing. The third or fourth time I had to track one of these down in a Go program (I am a slow learner), I realized the category error I was making.

GC is hard partly because it can never make a mistake, and when complaining about the overhead it's easy to forget how much time it saves in the long run, how valuable perfection in technology.

I don't want us to go back to tracking down memory corruption bugs.

You might counter by saying that in the compiler you know what memory is doing and these bugs won't happen, but a) you don't and b) others really don't and c) they will and d) it's a mistake to encourage thinking like this as a solution to Go performance problems. What we do as a team influences our users.

Let the GC do its job. Help it by reducing memory use but don't work around it by building custom allocators.

-rob


Niklas Schnelle

unread,
May 20, 2015, 5:20:18 PM5/20/15
to golan...@googlegroups.com
I think there was a quote attributed to Niklaus Wirth that a compiler optimization is only worthwhile if it makes the compiler so much faster that it offsets itself. With compilers being
short on floating point computation that rule of thumb is of course not useful for many of today's optimizations however I believe we are seeing something similar with GC here. Maybe
having the compiler as a good hard case is actually good for development much like compiling the Linux kernel drives a lot of VFS improvements. I can definitely with a few slow downs as GC improves.

Brad Fitzpatrick

unread,
May 20, 2015, 5:37:12 PM5/20/15
to Rob Pike, Josh Bleecher Snyder, Austin Clements, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
I think you're just replying to Josh because his message was last in the thread at the time, but to clarify: he was discussing there speeding up the runtime's allocator and not about user code playing games with their own allocators.

Rob Pike

unread,
May 20, 2015, 5:45:49 PM5/20/15
to Brad Fitzpatrick, Josh Bleecher Snyder, Austin Clements, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
I'm not sure who said it (Austin?) but there was talk of a custom batch (arena?) allocator for Nodes.

-rob

Brad Fitzpatrick

unread,
May 20, 2015, 5:47:13 PM5/20/15
to Rob Pike, Josh Bleecher Snyder, Austin Clements, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
Yes, this thread contains multiple topics.

Josh, why don't you start a new thread just about type gclink?

Dave Cheney

unread,
May 20, 2015, 5:51:16 PM5/20/15
to Rob Pike, Brad Fitzpatrick, Josh Bleecher Snyder, Austin Clements, Michael Hudson-Doyle, roger peppe, golang-dev
Yup, that was me.

Message received and understood rob.

Austin Clements

unread,
May 20, 2015, 6:07:41 PM5/20/15
to Rob Pike, Josh Bleecher Snyder, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
In most cases I completely agree.

However, I think a batch allocator for *Node is different. I'm suggesting that we never free them and that we "allocate" out of a []Node. Hence, there are no use-after-free bugs because there's no free, the allocator itself is type- and memory-safe, and there are no subtle aliasing bugs. And we happen to know we can get away with not freeing them since that's what the C version of the compiler did.

I think it's worth trying. It's an easy change, I'm just in the middle of another finicky change right now. Probably it won't help at all and the GC will be vindicated. If it does help, we can think deep thoughts and decide what to do.

Austin Clements

unread,
May 20, 2015, 7:21:23 PM5/20/15
to Rob Pike, Josh Bleecher Snyder, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
I did the experiment. Batch allocation doesn't help.

name        old mean             new mean             delta
6gGoTypes   1.33s × (0.96,1.05)  1.29s × (0.87,1.09)  -3.62% (p=0.016 n=19+18)
GoBuildStd  15.5s × (0.98,1.02)  16.6s × (0.99,1.02)  +6.88% (p=0.000 n=17+17)

where 6gGoTypes runs
$ cd go/types; go tool 6g $(go list -f '{{join .GoFiles " "}}')
and GoBuildStd runs
$ go build -a std

The CL is at https://golang.org/cl/10291/ if anyone's curious.

minux

unread,
May 20, 2015, 7:38:49 PM5/20/15
to Austin Clements, Rob Pike, Josh Bleecher Snyder, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
On Wed, May 20, 2015 at 7:21 PM, 'Austin Clements' via golang-dev <golan...@googlegroups.com> wrote:
I did the experiment. Batch allocation doesn't help.
I did the experiment for batching allocation of obj.Prog, the result is similar.

make.bash time showed less than 0.1% change in real time, go/types compilation time
doesn't change.

minux

unread,
May 20, 2015, 7:51:02 PM5/20/15
to Austin Clements, Rob Pike, Josh Bleecher Snyder, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
On Wed, May 20, 2015 at 7:38 PM, minux <mi...@golang.org> wrote:
On Wed, May 20, 2015 at 7:21 PM, 'Austin Clements' via golang-dev <golan...@googlegroups.com> wrote:
I did the experiment. Batch allocation doesn't help.
I did the experiment for batching allocation of obj.Prog, the result is similar.

make.bash time showed less than 0.1% change in real time, go/types compilation time
doesn't change.
I tried the extreme case of compiling output of rotate0 test, and
batching Prog showed 9% improvement in real time.

Batching Node showed 17% improvement in real time.

Combining the two patches, the improvement in real time is 25%.

so batching is effective for very big programs (the input has 23320 lines and the output
is 6MB)

Unfortunately, most of the Go program in real life is not that huge.

David Crawshaw

unread,
May 21, 2015, 10:10:50 AM5/21/15
to Austin Clements, Rob Pike, Josh Bleecher Snyder, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
I tried a similar experiment. As well as batch allocating, I replaced
all instances of *Node with a 4-byte identifier. This has the downside
that a Node basically never stack allocates, and the upside that the
Node object is smaller for 6g.

Surprisingly the results are a wash. Timing go build -a std shows
results within 5% of tip.

The diff is enormous, so I won't bother uploading it.

On Wed, May 20, 2015 at 7:21 PM, 'Austin Clements' via golang-dev
<golan...@googlegroups.com> wrote:

Russ Cox

unread,
May 21, 2015, 4:26:59 PM5/21/15
to David Crawshaw, Austin Clements, Rob Pike, Josh Bleecher Snyder, Dave Cheney, Michael Hudson-Doyle, roger peppe, golang-dev
I want to echo what Rob said. Long-term solutions like reducing the size of unnecessarily large structs are good. Short-term overfitting to the current implementation, such as batching allocations or other tricks, is not as good. Unsafe tricks like moving chunks into persistent memory and the like are actively harmful: we are using Go precisely to avoid the kinds of bugs that can happen when you do manual memory management. Luckily, I think there is still plenty of gain to be had from good long-term solutions. Not all of it will happen for Go 1.5 of course.

Russ

Michael Hudson-Doyle

unread,
May 21, 2015, 9:44:52 PM5/21/15
to roger peppe, golang-dev
I spent a little while poking at the linker, rather than the compiler.
I tried a few things -- trying to shrink LSym a couple of ways, by
replacing linked lists with slices (although hm, i didn't replace the
ones that get sorted by listsort), by combining boolean fields into a
flags field, optimizing for the version==0 case in Linklookup.
Nothing really made a significant difference and I won't bother
uploading the patches unless anyone is curious.

My observations:

1) Generating DWARF is slow (about 25% of runtime). I don't
understand this code at all and possibly there's some low hanging
fruit in there.
2) loading object files is slow (about 50% of runtime). We spend a
surprising amount of time (maybe 15% of runtime) in expandpkg!
3) Linklookup accounts for another 15% of runtime. Maaybe having a
symbol table and referring to symbols by index would be faster? Would
be a huge and very ugly change to do throughout the linker but maybe
it could work while loading an object file.

My unscientific feeling is that making the linker significantly faster
would require changing the object file format.

Hope these ramblings are useful/interesting.

Cheers,
mwh


On 19 May 2015 at 22:06, roger peppe <rogp...@gmail.com> wrote:

Josh Bleecher Snyder

unread,
May 21, 2015, 9:54:42 PM5/21/15
to Michael Hudson-Doyle, roger peppe, golang-dev
> 1) Generating DWARF is slow (about 25% of runtime). I don't
> understand this code at all and possibly there's some low hanging
> fruit in there.

One cheap but I suspect unpalatable fix is to not generate DWARF by
default. This would also considerably shrink binaries.


> 2) loading object files is slow (about 50% of runtime). We spend a
> surprising amount of time (maybe 15% of runtime) in expandpkg!

There are a ton of []byte/string conversions going on in that code,
many of them unnecessary. Changing to use []byte more thoroughly would
help; I started on that but the change metastasized. And using
bytes.Replacer (golang.org/issue/9905) in expandpkg could speed it up
considerably. All too involved for 1.5, though, and I maintain hope
that cmd/newlink will make all this code obsolete one day.

-josh

andrewc...@gmail.com

unread,
May 21, 2015, 10:31:05 PM5/21/15
to golan...@googlegroups.com, rogp...@gmail.com

It's a total cop out, but if you have enough RAM running with GOGC=off
will probably be faster.

It would be interesting to see some the peak memory usage stats. I think if people really care enough, they can do this themselves in a build script.

Aram Hăvărneanu

unread,
May 21, 2015, 10:38:43 PM5/21/15
to Josh Bleecher Snyder, Michael Hudson-Doyle, roger peppe, golang-dev
On Fri, May 22, 2015 at 3:53 AM, Josh Bleecher Snyder
<josh...@gmail.com> wrote:
> One cheap but I suspect unpalatable fix is to not generate DWARF by
> default. This would also considerably shrink binaries.

I am very saddened to hear this suggestion.

--
Aram Hăvărneanu
Reply all
Reply to author
Forward
0 new messages