SSA performance numbers

Keith Randall

unread,

Mar 11, 2016, 12:09:00 AM3/11/16

to golang-dev

Here's the current performance numbers for the go1 benchmarks, comparing tip relative to go1.6:

name old time/op new time/op delta

BinaryTree17-12 3.16s ±14% 2.77s ± 3% -12.33% (p=0.000 n=10+9)

Fannkuch11-12 2.68s ± 1% 2.21s ± 1% -17.72% (p=0.000 n=10+9)

FmtFprintfEmpty-12 52.5ns ± 3% 50.2ns ± 2% -4.42% (p=0.000 n=10+10)

FmtFprintfString-12 175ns ± 1% 163ns ± 0% -6.55% (p=0.000 n=9+9)

FmtFprintfInt-12 170ns ± 6% 160ns ± 0% -5.77% (p=0.000 n=10+8)

FmtFprintfIntInt-12 272ns ± 0% 260ns ± 1% -4.62% (p=0.000 n=8+9)

FmtFprintfPrefixedInt-12 245ns ± 1% 228ns ± 0% -7.11% (p=0.000 n=9+8)

FmtFprintfFloat-12 328ns ± 1% 303ns ± 1% -7.48% (p=0.000 n=9+8)

FmtManyArgs-12 1.06µs ± 2% 0.96µs ± 0% -9.61% (p=0.000 n=10+9)

GobDecode-12 8.20ms ± 3% 7.75ms ± 2% -5.44% (p=0.000 n=8+9)

GobEncode-12 6.99ms ± 1% 6.42ms ± 2% -8.14% (p=0.000 n=10+9)

Gzip-12 336ms ± 1% 289ms ± 1% -13.97% (p=0.000 n=9+9)

Gunzip-12 44.2ms ± 1% 40.7ms ± 1% -7.84% (p=0.000 n=9+10)

HTTPClientServer-12 223µs ± 7% 208µs ± 6% -6.56% (p=0.000 n=10+10)

JSONEncode-12 17.7ms ± 1% 16.7ms ± 1% -5.58% (p=0.000 n=8+9)

JSONDecode-12 58.6ms ± 0% 54.3ms ± 1% -7.24% (p=0.000 n=9+8)

Mandelbrot200-12 4.18ms ± 3% 4.40ms ± 1% +5.24% (p=0.000 n=10+9)

GoParse-12 3.74ms ± 2% 3.56ms ± 2% -4.84% (p=0.000 n=9+10)

RegexpMatchEasy0_32-12 102ns ± 1% 79ns ± 0% -22.93% (p=0.000 n=10+10)

RegexpMatchEasy0_1K-12 352ns ± 1% 251ns ± 0% -28.60% (p=0.000 n=8+10)

RegexpMatchEasy1_32-12 87.5ns ± 0% 76.5ns ± 1% -12.60% (p=0.000 n=8+10)

RegexpMatchEasy1_1K-12 515ns ± 1% 410ns ± 3% -20.37% (p=0.000 n=10+10)

RegexpMatchMedium_32-12 138ns ± 1% 122ns ± 1% -11.50% (p=0.000 n=10+10)

RegexpMatchMedium_1K-12 40.4µs ± 1% 37.0µs ± 0% -8.43% (p=0.000 n=10+10)

RegexpMatchHard_32-12 2.19µs ± 1% 1.94µs ± 4% -11.42% (p=0.000 n=8+10)

RegexpMatchHard_1K-12 66.2µs ± 1% 58.7µs ± 2% -11.32% (p=0.000 n=9+10)

Revcomp-12 594ms ± 1% 416ms ± 8% -29.96% (p=0.000 n=8+10)

Template-12 70.5ms ± 1% 65.0ms ± 1% -7.87% (p=0.000 n=8+10)

TimeParse-12 358ns ± 1% 341ns ± 0% -4.55% (p=0.000 n=10+9)

TimeFormat-12 363ns ± 1% 369ns ± 7% ~ (p=0.186 n=9+10)

Attached are the same numbers in graphical form.

Tip compiler (with SSA internal checks off) is about 7% slower than go1.6 to compile net/http (go test -a -c -gcflags=-d=ssa/check/off net/http)

go1.png

Robert Griesemer

unread,

Mar 11, 2016, 12:26:17 AM3/11/16

to Keith Randall, golang-dev

Great improvements!

- gri

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brendan Tracey

unread,

Mar 11, 2016, 12:30:48 AM3/11/16

to Robert Griesemer, Keith Randall, golang-dev

Here was my post earlier today on issue 14511. Ddot is a numeric benchmark

for i, v := range a {

sum += v * b[i]

}

though, a and b can have strides instead of being only linear

DdotSmallBothUnitary-8   17.6ns ± 1%  15.6ns ± 2%  -11.28%  (p=0.008 n=5+5)
DdotSmallIncUni-8        21.9ns ± 1%  21.9ns ± 1%     ~     (p=0.952 n=5+5)
DdotSmallUniInc-8        21.2ns ± 1%  20.2ns ± 0%   -4.54%  (p=0.000 n=5+4)
DdotSmallBothInc-8       21.1ns ± 0%  20.8ns ± 1%   -1.42%  (p=0.016 n=5+5)
DdotMediumBothUnitary-8   851ns ± 1%   843ns ± 1%   -1.01%  (p=0.032 n=5+5)
DdotMediumIncUni-8       1.17µs ± 1%  0.95µs ± 0%  -18.32%  (p=0.008 n=5+5)
DdotMediumUniInc-8       1.12µs ± 0%  0.86µs ± 1%  -22.91%  (p=0.008 n=5+5)
DdotMediumBothInc-8      1.21µs ± 1%  0.99µs ± 2%  -18.72%  (p=0.008 n=5+5)
DdotLargeBothUnitary-8   85.9µs ± 1%  83.0µs ± 1%   -3.33%  (p=0.008 n=5+5)
DdotLargeIncUni-8         169µs ± 1%   154µs ± 1%   -8.97%  (p=0.008 n=5+5)
DdotLargeUniInc-8         121µs ± 1%   106µs ± 1%  -11.99%  (p=0.008 n=5+5)
DdotLargeBothInc-8        241µs ± 1%   230µs ± 1%   -4.26%  (p=0.008 n=5+5)
DdotHugeBothUnitary-8    10.6ms ± 1%  10.1ms ± 1%   -4.59%  (p=0.008 n=5+5)
DdotHugeIncUni-8         25.8ms ± 1%  25.6ms ± 3%     ~     (p=0.151 n=5+5)
DdotHugeUniInc-8         17.7ms ± 1%  16.7ms ± 1%   -5.81%  (p=0.016 n=5+4)
DdotHugeBothInc-8        33.0ms ± 0%  33.2ms ± 1%     ~     (p=0.151 n=5+5)

oju...@gmail.com

unread,

Mar 11, 2016, 7:35:19 AM3/11/16

to golang-dev

Congratulations, Keith. The speedup of generated code is truly noteworthy. I think the performance of the compiler proper can be improved over time, as you have proved more than once these days.

reini...@gmail.com

unread,

Mar 11, 2016, 11:30:22 AM3/11/16

to golang-dev

Am Freitag, 11. März 2016 06:09:00 UTC+1 schrieb Keith Randall:

Here's the current performance numbers for the go1 benchmarks, comparing tip relative to go1.6:

name old time/op new time/op delta

...

Mandelbrot200-12 4.18ms ± 3% 4.40ms ± 1% +5.24% (p=0.000 n=10+9)

Nice

This is the only outlier, and 4.18ms is long enough that it cannot be the SSA construction overhead alone.

Did you check what's the deal with that?

Keith Randall

unread,

Mar 11, 2016, 11:48:48 AM3/11/16

to reini...@gmail.com, golang-dev

Mandelbrot has a ~15 instruction inner loop. The old compiler's peephole

optimizer can get down to one less reg->reg move than the SSA compiler can.

I hope to get that out with another tweak to register allocation.

Note the go1 timings do not include compile time.

vess...@gmail.com

unread,

Mar 11, 2016, 1:09:36 PM3/11/16

to golang-dev

Keith, this is great.

I ran tip against our server tool which has the following workload:

Load and parse roughly 100GB of data into 200GB or RAM

Do some full dataset computations for caching

The code is heavily parallelized and reasonably well tuned. We instrumented it to exit after 'ready' status was achieved, and used the time utility to time it, so these numbers include freeing up OS RAM at end of the run.

I ran each version of the binary once and then threw away the numbers to let Linux do what it could on caching the file reads. Then I ran each three times. The results look great for tip.

Results vs go1.6:

Go 1.6 Tip

------------- ---------

25.2s 22.2s

27.3s 22.1s

25.3s 22.2s

So this is a big win for us, thanks, and it's nice to see these improvements show up in a large codebase with varied kinds of work.

There do seem to be some performance regressions for a few functions, but it might be noise, or old data on our part. If I can find any areas that seem much slower and can get it drilled down to something repeatable, I will follow up. In any event, these numbers include those regressions (if any) so there might be even more good news ahead.

Peter

Keith Randall

unread,

Mar 11, 2016, 1:16:31 PM3/11/16

to vess...@gmail.com, golang-dev

On Fri, Mar 11, 2016 at 10:06 AM, <vess...@gmail.com> wrote:

Keith, this is great.

I ran tip against our server tool which has the following workload:

Load and parse roughly 100GB of data into 200GB or RAM
Do some full dataset computations for caching

The code is heavily parallelized and reasonably well tuned. We instrumented it to exit after 'ready' status was achieved, and used the time utility to time it, so these numbers include freeing up OS RAM at end of the run.

I ran each version of the binary once and then threw away the numbers to let Linux do what it could on caching the file reads. Then I ran each three times. The results look great for tip.

Results vs go1.6:

Go 1.6 Tip
------------- ---------
25.2s 22.2s
27.3s 22.1s
25.3s 22.2s

So this is a big win for us, thanks, and it's nice to see these improvements show up in a large codebase with varied kinds of work.

There do seem to be some performance regressions for a few functions, but it might be noise, or old data on our part. If I can find any areas that seem much slower and can get it drilled down to something repeatable, I will follow up. In any event, these numbers include those regressions (if any) so there might be even more good news ahead.