Fast and dynamic encoding of Protocol Buffers in Go

Vincent Bernat

Protocol Buffers are a popular choice for serializing structured data due to their compact size, fast processing speed, language independence, and compatibility. There exist other alternatives, including Cap’n Proto, CBOR, and Avro.

Usually, data structures are described in a proto definition file (.proto). The protoc compiler and a language-specific plugin convert it into code:

$ head flow-4.proto
syntax = "proto3";
package decoder;
option go_package = "akvorado/inlet/flow/decoder";

message FlowMessagev4 {

  uint64 TimeReceived = 2;
  uint32 SequenceNum = 3;
  uint64 SamplingRate = 4;
  uint32 FlowDirection = 5;
$ protoc -I=. --plugin=protoc-gen-go --go_out=module=akvorado:. flow-4.proto
$ head inlet/flow/decoder/flow-4.pb.go
// Code generated by protoc-gen-go. DO NOT EDIT.
// versions:
//      protoc-gen-go v1.28.0
//      protoc        v3.21.12
// source: inlet/flow/data/schemas/flow-4.proto

package decoder

import (
        protoreflect "google.golang.org/protobuf/reflect/protoreflect"

Akvorado collects network flows using IPFIX or sFlow, decodes them with GoFlow2, encodes them to Protocol Buffers, and sends them to Kafka to be stored in a ClickHouse database. Collecting a new field, such as source and destination MAC addresses, requires modifications in multiple places, including the proto definition file and the ClickHouse migration code. Moreover, the cost is paid by all users.1 It would be nice to have an application-wide schema and let users enable or disable the fields they need.

While the main goal is flexibility, we do not want to sacrifice performance. On this front, this is quite a success: when upgrading from 1.6.4 to 1.7.1, the decoding and encoding performance almost doubled! 🤗

goos: linux
goarch: amd64
pkg: akvorado/inlet/flow
cpu: AMD Ryzen 5 5600X 6-Core Processor
                            │ initial.txt  │              final.txt              │
                            │    sec/op    │   sec/op     vs base                │
Netflow/with_encoding-12      12.963µ ± 2%   7.836µ ± 1%  -39.55% (p=0.000 n=10)
Sflow/with_encoding-12         19.37µ ± 1%   10.15µ ± 2%  -47.63% (p=0.000 n=10)

Faster Protocol Buffers encoding#

I use the following code to benchmark both the decoding and encoding process. Initially, the Decode() method is a thin layer above GoFlow2 producer and stores the decoded data into the in-memory structure generated by protoc. Later, some of the data will be encoded directly during flow decoding. This is why we measure both the decoding and the encoding.2

func BenchmarkDecodeEncodeSflow(b *testing.B) {
    r := reporter.NewMock(b)
    sdecoder := sflow.New(r)
    data := helpers.ReadPcapPayload(b,
        filepath.Join("decoder", "sflow", "testdata", "data-1140.pcap"))

    for _, withEncoding := range []bool{true, false} {
        title := map[bool]string{
            true:  "with encoding",
            false: "without encoding",
        }[withEncoding]
        var got []*decoder.FlowMessage
        b.Run(title, func(b *testing.B) {
            for i := 0; i < b.N; i++ {
                got = sdecoder.Decode(decoder.RawFlow{
                    Payload: data,
                    Source: net.ParseIP("127.0.0.1"),
                })
                if withEncoding {
                    for _, flow := range got {
                        buf := []byte{}
                        buf = protowire.AppendVarint(buf, uint64(proto.Size(flow)))
                        proto.MarshalOptions{}.MarshalAppend(buf, flow)
                    }
                }
            }
        })
    }
}

The canonical Go implementation for Protocol Buffers, google.golang.org/protobuf is not the most efficient one. For a long time, people were relying on gogoprotobuf. However, the project is now deprecated. A good replacement is vtprotobuf.3

goos: linux
goarch: amd64
pkg: akvorado/inlet/flow
cpu: AMD Ryzen 5 5600X 6-Core Processor
                            │ initial.txt │             bench-2.txt             │
                            │   sec/op    │   sec/op     vs base                │
Netflow/with_encoding-12      12.96µ ± 2%   10.28µ ± 2%  -20.67% (p=0.000 n=10)
Netflow/without_encoding-12   8.935µ ± 2%   8.975µ ± 2%        ~ (p=0.143 n=10)
Sflow/with_encoding-12        19.37µ ± 1%   16.67µ ± 2%  -13.93% (p=0.000 n=10)
Sflow/without_encoding-12     14.62µ ± 3%   14.87µ ± 1%   +1.66% (p=0.007 n=10)

Dynamic Protocol Buffers encoding#

We have our baseline. Let’s see how to encode our Protocol Buffers without a .proto file. The wire format is simple and rely a lot on variable-width integers.

Variable-width integers, or varints, are an efficient way of encoding unsigned integers using a variable number of bytes, from one to ten, with small values using fewer bytes. They work by splitting integers into 7-bit payloads and using the 8th bit as a continuation indicator, set to 1 for all payloads except the last.

Variable-width integers encoding in Protocol Buffers: conversion of 150 to a varint
Variable-width integers encoding in Protocol Buffers

For our usage, we only need two types: variable-width integers and byte sequences. A byte sequence is encoded by prefixing it by its length as a varint. When a message is encoded, each key-value pair is turned into a record consisting of a field number, a wire type, and a payload. The field number and the wire type are encoded as a single variable-width integer called a tag.

Message encoded with Protocol Buffers: three varints, two sequences of bytes
Message encoded with Protocol Buffers

We use the following low-level functions to build the output buffer:

Our schema abstraction contains the appropriate information to encode a message (ProtobufIndex) and to generate a proto definition file (fields starting with Protobuf):

type Column struct {
    Key       ColumnKey
    Name      string
    Disabled  bool

    // […]
    // For protobuf.
    ProtobufIndex    protowire.Number
    ProtobufType     protoreflect.Kind // Uint64Kind, Uint32Kind, …
    ProtobufEnum     map[int]string
    ProtobufEnumName string
    ProtobufRepeated bool
}

We have a few helper methods around the protowire functions to directly encode the fields while decoding the flows. They skip disabled fields or non-repeated fields already encoded. Here is an excerpt of the sFlow decoder:

sch.ProtobufAppendVarint(bf, schema.ColumnBytes, uint64(recordData.Base.Length))
sch.ProtobufAppendVarint(bf, schema.ColumnProto, uint64(recordData.Base.Protocol))
sch.ProtobufAppendVarint(bf, schema.ColumnSrcPort, uint64(recordData.Base.SrcPort))
sch.ProtobufAppendVarint(bf, schema.ColumnDstPort, uint64(recordData.Base.DstPort))
sch.ProtobufAppendVarint(bf, schema.ColumnEType, helpers.ETypeIPv4)

For fields that are required later in the pipeline, like source and destination addresses, they are stored unencoded in a separate structure:

type FlowMessage struct {
    TimeReceived uint64
    SamplingRate uint32

    // For exporter classifier
    ExporterAddress netip.Addr

    // For interface classifier
    InIf  uint32
    OutIf uint32

    // For geolocation or BMP
    SrcAddr netip.Addr
    DstAddr netip.Addr
    NextHop netip.Addr

    // Core component may override them
    SrcAS     uint32
    DstAS     uint32
    GotASPath bool

    // protobuf is the protobuf representation for the information not contained above.
    protobuf      []byte
    protobufSet   bitset.BitSet
}

The protobuf slice holds encoded data. It is initialized with a capacity of 500 bytes to avoid resizing during encoding. There is also some reserved room at the beginning to be able to encode the total size as a variable-width integer. Upon finalizing encoding, the remaining fields are added and the message length is prefixed:

func (schema *Schema) ProtobufMarshal(bf *FlowMessage) []byte {
    schema.ProtobufAppendVarint(bf, ColumnTimeReceived, bf.TimeReceived)
    schema.ProtobufAppendVarint(bf, ColumnSamplingRate, uint64(bf.SamplingRate))
    schema.ProtobufAppendIP(bf, ColumnExporterAddress, bf.ExporterAddress)
    schema.ProtobufAppendVarint(bf, ColumnSrcAS, uint64(bf.SrcAS))
    schema.ProtobufAppendVarint(bf, ColumnDstAS, uint64(bf.DstAS))
    schema.ProtobufAppendIP(bf, ColumnSrcAddr, bf.SrcAddr)
    schema.ProtobufAppendIP(bf, ColumnDstAddr, bf.DstAddr)

    // Add length and move it as a prefix
    end := len(bf.protobuf)
    payloadLen := end - maxSizeVarint
    bf.protobuf = protowire.AppendVarint(bf.protobuf, uint64(payloadLen))
    sizeLen := len(bf.protobuf) - end
    result := bf.protobuf[maxSizeVarint-sizeLen : end]
    copy(result, bf.protobuf[end:end+sizeLen])

    return result
}

Minimizing allocations is critical for maintaining encoding performance. The benchmark tests should be run with the -benchmem flag to monitor allocation numbers. Each allocation incurs an additional cost to the garbage collector. The Go profiler is a valuable tool for identifying areas of code that can be optimized:

$ go test -run=__nothing__ -bench=Netflow/with_encoding \
>         -benchmem -cpuprofile profile.out \
>         akvorado/inlet/flow
goos: linux
goarch: amd64
pkg: akvorado/inlet/flow
cpu: AMD Ryzen 5 5600X 6-Core Processor
Netflow/with_encoding-12             143953              7955 ns/op            8256 B/op        134 allocs/op
PASS
ok      akvorado/inlet/flow     1.418s
$ go tool pprof profile.out
File: flow.test
Type: cpu
Time: Feb 4, 2023 at 8:12pm (CET)
Duration: 1.41s, Total samples = 2.08s (147.96%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) web

After using the internal schema instead of code generated from the proto definition file, the performance improved. However, this comparison is not entirely fair as less information is being decoded and previously GoFlow2 was decoding to its own structure, which was then copied to our own version.

goos: linux
goarch: amd64
pkg: akvorado/inlet/flow
cpu: AMD Ryzen 5 5600X 6-Core Processor
                            │ bench-2.txt  │             bench-3.txt             │
                            │    sec/op    │   sec/op     vs base                │
Netflow/with_encoding-12      10.284µ ± 2%   7.758µ ± 3%  -24.56% (p=0.000 n=10)
Netflow/without_encoding-12    8.975µ ± 2%   7.304µ ± 2%  -18.61% (p=0.000 n=10)
Sflow/with_encoding-12         16.67µ ± 2%   14.26µ ± 1%  -14.50% (p=0.000 n=10)
Sflow/without_encoding-12      14.87µ ± 1%   13.56µ ± 2%   -8.80% (p=0.000 n=10)

As for testing, we use github.com/jhump/protoreflect: the protoparse package parses the proto definition file we generate and the dynamic package decodes the messages. Check the ProtobufDecode() method for more details.4

To get the final figures, I have also optimized the decoding in GoFlow2. It was relying heavily on binary.Read(). This function may use reflection in certain cases and each call allocates a byte array to read data. Replacing it with a more efficient version provides the following improvement:

goos: linux
goarch: amd64
pkg: akvorado/inlet/flow
cpu: AMD Ryzen 5 5600X 6-Core Processor
                            │ bench-3.txt  │             bench-4.txt             │
                            │    sec/op    │   sec/op     vs base                │
Netflow/with_encoding-12       7.758µ ± 3%   7.365µ ± 2%   -5.07% (p=0.000 n=10)
Netflow/without_encoding-12    7.304µ ± 2%   6.931µ ± 3%   -5.11% (p=0.000 n=10)
Sflow/with_encoding-12        14.256µ ± 1%   9.834µ ± 2%  -31.02% (p=0.000 n=10)
Sflow/without_encoding-12     13.559µ ± 2%   9.353µ ± 2%  -31.02% (p=0.000 n=10)

It is now easier to collect new data and the inlet component is faster! 🚅

Notice

Some paragraphs were editorialized by ChatGPT, using “editorialize and keep it short” as a prompt. The result was proofread by a human for correctness. The main idea is that ChatGPT should be better at English than me.


  1. While empty fields are not serialized to Protocol Buffers, empty columns in ClickHouse take some space, even if they compress well. Moreover, unused fields are still decoded and they may clutter the interface. ↩︎

  2. There is a similar function using NetFlow. NetFlow and IPFIX protocols are less complex to decode than sFlow as they are using a simpler TLV structure. ↩︎

  3. vtprotobuf generates more optimized Go code by removing an abstraction layer. It directly generates the code encoding each field to bytes:

    if m.OutIfSpeed != 0 {
        i = encodeVarint(dAtA, i, uint64(m.OutIfSpeed))
        i--
        dAtA[i] = 0x6
        i--
        dAtA[i] = 0xd8
    }
    
    ↩︎
  4. There is also a protoprint package to generate proto definition file. I did not use it. ↩︎