simdjson-go: Parsing gigabytes of JSON per second in Go

simdjson-go: Parsing gigabytes of JSON per second in Go

Introduction

JSON has established itself as the "lingua franca" of the web. As such the parsing performance of JSON is hugely important for many applications. Despite the simple and human-friendly nature of JSON, it is not a technically trivial format to parse at high speeds.

Recently some new designs have been presented one of which is simdjson by Daniel Lemire and Geoff Langdale. simdjson uses a novel, two stage approach with which it is possible to achieve a parsing performance of gigabytes of JSON per second on a single core. Leveraging so-called SIMD instructions allows more text and number crunching to be performed per instruction and dramatically enhancing performance.

At MinIO, we have been working on simdjson-go which is a port to Golang. You may know MinIO as a high performance object store based on the S3 API. Amazon has been steadily adding functionality to the S3 API and one of the more recent additions has been a feature called S3 Select. This feature essentially allows an application to drill down into the contents of (large) blob objects in formats such as JSON or CSV. As such we can use all the parsing performance that we can get, and so it made sense for us to do this work.

Performance wise, simdjson-go runs on average at about 40% to 60% of the speed of simdjson. Compared to Golang's standard package encoding/json, simdjson-go is about 10x faster.

Features

simdjson-go is a validating parser, meaning that it amongst others validates and checks numerical values, booleans etc. Therefore these values are available as the appropriate int and float64 representations after parsing.

Additionally simdjson-go has the following features:

  • No 4 GB object limit
  • Support for ndjson (newline delimited json)
  • Proper memory management
  • Pure Go (no need for cgo)

Performance vs simdjson

Based on the same set of JSON test files, the graph below shows a comparison between simdjson and simdjson-go (larger is better).

Performance vs encoding/json

Below is a performance comparison to Golang's standard package encoding/json based on the same set of JSON test files.

Design

simdjson-go follows the same two stage design as simdjson. During the first stage the structural elements ({, }, [, ], :, and ,) are detected and forwarded as offsets in the message buffer to the second stage. The second stage builds a tape format of the structure of the JSON document.

Note that in contrast to simdjson, simdjson-go outputs uint32 increments (as opposed to absolute values) to the second stage. This allows arbitrarily large JSON files to be parsed (as long as a single (string) element does not surpass 4 GB...).

Also, for better performance, both stages run concurrently as separate go routines and a go channel is used to communicate between the two stages.

Stage 1

Stage 1 has been converted from the original C code (containing the SIMD intrinsics) to Golang assembly using c2goasm. It consists of five separate steps:

  • find_odd_backslash_sequences: detect backslash characters used to escape quotes
  • find_quote_mask_and_bits: generate a mask with bits turned on for characters between quotes
  • find_whitespace_and_structurals: generate a mask for whitespace plus a mask for the structural characters
  • finalize_structurals: combine the masks computed above into a final mask where each active bit represents the position of a structural character in the input message.
  • flatten_bits_incremental: output the active bits in the final mask as incremental offsets.

There is one final routine, find_structural_bits_in_slice, that ties it all together and is invoked with a slice of the message buffer in order to find the incremental offsets.

Stage 2

During Stage 2 the tape structure is constructed. It is essentially a single function that jumps around as it finds the various structural characters and builds the hierarchy of the JSON document that it encounters. The values of the JSON elements such as strings, integers, booleans etc. are parsed and written to the tape.

Usage and requirements

After successfully parsing the JSON contents, simdjson-go will return an iterator to navigate over the tape structure. Here is a back example of how to iterate over the

for {
	typ := iter.Advance()


	switch typ {
	
    case simdjson.TypeRoot:
		
        if typ, tmp, err = iter.Root(tmp); err != nil {
			return
	}

	if typ == simdjson.TypeObject {
		
        if obj, err = tmp.Object(obj); err != nil {
			return
	}
    
	e := obj.FindKey(key, &elem)
	
    if e != nil && elem.Type == simdjson.TypeString {
	
    	v, _ := elem.Iter.StringBytes()
		
        fmt.Println(string(v))
		
        }
	}

default:

	return

	}
}

In terms of requirements, simdjson-go require a CPU that supports both AVX2 and CLMUL.

Conclusion

simdjson-go is open source and released under the Apache License v2.0. You can find the code on Github under github.com/minio/simdjson-go.

Give it a try. We welcome any feedback and/or contributions.

Previous Post Next Post