Tokenizer

This is a pure go port of OpenAI's tokenizer.

Usage

package main

import (
    "fmt"
    "github.com/tiktoken-go/tokenizer"
)

func main() {
    enc, err := tokenizer.Get(tokenizer.Cl100kBase)
    if err != nil {
        panic("oh oh")
    }

    // this should print a list of token ids
    ids, _, _ := enc.Encode("supercalifragilistic")
    fmt.Println(ids)

    // this should print the original string back
    text, _ := enc.Decode(ids)
    fmt.Println(text)
}

Alternatively you can use the included command-line tool

> tokenizer -h

Usage of tokenizer:
  -decode string
        tokens to decode
  -encode string
        text to encode
  -token string
        text to calculate token

> tokenizer -encode supercalifragilistic

Todo

✅ port code
✅ cl100k_base encoding
✅ r50k_base encoding
✅ p50k_base encoding
✅ p50k_edit encoding
✅ tests
❌ handle special tokens
❌ gpt-2 model

Caveats

This library embeds OpenAI's vocabularies—which are not small (~4Mb)— as go maps. This is different than what the way python version of tiktoken works, which downloads the dictionaries and puts them in a cache folder.

However, since the dictionaries are compiled during the go build process the performance and start-up times should be better than downloading and loading them at runtime.

Alternatives

Here is a list of other libraries that do something similar.

https://github.com/sugarme/tokenizer (A different tokenizer algorithm than OpenAI's)
https://github.com/pandodao/tokenizer-go (deprecated, calls into JavaScript)
https://github.com/pkoukk/tiktoken-go

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
cmd/tokenizer		cmd/tokenizer
codec		codec
internal/cmd		internal/cmd
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
tokenizer.go		tokenizer.go
tokenizer_test.go		tokenizer_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

cmd/tokenizer

cmd/tokenizer

codec

codec

internal/cmd

internal/cmd

LICENSE

LICENSE

README.md

README.md

go.mod

go.mod

go.sum

go.sum

tokenizer.go

tokenizer.go

tokenizer_test.go

tokenizer_test.go

Repository files navigation

Tokenizer

Usage

Todo

Caveats

Alternatives

About

Releases 1

Packages

Contributors 3

Languages

License

tiktoken-go/tokenizer

Folders and files

Latest commit

History

Repository files navigation

Tokenizer

Usage

Todo

Caveats

Alternatives

About

Resources

License

Stars

Watchers

Forks

Languages