research!rsc

Thoughts and links about programming, by

RSS

Transparent Telemetry for Open-Source Projects
Posted on Wednesday, February 8, 2023. PDF

How do software developers understand which parts of their software are being used and whether they are performing as expected? The modern answer is telemetry, which means software sending data to answer those questions back to a collection server. This post is about why I believe telemetry is important for open-source projects, and what it might look like to approach telemetry in an open-source-friendly way. That leads to a new design I call transparent telemetry. If you are impatient, skip to the summary at the end. Other posts in the series detail the design and present various uses.

Why Telemetry?

Without telemetry, developers rely on bug reports and surveys to find out when their software isn’t working or how it is being used. Both of these techniques are too limited in their effectiveness. Let’s look at each in turn.

Bug reports are not enough. Users only file bug reports when they think something is broken. If a function is not behaving as documented, that’s a clear bug to report. But if a program is misbehaving in a way that doesn’t affect correctness, users are much less likely to notice. Statistics gathered by transparent telemetry make it possible for developers to notice that something is going wrong even when users do not.

For example, during the Go 1.14 release process in early 2020 we made a change to the way macOS Go distributions are built, as part of keeping them acceptable to Apple’s signing tools. Unfortunately, the way we made the change also made all the pre-compiled .a files shipped in the distribution appear stale to builds. The effect was that the go command rebuilt and cached the standard library on first run, which meant that compiling any program using package net (which uses cgo) required Xcode to be installed. So Go 1.14 and later unintentionally required Xcode to compile even trivial demo Go programs like a basic HTTP server. This is not the way we want Go to work on macOS. On systems without Xcode, when go tried to invoke clang, macOS popped up a box explaining how to install it. Users simply accepted that this was necessary, perhaps even thinking go had displayed the popup. No one reported the bug over three years of Go releases. We didn’t notice and fix the problem until late 2022 while investigating something else. With telemetry for the miss rate in the cache of pre-compiled standard library packages, the impact would have been obvious: all Macs running Go 1.14 or later would have a pre-installed package miss rate of 100%. This bug wasn’t caught by our unit tests because it was caused by the distribution build machines having a modified environment different from actual user machines. The unit tests ran in the same modified environment as the build and worked fine. These kinds of unexpected differences between developer machines and user machines are inevitable at scale. Instrumenting the software on user machines is the most reliable way to understand how well it is working.

Surveys are not enough. Surveys help us understand what users want to do with Go, but they are only a small sample and have limited resolution. Asking about usage of infrequently-used features on a survey wastes time for a majority of respondents, and it requires large response counts to get an accurate measurement.

For example, we announced in the Go 1.13 release notes that future releases would drop support for Native Client (GOOS=nacl). Similarly, we announced in the Go 1.15 release notes that future releases would drop support for hardware floating point on 32-bit Intel CPUs without SSE2 instructions (GO386=387). Both of those removals went off okay, retroactively proving that our instincts about how few people would be affected were correct. On the other hand, we drafted an announcement for Go 1.18 removing -buildmode=shared, because it had essentially been broken since the introduction of modules, but when we issued Go 1.18 beta 1 we got feedback from at least a few people who were using it in some form. We still don’t know how many people are using it or whether it is worth the maintenance costs, so it lingers on. Another question is how long to keep supporting ARMv5 (GOARM=5), which doesn’t have modern atomic instructions. More recently, we announced that Go 1.20 will be the last release to support macOS High Sierra and were promptly asked to keep it around. Usage information would help us make more informed decisions. It’s important to note the limitations of this usage information: if telemetry is disabled on all the machines that use the feature in question, or if it is only used in machines that don’t stay up long enough to report anything, then we won’t observe the usage. Telemetry is never perfect, but it’s a useful input to the decision and much better than guessing. A survey is not any better and usually worse: there is a limit to how many questions we can reasonably ask in a survey, and asking a question where 99% of people answer “no I don’t use that” is a waste of most people’s time.

Why Telemetry For Open Source?

When you hear the word telemetry, if you’re like me, you may have a visceral negative reaction to a mental image of intrusive, detailed traces of your every keystroke and mouse click headed back to the developers of the software you’re using. And for good reason! That mental image sounds like it must be an exaggeration but turns out to be fairly accurate. (Citations: Kindle tracking individual page turns, VS Code telemetry logs, and .NET telemetry events.)

Open-source software projects have tended to avoid this kind of telemetry, for two reasons. The first is the significant privacy cost to users of collecting and storing detailed activity traces. The second is the fact that access to this data must be restricted, which would make the project less open than most strive to be. When the choice is between this kind of invasive tracking or doing nothing, doing nothing seems like an easy call. Still, doing nothing has real disadvantages. It means open-source developers like me tend not to understand as well how our software is used or how it performs. Then, because we lack that knowledge, we end up wasting time by maintaining features that aren’t used, hurting users by removing features that are still being used, and delivering a poorer user experience by failing to notice when our software is underperforming in real-world usage.

Some open-source projects have adopted traditional telemetry, with mixed success and varying levels of user pushback. For example: Audacity, GitLab, and Homebrew. Homebrew’s telemetry seems to be generally accepted by users, and VS Code’s detailed telemetry has not stopped it from being used by 74% of developers, as reported by the 2022 StackOverflow survey. It could even be that the benefits from telemetry are part of how VS Code’s developers have been able to build a tool that users like so much. Even so, the vast majority of projects, even large ones that would benefit, stay away from telemetry.

I believe that the choice between invasive tracking and doing nothing at all is a false dichotomy, and it’s harming open source. Not having basic information about how their software is used and how well it is performing puts open-source developers at a disadvantage compared to commercial software developers. Not having this information makes it more difficult to understand what’s important and what isn’t working, making prioritization that much harder. Not having clear prioritization in turn exacerbates the pre-existing problems with maintainer burnout.

Eric Raymond famously declared that “given enough eyeballs, all bugs are shallow,” which he explained as meaning that “[g]iven a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix obvious to someone.” Perhaps this was true in 1997 (perhaps not), but it’s certainly not true today, as the Go macOS cache bug shows. A quarter century later, software is much larger, and open-source software is used by far more people who didn’t develop it and aren’t familiar with how it should and should not behave. Eyeballs don’t scale.

I believe that open-source software projects need to explore new telemetry designs that help developers get the information they need to work efficiently and effectively, without collecting invasive traces of detailed user activity.

Transparent Telemetry

This series of blog posts presents one such design, which I call transparent telemetry, because it collects as little as possible (kilobytes per year from each installation) and then publishes every bit that it collects, for public inspection and analysis.

I’d like to explore using this system, or one like it, in the Go toolchain, which I hope will help Go project developers and users alike. To be clear, I am only suggesting that the instrumentation be added to the Go command-line tools written and distributed by the Go team, such as the go command, the Go compiler, gopls, and govulncheck. I am not suggesting that instrumentation be added by the Go compiler to all Go programs in the world: that’s clearly inappropriate. Also, throughout these posts, “developer” refers to the authors of a given piece of software, while “user” refers to the users of that software. From the point of view of the Go toolchain, “developer” means a Go toolchain developers like me, while “user” means one of the millions of Go programmers using that toolchain.

With transparent telemetry, as programs from the Go toolchain run, they would increment counters for various events of interest (for example: cache hit, use of a given feature, measured latency in a given range) in a per-week on-disk file. These files hold only counter values, not user data nor user identifiers. Some counter names include a short stack trace (function names and line offsets only, no argument data).

The Go team at Google would run a collection server. Each week, with 10% probability (averaging ~5 times per year) the user’s Go installation would download a “collection configuration” to find out which counter values are of interest to the server and at what sample rate. The collection configuration would be served in a Go module validated using the Go checksum database, for added confidence that all clients are being served the same configuration. Based on the sample rates, the Go installation might send a report containing the counter values of interest. Typical sample rates would be around 2% (averaging ~1 report per installation per year), but very rare events could be sampled at a higher rate, up to the 10% limit. As more systems take part in transparent telemetry, the overall sample rate on any given system will decrease, because only a fixed number of samples is necessary.

The report would contain no ID of any form – no user login, no machine ID, no MAC address, no IP address, no IP address prefix, no geolocation information, no randomly-generated pseudo-ID, no other kind of identifiers. The report would contain basic information about the toolchain, such as its version and what operating system and architecture it was built for. The report could also contain coarse-grained information about the version of the host operating system (for example, “Windows 8”) and other tools the Go toolchain uses, such as the local C compiler (“gcc 2.95”).

The server would collect each day’s uploaded reports, update telemetry graphs served publicly on go.dev, and post the full set of uploaded reports for public download, inspection, and analysis.

Although the report would not include any identifiers, the TCP connection uploading the report would expose the system’s public IP address to the server if a proxy is not being used. This IP address would not be associated with the uploaded reports in any way. Standard system maintenance, including DoS prevention, might require logs that include the IP address, but uploaded reports will be kept separate from those logs. The privacy policy would be similar to the one used by the Go module mirror and checksum database.

The Go home page and download page already include a notice about the default use of the Go module mirror and a link to more information. That notice and link would be updated to disclose on-by-default telemetry. To opt out, users would set GOTELEMETRY=off in their environment or run a simple command like go env -w GOTELEMETRY=off; The first telemetry report is not sent until at least one week after installation, giving ample time to opt out. Opting out stops all collection and reporting: no “opt out” event is sent. It is simply impossible to see systems that install Go and then opt out in the next seven days.

Summary

Transparent telemetry has the following key properties:

Next Steps

For more detail about the design, see the next post. For more use cases, see the post after that.

Although these posts use Go as the example system using transparent telemetry, I hope that the ideas apply and can be adopted by other open-source projects too, in their own, separate collection systems. For example, even though VS Code collects high-resolution event traces (sometimes tens or hundreds of events per minute), a close reading of those traces shows hardly anything is new in each event. That is, VS Code suffers the reputational hit of collecting lots of data but appears to gather relatively little actual information. Perhaps using transparent telemetry in VS Code or a similar editor could offer the editor’s developers roughly equivalent insights and development velocity at a much lower privacy cost to users.

I am posting these to start a discussion about how the Go toolchain can adopt telemetry in some form to help the Go toolchain developers make better decisions about the development and maintenance of Go. I have written an implementation of local counter collection to convince myself it could be made cheap enough, but no other part of the system exists today in any form. I hope that the system can be built over the course of 2023.