Skip to content

Instantly share code, notes, and snippets.

@christoofar
Last active April 21, 2024 22:01
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save christoofar/880b4bcf3018f4681bb71bfdf1c16a6a to your computer and use it in GitHub Desktop.
Save christoofar/880b4bcf3018f4681bb71bfdf1c16a6a to your computer and use it in GitHub Desktop.
Wrapping a C library call in a defensive Go routine
This study focuses on the strategies used by the "xz backdoor", an extremely
complex piece of malware that contains its own x64 disassembler inside of it 
to find critical locations in your code and hijacks it by swapping out your 
code with its own as it runs.  Because this a machine-code based attack,
all code written in any program language can be attacked and is vulnerable.

Instead of targeting sshd directly, the xz 
backdoor injects itself in the parent systemd process then hijacks the 
GNU Dynamic Linker (ld), before sshd is even started or libcrypto.so is 
fully-loaded (needed by sshd), by creating an LD_AUDIT hook.  This gives 
the xz backdoor live information of what the linker is loading so it can 
activate before memory pages containing code in it are locked.

In order to pull this off, the attacker:
- Had advanced knowledge of the GNU C compiler
- Was keenly aware of upcoming changes to ld.so (the GNU Dynamic Linker), 
as the current ld.so present on your computer, home router and smartphone 
behave slightly differently than the new release still undergoing testing.

To raise the bar even higher for the next attacker that comes, this 
study shows you how to force an attacker to deal with the added complexity 
of relocatable coroutines, which is a runtime feature available in Go.

I've been thinking a lot about this as a CGo/gccgo dev: "What can a HLL programmer do against the likes of Jia Tan? They're attacking from the foundation software."

I'm not settled on this one but wrapping calls to C libs in goroutines probably would raise the difficultly level of a direct hijack on your own Go code, as the rapid context switches and unpredictability introduced on where the Go runtime will move the jump calls happens.

After 129,000 lines of asm, here is printf("Hello World") in Go down at the bottom: image

Now, let's see what happens when we do this:

func main() {
	hello := "Hello world!"
	go func() {
		print(hello)
	}()
	time.Sleep(1*time.Second)
}

Now we're asking the Go runtime to activate concurrency and main itself gets split into two compact parts with an anonymous function that disappears into the goroutine ecosphere (to get this to fit I'm stripping symbols): image image

Notice how nice and compact the goroutine is! Not many things you can do here but try to intercept the ret and call instructions, but you will need to also make sure the runtime stack cleanup happens or things will start to go crashycrashy.

So now let's make a C lib call but push it down into a goroutine wrapper, yet make it synchronous. And for fun, the data to the function will be passed via a channel, which brings in the communication/sync areas of the runtime with its maze of runtime functions. And since we're here, let's make it a full Go wrapper, with two channels and a goroutine bridge, and a done signaler.

package main

// #include <stdio.h>
// #include <stdlib.h>
// void printFromC(const char* str) {
//     printf("Received C string: %s\n", str);
// }
import "C"
import "unsafe"

func main() {
	myPrint("Hello from Go!")
}

func myPrint(hellostring string) {
	// Protect the C library call from Jia Tan and the NSA
	sendchan := make(chan string)
	recvchan := make(chan string)
	done := make(chan bool)
	
	go func(sender chan string, receiver chan string){ // This chan is send-only
		go func(receive <-chan string) { // This one is recv-only
			callCPrint(receive)
			done <- true
		}(receiver)
		go func() {
			strToSend := <- sender
			receiver <- strToSend
			close(sender)
			return
		}()
	}(sendchan, recvchan)

	sendchan <- hellostring
	<-done
	return
}

func callCPrint(str <-chan string) {
	cStr := C.CString(<-str)
	defer C.free(unsafe.Pointer(cStr)) // Deallocate memory when done
	C.printFromC(cStr)
}

The main() in asm representation gets shorter image

But now there is some real fun going on in myPrint() as it's acting as a traffic cop moving the string along its way into the chaos of pthread, with its context switches and semaphores. myPrint is split by the compiler into 6 asm functions (one for each launch context and its anonymous function), to allow for their dynamic reallocation to the runtime.

image That goes on for pages.

callCPrint then has a thunk going on, which can't get back its data to myPrint without going back through the runtime maze.

image

I'm still not sold on this approach but I'm definitely willing to change my own behavior to make these creeps go away if the difficulty is raised high enough. And throwing CGo calls through a goroutine bridge still makes readable code to me.

@christoofar
Copy link
Author

christoofar commented Apr 13, 2024

Advantages to this approach:

  • Code is still readable
  • The Go runtime itself is baked into the binary statically, and it is large and very complex, weighing in at two Commodore-64s.
  • This wrapper approach style at least gives you a place to inspect "weirdness" from C library returns and to clear memory in one place.
  • You can break the data up in the wrapper myPrint() and then resync it right before the C call, then again in the reverse direction for the return. Probably a good idea to make a different receiver func that's also in a goroutine, that chans the data back to myPrint() who's waiting for it.
  • Traditional security techniques to obfuscate in-memory data are not affected by this. For instance if the string is a credit card number, I can use a PRNG and rotate the bits (not the bytes) N number of times before sending it through the goroutine bridge, then add a fourth channel that sends the number of rotations through some easy calculation, say the positive or negative distance from a constant), then reverse the operation right before the C call. This will force the attacker to deal with the runtime and navigate through it to nail down an inject point, which is going to be very tough in the compact code inside the final goroutine, so there is no choice but to deal with the async introduced by pthread.
  • You can check for and clear any "surprise globals" in an easier spot (above, in the myPrint wrapper) that C libraries could leave behind (or nuke them at the C return to avoid traveling back through the runtime).

Disadvantages to this approach:

  • Longer to code it
  • A single parameter C call rarely happens, so you will probably have to wrap and thunk Go structs and get the pointers correct and use more channels to split the data up. For each split you will be making 2 more channels.
  • The Go runtime is hefty; making it a bad choice for very small and resource-constrained microcontrollers. Above that level the runtime performance is acceptable.
  • For stream processing in Go to C libraries you will probably have to set up another wrapper on top of this simple one if the underlying C library supports a change of behavior mid-flight. For instance, watching TV through the C function and you want to signal a change in the stream source without tearing down and reconstructing the bridges. You would have to do this anyway if the C library doesn't support that feature but you want the users of your Go wrapper to have it by hiding the tear-down and rebuild of the stream.

@christoofar
Copy link
Author

christoofar commented Apr 13, 2024

Why this approach would frustrate Jia Tan:

Even if you wield the power of your own in-memory DASM utility and you have captured LD_AUDIT to read up the load, this does not give you a cop-out to avoid the complexity of the decision-making in the Go runtime.

For starters:

  • Go routines have their own independent context. They're usually written out as anonymous functions so their asm is easy to locate, but on which CPU thread much less the internal runtime context it will get called out at (and even from which stack) is far less obvious. That's why Go programmers debug with delve.

  • Fixing the direction flow of the data channels creates another obstacle. It is more work than just bouncing a ctx around like a beach ball but it's more secure, because channels can only pass data through the runtime. Trying to patch around that is going to be a very challenging problem to overcome.

  • Go routines have a compact structure. The gostack area is very small which allows the Go runtime to make its reallocation movements very fast. Spinning up process threads is slow and tedious. In Go, creating a timeline independent task is quick. You can even do simple things to really foul up Jia Tan such as this:

func whenDoILaunch() {
	for i := 0; i<callAsyncTimes() ;i++ {
		i += randomizer() // returns *int
		go func(i *int) {
			if *i == 0 {
				go launchfunc()
			}
		}(&i)
	}
}

It's nonsense code, but still fast. This runtime trick gives Jia Tan some nasty work: he has no choice but to travel everywhere to find where the launcher context in the runtime is, forcing him to refocusing into your launcher, where the goroutine stacks are packed tight.

You could even spice this up even more by C-wrapping the C library itself, introducing another headache. That would certainly benefit the simplicity of the Go code in the routine cases of nasty call-chain dependencies.

So what are some gaps?

Biggest of all is the Go garbage management. For the most secure data you really must clear values and any other scratch data structures immediately after use. You can emit out your results to a buffered channel, but it's not that great of an idea to leave it to another goroutine to the cleanup, as that sets up another jump with pointer context passing that can be patched out.

An obvious one: don't send critical data back across a ctx. Use channels!
You can see from my screenshots that main() is the obvious universal landing point. You should have almost nothing in your main() but your primary goroutine launch and whatever core system validation that must be checked before proceeding. I really wouldn't even read environment variables or the passed-in arguments to the software from main(). Do all your work from inside goroutines where the stack is packed tightly and harder to patch.

So, I think this method of just putting a go routine moat around C libraries, taking the hit of immediately clearing memory and relying on minimal channel buffer sizes is probably the safest way to do unsafe code in CGo.

@christoofar
Copy link
Author

christoofar commented Apr 13, 2024

What should not be done?

On large CGo wrappers it is super-common to throw the .h file into a stub translator. That is such a bad idea for a release except for direct portages (like GTK). Just exposing out the CGo stubs will likely make other Go developers mad and not use your library, particularly if there's no obvious and easy call chain dependencies, like this:

  pInterestingThingRequestor := C.createThingRequest(...)   // type is *HANDLEInterestingThingRequest

  *pInterestingThingRequestor.SetSomePrecondition(..., 0, ...)

  C.reticulateSplines(&pInterestingThingRequestor)
  ...

When you could have just handled this part in native Go for them:


func ReticulateSplinesFast(options InterestingOptions) InterestingData {
    // Translate and build out the C structures here
}

Don't pass a ctx to where the C call happens

It's worth saying this a third, fourth, fifth time. Avoid the Go context structure when you are closer to the C code. Use an error code channel (int) and flatten out your data into data channels before you head off to the area which calls C.

ctx defeats the entire purpose of protecting the C calls behind goroutine bridges. Context structures are fine inside the Go ecosphere, but you decided to leave it. And the xz backdoor relied on a data structure passed everywhere in the library to get around the problem of context switching. Don't open that door.

@christoofar
Copy link
Author

christoofar commented Apr 19, 2024

For anyone who is interested:

I am applying the context in this study to new development: a Go-based app+API to liblzma.so: safexz

Unlike other CGo wrappers for liblzma, this one only allows the C code to send/receive data via one-way channels and strong-typed base types. No data coming from Go can be promoted into a pointer and misused from the C layer. So even if you had the 5.6.2 version of liblzma.so on your setup, you would be fine.

The Go-facing stubs are written and the underlying compress/decompress sequence is complete. I am now finishing manual stress testing of the compress/decompress in preparation for writing out the unit tests and a build sequence.

While I am still statically binding to liblzma.so I could (if I wished) set up the C library to load on invocation with dlopen() and then unload liblzma.so. As there has been a lot of renewed interest in Lasse Collins' work lately and Jia Tan's backdoor has been extricated from the test binaries and build process, there isn't a compelling reason to dlopen().

Since the xz format does not lend itself well to multiple stream structures (it is supported, but no one makes use of it), to stay compatible with how most people use xz I am debating throwing the standard/basic Go support for tar functions into this Go lib so that safexz can pack and unpack to .tar.xz in one shot rather than running through a tar | xz pipe. I primarily first want to make sure that there is buffer support in front of and behind STDIN and STDOUT since compressors usually are fed data that way.

Direct xz compression for strings and []byte for in-memory/database applications is quite easy, so I intend after this is over to "eat my own dog food" and use safexz in my own work at ${DAY_JOB}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment