Tuesday, May 14, 2019

High level overview of Scudo

With this post, I am going to go through some high level details about the architecture of the allocator and some of the security features offered. Some notions will be skimmed through, with the hopes of being covered in detail in a later post (based on my free time).

Scudo is made up of the following components:

  • a "primary" allocator: this is a fast allocator, servicing smaller sized requests (configurable at compile time). It is "segregated", eg: chunks of the same size end up in the same memory region, that is compartmentalized from other regions (the separation is stronger on 64-bit, where a memory area is specifically reserved for the primary regions); chunks allocated by the primary are randomized to avoid predictable address sequences (note that the larger the size, the more predictable the addresses are to each other). A couple of side effects to this design, is that there is no such thing as coalescing contiguous blocks, and that the memory used by the primary is never unmapped - but it can be reclaimed. While we are trying to focus on 64-bit, there is a 32-bit primary, mostly due to Android;
  • a "secondary" allocator: which wraps the platform memory allocation primitives, and as such is slower and used to service larger sized allocations. Allocations fulfilled by the secondary are surrounded by guard pages;
  • local caches: those are thread specific stashes, holding pointers to free blocks in order to relieve contention over the global free-list. There are two models: exclusive and shared. With the exclusive model, there is a unique cache per thread, which is more memory hungry but mostly free of contention. With the shared model, threads share a set number of caches, that can be dynamically reassigned at runtime based on contention - this uses less memory than the exclusive model and usually fits better the needs of end user platforms.
  • a "quarantine": which can be equated to a heap wide delayed free-list, holding recently freed blocks for a time until a criteria is met (usually, a certain size is reached), before returning them to the primary or secondary for reuse. There is a thread-specific quarantine, and a global quarantine to avoid as much as possible global locking. This is the most impactful in terms of memory usage and to some extent performances: even smaller sized quarantines will have a large impact on a process RSS, and it effectively kills locality, making any sort of memory cache less useful. As such, it is disabled by default, and can be enabled on a per-process basis (and sized according to the process needs).

Now for some security "features":

  • strong sizes and alignment requirements: we enforce maximum sizes and alignment values, but also check that pointers provided are properly aligned; those are cheap checks to avoid integer overflows and catch low hanging deallocation errors (or abuse);
  • each chunk is preceded by a header, that stores basic information about the allocation, and is checksummed to be able to detect corruption.
    While the debate in-band vs out-of-band metadata divides people, the choice for an in-band header was made to be able to detect linear {over,under}flows (at least until we get memory tagging).
    The checksum of the header involves a global secret, the pointer being dealt with, and the content of the header - it is not meant to be cryptographically strong. As for the data stored in the header, it holds the size of the allocation, the state of the chunk (available, allocated, quarantined), its origin (malloc, new, new[]) and some internal data. Headers are manipulated atomically to detect race attempts between threads operating on the same chunk.
    As is usually the case with this type of mitigation, inconsistencies are only detected when the header is checked, which usually means that a heap operation has to occur on the chunk in question.
    Overall, this allows for several security checks:
    • ensure that a pointer being deallocated actually points to a chunk, otherwise the checksum verification will fail. Some other allocators gladly accept a pointer pointing to the middle of a chunk for deallocation, we do not;
    • ensure that the state of a chunk is consistent with the operation being carried out. This allows for detection of double-frees and the like;
    • ensure that a sized-deallocation is valid for the targeted chunk, which allowed to find an Intel C Compiler bug, and prevents related abuse;
    • ensure that the deallocation function is consistent with the allocation function that returned the targeted chunk (eg: free/malloc, delete/new);
  • we randomize everything we can, to reduce predictability as much as possible; one of the side benefits of the thread caches is that they can make it more difficult for an attacker to get the chunks they want in the state they need, if they leverage allocation primitives in different threads;
  • guard pages are added when deemed useful;
  • we do not store pointers in free chunks, or anything really. Our arrays of free pointers (what we call transfer batches) are located in separate memory region;
  • the quarantine helps mitigate use-after-free to some extent, making it harder for an attacker to reuse a deallocated chunk. This mitigation only goes so far, as a chunk will end up being reused at some point in time (unless you have unlimited memory);
  • the non-standalone version of Scudo also offers the possibility to set an RSS limit, which results in the allocator returning null pointers if said soft limit is exceeded (or aborting if a hard limit is set); this allows to quickly check the resilience of an application to OOM conditions - I still have to add that feature to the standalone version.

Friday, May 10, 2019

What is the Scudo hardened allocator?

I am going to make a small series of posts about the Scudo hardened allocator, starting with some general considerations then getting into technical details.

Scudo is a user-mode allocator which aims at providing additional mitigation against heap based vulnerabilities, while maintaining good performance. It’s open-source, part of LLVM’s compiler-rt project, and external contributions are welcome.

Scudo is currently the default allocator in Fuchsia, is enabled in some components in Android, and is used in some Google production services. While it was initially implemented on top of some of sanitizer_common’s components, it is being rewritten to be standalone, without dependencies to other compiler-rt parts, for easier use (and additional performance and security benefits).

Why another allocator?

The journey started a few years ago while exploring the landscape of usermode allocators on Linux. It is no secret that Google uses tcmalloc, and in all honesty, the internal version is blowing everything else away. By a lot. But as it was noted by my esteemed former-colleagues Sean and Agustin, its resilience to abuse is ... lackluster, to say the least.

To understand our options, let’s have a look at a somewhat typical benchmark for production services at Google, involving a lot of asynchronous threading, protobufs, RPCs and other goodies, all of that running on a 72 core Xeon machine with 512GB of RAM (this is not meant to be the most rigorous of comparison, but give you an idea of what’s up). The first metric is the number of Queries Per Second, the second is the peak RSS of the program (as reported by /usr/bin/time).


Allocator

QPS (higher is better)

Max RSS (lower is better)

tcmalloc (internal)
410K
357MB
356K
1359MB
dlmalloc (glibc)
295K
333MB
142K
710MB
24K
393MB
18K
458MB
FATALERROR**

SIGSEGV***

scudo (standalone)
400K
318MB

* hardened_malloc is mostly targeting Android, and only supports up to 4 arenas currently so the comparison is not as relevant as it strongly impacts concurrency. Increasing that number yields to mmap() failures.
** Guarder only supports up to 128 threads per default, increasing that number results in mmap() failures. Limiting the number of threads is the only way I found to make it work, but then the results are not comparable to the others.
*** I really have no idea how real world payloads ever worked with those two.

tcmalloc & jemalloc are fast, but not resilient against heap based vulnerabilities. dlmalloc is, well, sometimes more secure than others, but not as fast. The secure allocators are underperforming, when working at all. I am not going to lie, some benchmarks are less favorable to Scudo, some others more, but this one is representative of one of our target use cases.
The idea of Scudo is to fall in the category of “as fast as possible while being resilient against heap based bugs”. Scudo is not the most secure allocator, but it will (hopefully) make exploitation harder, with a variety of configurable options that allow for increased security (but that comes with a cost in performance and memory footprint, like the Quarantine). It is also meant to be a good working ground for future mitigation (such a memory tagging, or GWP-ASan).

Origins

While various options for improving existing allocators were considered, a meeting with Kostya Serebryany‎ lead to the plan of record: building upon the existing sanitizer_common allocator to create a usermode allocator that would be part of LLVM’s compiler-rt project.

The original sanitizer allocator, which is used as a base for the ones of ASan, TSan, LSan, was originally written by Kostya and Dmitry Vyukov‎, and featured some pretty neat tricks that made it fast, and extensible.

A decent amount of things had to be changed (things were allocated from a fixed base address, in a predictable fashion, overall memory consumption was on higher side, etc). The original version targeted Linux only, and then support came for other Google platforms, Android first, and then Fuchsia.

Aleksey Shlyapnikov did some work to make Scudo work on Solaris with SPARC ADI for some memory tagging research, but that work was never upstreamed. I will probably revisit that at some point. As for other platforms, they will be up to the community.

Fuchsia decided to adopt Scudo as their default libc allocator, which required rewriting the code to remove dependencies to sanitizer_common - and we are reaching the final stages of the upstreaming process.