Tuesday, May 14, 2019

High level overview of Scudo

With this post, I am going to go through some high level details about the architecture of the allocator and some of the security features offered. Some notions will be skimmed through, with the hopes of being covered in detail in a later post (based on my free time).

Scudo is made up of the following components:

  • a "primary" allocator: this is a fast allocator, servicing smaller sized requests (configurable at compile time). It is "segregated", eg: chunks of the same size end up in the same memory region, that is compartmentalized from other regions (the separation is stronger on 64-bit, where a memory area is specifically reserved for the primary regions); chunks allocated by the primary are randomized to avoid predictable address sequences (note that the larger the size, the more predictable the addresses are to each other). A couple of side effects to this design, is that there is no such thing as coalescing contiguous blocks, and that the memory used by the primary is never unmapped - but it can be reclaimed. While we are trying to focus on 64-bit, there is a 32-bit primary, mostly due to Android;
  • a "secondary" allocator: which wraps the platform memory allocation primitives, and as such is slower and used to service larger sized allocations. Allocations fulfilled by the secondary are surrounded by guard pages;
  • local caches: those are thread specific stashes, holding pointers to free blocks in order to relieve contention over the global free-list. There are two models: exclusive and shared. With the exclusive model, there is a unique cache per thread, which is more memory hungry but mostly free of contention. With the shared model, threads share a set number of caches, that can be dynamically reassigned at runtime based on contention - this uses less memory than the exclusive model and usually fits better the needs of end user platforms.
  • a "quarantine": which can be equated to a heap wide delayed free-list, holding recently freed blocks for a time until a criteria is met (usually, a certain size is reached), before returning them to the primary or secondary for reuse. There is a thread-specific quarantine, and a global quarantine to avoid as much as possible global locking. This is the most impactful in terms of memory usage and to some extent performances: even smaller sized quarantines will have a large impact on a process RSS, and it effectively kills locality, making any sort of memory cache less useful. As such, it is disabled by default, and can be enabled on a per-process basis (and sized according to the process needs).

Now for some security "features":

  • strong sizes and alignment requirements: we enforce maximum sizes and alignment values, but also check that pointers provided are properly aligned; those are cheap checks to avoid integer overflows and catch low hanging deallocation errors (or abuse);
  • each chunk is preceded by a header, that stores basic information about the allocation, and is checksummed to be able to detect corruption.
    While the debate in-band vs out-of-band metadata divides people, the choice for an in-band header was made to be able to detect linear {over,under}flows (at least until we get memory tagging).
    The checksum of the header involves a global secret, the pointer being dealt with, and the content of the header - it is not meant to be cryptographically strong. As for the data stored in the header, it holds the size of the allocation, the state of the chunk (available, allocated, quarantined), its origin (malloc, new, new[]) and some internal data. Headers are manipulated atomically to detect race attempts between threads operating on the same chunk.
    As is usually the case with this type of mitigation, inconsistencies are only detected when the header is checked, which usually means that a heap operation has to occur on the chunk in question.
    Overall, this allows for several security checks:
    • ensure that a pointer being deallocated actually points to a chunk, otherwise the checksum verification will fail. Some other allocators gladly accept a pointer pointing to the middle of a chunk for deallocation, we do not;
    • ensure that the state of a chunk is consistent with the operation being carried out. This allows for detection of double-frees and the like;
    • ensure that a sized-deallocation is valid for the targeted chunk, which allowed to find an Intel C Compiler bug, and prevents related abuse;
    • ensure that the deallocation function is consistent with the allocation function that returned the targeted chunk (eg: free/malloc, delete/new);
  • we randomize everything we can, to reduce predictability as much as possible; one of the side benefits of the thread caches is that they can make it more difficult for an attacker to get the chunks they want in the state they need, if they leverage allocation primitives in different threads;
  • guard pages are added when deemed useful;
  • we do not store pointers in free chunks, or anything really. Our arrays of free pointers (what we call transfer batches) are located in separate memory region;
  • the quarantine helps mitigate use-after-free to some extent, making it harder for an attacker to reuse a deallocated chunk. This mitigation only goes so far, as a chunk will end up being reused at some point in time (unless you have unlimited memory);
  • the non-standalone version of Scudo also offers the possibility to set an RSS limit, which results in the allocator returning null pointers if said soft limit is exceeded (or aborting if a hard limit is set); this allows to quickly check the resilience of an application to OOM conditions - I still have to add that feature to the standalone version.

No comments: