heap-memory

A crude attempt at memory management

The other day I had a bit of a challenge to deal with. My workplace makes embedded data collection devices which are built around the Texas Instruments CC2538 SoC (internal photos visible here) and run OpenThread. To date, everything we’ve made has been an externally-powered device, running off either DC power (9-30V) or mains (120/240V 50/60Hz AC). CC2592 range extender support was added to OpenThread for this device.

The CC2538, although very light on RAM (32KiB), gets the job done with some constraints. Necessity threw us a curve-ball the other day, we wanted a device that ran off a battery. That meant going into sleep mode periodically, deep sleep! The CC2538 has a number of operating modes:

  1. running mode (pretty much everything turned on)
  2. light sleep mode (clocks, CPU and power stays on, but we pause a few peripherals)
  3. deep sleep mode — this comes in four flavours
    • PM0: Much like light-sleep, but we’ve got the option to pause clocks to more peripherals
    • PM1: PM0, plus we halt the main system clock (32MHz crystal or 16MHz RC), halting the CPU
    • PM2: PM1 plus we power down the bottom 16KiB of RAM and some other internal peripherals
    • PM3: PM2 plus we turn off the 32kHz crystal used by the sleep timer and watchdog.

We wanted PM2, which meant while we could use the bottom 16KiB of RAM during run-time, the moment we went to sleep, we had to forget about whatever was kept in that bottom 16KiB RAM — since without power it would lose its state anyway.

The challenge

Managing RAM in a device like this is always a challenge. malloc() is generally frowned upon, however in some cases it’s a necessary evil. OpenThread internally uses mbedTLS and that, relies on having a heap. It can use one implemented by OpenThread, or one provided by you. Our code also uses malloc for some things, notably short-term tasks like downloading a new configuration file or for buffering serial traffic.

The big challenge is that OpenThread itself uses a little over 9KiB RAM. We have a 4KiB stack. We’ve got under 3KiB left. That’s bare-bones OpenThread. If you want JOINER support, for joining a mesh network, that pulls in DTLS, which by default, will tell OpenThread to static-allocate a 6KiB buffer.

9KiB becomes about 15KiB; plus the stack, that’s 19KiB. This is bigger than 16KiB — the linker gives up.

Using heap memory

There is a work-around that gets things linking; you can build OpenThread with the option OPENTHREAD_CONFIG_HEAP_EXTERNAL_ENABLE — if you set this to 1, OpenThread forgoes its own heap and just uses malloc / free instead, implemented by your toolchain.

OpenThread builds and links in 16KiB RAM, hooray… but then you try joining, and; NoBufs is the response. We’re out of RAM. Moving things to the heap just kicked the can down the road, we still need that 6KiB, but we only have under 3KiB to give it. Not enough.

We have a problem in that, the toolchain we use, is built on newlib, and while it implements malloc / free / realloc; it does so with a primitive called _sbrk(). We define a pointer initialised up the top of our .bss, and whenever malloc needs more memory for the heap, it calls _sbrk(N); we grab the value of our pointer, add N to it, and return the old value. Easy.

Except… we don’t just have one memory pool now, we have two. One of which, we cannot use all the time. OpenThread, via mbedTLS also winds up calling on malloc() very early in the initialisation (as early as the otInstanceInitSingle() call to initialise OpenThread). We need that block of RAM to wind up in the upper 16KiB that stays powered on — so we can’t start at address 0x2000:0000 and just skip over .data/.bss when we run out.

malloc() will also get mighty confused if we suddenly hand it an address that’s lower than the one we handed out previously. We can’t go backwards.

I looked at replacing malloc() with a dual-pool-aware version, but newlib is hard-coded in a few places to use its own malloc() and not a third-party one. picolibc might let us swap it out, but getting that integrated looked like a lot of work.

So we’re stuck with newlib‘s malloc() for better or worse.

The hybrid approach

One option, we can’t control what malloc the newlib functions use. So use newlib‘s malloc with _sbrk() to manage the upper heap. Wrap that malloc with our own creation that we pass to OpenThread: we implement otPlatCAlloc and otPlatFree — which are essentially, calloc and free wrappers.

The strategy is simple; first try the normal calloc, if that returns NULL, then use our own.

Re-purposing an existing allocator

The first rule of software engineering, don’t write code you don’t have to. So naturally I went looking for options.

Page upon page of “No man don’t do it!!!”

jemalloc looked promising at first, it is the FreeBSD malloc(), but that there, lies a problem — it’s a pretty complicated piece of code aimed at x86 computers with megabytes of RAM minimum. It used uint64_ts in a lot of places and seemed like it would have a pretty high overhead on a little CC2538.

I tried avr-libc‘s malloc — it’s far simpler, and actually is a free-list implementation like newlib‘s version, but there is a snag. See, AVR microcontrollers are 8-bit beasts, they don’t care about memory alignment. But the Cortex M3 does! avrlibc_malloc did its job, handed back a pointer, but then I wound up in a HARDFAULT condition because mbedTLS tried to access a 32-bit word that was offset by a few bytes.

A simple memory allocator

The approach I took was a crude one. I would allocate memory in fixed-sized “blocks”. I first ran the OpenThread code under a debugger and set a break-point on malloc to see what sizes it was asking for — mostly blocks around the 128 byte mark, sometimes bigger, sometimes smaller. 64-byte blocks would work pretty well, although for initial testing, I went the lazy route and used 8-byte blocks: uint64_ts.

In my .bss, I made an array of uint8_ts; size equal to the number of 8-byte blocks in the lower heap divided by 4. This would be my usage bitmap — each block was allocated two bits, which I accessed using bit-banding: one bit I called used, and that simply reported the block was being used. The second was called chained, and that indicated that the data stored in this block spilled over to the next block.

To malloc some memory, I’d simply look for a string of free blocks big enough. When it came to freeing memory, I simply started at the block referenced, and cleared bits until I got to a block whose chained bit was already cleared. Because I was using 8-byte blocks, everything was guaranteed to be aligned.

8-byte blocks in 16KiB (2048 blocks) wound up with 512 bytes of usage data. As I say, using 64-byte blocks would be better (only 256 blocks, which fits in 64 bytes), but this was a quick test. The other trick would be to use the very first few blocks to store that bitmap (for 64-byte blocks, we only need to reserve the first block).

The scheme is somewhat inspired by the buddy allocator scheme, but simpler.

Bit banding was simple enough; I defined my struct for accessing the bits:

struct lowheap_usage_t {
        uint32_t used;
        uint32_t chained;
};

and in my code, I used a C macro to do the arithmetic:

#define LOWHEAP_USAGE                                                   \
        ((struct lowheap_usage_t*)(((((uint32_t)&lowheap_usage_bytes)   \
                                     - 0x20000000)                      \
                                    * 32)                               \
                                   + 0x22000000))

The magic numbers here are:

  • 0x20000000: the start of SRAM on the CC2538
  • 0x22000000: the start of the SRAM bit-band region
  • 32: the width of each word in the CC2538

Then, in my malloc, I could simply call…

struct lowheap_usage_t* usage = LOWHEAP_USAGE;

…and treat usage like an array; where element 0 was the usage data for the very first block down the bottom of SRAM.

To implement a memory allocator, I needed five routines:

  • one that scanned through, and told me where the first free block was after a given block number (returning the block number) — static uint16_t lowheap_first_free(uint16_t block)
  • one that, given the start of a run of free blocks, told me how many blocks following it were free — static uint16_t lowheap_chunk_free_length(uint16_t block, uint16_t required)
  • one that, given the start of a run of chained used blocks, told me how many blocks were chained together — static uint16_t lowheap_chunk_used_length(uint16_t block)
  • one that, given a block number and count, would claim that number of blocks starting at the given starting point — static void lowheap_chunk_claim(uint16_t block, uint16_t length)
  • one that, given a starting block, would clear the used bit for that block, and if chained was set; clear it and repeat the step on the following block (and keep going until all blocks were freed) — static void lowheap_chunk_release(uint16_t block)

From here, implementing calloc was simple:

  1. first, try the newlib calloc and see if that succeeded. Return the pointer we’re given if it’s not NULL.
  2. if we’re still looking for memory, round up the memory requirement to the block size.
  3. initialise our starting block number (start_nr) by calling lowheap_first_free(0) to find the first block; then in a loop:
    • find the size of the free block (chunk_len) by calling lowheap_chunk_free_length(start_nr, required_blocks).
    • If the returned size is big enough, break out of the loop.
    • If not big enough, increment start_nr by the return value from lowheap_chunk_used_length(start_nr + chunk_len) to advance it past the too-small free block and the following used chunk.
    • Stop iterating of start_nr is equal to or greater than the total number of blocks in the heap.
  4. If start_nr winds up being past the end of the heap, fail with errno = ENOMEM and return NULL.
  5. Otherwise, we’re safe, call lowheap_chunk_claim(start_nr, required_blocks); to reserve our space, zero out the actual blocks allocated, then return the address of the first block cast to void*.

Implementing free was not a challenge either: either the pointer was above our heap, in which case we simply passed the pointer to newlib‘s free — or if it was in our heap space, we did some arithmetic to figure out which block that address was in, and passed that to lowheap_chunk_release().

I won’t publish the code because I didn’t get it working properly in the end, but I figured I’d put the notes here on how I put it together to re-visit in the future. Maybe the thoughts might inspire someone else. 🙂