Optimising Tail Recursion and Performance

Recursion is a common programming paradigm where a function calls itself. Tail recursion, a special case, occurs when the recursive call is the function’s last operation. It can be optimized into a loop, improving performance by eliminating function call overhead and stack growth.

Compiler optimisations

Modern compilers are incredibly complex and good at their jobs, some modern compilers will optimise tail recursion for us, meaning that we can enjoy the benefits of writing recursively whilst everything is taken care for us under the hood.

To prove that tail recursion is optimised we will set up a simple experiment.Below is a simple C program that recursively calls the function count() and adds 1 to our total num each time until we reach 500000 where we return.

#include <stdio.h>

int count(int num) {
    if(num == 500000) {
        printf("We did it!\n");
        return 0;
    }

    num = num + 1;

    return count(num);
}

int main() {
    count(0);
    return 0;
}

The program should print out We did it once we reach 500000. So let’s compile it with gcc, and then test it.

$ gcc recursion.c -o recursion
$ ./recursion
Segmentation fault (core dumped)

That’s not what we expected, we get the very descriptive error message Segmentation fault instead.

Taking a closer look and debugging our program in gdb shows the function count() called thousands of times, this is not what we expected. We expected the compiler to optimise our recursive function into a loop so we should not be seeing the count() function being called more than once.

It turns out after reading the documentation that the compiler doesn’t optimise our code by default, so let’s try again with optimisations turned on. The flag -O3 tells gcc to use maximum optimisation and apply all of them to our code. A list of optimisations and what they do is here ↗.

$ gcc recursion.c -O3 -o recursion
$ ./recursion
We did it!

Success! Our program runs perfectly without any core dumps.

Analysing the optimisations

We’re not done yet though! Let’s take a look at how the compiler actually optimises these calls. I’m dumping the assembly code generated by gcc by using the -S flag which tells gcc to stop just before it compiles the assembly code.

I’ve stripped away some of the assembly code and only left the relevant bits to make it easier to read.

$ gcc recursion.c -S -o recursion

count:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $16, %rsp
    movl    %edi, -4(%rbp)
    cmpl    $500000, -4(%rbp)
    jne .L2
    movl    $.LC0, %edi
    call    puts
    movl    $0, %eax
    jmp .L3
.L2:
    addl    $1, -4(%rbp)
    movl    -4(%rbp), %eax
    movl    %eax, %edi
    call    count
.L3:
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   count, .-count
    .globl  main
    .type   main, @function

As you can see in this unoptimised version, we have a call to count from within the count function.

The cmpl instruction computes a conditional (whether the current count is equal to 500000) and sets a flag. The instruction jne is a conditional branch that jumps to the location L2 if the flag is false, L2 then calls count.

The jmp call is an unconditional branch and terminates the program once the flag is true. This is clearly unoptimised and will add a new level to the stack each time count is called eventually leading to a stack overflow.

$ gcc recursion.c -S -O3 -o recursion

count:
.LFB24:
    .cfi_startproc
    subq    $8, %rsp
    .cfi_def_cfa_offset 16
    movl    $.LC0, %edi
    call    puts
    xorl    %eax, %eax
    addq    $8, %rsp
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc
.LFE24:
    .size   count, .-count
    .section    .text.startup,"ax",@progbits
    .p2align 4,,15
    .globl  main
    .type   main, @function

The compiler does a much better job with the optimisations on, there are no branches as in the previous code and there are no calls to the count function from within itself. So it has successfully optimised the code to exclude tail recursion.

Closing thoughts

Whilst gcc does a good job at optimising it does not apply optimisations by default. Languages such as Java do not optimise tail recursion at all. Therefore where recusrion is critical I strongly recommend you to not rely on the compilers optimisations. Having said that, there’s plenty of cases where using recursion makes your life much easier.

Optimising Tail Recursion and Performance

Compiler optimisations

Analysing the optimisations

Closing thoughts

Continue reading

Splitting text into subtitle pages is harder than it looks