Whole Kernel Static Analysis

Created: February 12, 2023
Updated: June 01, 2023

Content

  • Environment
  • Custom kernel
  • Intermediate Representation
    • Whole Program Bitcode
  • Static Analysis
    • Alias Analysis
      • Data-flow Analysis
      • Type-based Analysis

Environment

  1. QEMU - virtual machine

    apt install qemu-system-x86
    
  2. Disk - Ubuntu 22.04 Server Cloud Image + Seed

  3. Boot

    qemu-system-x86_64 -hda ubuntu-22.04-server-cloudimg-amd64.img -m 2G -enable-kvm -nographic -net nic -net user,hostfwd=tcp::2222-:22
    

Custom Kernel

  1. Build

    wget https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.8.9.tar.xz # stable at the time of writing
    tar -xf linux-6.8.9.tar.xz
    cd linux-6.8.9
    make defconfig
    make -j$(nproc)
    
  2. Boot

    qemu-system-x86_64 -hda ubuntu-22.04-server-cloudimg-amd64.img -m 2G -enable-kvm -nographic -net nic -net user,hostfwd=tcp::2222-:22 -kernel arch/x86/boot/bzImage -append "root=/dev/sda1 console=ttyS0"
    

Intermediate Representation

Various analysis tools use the LLVM bitcode representation of the source. Compiling a single source file into bitcode is straightforward.

  1. clang -S -emit-llvm <file.c> to get the readable bitcode.
  2. clang -c -emit-llvm <file.c> to get the binary bitcode.

However, compiling large projects like the Linux kernel into bitcode is not as straightforward.

At a high level, each *.c source file is compiled into an object file. Object files .o in a sub-system are linked together to into an intermediate built-in.a file. Finally, all the built-in.a files are linked together to get the final kernel image vmlinux.

Whole Program Bitcode

To get the bitcode representation of the whole kernel, each source is compiled to bitcode and then linked in the same order the kernel build system does. One way to do it us a build system instrumentation tool like rizsotto/Bear to capture the build commands and execute the modified command or use a tool like my personal favorite trailofbits/blight which instruments the build system with pre and post build hooks to capture the build commands and modify them on the fly.

There are tools that already do something similar like travitch/whole-program-llvm and SRI-CSL/gllvm and work for all large projects not only the kernel. But these tools did not work for the kernel for reasons I'll have to investigate later.

For now, I use a simple python script not that generates the whole program bitcode for the linux kernel. The script is available at akshithg/whole-kernel-bitcode

Static Analysis

Alias Analysis

Andersen's pointer analysis is a flow-insensitive, context-insensitive, and field-sensitive pointer analysis algorithm. It is a data-flow analysis that computes inclusion constraints between points-to sets of pointers.

andersens pointer analysis

StatementInclusion Constraint
a = bPointsToSet(b) ⊆ PointsToSet(a)
a = &bb ∈ PointsToSet(a)
a = *b∀c ∈ PointsToSet(b): PointsToSet(c) ⊆ PointsToSet(a)
*a = b∀c ∈ PointsToSet(a): PointsToSet(b) ⊆ PointsToSet(c)
InputsDescription
Aa set of n pointers
Sa set of m statements, where m ≤ 4*n^2 unique statements
OutputDescription
ExhaustiveReturn PointsToSet(a) for all pointers a ∈ A
On-demandReturn if b ∈ PointsToSet(a) for a specific pair a, b ∈ A

  1. [POPL 2021] The Fine-Grained and Parallel Complexity of Andersen's Pointer Analysis