Whole Kernel Bitcode

Building the Kernel

Generating whole-program LLVM bitcode for the Linux kernel requires compiling each source file to bitcode and linking them in the correct order. The first step is to build the kernel.

wget https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.8.9.tar.xz # stable at the time of writing
tar -xf linux-6.8.9.tar.xz
cd linux-6.8.9
make defconfig
make -j$(nproc)

Running the Kernel

Once built, the kernel can be tested using QEMU.

qemu-system-x86_64 -hda ubuntu-22.04-server-cloudimg-amd64.img -m 2G -enable-kvm -nographic -net nic -net user,hostfwd=tcp::2222-:22 -kernel arch/x86/boot/bzImage -append "root=/dev/sda1 console=ttyS0"

LLVM Bitcode Representation

LLVM bitcode is widely used for program analysis, optimizations, and security research. It can be generated for individual source files using clang.

  1. clang -S -emit-llvm <file.c> -o <file.ll> to get the human readable bitcode.
  2. clang -c -emit-llvm <file.c> -o <file.bc> to get the binary bitcode.

Challenges in Whole-Kernel Bitcode Generation

Compiling an entire project like the Linux kernel to LLVM bitcode is non-trivial because of its complex build system. The kernel build process follows these steps:

  1. Each .c source file is compiled into an object file (.o).
  2. Object files within a subsystem are linked into an archive (built-in.a).
  3. All built-in.a files are linked to generate the final kernel image (vmlinux).

To preserve the correct build order, we must capture and modify the build process dynamically.

Generating Whole-Kernel Bitcode

To obtain bitcode for the entire kernel, each source file must be compiled to LLVM bitcode before linking. The most efficient way to achieve this is by instrumenting the kernel build system.

Build System Instrumentation

Tools like rizsotto/Bear and trailofbits/blight help capture and modify build commands dynamically:

  • Bear records all executed compilation commands, allowing you to re-run them with modified flags.
bear make -j$(nproc)
cat compile_commands.json | jq '.[] | .command'
  • Blight provides pre- and post-build hooks, enabling live modification of build commands.

Whole-Program Bitcode Extraction

There are existing tools designed to extract whole-program bitcode from large projects. travitch/whole-program-llvm and SRI-CSL/gllvm come to mind.

However, these tools did not work out-of-the-box for me with the kernel. As a workaround I built a llvm + python tool to generate the whole kernel bitcode.

My Approach

Relies only on the compiler and kernel build artifacts to generate the whole kernel bitcode:

  1. Compiles the kernel with llvm/clang.
  2. Used the compile_commands.json generated by clang to compile each source file into a bitcode file.
  3. Using the .buit-in.a.cmd files generated by the kernel build system, if finds the correct order to link the bitcode files.
  4. Links the bitcode files in the correct order to generate the whole kernel bitcode.

You can find the tool at akshithg/whole-kernel-bitcode.