Detecting malicious contributions in open-source projects

Created: September 16, 2022
Updated: September 16, 2022
Tags: research

I worked as a Research Intern at the Computer Science Laboratory at SRI International this 2022 Summer.

The main research objective we started of with was:

How do we protect the integrity of open-source software projects from malicious actors and influence operations within the community?

The motivation for this research comes from the fact that open-source software has become a critical part of our infrastructure. And we have seen multiple attacks on open-source projects that have resulted in supply chain attacks and other security incidents downstream. With this larger goal in mind, we first tried to tackle a smaller problem:

Can we detect malicious set of patches in the Linux kernel repository?

The team had already built out a graph based AI model which provided a great deal of insight into the social dynamics of the Linux kernel community. This was able to detect incidents such as Hypocrite Commits and other influence operations with low false positive rates.

My task was to enrich this model using additional information from actual code changes. My plan involved:

  1. Set up a Repo Mining infrastructure to extract code changes from the Linux repository.

  2. Using the concept of change graphs1 to represent the changes made by a patch.

  3. Classify the changes as across different dimensions, both qualitative and quantitative and add the score to patch graph.

  4. Taking advantage of the graph structure, use GNNs to train a model to classify patches as malicious or benign.

Step 1 was relatively simple once we figured the right abstractions and tools. Although step 2 seemed simple at first, it turned out to be a lot more work than I estimated. The main challenge being, how to produce bitcode while the kernel's build dependency evolves along with the codebase. We partially solved thanks to TuxMake2 a wonderful build tool anyone bulding the kernel multiple times a day should be using. Additionally, scopping down the problem to limit the analysis to a fixed range of commits (still a huge dataset by any measure).

Step 3 was the most interesting part of the project. We had to come up with a set of features which could be extracted from the code changes.



Change Graphs are a novel representation of code changes which are easier to analyze than the source code itself.