Conf42 Prompt Engineering 2024 - Online

- premiere 5PM GMT

Can LLMs Understand the Linux Kernel? An LLM-Powered Approach to Understanding Large Codebases

Video size:

Abstract

Feel lost in massive codebases like the kernel? RAG and fine-tuning fall short on some questions. Since development is a social event, what if we could survey every developer about each commit? With Code-Survey, we treat AI as survey participants, turning unstructured data into actionable insights.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, everyone. My name is Yushan Zheng. I'm the founder of the Yunomiya BPF. Today, I'm excited to share our project, CodeSurvey, which uses language models to help us understand large codebases. First, I will start by talking about what actually means to understand codebases. We know large language models can help us understand code to some extent, but they often struggle when it comes to answering questions about really big code reports like Linus Kernel. Today, I will go over another approach to explore git repos or big codebases, and I will show an example of how we try to gain insights from kernel. And I will also discuss some best practices we've learned and our next steps. First, we all know that LLMs can help us find bugs and write code, so they do understand code to a certain degree. But understand code isn't just about fixing bugs or writing functions. Sometimes we are trying to answer more high level questions. about the code and how it evolves. For example, you might want to know how new features impact stability or performance, or whether there are stages in a component's life cycle. We may ask how certain features have changed over time, which parts of the code are most likely to have bugs, or how different features and Components depending on each other, this high level questions can help us with things like software design, bug studies, find root causes of issues, improve code review process, and make software more stable. They can also help us design new debugging tools or CI tools or make it easier for newcomers to understand open source projects. So why are we using live language models for this? traditional methods are limited. Data analysis and manual code review are very time consuming, and even then, they often don't tell the full story. There are so many insights hidden in comments, emails, and other kinds of unstructured data. This data produced during the software development could be really valuable, but if we only rely on traditional methods, they are often limited by things like Scala, bias, and subjectivity. If you just try to apply large language models into these use cases, they also face their own challenges when it comes to understanding large codebases. For example, within context learning, we can put snippets of code into the model's context, but this has strict limits due to the context length. We can't load an entire data report into it. There is another approach called REC, where we combine library models with internal source of knowledge, but it's mostly rely on accurate retrieval information, so it's really only good on small, precise questions about specific code snippets. Time toning is a narrow option when you adapt models with specific data, but it's costly, time consuming, and there's a risk it might get too narrow and miss the big picture. these approaches are generally not ideal for answering high level design questions like spam notification powers or a cost overload being reported. Even more, it may have a long history during the report development. That's why we are proposing a new approach we call CodeSurvey. CodeSurvey is an LRM driven methodology or tool for analyzing codebases or JIT reports. The basic idea to think about how we measure is that what if we could ask every developer to take a survey about everything they contribute during the development process? So the idea is we can treat random models as if they were human participants in a survey. Because in a way, software development is a social activity. Using this approach, we can take everything involved in software development, commits, emails, discussions, and turn it into a structured dataset. Then we can run queries, perform quantitative analysis on development process, and visualize the results. Here is how code survey works. First, we have SBAT and LRM agents working together to come up with high level questions. These questions can help us figure out what we need to know. Then, we identify the required data types and sources to answer these questions. After left. We design survey questions based on the data we collected. The LLMs then use these questions to process the data and answer a survey. After this, we either have human aspects or more LLM agents and evaluate survey results to make sure they are accurate. If needed, we refine the questions or the survey process based on the feedback. Finally, we end up with a dataset list tagged and organized, allowing us to analyze the data in a structured way. For example, we create a survey definition about classified human aspects. with the main type of this commit with several choices like bug fix or new features and so on. We also provide some context for a survey so the language model understands the background, it's like a description and hint. These survey definitions are then turned into forms, like this one. which include the commit details, description, and the question we ask. With this problem, we can process thousands of commits to generate a meaningful dataset, or even hundreds of thousands of commits. Now, let's look at our case study on the eBPF subsystem in the Linux kernel. eBPF is a subsystem that allows you to run programs in the kernel. It's like having a virtual machine in the kernel. So let's take a look at our case study on the eBPF subsystem in the Linux kernel. So why do we pick eBPF? First, because we know a lot about it, which means we can ask current aspects to confirm our finding, because our company or organization are mainly doing eBPF ourselves. Second, eBPF is also a readily involved subsystem with a lot of complexity and functionality. We applied a course survey to analyze the evolution of eBPF to gain insights that viewers are interested in. Here are some of the initial results. we observe, for example, we observe feature changes over time since the API is event driven. It can attach to multiple events, like network events, tracing events, profiling events, or security hooks. We could see how these events have changed over time, which reflects the evolving use cases for eBPF. Initially, eBPF was designed for network related features, but now it has expanded to include more features like strategy ops or security management control. You can see that in the features. We also look into the interdependency between features and components. For example, where new features add which components need to be modified. We found that the Cisco interface and the Leap BPF library are closely related. Some features require changes to both of Leap's helpers and kernel functions. Also tightly coupled with the eBPF verifier, adding new helpers or keyframes usually involves updating the eBPF verifier to add more logic. When we add runtime features in eBPF, like a BPF link to console BPF reader, this often needs to update to multiple components at the same time. We also, for example, we also found that adding new eBPF instructions often includes more types of bugs in the eBPF verifier. In this feature, we compare the total numbers of bugs in the verifier with the addition of new instruction features, and there is a clear overlap between them. If you are interested in a more detailed report, it's available on GitHub and our archive paper. Now let's talk about some best practices we've learned. Using predefined tags or categories can help us standardize their response, which then make the response clear and reduce ambiguity. We implement workflows for LLM agents to reveal and refine their answers, which improves accuracy. We also know the models to answer, allow the models to answer and not sure if they don't have enough information, which prevents them from giving misleading answers or hallucinations. We start with parallel testing and refine our survey iteratively. Make sure the questions are clear and logically consistent. Finally, we make sure to evaluate the data for consistency and catch any contradictory answers. This approach helps us improve the readability and accuracy of the data generated by LRMs. It can also apply to other, not just code surveys, like other tagging or tags. However, code survey also has some limitations. For example, LLMs can make mistakes, so they may sometimes misinterpret information or hallucinations by providing incorrect answers. Human expertise is still sometimes needed to help with the design of survey questions and validate results. Quality of the results. Depending on the data. If the data is in or there are un enough resources, it can leave gaps in our understanding for next step. Next step, we plan to automate our process even more so it requires less human effort. We are working on making decentralized and scalable, and we are also fin refining the process using more elements to push efficiency and another. Another priority is like to building a robust framework for evaluation, which makes the validation more reliable. And finally, we are planning to apply the CodeSheriff to other projects like LLVM and Kubernetes to see how well it's classed. Thanks for your listening. You can find our Git report and archive paper at these links.
...

Yusheng Zheng

Founder @ eunomia-bpf

Yusheng Zheng's LinkedIn account Yusheng Zheng's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)