Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, everyone.
My name is Yushan Zheng.
I'm the founder of the Yunomiya BPF.
Today, I'm excited to share our project, CodeSurvey, which uses language models
to help us understand large codebases.
First, I will start by talking about what actually means to understand codebases.
We know large language models can help us understand code to some extent,
but they often struggle when it comes to answering questions about really
big code reports like Linus Kernel.
Today, I will go over another approach to explore git repos or big codebases,
and I will show an example of how we try to gain insights from kernel.
And I will also discuss some best practices we've
learned and our next steps.
First, we all know that LLMs can help us find bugs and write code, so they
do understand code to a certain degree.
But understand code isn't just about fixing bugs or writing functions.
Sometimes we are trying to answer more high level questions.
about the code and how it evolves.
For example, you might want to know how new features impact stability
or performance, or whether there are stages in a component's life cycle.
We may ask how certain features have changed over time, which parts of the
code are most likely to have bugs, or how different features and Components
depending on each other, this high level questions can help us with things like
software design, bug studies, find root causes of issues, improve code review
process, and make software more stable.
They can also help us design new debugging tools or CI tools or
make it easier for newcomers to understand open source projects.
So why are we using live language models for this?
traditional methods are limited.
Data analysis and manual code review are very time consuming, and even then,
they often don't tell the full story.
There are so many insights hidden in comments, emails, and other
kinds of unstructured data.
This data produced during the software development could be really valuable,
but if we only rely on traditional methods, they are often limited by things
like Scala, bias, and subjectivity.
If you just try to apply large language models into these use cases, they
also face their own challenges when it comes to understanding large codebases.
For example, within context learning, we can put snippets of code into the
model's context, but this has strict limits due to the context length.
We can't load an entire data report into it.
There is another approach called REC, where we combine library models
with internal source of knowledge, but it's mostly rely on accurate
retrieval information, so it's really only good on small, precise
questions about specific code snippets.
Time toning is a narrow option when you adapt models with specific data,
but it's costly, time consuming, and there's a risk it might get too
narrow and miss the big picture.
these approaches are generally not ideal for answering high level design
questions like spam notification powers or a cost overload being reported.
Even more, it may have a long history during the report development.
That's why we are proposing a new approach we call CodeSurvey.
CodeSurvey is an LRM driven methodology or tool for analyzing
codebases or JIT reports.
The basic idea to think about how we measure is that what if we could
ask every developer to take a survey about everything they contribute
during the development process?
So the idea is we can treat random models as if they were
human participants in a survey.
Because in a way, software development is a social activity.
Using this approach, we can take everything involved in software
development, commits, emails, discussions, and turn it into a structured dataset.
Then we can run queries, perform quantitative analysis on development
process, and visualize the results.
Here is how code survey works.
First, we have SBAT and LRM agents working together to come
up with high level questions.
These questions can help us figure out what we need to know.
Then, we identify the required data types and sources to answer these questions.
After left.
We design survey questions based on the data we collected.
The LLMs then use these questions to process the data and answer a survey.
After this, we either have human aspects or more LLM agents and evaluate survey
results to make sure they are accurate.
If needed, we refine the questions or the survey process based on the feedback.
Finally, we end up with a dataset list tagged and organized, allowing us to
analyze the data in a structured way.
For example, we create a survey definition about classified human aspects.
with the main type of this commit with several choices like bug
fix or new features and so on.
We also provide some context for a survey so the language model
understands the background, it's like a description and hint.
These survey definitions are then turned into forms, like this one.
which include the commit details, description, and the question we ask.
With this problem, we can process thousands of commits to generate
a meaningful dataset, or even hundreds of thousands of commits.
Now, let's look at our case study on the eBPF subsystem in the Linux kernel.
eBPF is a subsystem that allows you to run programs in the kernel.
It's like having a virtual machine in the kernel.
So let's take a look at our case study on the eBPF subsystem in the Linux kernel.
So why do we pick eBPF?
First, because we know a lot about it, which means we can ask current
aspects to confirm our finding, because our company or organization
are mainly doing eBPF ourselves.
Second, eBPF is also a readily involved subsystem with a lot
of complexity and functionality.
We applied a course survey to analyze the evolution of eBPF to gain insights
that viewers are interested in.
Here are some of
the initial results.
we observe, for example, we observe feature changes over time
since the API is event driven.
It can attach to multiple events, like network events, tracing events,
profiling events, or security hooks.
We could see how these events have changed over time, which reflects
the evolving use cases for eBPF.
Initially, eBPF was designed for network related features, but now it has expanded
to include more features like strategy ops or security management control.
You can see that in the features.
We also look into the interdependency between features and components.
For example, where new features add which components need to be modified.
We found that the Cisco interface and the Leap BPF library are closely related.
Some features require changes to both of Leap's helpers and kernel functions.
Also tightly coupled with the eBPF verifier, adding new helpers or
keyframes usually involves updating the eBPF verifier to add more logic.
When we add runtime features in eBPF, like a BPF link to console BPF
reader, this often needs to update to multiple components at the same time.
We also, for example, we also found that adding new eBPF
instructions often includes more types of bugs in the eBPF verifier.
In this feature, we compare the total numbers of bugs in the verifier with the
addition of new instruction features, and there is a clear overlap between them.
If you are interested in a more detailed report, it's available
on GitHub and our archive paper.
Now let's talk about some best practices we've learned.
Using predefined tags or categories can help us standardize their
response, which then make the response clear and reduce ambiguity.
We implement workflows for LLM agents to reveal and refine their
answers, which improves accuracy.
We also know the models to answer, allow the models to answer and not sure
if they don't have enough information, which prevents them from giving
misleading answers or hallucinations.
We start with parallel testing and refine our survey iteratively.
Make sure the questions are clear and logically consistent.
Finally, we make sure to evaluate the data for consistency and
catch any contradictory answers.
This approach helps us improve the readability and accuracy
of the data generated by LRMs.
It can also apply to other, not just code surveys, like other tagging or tags.
However, code survey also has some limitations.
For example, LLMs can make mistakes, so they may sometimes misinterpret
information or hallucinations by providing incorrect answers.
Human expertise is still sometimes needed to help with the design of
survey questions and validate results.
Quality of the results.
Depending on the data.
If the data is in or there are un enough resources, it can leave gaps
in our understanding for next step.
Next step, we plan to automate our process even more so it
requires less human effort.
We are working on making decentralized and scalable, and we are also fin
refining the process using more elements to push efficiency and another.
Another priority is like to building a robust framework for evaluation, which
makes the validation more reliable.
And finally, we are planning to apply the CodeSheriff to other
projects like LLVM and Kubernetes to see how well it's classed.
Thanks for your listening.
You can find our Git report and archive paper at these links.