Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi Im Pavel and today were gonna talk about NLP techniques for getting more insights
from git commit messages. Ill show you what we can do with git commit
message history to learn more about our projects, team members or project maturity
stage or even portfolio of the projects. I hope this video
would be interesting for the team leads, managers and hrs who is
interested in getting more context about their projects and organization.
The use cases which I will be describing here are theoretical.
We will use the open source project for the analysis, but hope
examples are close enough to the real processes in software development companies.
Before we start, let me introduce myself. My name is Paolo Perfilov.
I'm having 15 plus years of experience in fintech and
during my career I was working as a developer, engineer, project manager and
product manager. I have a master degrees in finance and master's degrees
in computer science. I'm very enthusiastic about the data engineering
and practical usage of the ML. A small disclaimer
here, I'm not representing any of my employers and I'm
speaking for myself. Again. The examples that I
would be showing you are theoretical and the projects we
would use for the analysis are open source.
Feel free to reach me out on LinkedIn and download
the notebook from my GitHub. Let's begin with the theory.
Here are the four building blocks of the classical management, planning,
organizing, leading and controlling and four building blocks of the people
recruiting, training, evaluating and motivating.
Is it enough to start managing people and projects?
It's just a theory which is missing the information about the culture,
environment, missing emotions and sentiments of the individuals.
I'll give you a practical example of the problem. Imagine that a software development
company is hiring new project manager and he gets five projects
which were running for quite some time already. He needs to read
and process huge amount of information to get up to speed.
Most likely the main sources of information would be there is as requirements to
project, plan, the documentation and he would need to talk to many people to get
the overview. But it might be not colorful enough to
get the sense of what is going on in reality.
From time perspective, it might take a few months or even
a year to get some understanding of people's behavior, get their feelings,
get the knowledge about the individual profiles and communication style to
become to be efficient in the team. But how to get this
insights fasting? I'll try to answer these questions in
this video and we'll be using NLP. We'll be using
one non obvious data source which is git commit messages.
Let's look at the git messages from the angle of different roles in the team,
the most of the roles would not use it as a data source. It's too
noisy, it's too low level, it's a lot of text, and most people
would not be able to extract meaningful information. But NLP
could help with that. From my personal experience, I can tell
you that comment messages might produce enough insights for all of the managerial roles
in the company. I'll try to show you some examples to prove it.
Okay, now we understand the problem and there is a lot of questions and
inspiration, but how we could turn data into
the insights. Let's talk about NLP. What is NLP?
NLP stands for natural language processing, which helps to turn words,
sentences, or any text into the numbers. Well, skip the theory
as I want to focus on the practical usage. NLP techniques
are used for sentiment analysis and categorization of the text.
It could tag the data, classify the data, and provide some emotional
levels. Here are some python libraries which I will
be showing you. And there are many more libraries which are
not in the scope of the video. Let's begin with
coding. Here are the libraries that you would need to install
to run the notebook. Select Python,
explore NLK. I will download
the repo and GitHub pandas.
This repo, most popular data science library takes
some time to download it. Okay, let's run the second
symbol. Let's grab the messages, the commit time,
and emails. All right, we have a result in
the resulting dataframe. We have three columns as I
expect. The shape of the dataframe is 335
thousand commits. Let's pre process the messages.
Let's delete the git keywords, CI, CD keywords,
some emails, some HTTP links,
and some purge pull request messages.
We need to make sure that the message and the text is looking good before
we start doing the sentiment analysis.
And also we see a lot of abbreviations here. Look doc,
es, zero, one, and something else. So it
might make sense to clean this up as well. So here is the
cleaned version of the message. We just apply the regex to
delete the verse that we don't want. We also extracted
the abbreviations. Here are the longest abbreviation. It seems like the
developer was a little bit annoyed by the somewhere this
let's start with descriptive statistics. Here's a chart
which is showing you the amount of contributions per year and number of
unique contributors, unique developers per year. This reminds
me very well the classical product life cycle.
So it does look like MT was a peak indicator
has reached the maturity. Let's look at the seasonality.
If there is any patterns. Indeed there is. In the summer
time there is a less amount of contributions. And let's look at the
top contributors. It seems there are like about
seven top main contributors who is contributing to
cadcastly. We could extract the word frequencies as
well. But what do we see here? We see not a lot of
meaningful words. There are some words like two in
four. There is a concept of stop words in the NLP.
So the stopword word is the words which has to be deleted because
it doesn't add any additional information into the sentence.
Let's check the stop words. Yes, indeed, there are quite a lot the
words marked as true as stop words. After we deleted the stop words, the vocabulary
look as we would expect.
Okay, let's start with the tokenization and lemmezation.
This concept basically standardized the form of the message the
it takes into account the NLTK library
has built in Wordnet Lemodizer. You can look at
the lexical database from Princeton University and
you can search for some words and that would give you the
part of speech and basically the explanation
of the words that appear dictionary. So let's apply the
tokenizing functions and tag the words by the
part of speeches. And let's try to count the
words again because this would be the
more appropriate and more filtered. Yeah,
here's the how message look like through the lemmatizing.
So it's very standardized. There is no noise at all
pretty much. And here are the most frequent words in
our it does look like a developer's
vocabulary just to compare the original message versus
the lemmatized message. By using some lemmas,
we can classify messages as features and as
a bugs. Okay. And we can build the vocabulary for
the bugs and features. And let's see what
are the statistics or the features and the bugs over the time here we
clearly see that the project kickstarted in 2012.
There was a stable period of development to 2020, and in
2022 there was rapid growth features.
As we are trying to look at the sentiments,
the best way of finding the negative sentiments is to search for the
bad words. Let's try to find them.
Oh yeah, indeed. There are quite a few comments with the bad
words and there are a few developers
who are using bad words more frequently than others.
Yeah, we can analyze this. I hope in your organization
you have a policy around that. But definitely the
empty messages with the bad words are looking
negatively and they would provide you a negative sentiment
and negative emotions. After we run the sentiment
analysis for sentiments, we would use the same word
note dictionary it has some additional information on the
top of the words and the part of speeches. So we could get
some scores, negative scores and positive scores
for every single word like that. As you can see,
the negative words include the error still problem
difference, and the positive words are, well,
improving, refinement and so on.
We can calculate the total score and average score per
period. As we can see in constant 2014,
there was a representative positive sentiment.
At this time AP running. And we can calculate these colors
per person per period. To see if there is any dynamic.
Let's plot the charts.
Okay. See that the green developer
was improving his negative score. There was the
orange guy also was improving his score. And we
can get some context about what people were doing
and see I talk to them,
maybe get some more feedback in the organization.
Let's look at the sentiments. There is one
nice library, which is called text plot, which is providing you
quite nice features. And you don't need to write a lot of code
to get and extract some polarity and subjectivity.
Let's add the polarity and subjectivity fields into our
data sets. Here's how it looked like.
There's a polarity column over here. And the polarity could be positive
or negative. And here is a polarity over the years.
It does look like a sinusoid, very interesting pattern.
After 2013, the negative polar g goes down,
the positive polar g goes up, likely at this time,
developers were very satisfied of the project.
And we can calculate the dynamic of the changes of the polarity.
It's red and green. When the features are delivered, the bugs are being
fixed, and we can look at the polarity of all three individual
contributors. We can calculate the ratio and
ratio of the subjectivity so you can make a judgment.
We can look at the polarity of the overall project per year.
And it's interesting to see that the polarity of the bugs and
polarity of the sentiments are different. The features
have polarity more positive. It's biased towards the right hand side.
Let's look at the deep learning models and let's try to get some
emotions out of our git. Commit messages.
The easiest way is to run existing models and run the transformers.
You can get the models from the website tagging
face. There are a lot of models. It's available for
everyone. And you can download any of these and run it.
Let's try to find the model which we search
for the model. There is a description over here. There's a 1.5
billion downloads. And we can try the API as well.
So we have this model we just downloaded it.
Sometimes the models are very big. This one might
be like one gig or something like that.
As you can see, it provides us with the attributes of the
sentence, provides the emotions like love,
annoyance and anger. Let's make
a sample of our data frame because it's too big. It's 35K
commits. Let's take just 2000 and try to enrich
the messages by emotions with a pre trained dataset.
It might take some time to run. I usually
on my laptop, I get the results within five minutes.
Running about five minutes. Okay, we got emotions.
Here are the dopamine steps we get from our 2000
messages and we can do some analysis
further on and group the data and look how the
dynamics of these emotions. Let's look on
the particular examples. Here's the confusion. I think
the confusion is caused by the word. Yeah, it looks
at least in the second sentence.
Okay, let's look at some others. Yeah, we can
select any. Let's look at the anger.
The anger probably caused by this line and the capital layers.
The model has to be fine tuned because the git commit
messages are very specific. Let's look at the git discussed.
Not very clear why this emotion popped up, but let's
look at the dynamics of of our movements and let's see how
they look for developer. Of course, the top per developer
as well. Neutral and approval. Let's drop these
first two columns and look at the remaining part of
the motions. And the remaining part are annoyance and disapproval.
Let's look at how was the dynamics of every single
emotion over the time. And you can see that
annoyance correlates a lot with the dynamics of the
project and disapproval as well. There are not
a lot of positive promotions, by the way.
Let's look at the last cycle again in 2020,
the annoyance was the top and was among the highest amount of contributions.
So yeah, probably developers don't like much of the periods
when there's a lot of features and a lot of bugs are being submitted
to create the pressure on them. And let's look
at the heat map graph. Oh yeah,
and white is the top. And disapproval as well.
Disappointment, a little bit surprise. In 2014,
there was a lot of surprises, sadness, anger.
The positive emotions are not very present and we can look
at the dynamics. That's just a different chart, just to
see how the scores are growing or failing.
Yeah. The next thing that I wanted to show you is summarization.
Again, we will be using the hugging face model. There are a
bunch of models and we would use the, one of the
most popular Facebook learned the model, the CNN
Daily Mail News and yeah, let's see what
we would get with this summarization. The idea here
is to reduce amount of text that we would need
to read. So we would run the
summarization function over the text.
If you need to store a huge text, which is having
a very different context than small pieces, I would
recommend you to run it two times or three times. So basically
first layer you run on the original message,
then you combine all these summaries that you got and
then you run the summarization again as a second layer.
That would improve the quality of the
output that you get. Otherwise the outputs might
be very messy, not let's run it over
there, let's take a sample, we'll take one top
contributor and twin last minutes and
let's build the let's enrich and let's get the
summary of every individual message and then get a summary of the
joint text. It might take some time to process.
Okay, we got the results. These are individual summaries for every single message.
Look at it and yeah,
the messages are a little bit more clean and clear.
The summary over the last ten messages
combined, so we could see what
the person was busy with and
we can specify what should be the length of the outlook message.
Here is the results. Yeah, the text look better
than it used to be and is a nice summary.
But again, this model that we were using is the model
create on the news. Let's try to use the chat Jpt
API. It's quite fun. It does provide a
nice quality of the summaries. We also can specify
what amount of tokens we need to have in
output and we specify the content.
Basically is prompt request as we would write it
to the chatbot summary is the
which is like joint text messages over the past.
Then I want to change the prompt middle
and we see output. We can play
with the prompt a little. If I want to have an
emotional response, I can make
it and ask Judge beauty to make it in a shorter way.
And I can ask I want to
summarize it in a way, in a binary
way, what is bad and what is good. We get the result.
The result is very structured. I highly recommend you
to try this out on the copy my notebook and run over
your twitter some insights that you
have never seen before. The most of the words of the dev
slang are having the negative sentiments, so don't be surprised if
you get the horrible scores. Check the original message and check the
dates that you get. NLP programming is very
iterative so be ready. Hope my video was
interesting. Thanks, Ocon 42, for hosting
me.