Conf42 JavaScript 2024 - Online

- premiere 5PM GMT

Who likes unit testing? You code, let GenAI DevTools worry about the test

Video size:

Abstract

How can developers harness GenAI code generation tools to build better software? Learn how to automate tedious tasks of writing test code. Gain insights into leveraging AI-driven tools to make test generation easier, faster and create higher-quality software. Be the testing hero your team needs!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, everyone. My name is Sharon Bar. I'm going to talk about unit test today, who likes unit testing. Actually, most developers don't like testing at all. But let's see how generative AI dev tools can worry about the tests for you. But first, a little bit about myself. I used to live 17 years in the Bay Area. I worked for Yahoo. I was the first VP of Engineering at Couchbase, a NoSQL database company, a CTO at Lifesign, a startup called Bina Technologies, acquired by Roche, a former developer. I recently founded Early. At Early, we build high quality working unit tests for pull requests and scale. I like to call myself the Chief Test Code Generator here. I generated over half a million lines of test code in the past year. I love travel and do outdoor activities. Tremont Blanc was one of the recent highlights and more. So let's dive into our talk. We all know CrowdStrike outage, where a tiny bug leads to a massive crisis. If you read about it, what happened over there, a simple auto array out of bound exception caused the null pointer exception added to the fact that the software was running at the low level of the Windows operating system caused blue screen of death. And since then we all know what happened, the highest cost of bug ever over 5 billion in losses. CrowdStrike alone, sorry, CEO of Delta said that, Microsoft outage cost them half a million, 500 million dollars in revenue. if we dive a little bit into the anatomy of bugs, you can see that the cost of bug is something that can go up exponentially. If you catch it early, close to coding, which is unit tests, the cost is pretty low. If you catch it post production, Then the cost of the bar goes up for the roof and CrowdStrike is an extreme example. So let's Look at an analogy over here, Henry Ford used to say if I had asked people what they wanted They would have said a faster horse And Henry Ford listened to the problem. People were trying to solve the problem of moving from point A to point B. So he invented the car. And, what happened then, when people had cars in their hands, they said, OK, what can I do with a car that I could not have done with horses? I could, go work on the next town and come back and sleep at home with my family. I can build a travel business. I can invent a truck and move goods. And if you look at the, what's happening today with the new waves of GNI tools that are focusing on the problem of code quality, the question is really, are we, how can we use these new DevTools that, where we are used to horses, for example, for test code generation, and today we are starting to have cars. How can we use these tools, to generate test code automatically? focus on the problem is not the testing, the problem is the code quality, and the test tools are really helping us achieve that end goal. Now if you look at how developers are spending their time, developers are, and this is based on different studies, about 30 percent of the time is spent on new code, half of it on tests related tasks, and half of it on the application code. about 20 percent on maintenance, again, half on the application, half on the test on 13 percent time about debugging and finding issues and fixing them. A quarter of the time on meetings, security, and people, like to drink a lot of coffee and obviously lunch, and breaks about 10 percent of the time. So overall, about a third of developers time is spent on tests or test related tasks around fixing bugs. So clearly, developers want to test their code. The question is, how can they spend their time effectively? And again, if you look at the, what types of tests exist, we talked about the functional test, unit test, component, integration system, UAT, non functional, security, localization, usability. Manual test, exploratory, ad hoc can happen throughout the life cycle. Automated tests, white box, black box, and there are many other types of testing. really, what will bring developers the highest value for their time in terms of minimizing the number of bugs in their code? So today, let's see unit test. How, what is the value of unit test generation? So as we said, it helps you catch the bugs early, very early, where the cost is really the lowest. It frees up a lot of the developer's time so you can spend it on, on the application code, on higher level tasks, on bringing value to the business. It speeds up the release cycles because you have the, you're catching the bugs earlier. And it can be used on multiple use cases. During development, we'll talk about a few examples. If you now look at the DevTools, there are roughly two groups, there are assistant DevTools for test code generation, like GitHub Copilot, like Cursor, and there are other companies that are raising a lot of money to help solve this problem. And our test code, AI agents, and we'll talk a little bit what the AI agent means down the road. But first, let's look at a couple of those tools. So today, I'm going to show you the two, maybe more popular or main tools around these two distinct areas. So cursor is A fork of a GitHub of a, visual Studio Code. It's a, an assistant tool that helps you generate code, across the board. It's general purpose, all languages, all type of code application, front end backend testing, unit testing. whereas, on the other side we have early, which is an AI agent. and the difference on AI agent is that it's very specific to one task and it does it. trying to do it better than anyone else in that case, generating unit tests. So let's switch to an example. I'll increase my screen. And we'll start by looking at the backend code, the server code that we wrote around the user services. And, we'll generate unit tests using, Cursor. let me, so I install Cursor. Cursor, you install it, from, you install any other application, VSCode, you install Cursor. What's specific about Cursor that makes it so popular is that it's very tuned for AI and using AI for your code development. So if you look at the configuration, obviously there are general parameters. There is a, pro version that is paid. You can select different models, or they support most of the popular models out there. You can actually plug in your own API key, and generate, results that are good for that. So how do I use it? I go to my method, so I can, for example, look at Check Max, number of, users. Actually, let me just quickly explain the method we are going to look at, as an example, check Max, number of users settings, right? It's a method that they. Returns if you have capacity for more users and throw an exception if you don't. So if the user is defined, you don't need to check your return. Otherwise, you compare the number of user counts to the max number of users that you get from a configuration. And if the user count is above the max number count, you throw an exception. going back to the code, if I, Highlight the code. I can see here chat and edit. Chat opens the tab, the cursor tab, that enables you to interact with the LLM in a very easy manner with your code. So over here, I have the method. I'll ask you to please generate happy path and edge cases, unit tests for this method. And the cursor will immediately start generating these tests. As soon as it's done, I can create a new test file. So let's apply. The code is copied over here. A new file is created. Let's change the file name to test, because that's my format on this project. And what I can see, all the tests that are generated over here. Let me close this for a second. So we can see that the tests, actually are not, have some compilation issues. We'll look at them, but first let's fix those. So if I highlight an error, I can do a quick fix, fix with AI. Over here there is an object, that the LLM did not get the property right, so it's fixing the properties. Let's fix this one, there are a few cases. It would be nice if there is a feature of, fix, for all, but, probably doesn't know. We'll fix this one as well. Fix with AI, fixing the object. So those are the three cases of, it didn't get the object right. There is another case where it didn't get the type right. So this is a big int, and over here, put a number. So if we do a quick fix, we'll fix, the ID. And now it's, one n. And then we have a similar problem to before, that the object is not, So it's basically editing the object. And so I'll accept. And then initialBalance. Something on the, probably, needs to be decimal. So accepting that, and decimal, this is a simple fix, let's just add library. So now, we fixed, after generating the test, we fixed all the, Issues, Now I can look at, another extension here that is Jest. Jest is a test framework that we are using over here. and it visualizes all the tests for me. in this particular case, now I'm running all the tests. and I can look at, the user service. sorry, where is it? user service dot, test. where is the file? Hold on. Okay, it's not this one. Hold on. Oh, yes, it is. what I can see over here that there are four red tests that are covering the cases. should, let's see. Should not throw an error if the user already exists. It's basically first initializing the objects, should not throw an error, and then should not throw an error if the user count is below limit, happy path, and other edge cases we are seeing if the count is at the limit and if the count is exceeding the limit. So that's one way, that is called, Code Assist, right? So we asked it to generate a test. It wasn't completely running. There are still issues, by the way. It's not running, more some, dependencies. But let's pause over here and try another tool. But this is an assistant tool. So the other tool we'll try is Early. Early, unlike, Cursor, is an extension. So it's an add on to the, Visual Studio, or to the IDE. You install it for the marketplace, do a sign in. And then what you can see over here. is a view of all your files and the classes and the methods within the file. So we look at this method, what else I can see here is, if there is a test unit test, that was generated for that file, I can see the test and I can see what is the coverage of the test at the method level, not just at the file level, like most, tools, but at the method level. So if you look at the check marks, number of users, now I can generate test for this. I can generate test for that method. I can also generate it for other methods. when I generate a test, I can do it through the magic wand or I can go through the. And when the test generation starts, I can also see a window over here that opens and help me increase the quality of the code, not just generating tests. So it can suggest documentation for the method. I can decide I choose to insert it to my method. It can actually suggest some improvements to the code. So consider caching the max number of users. Value to reduce database calls if there is one. So that's another good idea. And then the tests are being generated. Once they are ready, I have a link over here that says go to test. And I can see the links over here. And at the same time, there is a coverage calculation that is happening behind the scenes to see what is the coverage for these tests. let's have a look at these tests. Normally, if there is a high coverage, it means that the tests are actually compiling and running and are good. You can see that an agent can actually do a deeper dive into the code. It can generate mocks to generate better tests. You can see different types of tests. There is a happy path test. Probably same. Should not throw an error if the user already exists. which is what we had before. It's under, it's now grouped. should not for an error if user count is below maximum, etc. if I look at the, Jest extension, I can actually see the test over here. what's nice about it is, If I run them, let's run all the tests over here, I can see whether they're actually running, green, red, or not compiling. Let's see how they are performing. It takes a little bit of time because when I record, my computer uses CPU heavily and that slows down everything. But when it will finish running, it should take normally a few seconds. We can see the number of green tests. So what I can see over here, that I have, two happy paths, should not throw an error if the user exists, should throw error user limit, and so on. And that explains why I got 100 percent coverage, if I look at the method level, for all these three methods for which I generated tests. Let's see another thing over here, I can see, some red tests, so it means that there might be an issue with some of the tests. this could be actually expected anything, sometimes the red test are an issue of the test and sometimes the red test could actually indicate a bug. Expected email received undefined. could be a real bug on the code. So let's, we saw two tools. One is a code assist, one is an agent. There is an interesting, let's say, Let's see how an agent and an assistant can collaborate. So let's say I would like to see if this is a real problem and let the assistant, in this case a cursor, try and fix the test. Understand why is it red. If it's a problem on the test, maybe it can actually fix it. If it doesn't fix it, it could be a real bug. And it's worth looking into it, but that's for a different presentation. So let's go back And discuss, what we've seen. what we've seen is really, the differences, about the types of, AI agent dev tools, and the, I would like to talk a little bit about assistance versus agents, AI agents. we hear this terms a lot. Assistant, is a general purpose, tool where the user interact with the LLM, could be really amazing user experience, like cursor, could be more simplified, chat GPT is still very good, but the user ask a question, get a response, ask a question, get response, analyze it, until they are satisfied with the results. As we are transitioning to the world of, agents, we are starting to look about what humans are doing, more at the task level. our daily job, or job in general, is composed of many different tasks. One minute task, five minute task, thirty minutes, all the way to completing the job. When we talk about AI agents, the goal of an agent is not to replace the human, but replace the task the human is doing, completely. in this case, the human is asking the product, or solution to perform a task. The solution is now talking to the LLM multiple times, talking to itself, improving the results. When ready, handing it back to the user, get the feedback, or just build a trust, continuously improving it until it gives back a result that is working, high quality, and completed. If we now take that approach and understand how it looks like the test code agents, how does test code assistant look compared to test code agents, if we compare the output, how The code suggestions, the code assistant are generating suggestions for code, as we've seen, where AI agents are generating working code, everything. The quality on the assistant is low to medium, even if you fix it, sometimes it works, sometimes you still need to do more logical work. The quality of agents is high quality, comprehensive, happy path edge cases, mocks. There is still effort behind Assistant. You need to continuously fix it. Using the AI is still much faster than writing it from scratch. Where agents are closer to completion, and it saves you a lot of time. Even if you need the Assistant after the agent completed its part, It's trying to complete something that started at 90%, not 50%. And the impact is that you can actually generate not one test suggestion at a time, but dozens of quality working tests for your few files, many methods, all your project very quickly. So now let's see how it really helps us as developers. in the flow, there are various use cases, and if you use them wisely, it can help you generate or develop, much more complicated applications, high quality at a higher speed. first, do continuous testing during development. Test early and often. Then test your PR. You can do it continuously, or when you're done with your PR, you think it's good, now generate all the tests, protect from bugs, find the bugs for the red test, protect the bug for the green test. Testing private methods. Something that could be fairly complicated if you are trying to do it for end to end testing. Your end to end testing have to touch so many branches of the code, and whether it's your use cases or your data, It's becoming pretty painful. Now you can turn the private method into a public method, generate unit tests for it, test it, return it to private, and you tested it very easily. And last one is TDD, test driven development, which is done very differently in a simplified manner. In fact, let me talk about that a little bit. So we'll use a different example. In this case, it's a to do app. So I already made a flow of how developers are building an update project. This is an app that manages different projects. How they can update builder software. It starts with a simple method signature. I called it Simple Docker. User can generate some documentation. And they can generate tests using the chat, or they can generate tests using the agent, clicking on the magic wand. That's first step. What we'll see is that once tests are generated, they are somewhat basic. I can now insert the documentation and try again and generate tests. And this is where I have an example. Generate tests that are more complicated. And then I will start implementing, let me close this. I will start implementing my method and continuously generating tests. And what we'll see, That slowly the tests are becoming better and more, comprehensive. And finally, I will add all the edge cases and generate more tests for that. The outcome for this, and we have, the examples is, I can see, the test for, the simplified version. where I have a very simple documentation, happy path, edge cases. I can see how the tests are becoming more elaborated, right? If I look at the step 3 and step 4, where I already started implementing the cases. So step 3, I still have, where I have the happy path, I can see actually the unit tests that are generated for the happy path green, and the unit tests for the edge cases are red. And when I finish the implementation, we can see that I have four green tests. So going back, to the presentation, just to summarize the use case of TDD, you can see the different, types of, methods. Step one, very simple test. when I have, once I have more documentation, the tests become more precise. Instead of just, simple edge cases, the edge cases of the test generated are more, should return 404 if the project is not found and, and so on. And when I'm ready and have all the implementations, then I have all the test cases and also have high coverage for that particular case. So that's, another big example. Again, cars versus horses. It's very interesting to see how developers will start to adapt these tools. And build different ways and use them in different ways to basically solve the problem of bugs. So all of those examples are just some. Final step is how do you test the tests? high coverage does not necessarily mean high quality. And for that there are different techniques. One I will talk about today is called mutation testing. In briefly, what mutation testing means is that mutation testing are introducing changes to your code, like mutants, or bugs, if you will. And then it runs your unit test against those changes. It expects that your unit test will fail. And if they don't fail, it means that you either have A bad test or not even enough coverage. ultimately it produces a mutation score. let's, let me show you, an example of fire run it. It's, heavy in compute, so I hope it'll run nicely over here. So over here we generated, test for three files. We have a hundred percent coverage, and I run, I have the command line over here to run the mutation test for, the file user service, which is this file. And. And, we'll see, oh, so in order to run mutation tests, you cannot actually have, red tests, because if you have red test, it doesn't run them. So I'll, I will skip them, skip. So every time I have a red test, I will need to skip it. and when all my tests are green, let's skip this one. If you add more times, I will actually try and fix it. But, and then those are the tests from a cursor. So I will even delete them. So let's just verify everything is green. And let's try again, running striker. What we'll see with Striker is that it's actually starting to work pretty heavily to generate mutants for this file. The file that we saw over here, user service. It calculated that there are 34 mutants. And as we speak, it generates more mutants and running the tests. When it will finish it, it will generate a report that tells us what is a mutation score for that particular file, given that all the tests, all the tests that exist. And, just maybe, to fast forward, I will show you the results. actually, it's almost finished. Nice. Okay, so what we are seeing over here is a mutation score, which is 96. 55, which is pretty nice. Overall, there were 36 mutants. Only one survived, which means that we missed it. And there are some errors sometimes, and killed 28. So most mutants were killed, and that's how it got to the mutation score. One thing that we'll do over here, so let's actually delete some of the files that we created. So create user test, we can, we'll generate, we'll remove that test is current admin. We'll go back over here and let's recalculate the coverage. So you see now I don't have links to the test files. I just deleted, and we'll wait to see what will be the updated coverage. It was around the 87 for the file, if I recall correctly, that's how we started. Let me refresh it again. Sometimes, so what we can see over here that now we have, 87 percent coverage. Those files have lower than a hundred percent, check max numbers is 66 percent is current user is zero percent. So three tests are missing for three methods on this file. but the coverage is pretty high, 87%. so running, the mutation, tool against striker, really fantastic tool, by the way. and let's see what score it comes up with, right now. Out of the 34 mutants, it's the same number of mutants. Obviously, just less tests, because there are less unit tests. And we see that the mutation score dropped down significantly. Only six mutants were killed. Seven survived. Six has no coverage. almost half of the mutants are not being tested by the fact that these three unit tests are missing. It gives you an indication how important unit tests for the quality of the tests, because it really touches everything on your code. If we summarize, testing has never been so easy, right? Really, go out and try and use all the AI dev tools that you have to generate test code for you. mainly learn how to use them effectively, right? how to embed it into your development flow. and the outcome is deliver high quality code faster and become the 100x engineer of the future. I hope you enjoyed this talk. If you have any questions, if you want to contact me, happy, Sharon at starterly. ai. Thank you so much for listening.
...

Sharon Barr

Co-founder & CEO @ Early

Sharon Barr's LinkedIn account Sharon Barr's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)