Conf42 Incident Management 2024 - Online

- premiere 5PM GMT

Debugging in the ML/AI Space: Strategies for Upcoming Talents

Video size:

Abstract

In this talk, we’ll explore proven strategies in debugging issues in the ML/AI space for upcoming talents. From foundational concepts to practical skills, learn how to shape their growth, foster problem-solving abilities, and prepare them for real-world AI challenges.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. My name is Shraddha. In this talk, my co speaker and I will be sharing some practical tips for breaking into the AI ML space. This session is perfect for those who have, who are new to the field or have recently graduated, are industry professions who are, who are planning to transition into a new role, or anyone interested in learning more about the AI ML space. I hope you find, it valuable and learn something new. If at any point in the talk, you have any questions, feel free to reach out to us on our LinkedIn channels. You can find the IDs tagged in the slide. So let's start with high level objectives for this talk. We will start by talking about why debugging is an invaluable skill. You will clearly stand out if you can identify and solve Complex issues that others cannot. Then we will discuss the gap that exists between academic ML and practical ML. This is particularly interesting because the new graduates are often surprised when they discover that machine learning is more than building models. The scale of model serving and delivery is not something fully covered in graduate schools. Lastly, as promised, we will share some practical tips and tricks to break into the space. Debugging as a scale is valuable for any engineers in any space. specifically in AI. It is definitely an invaluable skill in today's automation world. And in this slide, I'm going to talk about why. So, first of all, AI is everywhere. From social media algorithms to recommendation systems and even self driving cars. AI is all around us. As companies automate more tasks using ML, the demand for people who can troubleshoot these systems will go up. If an AI makes a mistake or isn't working as expected, it will impact millions and billions of users all around the world. So having the ability to spot and fix the issues is actually very crucial and valuable. Secondly, models aren't perfect. ML models are built on data, and data can be messy, incomplete, and biased. Debugging helps identify when a model isn't learning what it should, and when it's not. Making incorrect predictions due to underlying data issues. For example, think about a recommendation system on a streaming platform like YouTube or Netflix. If it starts recommending shows and videos that users don't like because of a bug in the model, the platform could lose engagement and customers. So that directly impacts the revenue that the platform generates. And ultimately revenue is everything. Thank you. Right, debugging the system keeps it running smoothly and make sure that we have the engagement that we need from the user. thirdly, keep the systems running efficiently. So as automation grows, performance matters. There is a direct resource and monetary cost involved with serving these heavy models. A well debugged system runs faster, uses less resources, and scales better. So if you can debug and optimize a model, you can save company a lot of money and time by improving performance and reducing the downtime. So continuing from the last slide, my fourth point is it's all about continuous improvement. ML models evolve over the time. They need to be retrained with new data. And as, as that happens, new bugs and issues can emerge. Think of it like tuning a car engine. Even after the initial build, you, you need to keep making adjustments to make sure it runs smoothly, especially as conditions changed. Like, and the conditions do keep changing all the time. Next point is we want to prevent big mistakes from happening. So automation means that systems are often making decisions without human intervention. If an AI system makes a mistake, then the consequences can be large scale. For example, think of AI used in finance. if it incorrectly starts flagging transactions as fraudulent, this can cause chaos to all the customers. So debugging ensures that such critical systems make accurate decisions by minimizing the error. And last point is, it's a rare and a high demand skill. So debugging ML and AI models is a highly specialized skill. As automation continues to grow, companies will need experts who can dive into these complex models and data pipelines to find and fix these issues. Overall, in an age where automation is taking over, having the ability to debug these complex models will become like a superpower. It's a skill that's high in demand already and can open a lot of, can open up a lot of career opportunities in the tech world. Let's move on to the next topic. We will talk about the gaps that exist between academic ML and practical ML. Let's take a pause and look at these numbers. These numbers are mind blowing. Netflix, over 277 million daily active users worldwide. YouTube, 500 million daily active users. Instagram, 2 billion monthly active users with 500 million daily users engaging with the app on a daily basis. Think about the scale. Each request Mind you, each request here is actually a user asking for a customized feed, a personalized set of recommendations made specifically for them. But this customization isn't magic. It happens by ranking millions and billions of options available in the dataset for a given user. Think about the number of combinations out of a million options. How do you pick and choose the top 10 most suitable for a given user? How do you pick such that user actually ends up making Taking an action that will that we want them to take now Compare these numbers With data sets that, that are used in graduate schools. top of my mind, the most popular data sets that I can remember, at least the ones that I used during my time was, Iris Dataset. Iris, had about one 50 samples. it consisted of Iris flowers from three different pieces. and second one was, . It contained, grayscale images of handwritten digits, from s. It had about 60, 000 training samples and each image was a 28 by 28 pixel in size. so we are talking about 60, 70, 000 samples versus millions of daily active users and billions of requests. on hourly basis. So the scale is mind blowing. So next we'll talk about the heavy duty models that work behind the scenes. so ranking is done using complex models, and it's, they have a very complicated architecture and they are trained on a recurring basis. So the recurring schedule can be hourly, can be daily, or can be custom, like It can be trained on every six hours, so it depends on the type of model, the, the, the availability of data and how our entire data pipelines are set up. So it depends on all these parameters. so every time a model trains, it doesn't start training from scratch. Instead they begin, begin, training from the last snapshots and learn new weights based on the latest trends and data patterns. so, BERT, as you can see on your screen, BERT, GPT 3, T5, these are all tech summarization, chatbot, recommendation systems, and, These models are used for all these use cases, statbots, recommendation systems, and sentiment analysis. And each of them have about billions of parameters. And ResNet and YOLO, they are used for image classification and object detection in computer vision tasks. YOLO is, actually used for real time object detection in video streams. all of them have, very specific use cases and, they have complex model architectures and have billions and millions of parameters. What these parameters mean, each parameter is actually a weight that needs to be updated every time you retrain a model. So after every time new data is comes into the, new data comes into the pipeline. You start the training. It triggers a new training run. These model updates all, all these billions and millions of parameters. and these process, this process involves calculating and optimizing weights. and then, all these parameters are learned during the practice. back propagation process of the training. Compare this with, with your, with your graduate school models. The parameters are not, in, in billions. That's what I'm trying to say. It's the scale that, that, that matters here. So next I want to highlight the data pipeline, privacy and sensitivity filtering that takes place, in the training, data pipelines. so unlike in university, where, where you have a data set and you directly train your model, you perform all kinds of data processing on it and augmentation, data augmentation, data processing, pre processing on it. But it's not like you, you don't filter out the data based on privacy and sensitive information. Right. But you already have a data set available and you directly use it for your use case. But unlike, unlike that, unlike in school, Production actually has a lot of rules which changes from country to country and region to region. We have privacy and sensitivity, sensitive data filtering, phase where you have to, you have to, filter out data based on the, specific rules in that region or in that country. and these laws, are, developed by individual countries or, or unions. So for example, GDPR is one of the, is considered one of the strictest and the most comprehensive privacy laws globally. It was devised by the European union and, is, It protects the European residents, European residents, and because it mandates user consent for data processing. It gives users the right to data access and deletions and imposes heavy fines for noncompliance. in fact, other, other countries and regions like the ones that are listed in the table here, CCPA Bill C 27 by Canada, LGBT, by Brazil, Australia's Privacy Act reform, all of them have, in some sense, been, inspired by the GDPR law, GDPR took the lead, and, GDPR is like the first of its kind, that had come into the picture, and then other, California, Canada, Brazil, Australia, they did, and there are many other countries who have followed the, the, the And they have come up with their own, laws to protect their residents, so that they can, the residents have more control over what data is being collected about them. They can delete their data and they can opt out of it. So next, so far we have talked about, What, what, what model is, what, what kind of parameters the models are trained on in the industry, what kind of data filtering happens and why debugging it is complex. so now let's talk about how, what happens once the data is filters, data is ready, the, and the, and you have, You have fresh models, like you have, you have trained your models on the clean data. Now what happens? So I think in the graduate school, that's when your task ends, right? Like you have your model, which is trained, which is giving out prediction. You look at the confusion matrix where you look at the true positive, false negative. And everything else, and you look at the recall, you look at the precision metrics, and you decide if the model is good enough. But, this is not where, this is, this is just a start of, the delivery process in the, after the model training finishes, that's the start of the delivery process. so model will train a prediction value for each item. For example, a video or post a product, it'll predict what kind of engagement can this, video get if it is shown to a given user. but what happens after that? So there are other business rules that apply after the prediction value, is put out. So for example, on YouTube, you cannot have more than two videos From the same channel in a user's timeline. So don't quote me on the two videos part, but I'm sure there is, there is definitely some sort of restriction on how many videos from us from the same channel can be shown on user's timeline. next, you have ads, which are shown alongside the organic YouTube content. So ads is, or like a whole different space where advertisers are, putting money so that these, these ads, these videos, these advertisements can be shown alongside the organic content, coming out of other channels. So, how do you, how do you create a timeline where advertiser can exhaust their, advertiser budget is also exhausted. Advertiser gets, gets, what they want. Like they get the engagement on their website. Whatever. Whatever they are investing in, we give them enough return on investment. And at the same time, we want the users to also engage organically on the platform and also like the content that is created by other creators and not just advertisements. So users are usually, no one likes advertisements, but we want to make sure that advertisements can get enough engagement as well as. the organic videos can also keep the users engaged. So how do you create a timeline? How do you diversify the timeline such that, not the user does not feel overwhelmed. and at the same time you want to diversify, right? Like not all videos should be on the same topic. You want to diversify. You want to exhaust advertisers budget. You want to diversify. Don't want to show more than certain number of videos from the same channel and use this timeline. And there are so many other rules that business rules that come into picture when it comes to ranking and showing the timeline. Next, I want to talk about continuous monitoring and real world validation. So unlike in schools where you measure accuracy using simple test data sets, real time performance validation is, not possible. So you, even if your model predicts, Some prediction value you will you can't say with confidence if it is true or not Let's say it it predicts with high confidence that this user if he sees this video He will click on it and he will watch this video fully even if your model is saying it You can't fully know if it is true, if it is accurate, unless, unless you, you, in, in an ideal world, you, you'd be able to see it, but unless you'll have to, you can only confirm it after weeks or months of monitoring performance and revenue metrics. And you can't, even if you can, you can have engagement metrics at individual user level. No one, no one in the industry has the time to go through every user because we are talking about millions of users, right? So we only look at the aggregated metrics. so creating those aggregated metrics, understanding how the engagement is, is either improving, is becoming neutral to understand, to create those metrics, to, to monitor it on daily basis is really important. and then there also exists a bias within the models. for example, the early versions of tragedy was biased towards political, political ideologies. and that did not come up until, until the model was actually put out in, up for users to use. And that's when, the model owners started getting feedback from the users and the media that this is what is happening. And because of that, they had to reiterate on like iterate on the model and create a new version, which was, Which was not biased, so that's, that brings me to the next point, which is A B testing. A B testing is the process of running, running a small test, of your change, on a small subset of users before landing the change to the production or to the final production. and even after you run an A B test, it's not like you can just, put like, it's not like you can just deploy your model. It's not like that. You need to prove value to your stakeholders. Stakeholders here will mean product managers, engineering managers, directors. You have to make sure that you are, your model is actually doing what you expect it to do. For example, if you. stated that your model will improve engagement metrics, then you need to show that it actually improved engagement metrics. if you said, it will improve prediction performance. and you also have to prove that it does not cause any other disruption in, in, in the system. and all the top line and, prediction metrics, look normal. And when you're running, A B tests, you can have all sorts of issues. You can, you know, your, your A B test can suffer with dilution, because of a parallel learning experiments and your model, You have trained a certain model, which is supposed to be picked in your, your test version, but it doesn't get big because, there is a lot, there's a heavy fallback. Like it, it's, it's falling back to the production model. So, your, your test version is actually not set up correctly and, it's actually not doing what you expect it to do. So these are like the big issues that. That you run into when you're running the A B test and when you're trying to, you know, build a new model, build a new model architecture and put it in production. yeah, so, lastly, I want to say that in theory, the goal is to maximize accuracy and minimize loss. But in production, you need to consider system efficiencies, privacy, real world constraints, and business metrics. this is why real world ML is so important. It's all about making trade offs and ensuring that the model delivers value. It's not just about high accuracy. with that, I will pass on the mic to my co speaker, Sunandan, who will share some practical tips and tricks to break into the ML space. Thank you, Shraddha, for the wonderful introduction about the vast difference between the scale of ML in graduate school versus in production. Hi, I'm Sunandan, and I'm going to talk about what are the practical debugging skills you that upcoming talent can use in production environment to debug AI or ML systems. We need both short term and long term strategies to avoid the situation of on call fatigue when dealing with these systems. Before we do that, let's see what a typical ML system looks like from a high level. Any typical machine learning system in a production environment has several sub components which work together to give us the desired predictions. These steps such as model training, snapshot creation, snapshot validation, inference systems, etc. are not simple components, but a whole large distributed system with its own components and levels of complexity. For instance, we don't use raw data to train a model directly. Instead, we first clean, filter, apply privacy rules, and make transformations to prepare the data for model training. If there is failure in the data system, we can end up with corrupted features. This can lead to low quality snapshots being created during model training. If these snapshots are deployed to more production, they will produce poor predictions, which in turn can affect the behavior of software products. This can result in a different set of events and outcomes, which can generate more bad quality data. In other words, data corruption can have a ripple effect, impacting not only the model's performance, but also the overall behavior of the software products that rely on it. This impact is similar for any other components in this ecosystem. Visually, these failures, represented by fire icons, cascade to different parts of the system pretty quickly. For example, issues in training can lead to bad predictions, which can corrupt the latest snapshot deployed. Since the outcome of ML models is also used in future ML predictions, issues in training data today, if not mitigated on time, will lead to corrupted snapshots and predictions for features too. These interdependencies make backtracing of issues in ML models really hard to debug. As Shraddha explained in a session, due to scale of the ML systems and consequently the adverse impact a bad ML system can have, It is imperative that the on calls respond to issues quickly and with proven strategies so that these issues are mitigated as quickly as possible. So what should the junior engineers do to make sure that they don't get overwhelmed? There are six invaluable debugging techniques which, in the short term and the long term, will not only help the junior engineers. tackle their on call issues faster, but also make sure that they're progressing in their career too. In the next few slides, let's see what these techniques help us achieve. The most critical issues that the engineers need to deal with called selves or side events also need the highest level of attention. There's a critical incidence where a system is malfunctioning, causing service disruption or degraded performance, and the severity depends on how bad the impact is. Training focus of, for engineers in this area should be to recognize the high priority issues that impact fairness. This is done typically by creating clear guidelines for deciding which incident qualifies for a CEP and what should be the severity of the CEP. The junior engineer should also feel empowered to file follow up tasks to prevent future CEPs, even if the changes are difficult to do immediately. Why it's important? Because knowing when to escalate issues is crucial in industries like finance or healthcare. where biases can have serious implications. A pro tip I can provide is, we should encourage the junior engineers to to document the impact and the debugging steps, which they are doing, which will help the senior engineers respond faster when they join the incident impact debugging. This also creates a virtuous cycle, as these steps can be then integrated into the runbook for future reference. Finding incident reports on time and effectively is just one part of the puzzle. Engineers should also take time to understand the underlying model and serving architecture. It is not necessary that the engineers need to understand each and every part of the system. However, it is crucial that they understand what are the common failure points which will help in narrowing down the issues faster. For example, Amazon's AI hiring tool favored male candidates due to a bias introduced in the training data pipeline, which was identified, which was not identified early in the architecture review. Some walkthrough exercises or demo runs to show how these issues can be diagnosed can be useful here. For example, a fire drill can be conducted. Diagnosing issues like data latency between feature stores and serving layers could be an example of a walkthrough. Once the engineer has a good grasp of the architecture of the model, next step for them is to learn the end to end training and deployment workflows. Concepts like data pre processing, model training, evaluation, deployment, etc. are something that they will be facing when they're on call and, or when they're called upon to help mitigate production issues in their models. Understanding the end to end flow helps in quicker identification of root cause, faster in real life issues. To do this, having the engineers sharpen their skills on a simpler data set will teach them how to manage the deployments and the skills will be invaluable in showing them the ropes of the system. An example of how understanding the end to end workflow can be useful is when Google Photos started labeling black individuals as gorillas and after some investigation, it was found that it was due to some training data and model evaluation stages problems. Another example could be if a junior engineer is retaining a chatbot model, they should understand where the new data is coming from, how the model is updated, and what downstream services are impacted. Running effective queries is another skill set the junior engineer should continue to work on iteratively as they work on their projects. Knowing what to query is a human skill, and it takes lots of practice and trial and error to find the skill. Example of queries that is often debugged for production is figuring out why a recommendation system is returning irrelevant results. This could be due to missing user preferences, bad training data, mad model architecture, or some issue in the inference system, or could be something else. Junior engineers can tag along with senior engineers debugging the issue in shadow and learn from the experts. What are the smoking and science to look for? Building the intuition for figuring out the failures is a skill set which will help the engineers in the long run. Mentors play a crucial role in helping individuals develop their skills and knowledge in any field. In the context of debugging complex systems, mentors can provide valuable guidance and support to help individuals overcome challenges and improve their problem solving abilities. Finding a mentor can be done through various challenges such as within the organization or by connecting with professionals on LinkedIn or other engineering communities. It is important to find a mentor who has experience and expertise in the specific area of interest and who can provide constructive feedback and support. One example of how mentors can help is by encouraging Shadowing sessions or encouraging attendance in the debugging forums where the engineers can learn from real world incidents and gain practical experiences in debugging complex systems. In addition to finding a mentor, it is also important to join relevant communities and forums where the individuals can connect with peers and learn from their experiences. Communities like KD Nuggets and AI Breakfast Club, they offer valuable resources and networking opportunities for engineers interested in AI and machine learning. To become a successful AI ML engineer, it is important that they also develop the key skills that are necessary for success in this field. Skills like debugging expertise, understanding system design, and strong querying and querying skills are just one part of the skill set. They should also look to specialize in a specific area. Now that area could be MLOps, data engineering, or model interpretability. Also it's very important that the AI ML engineer stays ahead of the curve by engaging in continuous learning as new tools and frameworks are being developed all the time. Finally, it is also very important to understand the real world impact of the systems. Models like the compass system have had significant real world consequences and fixing issues with these models can have a major impact on the company's reputation and bottom line. And by understanding the potential impact of their work, the AI and ML engineers can ensure that they're making a positive contribution. I want to mention that none of this is possible without the combination of great culture, great team, and great tool. A great culture serves as the foundation for open communication and values diverse perspectives. Junior engineers should feel comfortable knowing that the company culture respects the diversity of background and technical skills and values the mantra that there are no stupid questions. Complementing this is a great team where individuals leverage the unique strengths and collaborate effectively towards common goals. This can only be achieved by leading through example. Managers and tech lead should ensure that the team is taking care of each other during the on call situation. and making sure on calls are getting help they need to progress through the issue mitigation. Lastly, great tooling enables the team to avoid repetitive tasks and enables team members to focus on high impact work. Automation, whenever possible, to reduce the possibility of human error should be a prime focus of team roadmap to build a great debugging culture for the product in the long run. And that concludes my talk. I hope the listeners gained some valuable insights and understanding of debugging techniques required to succeed in ML production systems. Please don't hesitate to reach out to us on LinkedIn for any questions. Thank you.
...

Sunandan Barman

Production Engineer @ Meta

Sunandan Barman's LinkedIn account

Shraddha Kulkarni

Software Engineer ML @ Meta

Shraddha Kulkarni's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)