The Disruptive Potential of On-Device Large Language Models

Video size:

Abstract

Discover the game-changing potential of on-device large language models! Learn how innovations from Apple, Mistral AI, and Google Gemma 2B are reshaping AI, optimizing performance, reducing costs, and enhancing privacy. Imagine a world with human-like conversations—right on your device!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, I'm Rishabh Mehra and today I'll be talking about on device large language models, which I believe is a path to safer and more efficient AI systems. Let me start by describing who I am and why I'm talking about this topic. So I studied computer science at Stanford where I was researching on computer vision systems under Professor Fei Li. We worked on publishing papers in MLHC and NeurIPS. in the computer vision and healthcare field. From there, I went to Apple, where I worked on device machine learning models to make iOS smarter as a system, as well as prototyping efforts for Apple Intelligence, which was a large language modeling system Apple launched recently. Now I'm working on Pinnacle, which is a startup in the intersection of all three of those fields. Also, as a fun fact, I have, an interest in rock climbing and that's why it shows up in the pie chart there. Alright, now let's look at the table of contents for today and what we'll cover. We'll start with why on device LLMs matter, why we're talking about these. We'll talk about state of the art, what's out there, and we'll see some examples of how they perform. We'll go through some real world use cases which are already deployed. Then we'll talk about what MVP is and what the path to this MVP would be in my opinion. And we'll conclude with where I see these models in the next five years and the applications they'll enable. All right, let's start with the first topic, why on device LLMs matter. Let's start by first understanding what large language models are. The way I like to define large language models, of course at a very high level, is through a three word game I used to play with my sister when I was young. I would say three words, my sister would follow up with the three most likely words she think would be best, and then I would say three words, and so on until we formed a full story. Interestingly, large language models are trying to do the same thing. You give them a bunch of words, And they reply with a bunch of words which they think make the most logical sense, given what you said. Of course, this is a very high level definition, but that's the task they're trying to perform. Now let's talk about three major problems I believe there are with large language models. The first is privacy evasion, invasion. Privacy invasion occurs because the types of applications that use large language models need personal data. They're often you telling it information or providing it information through your computer screen, et cetera, which they process. This raises major privacy concerns because you might be sharing your data with an LLM provider like OpenAI, but also an intermediary like a startup. So you need to trust both those companies. Large language models also have a massive carbon footprint. For example, just training GPT 4 in an article was shown to have emissions equivalent to driving 18 million miles in a gasoline car and a single inference call to GPT 4, again in an article, was shown to be equivalent of charging a mobile phone 60 times over. Then also another problem with large language models is their ability to mimic humans. This is because large language models are essentially producing language similar to humans and this leads to problems such as deepfakes. The internet is flooded with deepfakes, but also the possibility of fraud increasing a lot. For example, imagine getting a call in the voice of your mother and your mother tells you she's hurt and she needs some money. You'll probably transfer the money, but later you'll realize that this was just a large language model talking in your mother's voice. This is a very big concern, but we won't talk about this today because this is a whole other talk. We'll cover the first and the second points today. So moving these large language models from the internet to on device solves the first two problems. This is because on device models are off the grid. They don't need internet. They don't need any external connection. There is no middleman. Everything is on your device. So there's no privacy risk anymore. And as a result of these models having to run on device, they're automatically smaller, and smaller means that they're greener, and they essentially have, will have a lesser carbon footprint. And this is again, just by design that you have to make them smaller to run on the computer. All right, now let's talk about state of the art LLMs. Where is this technology today? What does it look like? And we'll do this through two demos. So for the first demo, I'll give the following prompt to the LLM. Hello, I'll be giving a presentation with the topic on device LLMs, a path to a safer and more efficient AI systems. Can you give me a one line opener, make it engaging and thought provoking? So I take this prompt and I will Copy the prompt and then I'll use OLAMA, which is a way to essentially run large language models on device. The model I'm running is LAMA 3. 1. This is a model by Meta and it has 8 billion parameters. This is now loading into the memory of my computer. I paste the prompt and run it and it gives me a reasonable response. Let's read the first one. As we increasingly rely on AI, Can we afford not to rethink the way we build models from edges of our networks to edges of our devices? Makes sense. It's nice. It's not perfect, but it's a good start. All right. Now just to get a baseline, sorry, before we get the baseline, we try to run this on Lama billion, which is a much bigger model by Facebook. And this is now loading into memory. But as I'll soon realize this doesn't actually load into the memory of my computer. I open activity monitor and I see this model is taking tons of memory. It's already taking 32 gigs, which is more than what this computer has. So this model essentially cannot run on my device. so we've seen that the model size on device is seven to 10 billion, somewhere around there. now just to get a baseline, we run the same prompt in GPD 4. 0. So I paste the prompt, run it. And I get the response, which is, imagine smartphone is not just smart, but a powerhouse of intelligent privacy. Welcome to the new frontier with on device LLMs. So perfect opening, exactly what I wanted. And that's what we have come to expect from GPD 4. 0. But of course, this is still running in the cloud. And as we saw, Lama 8, we produced a response, which made sense, but it's not at the level of GPD 4. 0 yet. All right, now for the second test, we'll do a coding challenge where we'll get these models to create Minesweeper, a popular game, and we'll see how the different models do. So let's start by copying this prompt I created to write, make the models write Minesweeper. And we'll again start with Llama8B, the tiny model by Meta that runs fully on my device. As you can see, it produces code which makes sense. It's calling the init function, print board, create mines, count adjacent mines, reveal. All of this makes sense for a typical minesweeper game. So now, it's, Finishing producing this code, it's written the play function, which is a while loop to play this game indefinitely until you hit a mine, of course. now I can go ahead and copy the code. I'll paste the code into a Python file called minesweeper llama8b. py. And then I'll go ahead and run the code. Unfortunately for Llama, it doesn't run. it only took me 15 to 20 minutes to correct it so it wasn't too off and other iterations produce slightly better code which sometimes even compiled. okay now to get a baseline let's go to GPT 4. 0, paste the prompt and as you can see it very quickly generates the code. Of course this is again running on the server and we expect this to be great but just to get a baseline let's see what it does. The code looks similar to what was produced by Lama, but of course, as we know, the quality is expected to be a lot better. So I do the same thing. I copy the code, go back to the terminal, create a file called minesweeper gpt40. py, paste the code, and now I run it. And here we get a perfectly working version of Minesweeper. I type 3 4, hit a mine, and lose in the first turn. So let's go ahead and play once more. And here I can essentially run through the entire game of Minesweeper where I'm going through different options I've chosen. This is fast forwarded and then eventually I write 3 3. hit a mine and the game ends. All right, now the interesting part. We're going to paste the same prompt, but in a very different model. This model is called CodeQuin. This is by the Alibaba group in China. this model is only a 7 billion parameter model. So it's a much smaller model than even, or not much smaller, but a smaller model than the Lama model we saw earlier. So it's able to run fully on device. But the interesting thing about this model is it's customized and tuned to do well on coding tasks. So its strength is coding and it outperforms LLAMA as we'll see soon. So right now it's writing code, it creates the board, counts the mines, reveals a similar structure to what LLAMA was producing. It's a bit slower running on memory, even though it's a smaller parameter model. Presumably it's not as optimized as Llama for MacBooks. so I'll give it a second to just finish coding. Almost there. Alright. There we go. It's finished the setup. code given some description. I will go ahead, copy the code and follow the same procedure where I'll create, I'll copy the code, create a Python file called Minesweeper code or q7b. py, paste the code and then I run it. Interestingly, it shows me the locations of the mines visibly, which is a bit weird, but it at least runs the code. Then we write the location of a mine 03, but it doesn't end the game. Then I write 00, which is not the location of the mine, and that, that works well. So essentially it's playing Minesweeper correctly, except that it shows the mines at the beginning and hitting the mines do not end the game. I played around with this code after the demo and it took me less than five minutes to make it fully functional. So it was definitely higher quality than what Llama8B had created for me. All right, we have looked at, these models qualitatively, but let's look at some numbers. sorry, the bottom right number is hidden because of my video, but it's 78. So let's start from the top left, LAMA 3. 1, the 8 billion parameter model. let's look at the chat reasoning task, so how well it performs on chat tasks and coding tasks. So for LAMA 3. 1, The 8 billion parameter model got a 73 on chat and about the same on coding. The 70 billion got 86 on chats, an amazing score for such a small model and the same on coding. GPT 4. 0 got 86. 7 on chat and 87. 8 on coding. So again, close to each other. But CodeGrend, the tiny model by the Alibaba group, got 78 on the coding challenge. So much higher than the 8 billion parameter Lama model like we saw earlier. And this shows us the importance of having task specific models, which we'll talk about more in the coming slides as well. Next, let's explore the current real world use cases for on device large language models. So these are use cases already out there in the wild. Here's a list of these use cases. Some of these are large language models and some of these are related technologies, which use similar tech to large language models. So style transfer, this is already out there. If you want to convert your email to a more friendly email or a more professional email. This is already possible today with large language models. Speech to text. Converting, your audio to text is a very important task and this actually works really well fully on device today already. probably almost as good as the server side model to be honest. Summarization. summarization does work on device today. It's not as good as server side, but it's pretty good. You can get 80 percent of the value you'll want out of it. And finally, translation, same as summarization, works really well on device, slightly worse than server side, but it's almost there. Now let's actually look at some of these use cases in action through a demo. All right, so let's go through the use cases which we saw before through the eyes of Apple Intelligence. Apple Intelligence is a suite of features Apple launched in iOS 18. 1. Some of the features in this demo also came before that. As you can see, the phone is on airplane mode, so there's no internet. This is fully on device, not running a model on the server or anything. So what I'm doing here is I'm in the notes app, and I'm typing an email. Hi Alex, reached this site. Good to meet you, but I don't think now is the right time to collaborate. While I love your vision question mark, the product isn't the right fit, right now. Hope to collaborate, someday. RISH. So not great English, but it's a start. So I can go ahead and select this. I can go to writing tools, which is one of the new features Apple launched, and I can click on proofread. So here Apple suggests me change a couple of words. RISH comma this side, good to meet, and then some days combined into a single word. Not super helpful, but it's okay. So let's try some other writing tool. this time I select writing tools and I click rewrite. So rewrite is now going to rewrite the message as expected. So it says, Rish, it's great to meet you on this side. So I think Rish is the other person instead of me, which is incorrect, but let's look at the second sentence. However, I don't believe this is the appropriate time to collaborate. So this sentence and the rest of the email is a lot better than before. So it's getting there. It's actually improving it the way I want. Let's just remove RISC to make the job simpler for the LLM. And this time when I select the email, I'll go to writing tools, but I'll just change it to a friendly tone, so style transfer this time. And I get the response, Hi Alex. It's great to meet you. I think we're both on the same page about wanting to collaborate in the future. I just don't think now is the right time to work together on this particular project. I hope we can still find a way to work with each other in the future. So work flawlessly. Next let's test dictation. Dictation is essentially, speech to text. I'm seeing the sentence. You can't hear it. Hi, this is a test for seeing how the dictation works. And it again worked flawlessly, fully on device, airplane mode is still on. Next, let's try selecting this entire email. And we'll try to create a summary of this email. I don't want to go through this whole email. So I'll go to writing tools and click on summary. Unfortunately, that does not work without the internet. So just to make sure it works with the internet, let's turn on the internet, writing tools. And I click on Summary. And I got a summary telling me the events with Uber, Bloomberg, Citadel, etc. And RSVPs are required for all these events. So I got all the information I needed perfectly there. Next, let's look at Spotlight. Meet Alex at 10 a. m. Airplane mode is on again, and I got a meeting invite. So there's NLP matching going on here, even if it may not be fully large language models. That's unknown. Next, let's look at Translate. So I try to translate hello. It doesn't work without the internet. So I turn on the internet and what I can do now is I can click the three dots and click on download languages, scroll down, And I can simply download Spanish onto my, iPhone, so now it should work on device. It also downloads English. So now I go back into airplane mode and I write, can I get the churros? And it translates to Spanish without the internet because airplane mode is on. And presumably this is correct. And I write thank you and it says gracias, which I know is correct. Next I have all these lists of notifications. I can click show less airplane mode is on And I get a summary of the notifications. So summary in this context seems to work without the internet as well All right, that's just a rapid fire of a bunch of features in apple intelligence that work without the internet hence running fully on device Next, let's talk about the path to MVP. MVP, in my opinion, if we can reach GPD 4. 0 level performance being the accuracy on device, that would be game changing. That would be a product that will actually be extremely valuable in everyday life. To reach this target, I believe we'll have to improve the hardware to be able to run 30 billion parameters on device. I think these 8 to 10 billion parameter models are still too small to have enough information, but we see promising results from 60, 70 billion parameter models. I'm sure we can make them smaller, perhaps to the size of around 30 billion. Models need to be more specialized. So like we saw CodeGwent absolutely outperformed, Lama 3. 18 billion, despite having similar number of parameters, just because it was coding specific. You can go one level deeper. You can make it Python specific instead of coding in general, and that might be even better. Next, we need to improve adapters. So adapters are things like LoRa adapters, which can. Which again, I can't go into details, that would be a whole other talk, but essentially can add a layer of specialization without having to recreate a whole new model. So let's say you have a Python specialized model, you can then use a LoRa adapter to make it a model for creating games using Python. So we've become even more specialized using adapters. So in terms of software innovations, we already talked about, the second and third point, which is adapter mechanisms and specializations. But one more thing that needs to be innovated is enhanced compression techniques. so not only on the hardware side can we reduce the parameters, but we can start compressing these models into smaller and smaller models, pulling out the parameters. by using better compression techniques. And then on the hardware side, we need optimized processing units to run bigger and bigger models. Instead of 10 billion models, we should be able to run 30 billion or higher models. But also we need an enhanced memory architecture. This would be the RAM system. because these models need to be extremely quick, so they need a really solid memory architecture. I'm not a hardware expert, so I can't go into details about these topics, but at a high level with, in what I've discussed with hardware experts, these are the two main frontiers we need to push. All right, and then finally, let's say all of this happens. We have 30 billion parameters. These models are running fully on devices, specialized, have adapters. What's the outcome? Let's talk about the real world use cases these models will enable. So the first is augmented workflows. These models will become deeply integrated into everyone's work life. You'll be able to share your entire work context with these models because they're not sending any data to the server and they'll streamline every single task in your life, whether it's emails, data analysis, presentations, or even like filling out Excel sheets or writing code. The next is On Device Therapy. On Device Therapy will become better with these models because you'll be able to share more and you'll have advice available at all times, in your pocket or on your computer. Legal assistance. This will help both lawyers, because they'll be able to share client data with these models, at any time to get information. And they can trust these models because they're running fully on device, but they can also help the layman to get legal assistance quickly, without having to engage a lawyer. And again, they can trust this model because it's fully on their device. So they can have that same, lawyer client, relationship. with this model where they know the model will not breach their trust. And then finally medical diagnosis. Medical diagnosis will again be empowered by these models because you'll again be more willing to share your personal data and you'll be willing to share your entire medical history with these models because they're fully on your device so you don't have to worry about where your data goes and you can get really accessible healthcare diagnosis. and potentially even treatment using these models. That's all I have for today. Thank you for coming to the talk. If you have any questions or follow ups, please feel free to email me at rish at pinnacle. co. Thank you so much. Bye.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

The Disruptive Potential of On-Device Large Language Models

Video size:

Abstract

Summary

Transcript

Slides

Rishab Mehra

Founder & CTO @ Pinnacle

Join the community!

Featured event

2025

2024

Info

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

The Disruptive Potential of On-Device Large Language Models

Video size:

Abstract

Summary

Transcript

Slides

Rishab Mehra

Founder & CTO @ Pinnacle

Join the community!