Global Intelligence Pipeline: How We’re Crafting Inference at the Edge

Video size:

Abstract

AI is a hot and rapidly developing topic in the cloud space. In this session, topics covered will include market trends, future directions, training, and inference. Get up to date on the latest AI trends and news, including the value cloud can deliver to AI and Gcore’s Inference at the Edge, including real-life scenarios where the company helped customers using AI

Summary

GCor is a global cloud edge and AI solutions provider. Our vision is to connect the world to AI anywhere, anytime. The cornerstone of the company is our secure low latency network. Today, I'm going to focus on our AI as a service offerings.
We leverage our CDN network with 180 points of presence. A user request will take just an average of 30 milliseconds end to end. And then perhaps a total of around 50 milliseconds to then get to an inference node where the actual machine inference takes place. Combined with an optimized inference round trip in under half a second.
An example with a real time face avatar. Another example of real time translation from English to Luxembourgish. Created and powered by the cutting edge technology of G core AI cloud. From the minute I pressed send, the processing was in almost real time.
Let's AI are a creative platform that allows you to generate images for anything. You can actually take a series of images generated by the platform and then combine together to form a video. It really shows the power of generative AI powered by GPU infrastructure.
edge product is actually being launched in beta this week. This gives you a serverless AI compute with very low latency around the world on a pay as you go model with DDoS endpoint protection. If you'd like to try it, I'd love to hear from you.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, my name is Nikhil Taroni, head of AI products at GCor. First of all, thank you very much to the organizer of Conf 42 for having me here today, and I'm really excited to talk to you about our pioneering work building our global intelligence pipeline and its cornerstone technology inference at the edge. And during the talk, feel free to ping me on slack with any questions or comments and I'll do my best to answer. But for now, let's dive in. As I'm here on behalf of GCor, let me begin by telling you a little bit more about us. We are a global cloud edge and AI solutions provider headquartered in Luxembourg, but with a global presence and our vision is to connect the world to AI anywhere, anytime. The cornerstone of the company is our secure low latency network, which consists of over 180 CDN points of presence and 25 cloud locations, giving us an average response time of just 30 milliseconds. We offer a range of cloud services covering infrastructure as a service, multi platform as a service. But today, as we go at a machine learning conference, I'm going to focus on our AI as a service offerings and some of the work we've done there. To begin, we offer a range of AI services including GPU, bare metal and virtual machine instances, managed kubernetes clusters, and also a 5g platform that helps reduce the latency between your device and the model. But all of these services underpins what we call our global intelligence pipeline, which is designed to follow the steps taken by machine learning practitioners to train, deploy and scale AI applications in production. So starting on the left there, you can see that we offer direct access to high performance AI infrastructure acted by Nvidia GPU's. This provides the raw computing power needed for intense AI training workloads. We then also offer a managed Kubernetes service which can also then utilize the GPU nodes to help you orchestrate your machine learning and AI workloads. So we then enable the deployment of workspaces in mlots platforms from our vendor partners and this helps you manage your machine learning lifecycle. And finally, we specialize in the serving and inference of pre trained AI models and that's going to be the focus of my talk today. So, as many of you know, when it comes to training your large machine learning AI models, you really care about compute performance in order to accelerate your research and development. So to give you the best possible performance, we partner with Nvidia so that all of our GPU clusters are powered by 800 way a 100 or H 100 gpu's within Vlink. But for really large training jobs that span multiple compute nodes, it's equally important to have a very fast interconnect, providing direct GPU connection across multiple nodes. All of our clusters are also connected with the latest Infiniband. But now, once you've trained or perhaps fine tuned a large model for your use case, the question arises, where can I serve my model with low latency? This is a really important question for several reasons, the main one being because as you go from research and development to production of a business application of AI, your primary compute workload will move from training to inference. For many business critical applications, the end user needs a real time response, no matter where they are in the world. And this is becoming more and more important as the market is, of course rapidly evolving, with an increasing number of businesses not only testing out AI in pilots or group concepts, actually adopting AI applications in full scale production at enterprise scale. So as an example of that, you can see here a survey last year from McKinsey, from enterprise leaders, and it showcased how over half of them are already adopting AI. And again, it just showcases just how prevalent AI is becoming, not only in pilots, but in actual production. This not only drives the demand for inference compute, but also the need for scalable, reliable, and secure infrastructure that can maintain a very high level of service availability. So, to give you an example, the application most of us are most familiar with is probably chatbots. So take an example here. You can see that if you take an off the shelf model such as Mistral Seven B and Runva locally, and I've taken a reasonably standard question around about 200 tokens, and got an answer around 20 tokens back and running that locally took around about 250 milliseconds. Now, to an end user, that really feels like real time. But now, supposing you were sitting, say, in Tokyo, and you were submitting that request to a data center in the cloud in, say, the United States, then just the network latency could easily double or treble that at that time. So that when you then add in all sort of the inference time, the end to end processing could get close to a second, which would really damage the user experience. Of course, there are many other use cases where you really care about a real time response, whether that's, for example, autonomous driving or any sort of virtual live streaming or even quality inspection use cases in industries such as mining or manufacturing. For all these applications, you of course need high performance compute for the fast inference, but you also need very low latency, and in many cases, it's also important for regulatory or privacy reasons, that the data is processed locally. And for us that means in an edge location that is geographically close to the end user. So now, supposing you're looking to run infront of your model, if you want to achieve that in real time, then there are three core requirements you have to work on. The first of those is you need a distributed, powerful compute infrastructure around the world. But also you then need a very low latency backbone which connects a compute. You then also need runtime deployment and reconfigurability. So let's start with the inference. And when we looked at running inference at the edge here at G core, we faced several challenges that required innovative solutions to help optimize performance and efficiency. I'll focus on LLMs here, as they are the technology that gets the most attention in a minute. But of course, these techniques are also relevant to other technologies. So the first strategy that we used was operator fusion, and by merging edge and operators, this helps streamline computational tasks, which often leads to better latency and enhancing the responsiveness of AI models. We also employed quantization, which involves compressing the activations and weights of neural networks so they require fewer bits. Another strategy is compression with techniques such as Sparta T, where we trim unnecessary connections, or distillation, where we train smaller models to mimic larger ones. Of course, with all these techniques, you've also got to balance the performance of the model with the accuracy and making sure you don't lose too much accuracy. And finally, the fourth technique, of course, is very well suited to GPU's is parallelization. Typically, we've implemented tense parallelism, distributing computation across multiple devices, and also pipeline parallelism to help efficiently manage larger models and help scale up our AI capabilities. So once we've optimized a computer model, the second piece of the puzzle is a network latency. Now, on this slide, I'm just showing you a map which shows the growth of 5G around the globe. And the reason I'm sharing that is, of course, over the last few years, 5G has enabled many of us, perhaps all of us, to stream really high quality content, such as videos, straight to our mobile devices, because of the ultra low latency that 5G provides. So we've taken those learnings, look to apply them also to AI. And so by leveraging our CDN network with 180 points of presence, what that means is that a user request will take just an average of 30 milliseconds end to end, to get to our network and back, and then perhaps a total of around 50 milliseconds to then get to an inference node where the actual machine inference takes place. And so that round trip of around 50 milliseconds very fast compared to perhaps a few hundred milliseconds in a typical public cloud setting. So what that means is that provided you've also optimized the model inference, so that it takes perhaps between 104 hundred milliseconds, you can achieve an end to end processing time of under half a second. And that's important because around half a second is around about the time which as a human we perceive that to be in real time. Any slower, you start to notice that lag. So again, to put that into more context, I said if you combine a very low latency network combined with an optimized inference round trip in under half a second, and that's applicable to a number of technologies. So for example, automatic speech recognition, object detection, or text speech are all examples where this enables a real time response. So then the final piece of the puzzle, we've had to work on runtime deployment and reconfigurability in a simple and scalable manner to really enable this inference at the edge, at scale. So to do this, we achieved this using a container as a service technology, so that we provide a single anycast endpoint that can be easily connected to a developer's application. So what that means is that no matter where in the world an end user is, the request to that endpoint is routed to the nearest CDN node and from there to the nearest inference node. So if we delve into a little bit more detail, you can see what's happening here under the hood. I apologize, a little bit small, but hopefully you can follow it. So at the top here, supposing you have a two end users, and let's say one of those is in rear vision Aero and one of those in say Kyoto. Now, the anycast endpoint is available globally. When each of those users sends the request, that request will go to their local CDN node, the user in Kyoto. There might be a CDN node in Kyoto itself. And similarly for revision Arrow, that request would go to the revision Arrow CDN node. Now after that we developed smart routing technology, and what that does is it routes the request to the nearest available inference region. So that's where we have some high performance compute, typically gpu's that will do the actual optimized inference. Now what that means is that not only does it guarantee the fastest response time for each of these individually, but it also ensures that the processing and the model itself also stay locally. So in particular, the user in Kyoto, the inference will be in state Tokyo and their data, and that model will only ever be in Tokyo. And likewise for the user in Rio, in Brazil, their model and their arrangement would also take place locally in Brazil. That also actually means you could potentially have a different model in each region. So, for example, you could have a model that was trained only on japanese user data in Japan, and similarly for other world regions. And so that also helps maintain that sort of data privacy and locality that I was talking about earlier. The final piece of puzzle that I want to talk about here is that in each of these regions, we also have developed auto scaling autoscaling functionality. So what that means is that the amount of compute scales up and down with demand. And that's really important because it means that you can scale up the compute when you need it, when there's lots of demand. But equally importantly, you can scale back down again when there isn't. And also that means you're not then paying for compute when you don't use it. So, tying all these things together, I'll just summarize by putting together our sort of full end to end architecture, and you'll see that we've combined high performance AI training infrastructure, and then that could be in a public cloud setting. Thus, for some customers where privacy is really important, we could also offer it in a private cloud setting. And then if you combine that with this infinite set, this low network, and the infinite edge technology we're talking about today, you have a comprehensive global AIoT architecture that we believe is fit for today's most demanding and scalable AI applications. So with that, that's probably enough of me talking. What I'll showcase here, a few demos that showcase technology in action. So the first of those is an example with a real time face avatar. And what you'll see here is at the top here, I'll be entering in some prompts. And what you should look out for is the way that the avatar of the person changes in real time as I type in those prompts. So for example, here, I'm going to type in portrait of Elon Musk. And you see that the face has changed, Elon Musk almost immediately. And now as I type Bill Gates again, that change was done in real time. And now moving on to Madonna and a few more examples here. But again, what's really impressive is the way that the change is done in real time compared to the typing of the prompt there. There's a Jeff Bezos when he had hair. So, another example I can show you here is of real time translation from English to Luxembourgish. Now, like most people in the world, I don't speak Luxembourgish, which is a local language of Luxembourg. But if I was in our head office and I needed to quickly translate something, then no problem. I could use a typical translation tool. Hello there. Allow me to introduce myself. I am an advanced machine learning model specifically designed for voice to text translation from English to Luxembourgish. I have been created and powered by the cutting edge technology of G core AI cloud. Great. And I'll press send, and you'll see that again in less than a second. Less half a second, I'd say. The translation happened, and I've received the luxembourgish translation of those words again, I really showcased, from the minute I pressed send, the processing was in almost real time. So, in the final example I'll show you today is some work we've done with our colleagues at let's AI. Let's AI are a creative platform that allows you to generate images for anything, you know, simply by tagging it in a text. We've worked with, let's see, I over the last few months, and they've been utilizing our H 100 AI infrastructure and inference services to both train their Generali models and also serve them to their customers worldwide. So what you see here is an example of a single image that was generated by the platform. But I think what's really impressive is that you can actually take a series of images generated by the platform and then combine together to form a video. So this is an example of a video generated with Lepsey Eye, which I think really shows the power of generative AI powered by GPU infrastructure. The rookie sensation Ricky Malone says surprises with an unbelievable pole position at silver. It looks like Ricky Malone is joining forces with the legendary f one team for the 1986 season. Ricky, after our crash, watching you lose yourself was hard. Not just because of us, but seeing you give into darkness. It wasn't only about you. I was hurt, too, trapped by resentment, encouraging you to race again, to face that trauma. It's for both of us. Us. I need to see you conquer this. Not to go back to what we had, but to find closure for both of us to move on. This is about letting go, forgiving each other, healing. By helping you return to the track, I'm also finding my way back. It's about finding peace, Ricky, for you and for me. So there you go. A completely generated video that I certainly found very impressive. So we're almost at the end of the talk here, and just want to let you know that if you're interested in learning more about technology, our infrastructure, the edge product is actually being launched in beta this week. So this lets you deploy your model globally with a single endpoint, as I explained. Or you can also choose an open source model out of the box from a model catalog, which includes popular models such as mistral unstable diffusion. So this gives you a serverless AI compute with very low latency around the world on a pay as you go model with DDoS endpoint protection. But also, as this is our beta service, it's also free. So if you'd like to try it, I'd love to hear from you, find out about your machine learning use case, and see whether edge computing could be the right solution for you. And I'll just end the same. But of course, our work doesn't end here. The team is striving to push the limits of our network and bring down the latency to just a few tens of milliseconds, and we believe this will further benefit sort of really mission critical use cases of AI at the edge. So with that, I hope you've enjoyed the talk. Thank you for joining me today. And again, please do reach out if you have any questions or would like to discuss any of the technologies that I've spoken about today, but for now, have a great conference.

See all 36 talks at this event!

Conf42 Machine Learning 2024 - Online

May 30 2024

Global Intelligence Pipeline: How We’re Crafting Inference at the Edge

Video size:

Abstract

Summary

Transcript

Michele Taroni

@ Gcore

Join the community!

Featured event

2025

2024

Info

Conf42 Machine Learning 2024 - Online

May 30 2024

Global Intelligence Pipeline: How We’re Crafting Inference at the Edge

Video size:

Abstract

Summary

Transcript

Michele Taroni

@ Gcore

Join the community!