Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, my name is Nikhil Taroni, head of AI products
at GCor. First of all, thank you very much to the organizer of
Conf 42 for having me here today, and I'm really
excited to talk to you about our pioneering work building our global
intelligence pipeline and its cornerstone technology inference
at the edge. And during the talk, feel free
to ping me on slack with any questions or comments and I'll do
my best to answer. But for now, let's dive in.
As I'm here on behalf of GCor, let me begin by telling
you a little bit more about us. We are a global
cloud edge and AI solutions provider headquartered
in Luxembourg, but with a global presence and
our vision is to connect the world to AI anywhere,
anytime. The cornerstone of the company is
our secure low latency network, which consists of over 180
CDN points of presence and 25 cloud locations,
giving us an average response time of just 30 milliseconds.
We offer a range of cloud services covering infrastructure
as a service, multi platform as a service.
But today, as we go at a machine learning conference,
I'm going to focus on our AI as a service offerings and some of
the work we've done there. To begin, we offer a range of
AI services including GPU, bare metal
and virtual machine instances, managed kubernetes clusters,
and also a 5g platform that helps reduce the latency between
your device and the model. But all of these services
underpins what we call our global intelligence pipeline,
which is designed to follow the steps taken by machine learning practitioners to
train, deploy and scale AI applications
in production. So starting on the left there,
you can see that we offer direct access to high performance
AI infrastructure acted by Nvidia GPU's.
This provides the raw computing power needed for intense AI training
workloads. We then also offer a managed Kubernetes
service which can also then utilize the GPU
nodes to help you orchestrate your machine learning and AI workloads.
So we then enable the deployment of workspaces in mlots platforms
from our vendor partners and this helps you manage your machine learning lifecycle.
And finally, we specialize in the serving and inference of pre
trained AI models and that's going to be the focus of my
talk today. So, as many of you know, when it comes
to training your large machine learning AI models,
you really care about compute performance in order to accelerate
your research and development. So to give you the best possible performance,
we partner with Nvidia so that all of our
GPU clusters are powered by 800 way a 100 or H
100 gpu's within Vlink.
But for really large training jobs that span multiple compute nodes,
it's equally important to have a very fast interconnect, providing direct
GPU connection across multiple nodes.
All of our clusters are also connected with the latest Infiniband.
But now, once you've trained or perhaps fine tuned a large model
for your use case, the question arises,
where can I serve my model with low latency?
This is a really important question for several reasons,
the main one being because as you go from research and development to
production of a business application of AI,
your primary compute workload will move from training to
inference. For many business critical applications,
the end user needs a real time response, no matter
where they are in the world. And this is becoming more and more important
as the market is, of course rapidly evolving,
with an increasing number of businesses not only testing out AI
in pilots or group concepts, actually adopting
AI applications in full scale production at
enterprise scale. So as an example of that,
you can see here a survey last year from McKinsey,
from enterprise leaders, and it showcased how
over half of them are already adopting AI. And again,
it just showcases just how prevalent AI
is becoming, not only in pilots, but in actual production.
This not only drives the demand for inference compute,
but also the need for scalable, reliable, and secure
infrastructure that can maintain a very high level of service availability.
So, to give you an example, the application most of us are most familiar
with is probably chatbots. So take
an example here. You can see that if you take an
off the shelf model such as Mistral Seven B and
Runva locally, and I've taken a reasonably
standard question around about 200 tokens, and got
an answer around 20 tokens back and running
that locally took around about 250 milliseconds.
Now, to an end user, that really feels like real time.
But now, supposing you were sitting, say, in Tokyo, and you were submitting
that request to a data center in the cloud in, say, the United States,
then just the network latency could easily double
or treble that at that time.
So that when you then add in all sort of the inference time, the end
to end processing could get close to a second,
which would really damage the user experience.
Of course, there are many other use cases where you really
care about a real time response, whether that's,
for example, autonomous driving or any sort of virtual live
streaming or even quality inspection use cases in industries
such as mining or manufacturing. For all these applications,
you of course need high performance compute for
the fast inference, but you also need very low
latency, and in many cases, it's also important for
regulatory or privacy reasons, that the data is processed locally.
And for us that means in an edge location that
is geographically close to the end user. So now,
supposing you're looking to run infront of
your model, if you want to achieve that in real time, then there are
three core requirements you have to work on.
The first of those is you need a distributed,
powerful compute infrastructure around the world.
But also you then need a very low latency backbone which
connects a compute. You then also need
runtime deployment and reconfigurability.
So let's start with the inference. And when
we looked at running inference at the edge here at
G core, we faced several challenges that required innovative solutions
to help optimize performance and efficiency.
I'll focus on LLMs here, as they are the technology
that gets the most attention in a minute. But of course, these techniques are
also relevant to other technologies. So the first strategy
that we used was operator fusion,
and by merging edge and operators, this helps streamline computational
tasks, which often leads to better latency and enhancing
the responsiveness of AI models. We also employed quantization,
which involves compressing the activations and weights of neural networks
so they require fewer bits. Another strategy
is compression with techniques such as Sparta T,
where we trim unnecessary connections, or distillation,
where we train smaller models to mimic larger ones.
Of course, with all these techniques, you've also got to balance
the performance of the model with the accuracy and
making sure you don't lose too much accuracy. And finally,
the fourth technique, of course, is very well suited to
GPU's is parallelization. Typically,
we've implemented tense parallelism, distributing computation across
multiple devices, and also pipeline parallelism to help
efficiently manage larger models and help scale up our
AI capabilities. So once we've optimized
a computer model, the second piece of the puzzle is
a network latency. Now, on this slide, I'm just
showing you a map which shows the growth of 5G
around the globe. And the reason I'm sharing that is,
of course, over the last few years, 5G has enabled
many of us, perhaps all of us, to stream really high
quality content, such as videos, straight to our mobile devices,
because of the ultra low latency that 5G
provides. So we've taken those learnings,
look to apply them also to AI. And so by
leveraging our CDN network with 180 points of
presence, what that means is that a user request will
take just an average of 30 milliseconds end to end, to get
to our network and back, and then perhaps a total of around 50
milliseconds to then get to an inference node where the actual
machine inference takes place. And so that round trip
of around 50 milliseconds very fast
compared to perhaps a few hundred milliseconds in a typical public
cloud setting. So what that means is that provided you've also optimized
the model inference, so that it takes perhaps between 104
hundred milliseconds, you can achieve an end to
end processing time of under half a second.
And that's important because around half a second is around
about the time which as a human we perceive that to
be in real time. Any slower, you start to notice that lag.
So again, to put that into more context, I said if you
combine a very low latency network combined with
an optimized inference round trip in under half a second,
and that's applicable to a number of technologies.
So for example, automatic speech recognition,
object detection, or text speech are all
examples where this enables a real time
response. So then the final piece of the puzzle,
we've had to work on runtime deployment
and reconfigurability in a simple and scalable
manner to really enable this inference at the edge, at scale.
So to do this, we achieved this using a container
as a service technology, so that we provide a single
anycast endpoint that can be easily connected to a developer's
application. So what that means is that
no matter where in the world an end user is, the request to that
endpoint is routed to the nearest CDN node and from
there to the nearest inference node. So if we delve
into a little bit more detail, you can see what's happening here under
the hood. I apologize, a little bit small, but hopefully you can follow it.
So at the top here, supposing you have a two end
users, and let's say one of those is in rear vision Aero
and one of those in say Kyoto. Now, the anycast
endpoint is available globally. When each of those
users sends the request, that request will go to their local CDN
node, the user in Kyoto. There might be a CDN node in
Kyoto itself. And similarly for revision Arrow, that request
would go to the revision Arrow CDN node. Now after that
we developed smart routing technology,
and what that does is it routes the request to
the nearest available inference region. So that's where we have some
high performance compute, typically gpu's that will do the actual
optimized inference. Now what that means is
that not only does it guarantee the fastest response time for each
of these individually, but it also ensures that the
processing and the model itself also stay locally.
So in particular, the user in Kyoto, the inference
will be in state Tokyo and their data, and that model
will only ever be in Tokyo. And likewise for the user
in Rio, in Brazil, their model and their arrangement would also take place
locally in Brazil. That also actually means you
could potentially have a different model in each region. So, for example,
you could have a model that was trained only on japanese user data
in Japan, and similarly for other world regions.
And so that also helps maintain that sort of data privacy and
locality that I was talking about earlier. The final piece
of puzzle that I want to talk about here is that in each of these
regions, we also have developed auto scaling
autoscaling functionality.
So what that means is that the amount of compute scales
up and down with demand. And that's really important because
it means that you can scale up the compute when you need it, when there's
lots of demand. But equally importantly, you can scale back down again when there
isn't. And also that means you're not then paying for compute
when you don't use it. So, tying all these things together,
I'll just summarize by putting together our sort of full end to end
architecture, and you'll see that we've combined high
performance AI training infrastructure, and then that
could be in a public cloud setting. Thus, for some customers where
privacy is really important, we could also offer it in a private cloud setting.
And then if you combine that with this infinite set, this low network,
and the infinite edge technology we're talking about today,
you have a comprehensive global AIoT
architecture that we believe is fit for today's most demanding and
scalable AI applications. So with that,
that's probably enough of me talking. What I'll showcase here,
a few demos that showcase technology in action.
So the first of those is an example with a real time face
avatar. And what you'll see here is at the top here, I'll be entering
in some prompts. And what you should look out for is the
way that the avatar of the person changes
in real time as I type in those prompts.
So for example, here, I'm going to type in portrait of Elon Musk.
And you see that the face has changed, Elon Musk
almost immediately. And now as I type Bill Gates again, that change
was done in real time. And now moving on to Madonna
and a few more examples here. But again, what's really impressive
is the way that the change is done in real time compared
to the typing of the prompt there. There's a Jeff Bezos
when he had hair. So, another example I
can show you here is of real time translation from English
to Luxembourgish. Now, like most
people in the world, I don't speak Luxembourgish, which is a
local language of Luxembourg. But if I was in
our head office and I needed to quickly translate something,
then no problem. I could use a typical translation
tool. Hello there. Allow me to introduce myself.
I am an advanced machine learning model specifically designed for
voice to text translation from English to Luxembourgish.
I have been created and powered by the cutting edge technology of G
core AI cloud. Great. And I'll press send,
and you'll see that again in less than a second.
Less half a second, I'd say. The translation happened,
and I've received the luxembourgish translation
of those words again, I really showcased, from the minute
I pressed send, the processing was in almost real time.
So, in the final example I'll show you today is some work
we've done with our colleagues at let's AI.
Let's AI are a creative platform that allows you to generate images for
anything, you know, simply by tagging it in a text.
We've worked with, let's see, I over the last few months, and they've been utilizing
our H 100 AI infrastructure and inference
services to both train their Generali models and
also serve them to their customers worldwide.
So what you see here is an example of a single image that was generated
by the platform. But I think what's really impressive
is that you can actually take a series of images
generated by the platform and then combine
together to form a video. So this is
an example of a video generated with Lepsey Eye,
which I think really shows the power of generative AI powered
by GPU infrastructure.
The rookie sensation Ricky Malone says surprises with an unbelievable
pole position at silver. It looks like Ricky Malone is joining forces with the
legendary f one team for the 1986 season.
Ricky, after our crash, watching you lose yourself
was hard. Not just because of us, but seeing you
give into darkness. It wasn't only about you.
I was hurt, too, trapped by resentment,
encouraging you to race again, to face that trauma.
It's for both of us. Us. I need to see you conquer this.
Not to go back to what we had, but to find closure
for both of us to move on. This is about letting go,
forgiving each other, healing. By helping you return
to the track, I'm also finding my way back. It's about finding
peace, Ricky, for you and for me.
So there you go. A completely generated video that I
certainly found very impressive. So we're almost at the end of the talk
here, and just want to let you know that if
you're interested in learning more about technology, our infrastructure, the edge
product is actually being launched in beta this week.
So this lets you deploy your model globally with
a single endpoint, as I explained. Or you can also
choose an open source model out of the box from a model catalog,
which includes popular models such as mistral unstable
diffusion. So this gives you a serverless AI compute with
very low latency around the world on a pay as you go model
with DDoS endpoint protection. But also, as this is our
beta service, it's also free. So if you'd like to try
it, I'd love to hear from you, find out about your machine learning
use case, and see whether edge computing could
be the right solution for you. And I'll just end the same. But of course,
our work doesn't end here. The team is striving
to push the limits of our network and bring down the latency
to just a few tens of milliseconds, and we believe this
will further benefit sort of really mission
critical use cases of AI at the edge.
So with that, I hope you've enjoyed the talk. Thank you for
joining me today. And again, please do reach out if you have any
questions or would like to discuss any of the technologies that I've spoken
about today, but for now, have a great conference.