Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and a warm welcome to PROMPT Engineering Conference 2024.
I am Bethany Jebchumba and I'll be taking you through inside multimodal models.
A little bit about myself before we get started.
I am a cloud advocate at Microsoft and in case you need to reach out
to me after the session you can find me on the internet at Bethany Jeb.
In this session, we'll be covering a lot around multimodal models, understanding
the inner workings of these models, understanding what modality is, And
following up with a couple of demos to showcase how you can apply multi
modal models in your day to day life.
We are currently living in the era of AI, and over the last few years,
we have seen the speed of innovation in the AI space be truly incredible.
The breakthroughs, there are a couple of breakthroughs happening many times
a week across the entire industry.
And About more than 50 percent of the organizations have adopted AI in at
least one business area And 70 percent of employees are ready to test out AI and
see what it is about The idea behind AI and generative AI has been Language models
can be able to generate new contexts and in Majority of the times we've seen the
content is particularly of one modality.
For example, having content in natural language, interacting with chat or
having models that interact with the modality of vision or speech only.
But in real, models are amazing.
But in reality, we as humans interact with the real world using multiple senses.
Since computers have been unable to interact across multiple modalities,
it has meant that us human beings have had to learn how to interact with
computers versus the other way around.
But now everything is changing.
The question around multimodality is we as humans can be able to grasp
all these different modalities.
We can be able to see.
It's a pretty different languages and speak out loud, almost at the
same time and almost simultaneously.
But why can't large language models do the same?
The modalities available currently include language and have models
such as GPTs, which understand and generate natural language.
And then we have Whisper where models such as, we have speech where models
such as Whisper are able to understand and generate spoken language.
And finally, we have vision where models such as Dali generate
new images from textual prompts.
But now when large language models come into play, models can be able
to process multiple data types, for example, text, images, audio, and vision,
and rezone across various scenarios.
Before we go further into the inner workings of multimodal models, the first
thing we'll do is define exactly what The technology, the architecture behind it is.
And the first one is the attention mechanism.
Whereby models are able to weigh the importance of different parts
of the input data and provide better context understanding.
The attention mechanism demands a focus on what is relevant at hand.
And you can be able to build relationships within a single modality
or across modalities to be able to now give contextual understanding
and provide insights on which.
Tax is more valuable than the other based on the question asked.
And then the transformer architecture, which relies on these attention
mechanisms to be able to transform data and ensures you can be able
to handle multiple modalities.
For example, you can define specific encoders that feed
to a shared transformer layer.
And these encoders, once in the same layer, can be able to come in and
perform multimodal transformation.
Examples, then onto the multi modal models.
I'll talk about two today.
The first one is CLIP, which is Contrastive Language
Image Pre training and DALI.
How CLIP works is you're able to get a lot of image and you're able to get
a lot of image and Caption pairs and code both of them in different encoders
and have the embeddings in the same vector space, whereby now you're able
to match those with a similar pair.
And those that are not similar, the pair that are not matching are able to be.
Those pairs that are not matching are able to be minimized,
ensuring you're able to now.
Get a better response based on what your input will be.
For example, if you want to classify images or generate a new image, it
will be simpler to do so because you can be able to have the caption
and have the caption there and get an image encoded, decoded for you.
And then the next one is DALY.
How DALY works is you're able to give a prompt.
And get a response based on the prompt we've given.
The mechanisms behind this is the training data, whereby we have billions
of parameters used to train the model.
These are vast parameters that contain image and text pairs, which are now
able to give the association between textual and visual information.
Therefore, allowing you to be as creative as possible.
And the more the data, the better the performance of the model.
Up to, of course, a certain level.
And at the base of DALI is transform architecture, which, of course,
ensures you're able to have the attention mechanism to decode text
prompts and transform it into images.
We've talked a bit about the theory.
So let's see multimodality in action.
Before we go into it, the first thing I'll mention is you can be able to
access all these models on Azure OpenAI.
And some of the models you can access on Azure OpenAI.
For specific modalities include something like GPT 4 or 3.
5 Turbo for your text generation, Whisper for your audio generation,
and Dally for image generation.
But today we'll be exploring GPT-4 O for our multi mod, our multimodal examples.
So onto the demo,
a way you can be able to apply GPT-4 O, that is a multimodal
model, is, for example, getting detailed descriptions of your image.
So I will upload an image of my cookies that are recently backed
and ask what is in the image.
The response should be a description of what exactly the image is
about and it shows a baking tray with several cookies on it.
The cookies appear to be homemade, slightly golden brown color.
They're uneven and it's very characteristic of many homemade cookies.
So you can ask, okay, how can I improve my baking based on now the
feedback I got from the model and it will give me ideas of probably
should be able to do this and this.
let's give it a second as it loads.
It's still generating the response.
So let's give it a second and hear the response.
So first, I ensure you read the recipe.
Quality ingredients, the accuracy in measurement, and so much more, which
ensures your baking is way better.
The next way is you can be able to use your model to be able to
detect if there's something wrong.
With a particular image, so I can be able to upload any of these images.
Let's upload this cake that was generated by Dal E.
And I can ask, is there
an anomaly in this image?
I, good news comes.
case for this is, especially in areas that experience things like droughts or floods,
you can be able to upload different images and understand what is happening there.
Also, this cake does not have any anomalies.
It's aesthetically pleasing and everything is well done.
So yay, no anomalies in the cake.
And then the other bit you can be able to do is help you
understand and generate graphs.
So I have a Power BI graph here.
And I can be able to ask Dali to explain the dashboard.
What else I can be able to add into the dashboard.
Once that's done, I can be able to upload it.
And it will give me a description of this dashboard, which shows a retail
analysis of sales, the sales based on the number of stores, the sales
that happened that year, where did the most sales comes from, and more
details around the sales in the store.
So let's give, opening our model a few seconds for it to finish replying.
Then you can be able to check up the answer.
Okay.
Seems like there's a problem with my,
yep.
Here comes the reply.
So we can be able to see these are the sections and what
are the sections indicate.
These are other sections, what it indicates.
So the top section, the middle and then the bottom section.
And it's also telling us what additional metrics we can be able to create.
So probably I can ask, how can I increment the sales trend over.
multiple years and it's able to give me step by step instructions to be
able to now be able to create with a multi year trend analysis to see
the long term growth and decline of the sales over the period of time.
Then once you're done with DALI for ensuring you can be able to create
and generate your different graphs, create more graphs, you can also
be able to convert this into code.
For example, if you have one of the graphs that you really and you
want to convert to code, that's something you can be able to do.
There are ways you can be able to use multimodality is translating
between different languages.
I don't know, but, there's sometimes you find a sticker and it's in a very
different language, maybe on your clothes, or maybe even a poster on the
wall, or you're visiting a new country and you're seeing all these posters and
you want to get a translation of them.
You can just take the picture of the image and upload to GPT 40
and get a response based on that.
And then the last bit of how I use this in a personal level is I use it
for my designs and for my drawings to be able to now get feedback of,
okay, this is a drawing I made.
Can you be able to convert this for me or using it on my handwriting to be able
to tell, okay, are you able to give me.
feedback based on this, if there's anything I need to
change about my handwriting.
So those are the different use cases for GPT 4.
And as we wrap up, we've been able to cover what exactly GPT 4 is about,
what exactly you can be able to do with the different multi modal models,
how you can be Able to implement them.
So my final parting search is you can be able to join the A JA community
on Discord to learn more about these multi model models, learn how you can
be able to build them, and also get started with the generative AI course
for beginners, which gets you into now.
Yes, learn the basics.
So how can you continue forward and create more robust applications?
Thank you so much, and have a great day.