Inside Multimodal Models

Video size:

Abstract

In this session, we will dive into the intricacies of multimodal AI, exploring its mechanisms and the necessity for such advanced systems in today’s digital landscape. We’ll discuss how these models process and analyze diverse data streams to create a cohesive understanding of user inputs.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello and a warm welcome to PROMPT Engineering Conference 2024. I am Bethany Jebchumba and I'll be taking you through inside multimodal models. A little bit about myself before we get started. I am a cloud advocate at Microsoft and in case you need to reach out to me after the session you can find me on the internet at Bethany Jeb. In this session, we'll be covering a lot around multimodal models, understanding the inner workings of these models, understanding what modality is, And following up with a couple of demos to showcase how you can apply multi modal models in your day to day life. We are currently living in the era of AI, and over the last few years, we have seen the speed of innovation in the AI space be truly incredible. The breakthroughs, there are a couple of breakthroughs happening many times a week across the entire industry. And About more than 50 percent of the organizations have adopted AI in at least one business area And 70 percent of employees are ready to test out AI and see what it is about The idea behind AI and generative AI has been Language models can be able to generate new contexts and in Majority of the times we've seen the content is particularly of one modality. For example, having content in natural language, interacting with chat or having models that interact with the modality of vision or speech only. But in real, models are amazing. But in reality, we as humans interact with the real world using multiple senses. Since computers have been unable to interact across multiple modalities, it has meant that us human beings have had to learn how to interact with computers versus the other way around. But now everything is changing. The question around multimodality is we as humans can be able to grasp all these different modalities. We can be able to see. It's a pretty different languages and speak out loud, almost at the same time and almost simultaneously. But why can't large language models do the same? The modalities available currently include language and have models such as GPTs, which understand and generate natural language. And then we have Whisper where models such as, we have speech where models such as Whisper are able to understand and generate spoken language. And finally, we have vision where models such as Dali generate new images from textual prompts. But now when large language models come into play, models can be able to process multiple data types, for example, text, images, audio, and vision, and rezone across various scenarios. Before we go further into the inner workings of multimodal models, the first thing we'll do is define exactly what The technology, the architecture behind it is. And the first one is the attention mechanism. Whereby models are able to weigh the importance of different parts of the input data and provide better context understanding. The attention mechanism demands a focus on what is relevant at hand. And you can be able to build relationships within a single modality or across modalities to be able to now give contextual understanding and provide insights on which. Tax is more valuable than the other based on the question asked. And then the transformer architecture, which relies on these attention mechanisms to be able to transform data and ensures you can be able to handle multiple modalities. For example, you can define specific encoders that feed to a shared transformer layer. And these encoders, once in the same layer, can be able to come in and perform multimodal transformation. Examples, then onto the multi modal models. I'll talk about two today. The first one is CLIP, which is Contrastive Language Image Pre training and DALI. How CLIP works is you're able to get a lot of image and you're able to get a lot of image and Caption pairs and code both of them in different encoders and have the embeddings in the same vector space, whereby now you're able to match those with a similar pair. And those that are not similar, the pair that are not matching are able to be. Those pairs that are not matching are able to be minimized, ensuring you're able to now. Get a better response based on what your input will be. For example, if you want to classify images or generate a new image, it will be simpler to do so because you can be able to have the caption and have the caption there and get an image encoded, decoded for you. And then the next one is DALY. How DALY works is you're able to give a prompt. And get a response based on the prompt we've given. The mechanisms behind this is the training data, whereby we have billions of parameters used to train the model. These are vast parameters that contain image and text pairs, which are now able to give the association between textual and visual information. Therefore, allowing you to be as creative as possible. And the more the data, the better the performance of the model. Up to, of course, a certain level. And at the base of DALI is transform architecture, which, of course, ensures you're able to have the attention mechanism to decode text prompts and transform it into images. We've talked a bit about the theory. So let's see multimodality in action. Before we go into it, the first thing I'll mention is you can be able to access all these models on Azure OpenAI. And some of the models you can access on Azure OpenAI. For specific modalities include something like GPT 4 or 3. 5 Turbo for your text generation, Whisper for your audio generation, and Dally for image generation. But today we'll be exploring GPT-4 O for our multi mod, our multimodal examples. So onto the demo, a way you can be able to apply GPT-4 O, that is a multimodal model, is, for example, getting detailed descriptions of your image. So I will upload an image of my cookies that are recently backed and ask what is in the image. The response should be a description of what exactly the image is about and it shows a baking tray with several cookies on it. The cookies appear to be homemade, slightly golden brown color. They're uneven and it's very characteristic of many homemade cookies. So you can ask, okay, how can I improve my baking based on now the feedback I got from the model and it will give me ideas of probably should be able to do this and this. let's give it a second as it loads. It's still generating the response. So let's give it a second and hear the response. So first, I ensure you read the recipe. Quality ingredients, the accuracy in measurement, and so much more, which ensures your baking is way better. The next way is you can be able to use your model to be able to detect if there's something wrong. With a particular image, so I can be able to upload any of these images. Let's upload this cake that was generated by Dal E. And I can ask, is there an anomaly in this image? I, good news comes. case for this is, especially in areas that experience things like droughts or floods, you can be able to upload different images and understand what is happening there. Also, this cake does not have any anomalies. It's aesthetically pleasing and everything is well done. So yay, no anomalies in the cake. And then the other bit you can be able to do is help you understand and generate graphs. So I have a Power BI graph here. And I can be able to ask Dali to explain the dashboard. What else I can be able to add into the dashboard. Once that's done, I can be able to upload it. And it will give me a description of this dashboard, which shows a retail analysis of sales, the sales based on the number of stores, the sales that happened that year, where did the most sales comes from, and more details around the sales in the store. So let's give, opening our model a few seconds for it to finish replying. Then you can be able to check up the answer. Okay. Seems like there's a problem with my, yep. Here comes the reply. So we can be able to see these are the sections and what are the sections indicate. These are other sections, what it indicates. So the top section, the middle and then the bottom section. And it's also telling us what additional metrics we can be able to create. So probably I can ask, how can I increment the sales trend over. multiple years and it's able to give me step by step instructions to be able to now be able to create with a multi year trend analysis to see the long term growth and decline of the sales over the period of time. Then once you're done with DALI for ensuring you can be able to create and generate your different graphs, create more graphs, you can also be able to convert this into code. For example, if you have one of the graphs that you really and you want to convert to code, that's something you can be able to do. There are ways you can be able to use multimodality is translating between different languages. I don't know, but, there's sometimes you find a sticker and it's in a very different language, maybe on your clothes, or maybe even a poster on the wall, or you're visiting a new country and you're seeing all these posters and you want to get a translation of them. You can just take the picture of the image and upload to GPT 40 and get a response based on that. And then the last bit of how I use this in a personal level is I use it for my designs and for my drawings to be able to now get feedback of, okay, this is a drawing I made. Can you be able to convert this for me or using it on my handwriting to be able to tell, okay, are you able to give me. feedback based on this, if there's anything I need to change about my handwriting. So those are the different use cases for GPT 4. And as we wrap up, we've been able to cover what exactly GPT 4 is about, what exactly you can be able to do with the different multi modal models, how you can be Able to implement them. So my final parting search is you can be able to join the A JA community on Discord to learn more about these multi model models, learn how you can be able to build them, and also get started with the generative AI course for beginners, which gets you into now. Yes, learn the basics. So how can you continue forward and create more robust applications? Thank you so much, and have a great day.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

Inside Multimodal Models

Video size:

Abstract

Summary

Transcript

Slides

Bethany Jepchumba

Cloud Advocate @ Microsoft

Join the community!

Featured event

2025

2024

Info

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

Inside Multimodal Models

Video size:

Abstract

Summary

Transcript

Slides

Bethany Jepchumba

Cloud Advocate @ Microsoft

Join the community!