Conf42 Python 2024 - Online

Decoding Algorithmic Artistry: Navigating the Technical Landscape of Visual Generative AI

Video size:

Abstract

My research focuses on text-to-3D scene generation via voxel grid optimization. Broadly, this means creating software that generates fully interactive 3D scenes from text prompts. I’d like to talk about generative AI, my research, and the potential of AI-powered 3D workflows at a technical level.

Summary

  • Jay Ron Plumade is an AI researcher based in Chicago. He will discuss the capabilities of visual generative AI. He'll also discuss its future implications, its current limitations, and several ways researchers are preparing for the rise of photorealistic AI.
  • Diffusion models are a powerful class of generative models that are capable of producing high quality images. The most popular consumer diffusion models on the market currently are dolly three, mid journey and stable diffusion in seconds. These implementations have broad applications in art, design and related fields.
  • AI could very well follow the paradigm established by smartphones of the last two or so decades. What does the future look like with such powerful AI now becoming so readily available? And how does this affect the various aspects of the media we consume?
  • Dolly can't generate an image of ramen without chopsticks. This Dali model is highly reliant on training data. Can AI truly be creative? Or does it just rely on the creativity of its training data?
  • In this last scenario, I gave Dolly the task of illustrating Winograd schemas. These are basically phrases that test Dolly's ability to interpret and use context clues. Along with the rise of visual generative AI, we also see a natural increase in the spread of misinformation and fraudulent activity.
  • C two PA is a proposed media specification that would be able to verify the origin and edit history of any media. This would ensure that every pixel edited would leave a footprint. OpenAI has pledged it here to the C two PA standard once sora is formally released to the public.
  • TikTok is one of the first platforms that requires users to tag AI generated content. Quick summary diffusion models are a type of generative AI that are highly effective in creating realistic media. Within the next decade, this will likely have far creating implications in nearly every industry that relies on digital media.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Jay Ron Plumade, and I'm an AI researcher based in Chicago. Though I'll begin by introducing visual generative AI at a technical level, and I'll then talk about several interesting papers that have been published in the last twelve months that really demonstrate the capabilities of visual generative AI. Finally, I'll discuss its future implications, its current limitations, and several ways researchers are preparing for the rise of photorealistic generative AI. Let's begin. Let me give you a brief introduction. Diffusion models are a powerful class of generative models that are capable of producing high quality images, and these implementations have broad applications in art, design and related fields. The most popular consumer diffusion models on the market currently are dolly three, mid journey and stable diffusion in seconds. These powerful models can input text and output high quality images. And most recently, OpenAI has released Sora, which is a text to video platform that really blows away any previous models. This was actually so recent, in fact, that I had to modify a decent portion of my slideshow and the talk so what are diffusion models and how do they work? Diffusion essentially boils down to two steps. We have a forward process that gradually adds noise to an input image, and then we have a reverse process that attempts to remove noise from an input image. So we begin with an input image that we'll call x zero x zero. This is the original image, or the image at step zero. We then add noise to our original image to form x one, the image at step one. We can do this again and again. In this case, we form x two until we get random noise. And a value known as the schedule controls the rate at which noise is added. So there have been multiple approaches used. But the most straightforward approach is a linear schedule, which adds the same amount of noise in each step. There are also non linear schedules that change the rate at which noise is added depending on where you are in the forward process. So most notably, researchers at OpenAI used a nonlinear model that added a lot of noise at the beginning and then gradually decreased tapered off as it approached the final time step. And this really improved the effectiveness of the diffusion model by preserving the image quality for longer. And this process forms a Markov chain, wherein the image at any given time only depends on the previous image. In the reverse process, we start with an input image from some time step, let's say x two, and this image is fed into the diffusion model, which essentially predicts the noise that separates this image from the original image. So we feed this into the diffusion model, and we get a prediction of noise. We can then subtract this noise to get some approximation of the original image, and we want to get a better approximation of the original image. So counterintuitively, we add some noise back to get to a step between the original image and our original input. Let's say x one, and then once again we predict the noise that separates x one from x zero, which is the original image. Once we subtract this value again, we're left with an even better approximation of our input. Image and diffusion models through this process enable nearly every modern visual generative AI on the market. And used in combination with other neural components like text encoders, this forms the basis of dolly three and stable diffusion. And imagine which are some of the most powerful visual generative AI platforms currently. Let's talk about the current research. Nvidia is popping up everywhere these days, and really they got into visual generative AI and AI more generally because their hardware is highly specialized for AI. Gpus are essential to training large AI models, and Nvidia is, of course, a GPU manufacturer. Nvidia Picasso is a platform that allows the creation of custom generative AI for visual content. So this includes images, videos, 3d models, and 360 degree content. And this allows enterprise users to train custom visual generative AI models through the Nvidia edify platform. So, for example, a stock photo company could train an edify model on their current library to automatically generative AI stock content. And in fact, several such companies have entered partnerships with Nvidia. So, for example, Shutterstock, Getty Images, and Adobe have been the earliest adopters of the Nvidia Picasso platform. So I'll first give some background on neural radiance fields, or neRfs. A NERF is a neural network that can synthesize views of complex 3d scenes from just a partial set of other words. It can predict angles of a 3d scene that it was not trained on. And this is a very powerful and versatile method for 3d scenes representation. It works by taking input images and interpolating between them to create a complete scene. So a NERF is trained to map directly from a 5d input, which is three coordinates in x, y and z, and then two coordinates for viewing direction, and it outputs opacity and color. So this is a 4d output, three color channels, and opacity. And what I'm showing you right now is a project called live portrait 3d, which is a project by Nvidia that presents a new method to render photorealistic 3d facial representations from just a single face portrait. And it does this in real time. So, given a single rgb input, their image encoder directly produces a Nerf of the target face. And this method is quite fast. So they achieved 24 frames per second on consumer hardware, and it produces much higher quality results than previous methods. So what you're seeing right now is an input image on the left, and the right is an output model from the LP 3D model. And so this is just an example of current research. Moving on. This project is a system that learns physically simulated tennis skills from large scale demonstrations of real tennis play. And this is harvested at a large scale from broadcast videos. So the approach is based on hierarchical models that combines several policy networks to steer the character in a motion embedding learned from these broadcast videos. And the system can learn complex tennis skills and realistically chain together multiple shots into extended rallies. Importantly, this model only uses simpler rewards and doesn't require explicit annotations on stroke types and movements. That's what makes this so powerful, it reduces dependency on human annotators. And this model is applicable to other motion capture applications as well. It also works despite the relatively low quality of motion extracted from the broadcast videos. And the end result of this is that the system can synthesize two completely simulated characters playing extended tennis rallies with realistic physics. Diffusion models really struggle with realistic physics, and more on that shortly. So, aside from image and video generation, diffusion models have also been used in a variety of applications outside of this domain. So very recently, researchers at Singh Hua University have used diffusion models to generate 3d models. So, in other words, each of the models pictured here can be moved around and manipulated in six degrees of freedom. So what are the future implications? For the sake of demonstration, I'm going to use videos generated by Sora, because they are by far the most advanced videos currently available. So what does the future look like with such powerful AI now becoming so readily available? And how does this affect the various aspects of the media we consume? Maybe this marks the end of an entire tier of digital creators, or maybe it simply augments the capabilities of skilled videographers. We've seen the rise of several innovations that really shifted the landscape of entire industries, the most impactful of which is likely sitting in your hand or pocket if it's not off chart. And visual. Generative AI could very well follow the paradigm established by smartphones of the last two or so decades. By placing a camera in the hands of pretty much every single person on the planet, it significantly reduced the barrier of entry for photography. And similarly, it could very well be the case that videography and even 3d modeling continues along this trajectory without necessarily displacing a significant portion of the industry. But what about licensing drone stock footage? We've seen that Sora is capable of generating highly realistic that mimics the sort of footage that you would get from a drone that is several hundred dollars. So here's another parallel with Photoshop. Before Photoshop, it was much easier to accept the credibility of images to take them at face value. And at least for now, it's possible to find weaknesses in generates video, especially those that involve object interactions and realistic physics. This AI content isn't really fooling anyone. No one is going to see this particular video and think this is real content. We can see the chair morphing and floating around, but within the next couple of years, it will be very hard, if not impossible, to discern these sorts of images with the naked eye. So it will prompt the creation of several important forensics tools and more on that shortly. We can also see a few strange artifacts here with, for example, a flame that doesn't move with the wind, wax that morphs downwards as opposed to just creating, and a character that doesn't really react to touching a flame. So we could also see other cultural shifts, like being desensitized to AI content. This is predictable because it happens with everything. And actually, I lied about this video here. This is a drone video. This was collected by a drone over the city of Chicago. And the odd thing is, it doesn't really seem out of place with the quality brought by advanced generative AI. So what does that mean for the industry? So let's go over some of the current limitations of visual generative AI, and I'd encourage you to keep this in mind, but stay updated, because I'd expect that many of these issues will have been vastly improved even within the next year or two. So why are we staring at several images of ramen? I asked Dolly several times to generate an image of ramen without chopsticks, and it's completely unable to. And I'd encourage you to try this yourself. That seems pretty odd at first glance, but it really makes sense when you consider the underlying diffusion architecture. This Dali model is highly reliant on training data, and in this case, it appears that the training data has chopsticks in every image of ramen. So in this case, Dali is actually unable to create an image without chopsticks somehow making their way into the image. And this begs the question, can AI truly be creative? Or does it just rely on the creativity of its training data? And this is more relevant when considering the ongoing battle between large AI companies and individual artists, artists are unhappy because of the usage of their artwork. And as far as I know, human artists don't mindlessly synthesize images they've seen before. I don't have any radically different ideas on this particular debate, but it's definitely worth taking into account the underlying mechanics of diffusion models. In this last scenario, I gave Dolly the task of illustrating Winograd schemas, and these are basically phrases that test Dolly's ability to interpret and use context clues. So Dolly was actually able to get several of these, which surprised me, but I basically told Dolly to describe the following scenarios. So, in the top left, the scenario I used was I dropped a bowling ball on a table, and it broke. This requires understanding of bowling balls, tables, and what it means for this interaction. So, obviously, if I asked a human to tell me, to explain to me what this means, they'll probably tell me that bowling balls are heavy, tables can be fragile, and especially when you drop a bowling ball from a height, it's very possible that the table breaks. But that requires understanding of these objects and the interactions at play. And this is actually very impressive for a visual, generative AI model, because especially Dolly one and two struggled with this. In the top right, I used the phrase, my trophy couldn't fit into the brown suitcase because it was too large. Once again, this requires understanding of suitcases and objects. Really, I think chat GBT is serving as an intermediary here, sort of guiding Dali. So it's actually perhaps reducing the computational load on specifically Dali, but this is still pretty impressive. So, in the left, bottom left, I use the phrase, the large ball crashed right through the table because it was made of styrofoam. And this is really the first one that I saw that Dolly struggled with. It appears that everything in this room is made of styrofoam, the table, the ball. So that doesn't really make sense. This is similar to the first prompt, but it incorporated materials, different materials, and it looks like Dolly is still struggling with that a little bit. But the fourth one surprised me by far the most. The phrase I used was the painting in Mark's living room shows an oak tree. It is to the right of a house. So this is a very complex sentence. It's probably a little bit confusing, even for humans, because it refers to several objects and their relative positions. The painting in Mark's living room chose an oak tree. It is to the right of a house, and it looks like Dolly nailed this. And I was very surprised to see this. So Dolly 4 may in fact mark the end of gullible visual generative AI, but I suppose we'll have to wait and see. The rate at which is advancing is remarkable. And let's see how researchers are planning to navigating AI growth. So along with the rise of visual generative AI, we also see a natural increase in the spread of misinformation and fraudulent activity. So, for example, security footage and evidence tampering. This could require additional forensic tools specifically designed to mitigate generative AI. If I asked sora to create security footage of a bank robbery, that could be very dangerous. And I'm sure OpenAI will impose restrictions and use chat GBT as a bouncer, sort of filtering the inputs that it's allowed to take in. But nonetheless, I am excited to see the capabilities of sora and the lengths to which people will go to break the safety protocols that OpenAI is decoding. So, on a slightly more lighthearted note, advertisements for fake products could lead to widespread scams, as well as false information or propaganda designed to mimic educational videos. So in addition to low quality, mass produced content churned out with the help of generative AI, we can along with the rise of visual generative AI, we will also see a natural increase in the spread of misinformation and fraudulent activity in the next half decade or so. For example, security footage and evidence tampering. This could require the usage of additional forensic tools specifically designed to mitigate generative AI. So if I asked Sora to generate security footage for a bank robbery that didn't exist, this could be very dangerous. And I'm excited to see not only the security protocols that openad puts in place, but also the ways that people exploit these security models to find vulnerabilities. So this has actually been pretty amusing in chachibt. So, for example, asking chachibt to give you a recipe for napalm brownies, and it's generally gotten better at dealing with these prompts. But I still know for a fact, because I tried, that it is possible to get chattvt to do some pretty wacky stuff that OpenAI probably wouldn't endorse. So in addition to evidence tampering, what about advertisements for fake products? This could lead to widespread scamming and also false information or propaganda designed to mimic educational videos. That's another way that visual generative AI could be applied conceivably within the next couple of years. So with the help of visual generative AI, the entire process, from start to finish, of content creation can be streamlined including writing a script, generative AI footage, generating stock footage, perhaps with the usage of sora, and editing it together seamlessly. And so I'd also like to briefly touch on the touch on the images that I used here. These images were generated by Dali with the prompts of misinformation, and I think it did a pretty good job. I think. So does this mean that all media goes downhill from here? Not quite. I am personally quite optimistic because there are several safeguards being discussed right now that could drastically reduce the impact of aigenerated misinformation. For example, C two PA is a proposed media specification that would be able to verify the origin and edit history of any media. So these corporations shown at the bottom are pushing for legislation mandating the usage of verification standards, and it's supported by several of the largest media giants. This would ensure that every pixel edited would leave a footprint. So here's how it would work. Let's say we have a camera that uses the C two PA standard. We could put this into Photoshop, which would then create a seal. This could then be uploaded to YouTube, and you would be able to click on an icon in the top right. This would show you exactly where the content came from, and it would let you decide for yourself whether or not it's legit. So you'd see whether or not it came from the New York Times or someone pretending to be the New York Times. And this is actually very promising. So in the case of generative AI, OpenAI or any other company that uses C two PA would stamp its media with a unique signature that would ensure it's accurately classified as having come from that company. So, specifically, it's worth mentioning that OpenAI has pledged it here to the C two PA standard once sora is formally released to the public. So what else can we do to prevent the rise of AI generated misinformation? Well, like many things, the solution turns out to be more AI. So TikTok is one of the first platforms that requires users to tag AI generated content. And this is very convenient because it forms a large set of labeled content that could then be used to train a large media classifier. And in fact, the licensing of such a tool could prove to be highly valuable several years down the line. So quick summary diffusion models are a type of generative AI that are highly effective in creating realistic media. This media can be used in a variety of ways, from filmmaking to education and far beyond. Within the next decade, this will likely have far creating implications in nearly every industry that relies on digital media. And there are several proposed countermeasures to AI content, including metadata verification and AI thin classifiers. I really enjoyed preparing this talk and I'm really grateful for the opportunity to share my thoughts. Please feel free to reach out to me with any questions. My name is Jayram Palamadai and thank you for watching my session.
...

Jayram Palamadai

@ University of Chicago

Jayram Palamadai's LinkedIn account Jayram Palamadai's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)