Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Jay Ron Plumade, and I'm an AI researcher
based in Chicago. Though I'll begin by introducing
visual generative AI at a technical level, and I'll then talk
about several interesting papers that have been published in the last twelve
months that really demonstrate the capabilities of visual generative AI.
Finally, I'll discuss its future implications,
its current limitations, and several ways researchers are preparing
for the rise of photorealistic generative AI. Let's begin.
Let me give you a brief introduction.
Diffusion models are a powerful class of
generative models that are capable of producing high quality
images, and these implementations have
broad applications in art, design and
related fields. The most popular
consumer diffusion models on the market currently are
dolly three, mid journey and stable diffusion in
seconds. These powerful models can input
text and output high quality images.
And most recently, OpenAI has released Sora,
which is a text to video platform that really blows away
any previous models. This was actually so
recent, in fact, that I had to modify a decent portion of my
slideshow and the talk so
what are diffusion models and how do they work?
Diffusion essentially boils down to two steps. We have
a forward process that gradually adds noise to an input image,
and then we have a reverse process that attempts to remove
noise from an input image.
So we begin with an input image that we'll call
x zero x zero. This is the original image,
or the image at step zero.
We then add noise to our original image
to form x one, the image at step one.
We can do this again and again. In this case, we form x
two until we get random noise.
And a value known as the schedule controls the rate
at which noise is added. So there have been multiple approaches
used. But the most straightforward approach is a linear schedule,
which adds the same amount of noise in each step. There are also
non linear schedules that change
the rate at which noise is added depending on where
you are in
the forward process. So most
notably, researchers at OpenAI used a nonlinear model that
added a lot of noise at the beginning and then gradually decreased
tapered off as it approached the final time step.
And this really improved the
effectiveness of the diffusion model by
preserving the image quality for longer.
And this process forms a Markov chain,
wherein the image at any given time only depends on
the previous image. In the reverse process, we start
with an input image from some time step, let's say x two,
and this image is fed into the diffusion model, which essentially
predicts the noise that separates this image from the original image.
So we feed this into the diffusion model, and we get a
prediction of noise. We can then subtract this noise
to get some approximation of the original image,
and we want to get a better approximation
of the original image. So counterintuitively,
we add some noise back to
get to a step between the original image
and our original input. Let's say
x one, and then once again we
predict the noise that
separates x one from x zero, which is the
original image. Once we subtract this value again, we're left
with an even better approximation of our input. Image and
diffusion models through this process enable nearly
every modern visual generative AI on the market.
And used in combination with other neural components like text
encoders, this forms the basis of dolly three
and stable diffusion. And imagine which are some of the
most powerful visual generative AI platforms currently.
Let's talk about the current research.
Nvidia is popping up everywhere these days,
and really they got into visual generative AI and AI
more generally because their hardware is highly specialized
for AI. Gpus are essential to training large
AI models, and Nvidia is, of course, a GPU
manufacturer. Nvidia Picasso
is a platform that allows the creation of custom
generative AI for visual content. So this
includes images, videos, 3d models,
and 360 degree content. And this
allows enterprise users to train
custom visual generative AI models through the Nvidia edify
platform. So, for example, a stock photo
company could train an edify model on their current library to automatically
generative AI stock content. And in fact,
several such companies have entered partnerships
with Nvidia. So, for example, Shutterstock, Getty Images, and Adobe
have been the earliest adopters of the Nvidia Picasso platform.
So I'll first give some background on
neural radiance fields, or neRfs.
A NERF is a neural network that can synthesize
views of complex 3d scenes from
just a partial set of other
words. It can predict angles of a 3d scene that it was not
trained on. And this is a very powerful and versatile
method for 3d scenes representation.
It works by taking input images and interpolating
between them to create a complete scene. So a NERF
is trained to map directly from a 5d input,
which is three coordinates in x, y and z,
and then two coordinates for viewing direction,
and it outputs opacity and color. So this is a 4d
output, three color channels, and opacity.
And what I'm showing you right now is a project called live
portrait 3d, which is a project by Nvidia that
presents a new method to render photorealistic 3d facial
representations from just a single face portrait.
And it does this in real time. So, given a single rgb
input, their image encoder directly produces a
Nerf of the target face. And this
method is quite fast. So they achieved 24 frames
per second on consumer hardware, and it produces
much higher quality results than previous methods.
So what you're seeing right now is an input
image on the left, and the right is an output model
from the LP 3D model.
And so this is just an example of current
research. Moving on.
This project is a system that learns physically simulated
tennis skills from large scale demonstrations of real tennis play.
And this is harvested at a large scale from broadcast videos.
So the approach is based on hierarchical models that
combines several policy networks to steer the character
in a motion embedding learned from these broadcast videos. And the system
can learn complex tennis skills and realistically chain together multiple
shots into extended rallies.
Importantly, this model only uses simpler rewards and doesn't require
explicit annotations on stroke types and movements. That's what makes
this so powerful, it reduces dependency on human annotators.
And this model is applicable to other motion capture applications
as well. It also works despite the relatively
low quality of motion extracted from the broadcast videos.
And the end result of this is that the
system can synthesize two completely simulated
characters playing extended tennis rallies with realistic
physics. Diffusion models really struggle with realistic physics,
and more on that shortly.
So, aside from image and video generation, diffusion models have
also been used in a variety of applications outside
of this domain. So very recently,
researchers at Singh Hua University have used diffusion models to generate
3d models. So, in other words,
each of the models pictured here can be moved around and
manipulated in six degrees of freedom.
So what are the future implications? For the
sake of demonstration, I'm going to use videos generated by Sora,
because they are by far the most advanced videos currently available.
So what does the future look like with such powerful AI now
becoming so readily available? And how does this affect the
various aspects of the media we consume? Maybe this marks
the end of an entire tier of digital creators, or maybe it simply
augments the capabilities of skilled videographers.
We've seen the rise of several innovations that really shifted the landscape
of entire industries, the most impactful of which is likely sitting in your hand
or pocket if it's not off chart. And visual.
Generative AI could very well follow the paradigm established
by smartphones of the last two or so decades. By placing
a camera in the hands of pretty much every single person on the
planet, it significantly reduced the barrier of entry for
photography. And similarly, it could very well be the case that
videography and even 3d modeling continues along this trajectory without necessarily
displacing a significant portion of the industry.
But what about licensing drone stock footage?
We've seen that Sora is capable of generating highly realistic
that mimics the sort of footage that you would get from a
drone that is several hundred dollars.
So here's another parallel with Photoshop.
Before Photoshop, it was much easier to accept the
credibility of images to take them at face value.
And at least for now, it's possible to find weaknesses in generates
video, especially those that involve object interactions and
realistic physics. This AI
content isn't really fooling anyone. No one is going to
see this particular video and think this
is real content. We can see the chair morphing and
floating around, but within the next couple of years,
it will be very hard, if not impossible, to discern these
sorts of images with the naked eye.
So it will prompt the creation of several important forensics
tools and more on that shortly.
We can also see a few strange artifacts here with,
for example, a flame that doesn't move with the wind,
wax that morphs downwards as opposed to
just creating, and a character that doesn't really react
to touching a flame.
So we could also see other cultural shifts,
like being desensitized to AI content. This is
predictable because it happens with everything. And actually,
I lied about this video here. This is a drone
video. This was collected by a drone over the city of Chicago.
And the odd thing is, it doesn't really seem out of place
with the quality brought by advanced generative AI.
So what does that mean for the industry?
So let's go over some of the current limitations of visual generative
AI, and I'd encourage you to keep this in
mind, but stay updated, because I'd expect that many of
these issues will have been vastly improved even within the next year
or two.
So why are we staring at several images of ramen?
I asked Dolly several times to generate
an image of ramen without chopsticks, and it's completely
unable to. And I'd encourage you to try this yourself.
That seems pretty odd at first glance, but it really makes sense when
you consider the underlying diffusion architecture.
This Dali model is highly reliant on training data,
and in this case, it appears that the training data
has chopsticks in every image of ramen.
So in this case, Dali is actually unable to create an
image without chopsticks somehow making their way
into the image. And this begs the question, can AI truly
be creative? Or does it just rely on the creativity of
its training data? And this is more relevant when
considering the ongoing battle between large AI companies
and individual artists, artists are unhappy because of
the usage of their artwork. And as far as
I know, human artists don't mindlessly synthesize images they've
seen before. I don't have any radically different ideas
on this particular debate, but it's definitely worth taking into account the
underlying mechanics of diffusion models.
In this last scenario, I gave Dolly the task of illustrating Winograd
schemas, and these are basically phrases that test Dolly's
ability to interpret and use context clues.
So Dolly was actually able to get several of these, which surprised me,
but I basically told Dolly to describe
the following scenarios. So, in the top
left, the scenario I used was I dropped a bowling
ball on a table, and it broke. This requires understanding
of bowling balls, tables,
and what it means for this interaction. So,
obviously, if I asked a human to
tell me, to explain to me what this means,
they'll probably tell me that bowling balls are heavy,
tables can be fragile, and especially when you drop a
bowling ball from a height, it's very possible that
the table breaks. But that requires understanding
of these objects and the interactions at play.
And this is actually very impressive for a
visual, generative AI model, because especially Dolly one
and two struggled with this. In the
top right, I used the phrase, my trophy couldn't
fit into the brown suitcase because it was too large.
Once again, this requires understanding of suitcases
and objects.
Really, I think chat GBT is serving as
an intermediary here, sort of guiding Dali.
So it's actually perhaps reducing
the computational load on specifically Dali,
but this is still pretty impressive.
So, in the left, bottom left,
I use the phrase, the large ball crashed right through the
table because it was made of styrofoam. And this is really
the first one that I saw that Dolly struggled with.
It appears that everything in this room is made of styrofoam,
the table, the ball. So that doesn't really make sense.
This is similar
to the first prompt, but it incorporated materials,
different materials, and it looks like Dolly is still struggling with
that a little bit. But the fourth one
surprised me by far the most. The phrase I used was
the painting in Mark's living room shows an oak tree. It is
to the right of a house. So this is
a very complex sentence. It's probably
a little bit confusing, even for humans,
because it refers to several objects and
their relative positions. The painting in Mark's living room
chose an oak tree. It is to the right of a house,
and it looks like Dolly nailed this. And I
was very surprised to see this.
So Dolly 4 may in fact mark
the end of gullible visual generative AI,
but I suppose we'll have to wait and see. The rate at
which is advancing is remarkable. And let's
see how researchers are planning to navigating AI growth.
So along with the rise of visual generative AI, we also
see a natural increase in the spread of misinformation and fraudulent activity.
So, for example, security footage and evidence
tampering. This could require additional forensic tools specifically
designed to mitigate generative AI. If I asked
sora to create security footage of a
bank robbery, that could be very dangerous.
And I'm sure OpenAI will impose restrictions and
use chat GBT as a bouncer, sort of filtering
the inputs that it's allowed to take in. But nonetheless,
I am excited to see the capabilities
of sora and the lengths to
which people will go to break the safety protocols
that OpenAI is decoding.
So, on a slightly more lighthearted note,
advertisements for fake products could lead to widespread
scams, as well as false information or propaganda designed
to mimic educational videos. So in addition
to low quality, mass produced content churned out with
the help of generative AI,
we can along
with the rise of visual generative AI, we will also see a natural
increase in the spread of misinformation and fraudulent activity in the next half
decade or so. For example,
security footage and evidence tampering. This could require the
usage of additional forensic tools specifically designed to mitigate generative AI.
So if I asked Sora to generate security footage
for a bank robbery that didn't exist, this could be very
dangerous. And I'm excited to see not only
the security protocols that openad puts in place, but also the ways
that people exploit these security models
to find vulnerabilities.
So this has actually been pretty amusing
in chachibt. So, for example,
asking chachibt to give you a recipe for napalm
brownies, and it's generally gotten better
at dealing with these prompts.
But I still know for a fact,
because I tried, that it is
possible to get chattvt to do some pretty wacky
stuff that OpenAI probably wouldn't endorse.
So in addition to evidence tampering,
what about advertisements for fake products?
This could lead to widespread scamming and also
false information or propaganda designed to mimic educational videos.
That's another way that visual generative AI could be
applied conceivably within the next couple of
years. So with the help
of visual generative AI, the entire process, from start
to finish, of content creation can be streamlined including
writing a script, generative AI footage,
generating stock footage, perhaps with the usage of sora,
and editing it together seamlessly.
And so I'd
also like to briefly touch on the touch
on the images that I used here. These images
were generated by Dali with the prompts of misinformation, and I think it did a
pretty good job. I think. So does
this mean that all media goes downhill from here?
Not quite. I am personally
quite optimistic because there are several safeguards
being discussed right now that could drastically reduce the impact
of aigenerated misinformation. For example,
C two PA is a proposed media specification that
would be able to verify the origin and edit
history of any media. So these
corporations shown at the bottom are pushing for legislation
mandating the usage of verification standards, and it's supported
by several of the largest media giants. This would
ensure that every pixel edited would leave a footprint.
So here's how it would work. Let's say
we have a camera that uses the C two PA standard.
We could put this into Photoshop, which would then
create a seal. This could then be uploaded
to YouTube, and you would be able to click
on an icon in the top right. This would show you
exactly where the content came from,
and it would let you decide for yourself whether or not it's
legit. So you'd see whether or not it came from
the New York Times or someone pretending to be the New York Times.
And this is actually very promising.
So in the case of generative AI, OpenAI or
any other company that uses C two PA would stamp its media with
a unique signature that would ensure it's accurately classified
as having come from that company. So,
specifically, it's worth mentioning that OpenAI has pledged
it here to the C two PA standard once sora is formally
released to the public. So what
else can we do to prevent the rise of AI generated misinformation?
Well, like many things, the solution turns out
to be more AI. So TikTok is one of the first
platforms that requires users to tag AI generated content.
And this is very convenient because it forms
a large set of labeled content that could then be used to train
a large media classifier. And in
fact, the licensing of such a tool could prove to be highly valuable
several years down the line.
So quick summary diffusion
models are a type of generative AI that are highly effective in
creating realistic media. This media can
be used in a variety of ways, from filmmaking to education
and far beyond. Within the next decade,
this will likely have far creating implications in nearly
every industry that relies on digital media.
And there are several proposed countermeasures to AI
content, including metadata verification and
AI thin classifiers.
I really enjoyed preparing this talk and I'm really grateful
for the opportunity to share my thoughts.
Please feel free to reach out to me with any questions.
My name is Jayram Palamadai and thank you for watching
my session.