Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, and let's talk about unlocking the power of
ll them and building a huggingface agent.
So first of all, as you might know, a huggingface
phase transformers is a popular state of the art
machine learning library for Pytorch,
Tensorflow and Jax. And it provides a
thousand of pretraining models to performed tasks
on different modalities such as text,
ovision and audio.
And it was just a quick introduction,
you might be already familiar with it.
And let's go to our agenda. So today
we're going to talk about what are agents, what are
tools, how to set up and initialize
the agent, how to use a predefined
tools such as translation, image captioning,
text to speech, what are other tools?
Predefined, curated tools exist.
And very interesting, how can we create
our custom tool? So let's get started.
Right. First, what are agents?
An agent, let's just think
about a general term, an agent. You might think
about a person who you hire for
performing different tasks. So for example,
an agents can assist you in writing
some publications or calling someone
or publishing some post on
social medias. So the general
idea is that an agent is an
assistant to simplify your life,
right? And if we go back to the huggingface
phase. Idea of agent. An agent
is a large language model or LLM,
and we are prompting it. So we are asking it to perform
a specific set of tasks
so the agents can be equipped by different tools.
And we will talk just in a minute about what
different tools are and why.
Is it possible lecture language
model can generate a small piece of
text, a small piece of code in
very good and efficient way.
Maybe if we ask it to generate a whole
script, it might be not so good at it,
but generating like three lines,
four lines of code, it can deal with it.
If we leverage this possibility to generate a small piece
of code, small piece of text, and equip
the agents with different tools, we can use
it, power how you will see it. And now
let's talk about what are tools. So if you
have a toolbox in your garage, you might have something
like this and this and this and
hammers. And all these tools
are for specific tasks. Each tool does
specific job. And you might know I'm
not very good at it, but you might know how to
use each tool for which task.
So the same idea if we think about tools
for llms. So a tool
is something simple which
represents a single function with
a name and a description, right? Because each tool has its
own name and we have a description
of this tool, how can we use it? And each tool,
each function is dedicated to one
very simple task. And if we put
this together, this picture is from the official
hanging face agents tutorial. We might
have the following structure. So we have
an extraction which we
can tasks or what we can prompt the agent
for and it is translated to the prompt.
So in this particular example, we ask the agents to
read out loud the content of the image.
So if we think in concept of it,
might want to first understand what's on the image.
And then to generate text is
the first step. And the second step is to read out loud this
text. So this creates a prompt
and our agents literally language
model understand. It has a toolbox
right here, different tools,
the toolbox and the agent understands
that it can use image captioner to caption the image
and the text to speech tools to read
the text out loud. And it generates
a code run a Python interpreter,
a text being voiced.
So this is how it works. But if
we think in general, why should we care why it
might be interesting for us. I would
say that this is a great interaction
experience, so we
don't need to even know how to code. We can
leverage the conception prompting versus
coding so we can prompt and have the code out of our
prompt. And this is a great
instrument to have a
chained output. So if you think about it, we can
for example generate an image,
then add some elements to the image,
then maybe resize the image or generate
image captioner, translate it, get the
voicing of it. Because there are very different
tools and we can add our custom tools,
it is very flexible and we will learn
how to add our custom tools.
And now let's talk about, let's go a little
bit hands on and learn how to set up and initialize
the agent. As a prerequisite, we need
a huggingface phase token. And depending
on the agent we are going to use,
in our case we're going to use OpenAI agent.
We need an OpenAI API key and
we need a bit of code. So let's
see how it looks like and let me
open the collab here
and run a simple setup. I use the
latest version of transformers here and
I will need to pass my huggingface phase token which
you can have it for free. You can go to
huggingface phase hub and you can create
your token there, read or write token. For this particular
activity we can have read token and
while it's running, I should also add
that I will upload this code to GitHub
repository and you will have this notebook in
a Jupyter notebook format and
you can use it and play with it.
Let me just grab my token and tasks
it here. I'm logged in and
I will pip install OpenAI library
here and I'm going to the
agents initialization.
I will use OpenAI agent
and I need my OpenAI token here,
let me grab it, let me tasks
it. Voila, we have it.
Now let's go back to the presentation and
see what are predefined tools.
The first tool we are going to look at is
image capturing tool. It's very simple so
we can have pretty much everything as an image
and we can generate a caption
for it. So as you can see we
just call in the comment as agent
run and a natural language prompt
and we are passing an image as a
variable here. Let's go and
look how it would look like.
So first I will just quickly choose
a picture and I will use
foot some foot as a picture and
I will generate a description for
my picture in English. And as
you can see here, while agent is running my
code, it is generating an explanation.
So you can see the explanation. I will
use the following tool and it chooses the
image captioner tool and then it generates code
and then it runs code and we can see
an output here, a plate of food with eggs,
bread and a cup of coffee, which is
true. And now let's go back to the presentation
and let's have a look at translation and audio
generation tools. So it can
translate to and from over 80
languages and it can voice the
text easily. So I'm not going to spend
much time on this slide, I'm going to show it to you.
So a little side note about the translation.
Under the hood, the translation is being curated by
the meta. No language left behind
an LLB model.
And they claim that they have over
approximate number of 200 of languages.
But when I checked last time, not all
of languages were available for translation. Some of them,
at least with agents, some of them were
generating errors. So I just passed a
list of languages that worked for me and if you
want to use a specific language you might want to check
it before.
But in my list you can see that there are
approximately 80, 80 of languages, which is
also, I believe that it is also good.
So going back to our tools,
we can run it together, so we can translate
text and read it out loud just in one go,
or we can run it separately. So first let's run it
together just in one go.
And you can see again, you can see an explanation, you can see
a code generated by the agent.
And you can see
what's happening under the hood. And also
if you want to build a chain
of inputs, outputs. Don't forget to
save your outputs as a variable
so you can hand it as an input to next
comment. And we can see the translated
text here. I'm not going to read
it out loud because it is in Spanish.
Yes, but I have a tool that can read it out loud.
And let's hear, how does it sound?
Unplodo de kameta con huevos paniunataza de cafa.
I'm not very proficient in Spanish,
but to my mind it doesn't sound like Spanish.
It sounds like an english version of
Spanish or something like this.
So maybe we can try to have it
separately. So first doing translation and then doing
voicing. So I will try to have my
translated text as a variable here.
And I'm trying to have audio.
And let's try Plato de Komeda
conjuevos panionitaza de cafa.
No, I believe it's not very Spanish, but we
will see what can we do here? And I
promised you that we are going to talk about other predefined
created set of tools. And Hagenphase
provides you with several
tools based on transformers, based on transformers
models. And we can have
different tools. Let's take
a look at them. So the first it's document
question answering tool. You can have a document in
an image format and you can ask a question based on
it. And under the hood,
the transformers model which is used for it is donut.
The next is text question answering.
So you can have a long text paste in
a text format, in a string format. And you
can ask the question based on it. And the transformers
model used for this task is
flinty five. The next is
image question answering. So you can pass an
image and ask a question what's on this image?
Or specific question on the
image itself. And the
transformer model operating by
the hood is build.
It's just for
understanding what's going on. Next.
We can have an image segmentation.
We can output the segmentation tasks
of the image. For example,
detect animals or detect
nature on the image. And the model
is clip sag.
Also we can have. In our example we had text
to speech and we can have a reverse task.
We can have speech and then translate it to text.
And this model is transformer model is
whisper. And we also can have a zero shoot
text classification. If we don't have many labels
for classification task, we can provide a text,
we can provide a list of labels and
try Bart to classify this
text. And pretty straightforward,
we can have text summarization so we can pass
along text and have its summary
in just two or three sentences.
And this task is also apparated by Bart.
And also we can have tools that are not based
on transformers. That's why they are called transformers
agnostic, because they can use different models
or just ordinary Python script
we might use in this text downloader tool.
So we just can provide an URL and
download the text. Advanced model is used
under the hood. Yeah, just a python script which is going there
and downloading. Pretty simple. And also
we can have different flowers of stable diffusion
to have text to image generation or
image transformation like here,
you can see an example. First we create an
image and then we understood that we
might want to have this image changed a
little bit. So we ask agents to add
a rock on this image.
And also we can have a text
to video generation which is also a flavor of
stable diffusion here. So yes,
and we can also, as I already said,
we can also create a custom tool. And if we
think about agent as an octopus,
I believe it legs, yes,
we can have an infinite octopus and we
can add legs to this octopus.
And these legs are our custom tools, so we
can extend the possibility,
the power of our agent. And also we can
push this leg to the hub so other
members can benefit from our
interesting tool. And let's see
how can it to.
As you might remember, our voicing tool was not
very powerful. And I'm going to
use a Google text to speech library to
generate audio based on text target
tailored to specific language.
And I already installed this library here and
I'm importing it and I have a simple
function using this library. I am passing a
text as a variable here
and also I'm passing a language as a second
variable here. And I have an audio as the output.
Pretty straightforward. And how
can we wrap this function? How can we create
a tool based on this function? So we will import class
named tools from transformers and we
can inherit it. And we can inherit
our tool class from this tool parent class,
how it will look like. So we are
creating this class and we are passing a
description name a name here.
What's the name of tool like image captioner?
Here we have Google voice in multiple languages which
describes how it
would work. But we have a description
for agent to understand what
this tool is for. So this is the name, this is
a description, for example like hammer is
for nails. And this is tools that
can voice a word or a phrase in a given language
and what does it take and what does it output.
And here we should
have description on our inputs format
and description on our output format
and we should have a call
our function itself, how it
would operate and we already can try it.
So if I pass a language,
yes, comida, comida.
I believe it's foot in Spanish and it already
sounds like a Spanish. And as you
might notice, I'm not using Spanish as a
language language name
here, I'm using a language code and that's why
I will be using len codes library to translate
it from the human natural name
of language to the language code.
But first let's initiate,
let's expand our octopus. Let's add this
tool to our agent. So I'm going to reinitialize
restartment agent and I
will need OpenAI key once again
and I will use len codes to translate
to translate it from Spanish
to sorry, target language
will be Spanish here.
Yes, it translates Spanish
or some other language tool to language
code compatible with our tool.
And I am running, I'm going to run
this comment as a prompt.
Yeah, I'm going to give a prompt to our OpenAI
agent and to ask to generate
voicing of our text that
we already obtained in previous
steps to voice it to Spanish.
Unplato de comida, congue natasa
de casse. I would say it sounds more
Spanish. Right. So here was
a walkthrough,
how to create your custom tool. And if
you use a little bit of imagination you can think about
expanding and using different tools,
even funny tools like fetching
image of cat from the Internet. So it
can be pretty much everything that we
can code in Python. So our agent
can be really powerful with this. And this
is the conclusion and final thoughts,
a quick recap of our talk.
Agents are still experimental,
they still a little bit brittle for many use cases,
especially when it comes to some complex prompting with
different steps. So the output can be
in some cases unexpected.
So we need to practice
in writing correct projects for
using correct tools. But they
are promising. As our agents are getting
smarter, we can have a lot of
tools. Agents are easily extendable and
they are also easy to start with. As we
saw, we can simply leverage an
agent just in few lines of code and we
can already use a great set of tools,
predefined tools and they are very various
and we can build some interesting chains
and yeah, this what
can make our agent really smart.
And also some further ideas. If you
like the idea of agent, you might want to
experiment more with prompting, writing advanced
prompts, and you can also bring your custom model
as an agent. Try different OpenAI
or just open source llms as
an agent. Maybe they will get you better results.
And also, if you like the overall concept of
smart agents smart assistant, you can also check agents
of launching or Amazon Bedrock. They also
provide some capabilities of
empowering an LLM to
act on behalf of you. So yes,
and as I already said, you can find my
code, my slides and links used in this presentation
in my GitHub repository huggingface underscore
agents, and let's stay in
touch. Thank you.