Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi and welcome to this talk on SRE best practices
for API design. I'm Navendu and in this
session we will look into how development can development teams
can build reliable APIs. We will look into what reliability
means for APIs and reliability issues
in a traditional API design. We will also look into
how SRE fit into API development pipelines,
and we will top it off with SRE and devop centric
best practices for API development with an
API gateway. Before we
mourn a little bit about me I am Navendu
and I'm a developer advocate at API seven AI.
I currently contribute to Apache API six,
which is a cloud native API gateway.
I was also a cloud native computing foundation open source maintainer
and I also help Google Summer of code and LFX
mentees to help start their open
source contribution journey. And you can reach out to me
on Twitter, I'm mostly active there. If you have any operations or
if you would like to discuss things further, feel free to reach out to me
over at Twitter. All right, let's start
the session by discussing about reliability.
What does it mean to be reliable?
So if you
are a seller of an API, you might
have slas,
you might quote to your customers that your
API is 99.9%,
has a 99.9% uptime.
But uptime can be a myopic view of
what reliability entails.
And even the case of uptime,
it is kind of caused by making sure
that your services don't crash.
There is something more to uptime, or uptime is
the result of some other factors.
So what does it mean to be reliable?
So when talking about reliability, a lot of
teams get tossed around.
These are consistency,
especially in case of APIs. You need
to have your APIs consistent so that
the client applications can produce reproducible results
with your API, and you need to make it
available. So availability directly translates to
what could be the uptime. So we want to make sure that
your API is available
all the time or as expected, and the
consumers of the API don't have app
crashes due to a
lack of response from your API and low
latency. So latency, a service with a high
latency is almost equal to service that is not working.
So basically for a client or for a consumer,
it basically translates to a
failed application. So latency is also an important factor
when it comes to reliability and security.
Secure APIs and secure services are
what are like the pillars
of reliability when it comes to API.
And on top of that, you also need to ensuring you have
status of your API so this goes for
both the development teams and the consumers of the API.
So both of them should be aware of how
their API is performing and what is the status?
Is it up right now or is there some
redirects configured or that sort of things? We will look into this
further. So I want to emphasize
on the point that reliability is more than just the
uptime. And for
this talk I will use the term microservice
loosely. It may need
not be a cloud native microservice, it can be your application,
servers or anything that is serving
your API to your consumers. So traditionally
you will have more than one client
for your API, and I am representing it here
by different applications running on different platforms.
Yes. Let's look into
some of the problems you face with
traditional API architectures that
will be of your concern as a site reliability
engineer. So we talked about
all the different pillars of reliability or
different aspects of reliability,
and if you want to improve
the reliability of your services, you have to do something about it.
So in a traditional API architecture,
if you want to do something about it, what you will end up doing is
you will have to configure or you'll have to add something
new to each of your service, each of your endpoints.
So these endpoints
basically could be written in multiple programming languages.
They could have been using different libraries, all sort of things.
So it is not plug and play.
It is more of a tedious job that can waste
a lot of developer hours,
and they are not centralized. So when
you see something like this, it immediately
pops into our mind that this should have been centralized. But in
this case, as the API scales, you also have to
scale your scale, the structure you have set up to ensure
reliability, which is not feasible, which is not sustainable.
So if you are setting up monitoring,
you'll end up having to monitor every service
or maybe every request to the service or
every endpoint in the service. And if you want to set up security,
the same goes, you will have to configure your
security for each of your services. And if you want to set
up something like an authentication, it is also not centralized and you will
end up having to configure them directly
on all your services, which needless to say,
is a lot burden for the developer and
for the maintenance team who works on
it afterwards. And we
can even imagine how difficult it would be to
make new releases. So it will be a
tiring job because you
have to ensure very less downtime or
zero downtime. And we want to ensure that no requests are
interrupted while transitioning to this new
version of the API. So from a traditional
perspective, this seems too difficult to handle.
What is the solution? What can we
do to overcome this?
That is where we introduce API gateways.
So API gateways have been around for a
really long time, ever since the API development
model was popularized,
and they have been widely gaining adoption ever since
people started to moving from monoliths to microservice
based architectures. So what
do you mean by an API gateway and why should you care about it?
Now, if we go back to our service,
you have a lot of services and you have to end up configuring all
of your observability configurations
like monitoring, tracing,
security, authentication and traffic control and all
sort of things directly to your microservice. And that is
where an API gateway steps in.
So an API gateway acts as common
entry point for all of your traffic. And in turn,
an API gateway routes, it has
some configurations, and based on that configuration, it routes the traffic
back to your backend, back to your services.
So an API gateways in essence abstracts
out all the configuration you need on
your APIs. So it abstracts out,
when talking in terms of observability,
it abstracts out all the burden from
each of the individual services into one standards
instance, and it can be managed centrally.
So an API gateway does a lot of functions.
So it manages authentication,
it deals with your security, it can be configured
to allow for monitoring and observability,
and it can also be used for traffic control,
among a lot of other things.
So can API gateways is quite useful.
And with that in mind,
let's look at reliability, some of the reliability
best practices for API gateways,
and there are a lot of vendor neutral and
open source API gateways available.
As I might have mentioned, I am one of the maintainers
of Apache API six project,
which is also a cloud native API gateway.
But throughout this talk,
I'll be talking about API gateways from on a high
level, and you can use any of the API gateways of
your choice, or you can even go for cloud
providers API gateways. So let's look
at reliability best practices with these API gateways.
So authentication and security, as we discussed
in the earlier session of this talk, is quite
essential. And the
first thing is user authentication.
So user authentication or authenticated rookies
are a proven way to secure your client API interactions.
And when it comes to monitoring authenticated, rookers also
holds monitor your APIs in a very fine
grained manner. The picture is self
explanatory. We have all traffic routed
through the API gateway and the API gateway will handle
the authentication. So you can have basic authentication
like a jot
token or cookie in the header, or something
basic to something like you
can even use authentication providers like active directories
and all sort of things, or maybe even
authentication. So basically, API gateway takes care
of all of your authentication needs,
and once your client is authenticated,
it can use the info gained
from the authentication and it can be used for
the algorithms in the service, or it can
be used later on in your back end or in your services.
And the next important aspect of security
is rate limiting. This is something that some
of you might not have thought in terms of reliability
perspective. But rate limiting is also quite
important, mainly because it avoids
intentional or even unintentional misuse
of your APIs, like a denial of service attacks.
And it also helps improve the scalability as
your API encounter traffic spikes,
or mainly like quite uncertain traffic spikes.
So rate limiting is quite important.
So basically all
of your requests will be routed through your API gateway.
And if
your services can't handle a set
of requests, what you can do is you
can block those requests. So if there are too many requests,
they won't be processed and you can either reject those
requests or you can either delay
those requests. So based on your configuration or based
on what you are trying to do, you can do either by reject,
I mean you will entirely suspend
those requests and you will probably return 500
range status code back to your client. Or you can maybe
delay those requests if you
can tolerate some level of latency, or if your client application
can tolerate some level of latency, you can delay those requests until your
services are able to handle those requests.
So it is like a first come first priority.
So you can work based on that and
there are even other ways in which you can
ensuring security and authentication. But I will leave that
to you to explore and I will move on to the
monitoring and observability part of our discussion.
So monitoring and observability deals
with tracing,
logging and metrics.
So we have our API gateway,
and by monitoring
what you can get is you can monitor your reliability metrics. We talked about
some metrics and setting up some monitoring tool
directly on your API gateway means you can monitor all of your traffic and
you can monitor those traffic for your reliability metrics.
And the API logs and your traces
give detailed information of one particular request. So a trace tracks
the entire request throughout your API, from your
API gateway, through your services and back to the client. So post
can give detailed information about
the different reliability metrics and you can know how your API is
performed. And setting up
monitoring also helps you to know when your API has
failed or know when there is an error. And instead
of silently failing. With monitoring setup and alert
setup, you can
easily come in and fix it quite quickly and
fix the system and get it up and running again.
Later on we will discuss some circuit breaking
mechanisms, but basically setting up monitoring
can help, can go a long way. So it
can also help in knowing your
traffic. So when to scale, when to not scale, those kind of
metrics are also key here as well.
So going back tracing,
we can set up logging and we can set up metrics.
Now let's look at version control and zero downtime.
Maybe this is
more straightforward to think when it comes to reliability,
especially in case of zero downtime. How do you ensure
that your services stay up all
the time? So let's
first look at the version control aspect.
So when you are releasing a new version of
your API, how do you do that? So there is a
release strategy called Canary release.
So basically what you can do with an API
gateway is it can direct all of your
traffic to an upstream can. Upstream here represents
all of your back end or your services.
So you have an upstream on version
one and you are trying
to introduce a new version two, but you haven't tested
it with production traffic before, so you want to ensure
that it works perfectly before you
deploy it completely. So you
don't want to have to roll back to the previous version when
something fails, you have to ensure that it will work.
So initially what we will do is we will get
all traffic to our API gateway and it will direct all traffic
to can upstream to the initial version
of our upstream. That is how it will be functioning normally.
And when we have a new version ready, what we
will do is we will direct few traffic,
few of the traffic to the
new upstream. So can API gateway can be configured to do
this dynamically based
on the results from this traffic to the new version. If it is
working fine, you can slowly increase the traffic
to the new version until we have
all the traffic directed to the new version.
So this will ensure that your services stay up
all the time, and it will also ensure
that the new version you have released works
perfectly. And in case something
fails, or in case there are some
issues, you still have your previous upstream in standby,
and you can go back to it quite easily by just changing the
configuration in your API gateway.
Now let's talk about circuit breaking. Circuit breaking
seems like can electrical engineering concept,
but circuit breaking is quite essential
in modern software
architectures. So basically you
have your multiple upstreams. So all these upstreams
does the same thing. So our API gateway acts
as a load balancer
for your upstream. And if
one upstream service is unavailable,
or maybe it is experiencing high latency,
it needs to be cut off. Because if you don't
cut it off, a rook is coming to
the the failed upstream will
be stagnated and it will cause resource
exhaustion and the gateway or the service will
keep trying the retrying the request.
So what, this can cause a chain reaction and it can even
cascade into all of your other upstreams. So your
whole system may
be in the way of in the domino tiles,
so it needs to be cut off.
So your upstream has gone down.
And what the circuit breaking functionality
of an API gateway does is it cuts off
all traffic to your failed
upstream, and it instead routes
all traffic to your fully functioning upstream.
And once the upstream is back,
or once time has
passed, what the API gateway does is it tries to check
the status of the upstream, and if it is working
fine, it can go back to the healthy state
and the traffic can again be sent
to this upstream and it can again be functioning
as normal.
And finally, there is
also this case of reporting status,
or creating
new APIs, or changing APIs. So what happens
when you change the path of your API? How does it
affect users and how can you ensure reliability
in such a case? So when a
client is used to send
requests to one particular path,
and if the path is no longer there,
or if you are trying to change the path for whatever reason,
or maybe the services changed, or maybe things change.
So basically what happens here is your
old path is no longer there, but instead you have a
new path, but the client user don't
know this path directly. You either have to
talk to them before, or provide documentation on this
change, or something that is like
that. But in most cases, this can
be a tedious process to change the client code.
So how do you handle such cases? How do you let
your client know that this is the new API endpoint,
and that is where an API gateway comes in. So in your
normal use case, when you are going to
the old API path, and the API gateway directs all
traffic to your API endpoint, and when
it is no longer there, what you can do is you can change
the configuration of your API gateway to redirect
traffic to this path from this path to your
new path. So every time a
user goes to this particular endpoint,
the API gateway is configured to redirect the user to
the new API. And you can even give
a redirect status
code before redirecting, and you can even send,
let's say, a message saying that, okay, this old API path
is being deprecated and this is the new path.
Please change this and you can get on with
that. But still, it will be backwards compatible as the
users of the old API will still be able
to access the new endpoint without
having to change any of their client code.
So let's wind up this session with a quick summary,
and let's look at the key takeaway takeaways so
we started this discussion by talking about
reliability, and we decided that reliability is more than
just the uptime, and it is also about consistency,
availability, low latency, security and status.
We also looked into API gateways and how
they overcome the issues faced by the
traditional API architectures. We also looked at
how API gateways can help with best practices
for reliability in areas of authentication and security
monitoring and observability, motion control
and zero dam time.
That's it. And if you'd like to learn more, you can
check out the Apache API six documentation,
which is free and open source API gateway
hosted by the Apache Software foundation. And there
are also other API gateways free and open source
available out there. And you can also reach
out to me on Twitter. Here is my Twitter handle if you
have any questions, or I'll also be hanging out in the Discord
channel where you can ask questions.
So thank you,