Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi folks, and welcome to this session as part of the Comp 42 Cloud Native
Conference. In this session today we're going to be talking about how we can improve
reliability using health checks and dependency management within
our applications and our workload. So let's get stuck into it
then. Before we go any further, quick background on
myself. My name is Andrew Robinson. I'm a principal solutions architect
at Amazon Web Services and part of the AWS wellarchitected
team team that I'm in. We work with our customers and our partners
to help them build secure, high performing,
resilient and efficient infrastructure that they can run their applications and
workloads on. I've worked in the technology industry for the last 14
years and my background is mainly in infrastructure and reliability.
Also spent some time in data management as well. The workload that we're going
to be looking at for the purpose of this is a web
application. This web application works as a recommendation engine.
Let's just have a quick look then at the data flow of how users
would walk through this application. So first of all, we start with
our users up here. They connect in through can Internet gateway,
which is our public facing endpoint. This then tends them to
an elastic load balancer. This load balancer will take those
incoming connections and distribute them out across a pool of servers,
or in this case Amazon EC two instances. You'll note
that we have three separate EC two instances,
and each one of these is running in a separate availability
zone. We've got multiple instances so that in the event of a failure of
one of those instances, we can still service user requests and they're
in different availability zones. That way we've got some level of separation.
If you're not familiar with an availability zone in AWS, think of it as akin
to a data center or a collection of data centers that are joined
together using high throughput, low latency links
to provide you with different geographical areas that you could
deploy that workload into within a single AWS region.
You'll note that the instances are in can auto scaling group this auto
scaling group allows us to scale up and scale down to meet
demand, improving reliability because we're able to better
handle user requests by making sure we've got the appropriate number of resources
available. But also it provides us with the ability to replace a failed
instance. So if one of these instances has an issue, maybe there's a failure of
underlying hardware, maybe there's a configuration issue and that instance goes
offline or becomes unavailable, or as we may see later,
fails a health check. We then have the ability to replace that instance automatically
to continue serving user requests. I mentioned that this is a
recommendation engine and the actual recommendation engine that we use is
an external API call. So we have a recommendation
service that sits external to our application and
this is where our recommendations come from. In our case we're using
Amazon DynamodB as a NoSQL key value pair database
and this stores our recommendation. But this could be any external
services to the workload or application. It could be an external company,
maybe a payment provider, or it could be an external service
within your own organization that you're calling. Just as a final point, you will
also notice that we have these Nat gateways. These provide our EC
two instances or servers with external Internet
connectivity. This is needed in our case to be able to call our recommendation
service. Let's dive a little bit deeper then into some best practices for
how the infrastructure of this has been built, and then we'll dive into the
code that's running on those instances and show you how we can implement some
of those health checks and dependency monitoring that we mentioned. So some best practices
to get us started. High availability on network connectivity
at AWS we take care of some of this for you. At the physical
level, we make sure that there's multiple connections going into our
different data centers that make up our availability zones.
However, at the logical level you will still need to do some implementation.
As you could see from earlier, we've got nat gateways that
we use here. We've got multiple Nat gateways. So in the event
of a network failure in one of our availability zones,
we can still route traffic through another Nat gateway in
another availability zone. So that instance can still communicate
with our external API service. We also want to deploy to multiple
locations. This helps give us a smaller fault domain or a smaller
blast radius for errors that may occur, and also means that in
the event of a failure in one of those areas, we can still continue to
services our application. We do have this done at the availability
zone level in this case, and as I mentioned earlier, an availability zone is
a collection of data centers or a single data center that has low
latency, high throughput connectivity to other data centers in
the same availability zone. So you could think of it as akin to a single
fault domain if you need to. For workload purposes.
You may want to look at deploying this to multiple geographic regions
across the globe, maybe to service users in different areas,
but also that can help to improve reliability. But it
does come with some additional management overheads because you then have multiple
AWS regions that you're managing your workload within.
Finally making sure that we're using loosely coupled dependencies.
We've done this in this scenario by placing our API
call for our recommendation engine external to
the actual application that's running. On our instances we could
go further and use systems like queuing or dataflows
to be able to stream data in some way, and that
would then help create additional loose coupling that we need within there.
But for this purpose, we're just making an external API call. So let's
jump into the code that's running on these applications.
And if you're not a developer, I'm not either. Please don't worry, we're not
going to be going through all the code on here. We're just going to be
looking at some extracts and I'll be explaining what all of this nodes does.
The coding question here is all written in Python. This is the language
that I'm the most familiar with, which is the reason that I chose it.
You can achieve the same thing in multiple other different languages,
but for the purpose of this, I've just chosen Python. I find it the easiest
one to explain, and hopefully that will make it easier for everybody to understand
what we're trying to achieve here. So our first basic health check that
we're doing with our load balancer, we can have a path
that we specify in our load balancer that the load balancer uses to
checks the health of those servers or instances it's connecting
to. In this case, this is on the health checkpath.
So we're looking for any connections coming in on that slash
health checkpath, and if they do come in, we're sending a
200 lessons a HTTP 200 response,
which is a success. That means that when our load balancer does
its health check routinely, which we can specify what period that
happens on every 30 seconds, every 60 seconds.
If it then successfully connects to this URL with the
health check path for that instance, we'll get a 200 response back.
So that gives us some idea that the instance and the application
is running. But this is a fairly simple health check,
and it only tells us that what could we do to make this a little
bit more meaningful? So we could look at doing deeper health checks,
and that's what we'll have a look at here with this here. We're still looking
on our health check path for any health check connections
coming in from our load balancer. What we're doing is we're setting this
variable called is healthy immediately to false,
and we do that as you'll see later on. That way, if anything goes wrong
with our health check process, the load balancer will get an error.
In return, it will get an error code from the instance and
that means it knows the instance isn't healthy. What we actually then do is we
use a try statement to make our call. So we're
making our get recommendation call as part of
our health check. So our health check is now going to be checking on the
health of our dependency AWS, well as the health
of the actual application itself. So we're just looking then for
a response. And as I mentioned, we're looking for a tv show and a
username for our recommendation engine that it provides.
If we get this back, we set our is healthy to test
values so that we've now no longer got that false value that's set
there. And then we just have an exception clause that just catches
any errors that we've got and provides us with a traceback error code that
we can use. Carrying on still wrapped here in the same health check
statement is we just have an if and an else statement.
So our if looks at is healthy and if there's
a value that that's been set to, we send a 200 lessons content
type is HTML and we set a message as success
and send some metadata. So that means that we're not only checks that our
application and therefore our instance is healthy, we're now also checking that we
can successfully call that external API, meaning that that
external API is healthy because it's providing us with a valid response.
We then just have an else statement. So if anything else happens,
we send a 503 error and we then include
that exception error message from previously so that we
know what this error is. In this case, we'll be sending that 503
error back to our load balancer and our load balancer will then mark this instance
or server AWS being unhealthy and won't send any
traffic to it. I mentioned there that it won't send traffic to this instance.
This is a behavior that we call fail closed. This means
that in the event of that instance using unhealthy, the load balancer
will no longer send any traffic to it. So you can see here
that this instance is marked now as being unhealthy and health
checks failed with a code 503. The two other instances are still
showing AWS healthy, so any users connecting in will be sent to those two
instances and they'll still be able to use the application as before.
But the load balancer will not send any traffic to that instance,
we then have a choice. We can either choose to replace that instance,
or we could have a countdown timer for the number of failed health checks
before we decide to replace it. If we replace it straight away, that of
course means that the instance is taken out of service and then a new one
will be built to replace it. However, if we wait until the
instance is back healthy again, we could then resume sending
traffic to it. What happens, though, if all the instances within
our load balancer fail? So in this case,
we revert to a behavior called failopen. And this
means that requests will be sent to all targets.
Because all targets are unhealthy, it means all
of them have failed their health check. But because of the fact that
our load balancer is configured to do fail open, we will route
those requests to all targets. Now sometimes
that's helpful. For example, if you have an external dependencies that you're
making a call on, which may be slow to respond, you may have instances
that flap in and out of being healthy or unhealthy.
Now, as those instances are flapping, that might not trigger the
threshold to be able to take that single
instance out, and you may end up with all of the instances
flapping at the same time, all go unhealthy at the same time,
and then your application isn't available. We have this standard behavior AWS
fail open. We need to make sure that we're testing our dependencies.
We need to make sure that we're doing partial testing as well, so that we're
testing both a standard health check and also the deeper health check
as well, so that we've got a real true idea of what those instances
are doing and what the application state is. Next we'll have
a look at dependencies. As we mentioned earlier,
we've got a dependency within our application, which is our external API
that we're making a call to to get our recommendations from. In our code
here. When we get a request that comes in from a user,
we're making a call to this get recommendation engine,
and then we're parsing the value of those responses from a
tv show and a username that we get back from our recommendation engine.
Now this is called a hard dependency because if this dependency
call fails, users will get a HTTP 502
or 503 error. That means that they can't actually get anywhere
with our application. It doesn't work, it won't do what they
need it to. What we can do is change this into
a soft dependency. So we would have a try statement and
in here, we'd have that same code that we just had on our previous slide
that would try to make that recommendation call, but this time we add an
exception clause. What this means is if that call fails for
any reason, we would provide the customer with a static response,
and we'd recommend the tv show I love Lucy to them. Now, we would then
also provide them with some diagnostic information on their browser,
which would just say we can't provide you with a personalized recommendation at the
moment. If this problem persists, please contact us and then
we'd provide them with details of the error code. Now, yes, this does mean they're
not going to get a personalized recommendation, but it does mean that they'll still be
able to access that application. And if this application forms part of a
larger app that you're building, the whole application will still continue
to function. The recommendation engine just might not recommend them
exactly what they want, but they'd know that, and we'd be giving them a predefined
static response, meaning that they can still continue to do what they need to
do. So having a look then, at some of these best practices from our
previous slides that we went through here. So when component dependencies
are unhealthy, the component should still function,
but in a degraded manner. As we saw in our previous example. If you can't
provide a dynamic response, use a predetermined static response
so that you can still provide your customers and users with something.
We should continuously monitor all components that make up our workload
to detect failures, and then use some type of automated
system to be able to become aware of that degradation and take
appropriate action. In our case, that's our load balancer. Our load balancer
detects that an instance has become unhealthy and then removes that instance
from being able to have traffic sent to it. We also use our load balancer
to make sure that if we have an unhealthy or a failed resource,
we have healthy resources available that we can continue to serve requests,
and our autoscaling group helps to provide those resources that our load balancer can
then send the traffic to. You'll note the health check that we had earlier operates
on the application layers or the data plane. This indicates
the capability of the application rather than the underlying infrastructure.
We want to make sure that our application is running
rather than focusing our health checks on the underlying infrastructure.
A health check URL, as you saw in our examples earlier,
should be configured, but to be used by the load balancer so
that we can check the health status of the application. We should also look
at having processes that can be automatically
and rapidly brought in to mitigate the availability
impact that our workload has. It should remove the cognitive burden that
we place on our engineers so that when we're looking at these errors,
we have enough information to go on to make an informed decision about
what the problem is. An example of how we can do this is by providing
that static lessons and including the exception traceback
errors in any error messages that we provide so that we have more
detail on what the application issue is. We should also look
at fail open and fail close behaviors. When an individual services
fails a health check, the load balancer should stop sending traffic
immediately to that services or instance. But when all
servers fail, we should revert back to a fail open and
allow traffic to all services. As I mentioned, there were some
caveats around this, and making sure that we're testing the
different failure modes on the health check of the dependencies
so that we can see exactly what's going on within our application.
To wrap up and some conclusions, you will find servers
and software fail for very different and very weird reasons.
Physical servers will have hardware failures. Software will have bugs.
Having multiple layers of checks, from lightweight to
passive monitoring to deeper health checks, are needed
to catch the different types of unexpected errors that we can see.
When a failure happens, we should detect it, take the affected server out
of service as quickly as we can do, and make sure that we're sending traffic
just to those healthy instances. Doing this level of automation
against an entire fleet of servers does come with some caveats.
We should use rate limited thresholds or circuit breaks
to turn off automation and bring humans into the decision making
process. That's as the example where we use fail open, we're providing
that safety net that we could then bring a human in
to help us with the diagnosis of this. So using these fail open
behaviors when we have all servers in a pool unhealthy can
really help to provide that additional safety net that we could need here
to make sure that we're able to correctly diagnose this issue
and therefore return our application to a healthy state.
So finally, a couple of call to actions, folks. So if
you've heard anything in today's session that interests you,
I'd recommend going and having a look at the Amazon Builders Library
and this specific article on implementing health checks. It goes into
much more detail than what I've covered in this session,
and talks through a little bit more about how Amazon uses some of these technologies
to improve the reliability of the workloads that our customers
use. There's also a collection of multiple other articles on the Builders library
that will help you with understanding how Amazon and AWS
implement some of these best practices. The actual architecture that
I went through today and the code is all available as part of a
series of AWS well architected labs that we've published.
The specific lab for this session is available following
this link, but there's a collection of over 100 labs covering all
of the different areas of reliability, security,
cost optimization, performance efficiency, and operational
excellence that you can go and access. The labs are all
open sourced on GitHub, so you can take the code and you
can use it to help yourself learn with that, I'd like to thank you all
for your time tending this session today. I really hope you do enjoy
the rest of the Comp 42 cloud native conference and look forward to
seeing you at the next one. Thanks again everybody, and bye.