Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems and
observing changes exceptions. Errors in real
time allows you to not only experiment with confidence,
but respond instantly to get things working again.
Close good
morning, good afternoon, good evening, wherever you are in the world. My name
is Nick Hodges. I'm the developer advocate at Rollbar, and I'm here
today to talk to you about building self healing
systems and how they work. All that fun stuff about
automating the process of managing your application.
Before we get rolling, I'll talk a little bit about myself. As I said,
I'm a developer advocate. I preach the good news about Rollbar
throughout the developer community. I'm a longtime developer and
development manager. I'm a Delphi guy way back.
Anybody out there remember Delphi? I've written
a couple of books on Delphi. I did turbo Pascal to
start out, and lately I've been working on angular and
typescript, really cool language and platform for developing web applications.
I'm a former naval intelligence officer, did that for twelve years.
I will tell you that it is about 142
times more boring than they show it
on tv, in the movies. And finally, I'm obsessed with
pistachio nuts. If you come over to our house on a Sunday afternoon to watch
NFL football, don't get between me and the pistachio nuts
bowl. Before we get started, I just want to talk a little bit about Rollobar's
mission and values. Our mission is to help developers build software
quickly and painlessly so they can get back to building cool new features.
Instead of fixing bugs, spending a lot of time fixing bugs
and whatnot. We like to think about
ways to make developers more productive and to
make it as fun and interesting as possible. Our values are
honesty, transparency, pragmatism, and dependability.
We like to show those in everything we do, from the sales process to
presentations like this. So if you see us not following those values,
hold us to it. So the topic for the day,
of course, is self healing systems. Self healing systems is
kind of a self explanatory definition, but I scoured around the Internet and
this is the best definition I found. It's the idea that a system can
recognize that it's not working right and fix
itself without any internal, I'm sorry, external assistance.
So this could mean something like turning off a feature flag or
rolling back to a previous release, or somehow,
some way getting itself into a working state.
So selfhealing systems, something that's only relatively new,
mainly because we at this point with SaaS software,
have control over the entire application. In the
old days you used to have to distribute executables out to
your customers. And of course everybody's systems was different. Everybody had
their own individual application that they were running. And sometimes a
bug would appear on that system out there and there's no way for
it to self heal because you would have to
reproduce the customer's environment as best you could
in order to reproduce the bug. A selfhealing systems
possible because with SaaS systems internal.
SaaS systems are internal to our environment.
They're built inside our own cloud. The client application
that is built runs on the web and you have complete
control over that client application. Now it's still true that you deploy mobile
applications and they will suffer the same problem with versioning and bugs,
but at least with your web system and your built in
back end, you have complete control over the system, enabling it
to self heal. Before we talk about self healing systems
specifically, we can talk about the levels of selfhealing
systems. In other words, what are the basic
three levels of ways that systems get repaired?
One of course is the manual change. A bug is reported to you by a
customer. You open up the product in the debugger, in the old school
way, you find the bug, you fix it, and you return it
to the customer with a fixed version that happens
entirely manually. There's no automation process going on.
Some automation plus manuals. Step level two, if you will,
requires that you have some automation involved.
For instance, you may have a system that monitors your website
to let you know when there's an error, but however
it doesn't really help you or point you in a direction of where that error
is. And you have to manually fix the error yourself and
deploy the system manually, or deploy it through your CI CD program.
Fully automated systems are self healing systems. That is a system
that is able to detect errors and resolve
them in a sense, not in a sense of fixing the bug,
but in a sense of dealing with it in such a way that the state
of the system is returned to equilibrium so that it works properly.
That's a fully automated system. So level three obviously is selfhealing
systems. The term self healing to describe what is level three.
So what are the requirements for a system to be self healing? I've talked a
little bit about it already. One is it has to always be up,
that is the application. The front end has
to be available and working properly at all times.
You don't want a system to be down for maintenance,
because then it really isn't self healing.
If a change to the deployment process or a change
to the way in which the application is run requires
the application to be brought down and not function. Then it can't really
be selfhealing. In addition, a self healing system
needs to be monitoring. Without monitoring, you can't self heal because
you don't even know that there's a problem. You need to see that problem in
order to be able to fix it. And obviously, for a self healing system to
work, it needs to have a restoration mechanism. It needs to
have a way for it to be able to repair itself
or get itself back into a known working state.
These three things are what a self healing system needs
in order to actually function as a self
healing system. So now we can talk about what a self
healing systems what are the characteristics of
a self healing system? So first we need the ability to
deploy to any server, not just your specific server that you're
running and managing yourself, but to any server. In other words,
we need to deploy it to a cloud environment.
So your servers need to be cattle, not pets. If you're not familiar with
the term cattle and pets, it refers
to the ways servers are handled. For instance, if you have a pet server,
that's a server you own, you take care of it, you feed it,
you manage it, you care for it, you give it a name, all that
kind of stuff. It's a pet versus a cattle server,
which is just a herd of servers that you don't care necessarily
so much about any individual member, but about the herd as a
whole. And so you deploy
your application to cattle when you deploy it to
a cloud environment that has many, many systems that
run your application instead of just one specific pet server.
So that for a self healing system to work, it needs the ability to
deploy to a cattle based system. You also
need the ability to deploy without downtime. This is normally done
through blue green deployments, where in the
most simple form, you have a blue machine and a green machine. The blue machine
is running and the green machine is updated.
And then immediately the load balancer, or whatever
type of system you're using to manage your traffic flow switches
from blue to green in one instant, and no downtime
has occurred as a result. And then the blue machine can be updated to be
a green machine. And then you can turn both of those machines on,
and the load balancer knows about everything that's up and automatically
feeds traffic to working systems.
Third, we need the ability to monitor the hardware needs and
adjust dynamically. This is where cattle versus pets come in,
right? If you monitor the hardware needs and
see that your hardware, say your cpus are being stressed
or there's too much network traffic going through any specific or
number of machines. You can spin up new machines,
new systems, in order to dynamically adjust and
respond to the demand that occurs as the demand
for your application increases. Finally, the ability to
monitor services in real time and react to failure, I. E. The chaos
engineering portion. You need to be able to track
the services that your application requires, and if any of them go
down. You need to be able to react to the failure of any one particular
given system. You need to be able to respond correctly if one of your
services suddenly fails. In order to be self
healing, you need to be able to deal with the failure of any of your
given microsystems. And finally, you need to be able to have the ability
to predict a future and take actions based upon your knowledge of the
future. Of course, we don't really know the knowledge of the future, but I'm talking
about things like knowing that every Monday morning at
09:00 your system demand increases drastically.
You know that. And so you can schedule the increase of your cattle.
You can buy more cattle, I. E. More servers and spin
them up at any particular time when you know that that time is
going to require more processing
power because of the demand on your application.
So what is the common thread between the problem and the solution?
Error tracking. That's what's required in order to
both create a self healing system and to solve
the problem of an error that happens.
In order to know that you need to self heal, you need to know
that there's something to heal from. You need to know that there's an error.
And the ability to accurately track and report errors is critical to self
healing systems. They are part of the problem because
error tracking and errors create the
need for self healing systems. And they're part of the solution, because by knowing
and understanding your errors, you can properly respond
to them and know whether a system needs to react or not to any
particular error and how one might react to different kinds
of errors using automation.
Now, errors are everywhere. They occur in every application.
Sometimes they're a problem, sometimes they're not a big deal. In a SaaS
application, errors occur probably almost
constantly. There may be errors that happen
that are probably not a big deal, and you don't need to worry about them.
Someday you'll take care of them. Maybe you don't even need to take care of
them. Maybe there's an error that pops up that is inconsequential
and doesn't have any effect on your application at all. Maybe there's an error that's
critical and needs to be dealt with immediately.
But in any event, errors are everywhere. And so let's talk
a little bit about the nature of errors and the nature of error tracking,
because obviously, that's what is the critical portion here of what we're
talking about with self healing systems. So here's an example of a
relatively crude and imperfect error
detection system. When I was a kid,
we had Christmas lights that looked like this instead of the little teeny ones that
we have now. And they were connected,
I believe, what you'd call in series, so that each bulb's
ability to light up depended on the bulb before it.
And if any one bulb on the entire string failed,
the entire string failed to
light up because the electricity couldn't flow through every single bulb.
And this was a very crude way to detect an error.
You would know that there was an error in one of the bulbs by the
fact that the light string didn't work, but you
wouldn't know which bulb. So you'd literally have to get a known good bulb,
replace every bulb in the string in
order to find the one that was bad. Sometimes there was more than one that
was bad that would not report itself when you'd
replace it with a good bulb. And so this is a
very crude and difficult way to detect errors. And this
might be analogous to the ability of you to
detect errors on a remote machine. For instance, customers machine,
the classic works on my machine, but it doesn't work on the customer's machine problem.
And those are gone, and those issues are
mostly gone today with the advent of SaaS applications.
But to a large degree, this was a way that errors were detected in
software systems. Now, here's another system that reports a
problem, and it tells you that your car
is overheating. It doesn't tell you why, but it does at least
tell you that your car is overheating. Overheating is a symptom
of any number of different issues with your car, but it
is a slightly better way of reporting can error in that
you know your car is overheating, and you know that there's a problem, so you
can turn the car off, fix the problem. But again,
this doesn't work for a self healing system.
Probably a better metaphor for a self healing system is the notion of a
circuit breaker. We all understand circuit breakers. We have them in our homes.
A particular circuit on your house, electrical system becomes
overloaded. The amperage is too high because you've plugged in 700
different items into it, and the circuit breaker blows.
Well, this is a very effective technique for error reporting, because specifically,
first it solves the problem of the error. It prevents a
house fire, for instance, whereas if the system did not have a
circuit breaker or a fuse, if you will,
then the electricity in your house might overload in that
particular circuit and start a fire, which is very bad. So the circuit breaker prevents
a fire. It also tells you exactly which circuit the error is
occurring on and allows you to go and make changes
to the system that you've got plugged in to that particular circuit.
So circuit breakers are a good symbol
for self healing systems, because an error reported
by some type of circuit breaker mechanism, a critical error
that is fired, allows you to respond quickly
and algorithmically to the particular problem. For instance,
with your system, with an application,
SaaS application, the firing of a critical error,
I-E-A circuit breaker snapping, may result
in you rolling back to a previous version,
or may result in you turning off a feature flag of the
error, of the feature that the error occurred in.
So the circuit breaker is actually a pretty good error reporting
mechanism, allowing you to deal with errors
in an automated way, because all too often this
is kind of what happens. In other words, everything's going along great, and all
of a sudden, boom, you've got a problem.
Response times go way down, there's something wrong,
but you don't know what to do. You have no idea why that particular
thing is happening. So you need an error reporting system that
gives you clean, clear information about why something
like your transaction time here is suddenly jumped from
a very acceptable level to levels that are not at all acceptable.
That type of error reporting is critical to what it
is you want to do in terms of building a self
healing system. So this is the problem that
we face, right? We have this sudden influx of errors. We have
this sudden problem that we notice, and we don't necessarily know why it's occurring.
Now, traditional monitoring, I. E. Log files and
different types of monitoring systems, don't necessarily give you what you need to understand and
respond, because there's a huge amount of data coming
in, and that data needs to be processed in a way that
human beings aren't very good at. Nobody has enjoyed
spending time poring over log files, looking for patterns or trying
to figure out exactly where the error occurred and why the error occurred and what
was going on when the error did occur. So we need a better
way to see where we're headed when we want to find
out where an error occurred, why it
occurred, and how we can know exactly what the problem is and where
we can go to fix it. You can't automate the response for every error,
but you can automate the response for some errors. If you have that
particular level of intelligence and knowledge and
trust in the way a specific error is tracked
and managed and reported to you. If you trust that the
error is truly critical, and then it does require a feature flag
being turned off, then you can do that automatically, which is very
powerful and capable, keeps people from being woken up
in the middle of the night, which is a very good thing.
Now, one thing that factors into all of this is that a developer's job
has changed over the years. It used to be that our
process limited change. We did waterfall.
We manufactured the software and actually put it in a box
and it had to be perfect before release. And when we were
done was the point when we said, okay, we're going to
ship this thing. And its shipping actually involved etching it onto a CD,
building it onto a CD, and mailing that CD and the box
and the manual out to customers who then opened the box,
pulled out the CD, loaded it onto their individual machines one
at a time. So with perfection was sought there
through compiled languages, unit testing, strong QA
and the ability to fix, or the desire anyway,
excuse me, to fix every single bug that was known before
shipping was maintained. But now we're
much more agile. We do research into what the issue is and
we fix it on demand. We can fix a bug
immediately with a SaaS system. We improve every release,
every release is slightly better and we're never done because the application is
constantly iterating on a minute by minute basis. Really?
Company like Amazon will deploy thousands of times a
day to improve the quality and features on their system.
The process actually encourages change. Instead of limiting it.
We fix the bugs that matter the most first,
and we achieve fast iterations with automated releases
and automated feedback. This is the method and the
means by which automated systems are possible.
So these new types of systems have error reporting that's very different
from previous methods. The errors come in in a fashion.
They're different types, different categories, different pieces
of information, and they're reported to us in a very generic way. They come in,
they may have colors to them, for example, when they happen,
but they come into our system into a log file in a very colorless
way. And what you want really is for those errors
to be grouped together and classified in
such a way that you can tell whether they're new, they're reactivated, they're resolved,
they're critical. They're not so critical. They're information only,
whatever. So in order to respond
in a self healing manner, you need to know what those critical errors are
versus what those just informational errors are that occur inside
of SaaS systems. You need to group and
organize and manage them. And that's something that's very difficult for humans
to do as they look through log files. But it's something that's very easy for
a system like rollbar to do using machine learning.
Very powerful capability to group the
errors by giving each error a fingerprint and grouping
those errors according to those fingerprints. So this is
a real basic look at the way rollbar works and allows for self healing
systems. Real straightforward. First, it sees errors.
It allows you to track errors by grouping
them and seeing them as they occur in very specific patterns.
So if there are ten different errors coming in to your system,
rollbar can see the distinction between all ten of them and can report on
the frequency and characteristics of those errors.
So if you have an exception that you've raised that you've marked as critical exception,
and that you know you need to immediately fix
and deal with and respond to,
rollbar can identify that type of error for you and allow a
hook to take place, and a web hook or some type of
hook, allowing you to respond to that particular
type of error because of that grouping and that
matching of errors and the fingerprint that each error gets.
So rollbar then say, determines that your error is a critical error.
There's a service that's down, and the
reason that you're not able to access that error is
an exception that, you know, saying that the service
isn't getting what it's expected. So some change in your most recent
deployment has caused a service to not
respond correctly because you're not asking correctly for information
that change in your system has caused an error, and you know that perhaps the
way to deal with that is to roll back to the previously known
error, send a message or an email
or a slack message, or something in pager duty
to the developer in question, and roll back to
the previous known working state, and thus
healing itself and reporting back that there was a problem.
Then this is a great example of saving somebody some
phone call in the middle of the night from support because
of an error. In the olden days,
back in the olden days, say five years ago, ten years ago,
one of the things you would have to do when an error occurred was you
have to call somebody to get them to fix it, particularly if it was a
critical error where customers were, say, losing money or customers
weren't able to spend money, which of course would be a big problem for your
business. But because you can now,
through tools like rollbar, identify and recognize
specific errors as critical, you can fix that with the self healing system and
then have somebody just come in in the morning and deal with it by
returning the system to a known good state.
So finally, yes, rollbar can automatically roll back a release. It can
automatically turn off a feature flag, and it can automatically
create an environment that you know was working correctly
before the deploy happened. Now, you might wonder, why would
you not know this right away? You might. And you might know immediately
that a problem occurs with a new feature or with a new release,
but you might not. Maybe the problem occurs because of
the character set used in a time zone that is different
than yours, or a time zone, or a character set that you
haven't tested with, perhaps. And maybe that happened in the time
zone halfway around the world, in Japan from the United States, or vice versa.
And maybe the error occurs simply because a customer
does something in a specific way that no one has ever thought of. Customers are
known for being chaotic and chaotic
neutrals. Probably. My guess would be, and Mary,
there might even be some chaotic evils out there. But in
any event, you want to be able to deal with those errors and roll
back to its known good state, whatever that means, whether it be a feature flag
or a previous version or a previous release, in order
to be able to correctly and properly respond
and heal the system without human intervention and
without having to call somebody at 02:00 in the morning.
So, in summary, self healing systems are possible. They're only
going to get more capable as we move forward.
Tracking and learning about errors is the critical part of all
of a self healing system. Knowing what the nature of
the problem is, how often it's occurring, where it's occurring, is all very,
very critical. And rollbar is the system that lets you self heal.
So I'd like to thank you for your attention today. Please feel free
to reach out to me. You can contact me on my contact
page at try rollbar.com nick.
And thank you for your attendance and your attention.
I'm happy to answer any questions.