Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica makes up real
time feedback into the behavior of your distributed systems
and observing changes, exceptions. Errors in
real time allows you to not only experiment with confidence,
but respond instantly to get things working again.
You san
good afternoon, good evening, good night, wherever you are
joining us today, thank you for joining us at Comp 42,
Chaos Engineering 2022.
My name is Rain Leander. I am a developer advocate with
cockroach Labs, and full disclosure,
this is a talk about failure,
mistakes, lessons learned.
And I want to start by recognizing the
two major reasons why people come to
chaos engineering. One is that,
like me, you're curious about
this chaos engineering thing, and you'd like to apply it
to projects that you already have or projects
that you're planning to have. And you think, you know what, I'm going
to get my hands dirty. I'm going to start playing with the tools that are
out there. And in this experiment, I played with Gremlin.
Yes, there might be a part two to this experiment,
and you'll understand why soon enough. The other
reason why people might come to chaos engineering
is because some catastrophic event
has happened in production, and they realize that
they want to set up sufficient checks and balances
to prevent that from ever happening again. I will
say, thank goodness I have not had that experience
directly. I have worked for companies that
have had that experience, and I
have not enjoyed being the one on pager
duty. But in this case, for this experiment here
I was the curious student making
a mess, learning from things and whatnot.
That's where I'm coming from. And I think that's important to acknowledge first,
because if you're watching this and you've experienced a
catastrophic event, maybe you want to watch this on one and
a half speed or more, and really just
start playing with your environment and the different
tools available. So let's start out with
what happened is that initially
I had heard of Chaos engineering. I work at
Chaos Cockroach Labs,
and I found this
medium post, which I should
probably share the link, which I will do later,
basically about hardening CockroachDb. And I read through it, and it
was about taking Gremlin and installing using
AWS platform, installing three
instances of Ubuntu within a VPC so you can enable
private and public networking.
It was using CockroachDB, dedicated the
open source version, which is important later,
and Gremlin. And so I signed up for
Gremlin free account. I played with it a little bit. I noticed
that a few things, which I'll go into later,
but then I decided, okay, so my plan is instead of
using AWS net platform,
I kind of would like to avoid that platform because
my experience with AWS is not that high. And so I wanted to
keep it simple, maybe install it on virtual systems
on my laptop, maybe use a
platform that I'm used to working with like Netlify or
Vercell or Hadoku, but basically
not AWS platform.
Instead of installing the instances of
cockroach labs three Ubuntu systems, I wanted to
use cockroach labs services, which is
a service where you don't have to worry
about the actual installation,
anything like this. It's a free service you can sign up to. Super easy.
I already was skeptical based on playing with
Gremlin, but I'll tell you why in a minute. And then
I thought, you know what? Yeah, I'm going to use Gremlin
because I did go in, I played around, I was like yeah, this could work,
this could work. So what
actually happened is
first of all, cockroachDb serverless,
there are lots of benefits to it. It's made so that a
developer like myself can spin up
an application, point it to the database and not
worry about it, not worry about sharding or scalability or replication,
nothing. And because of that, it's kind
of like how when a Linux developer
on Ubuntu, who's used to being able to go into terminal and
hack on nodes and whatnot,
then gets a Mac or a Windows system
from the 90s where that kind of access wasn't really as
available. And unfortunately within serverless
there's not a way for an administrator
to then go into the database and
allow access to Gremlin.
For example, because Gremlin requires a specific secret
key, a team key, certain identifications
just like any API, so that it can tell what
you are observing.
So there was that. So then I switched to
cockroachdB, dedicated the open source version
that I could download and I downloaded it onto a docker
image, onto my makes,
which is by the way, what I was using this whole time
was my laptop because I wanted to
replicate this as cheaply and as simply
as possible.
So number two was that
I was like okay, so I've got
the cockroach labs installed in Docker on
my system. Can't use serverless, oh well,
I've got cockroachdb installed. So then I go
to the Gremlin aspect and I'm like okay, so how do I get Gremlin
into my system? And it turns out that gremlin
doesn't like makes and I
didn't notice that when I was setting up the project. So all of a sudden
I now had to use Docker again to install
Gremlin, which is fine. It worked like a dream,
even had a demo. These docs, by the way,
are available on cockroachlabs.com and gremlin.com
respectively. It was pretty smooth, except that
again, the reason why I wasn't using AWS was because I
wasn't that comfortable. And now suddenly I was using Docker, which I'm
also not very comfortable with. All of this
may have been avoided if I had more Docker
experience. And if you're watching this right now and
you have that Docker experience, maybe it was
as simple as combining a yaml file.
But for me, I did not know how to take my
docker images with cockroach labs,
my docker images with gremlins and
get them to observe each other communicate
in any way. The other
aspect of that was that I found that on cockroach
labs Docker, one of the nodes was constantly
failing. And this may have been my
version, because the open source version is a little bit older than the latest version.
It could have been any number of reasons. Maybe my laptop just
can't hang, it doesn't have enough resources on it, but for whatever reason,
one node was constantly failing and
it didn't matter if I reinstalled. If I dedicated more
resources, it was not there.
So I decided to take out the docker aspect completely
and I switched to kubernetes, specifically minikube.
And that resolved the node issue with cockroach
labs. I know, I have no idea, to be clear
why it was failing on Docker but not minikube.
But again, the docs for mini
cube installation of cockroach tv are cockroach labs.com.
And then finally, because the talk
was coming up, I decided to embrace AWS and
see if I could get it to work exactly the way the original writer on
medium had gotten it to work. And I went and
created three instances
on AWS and
started to install cockroach.
It was an incredibly painful
6 hours of my life, and I decided
that instead of bashing my head against the
things that I don't know, that I would instead package
everything up and bring it here to
conf 42 chaos engineering.
Now here's what I learned.
If the plan doesn't work, change the plan, not the goal.
And two losers don't quit. Winners fail until
they succeed. And those two things are fall down
seven times. Get back up eight. Don't give up
so while I haven't given up on this project,
I indeed have a deadline. This conference is
the 10 march, and as of when
I recorded this, this is where I am.
I also need to find out why that
one node kept failing on Docker and
if that is a systemic thing that is maybe resolved
in a more recent version of CockroachDB, or if it's
something to do with my docker setup. And then
I realized I don't know Docker as well as I did.
I have spun up docker images before, I have
put applications on Docker before, but apparently this was
just beyond my knowledge and
so it was humbling.
So let's talk about the next steps.
One, I'd like to try those original
platforms that I had in mind, Netlify, Bracel,
heroku. I would like to spin up an application,
a simple it can be just a leaderboard or a
website on one of these
platforms that I'm more comfortable with and see
if I can get gremlins to recognize
it and test it. This would require
that Netlify Versaille Hodoku allowed
that kind of observation and testing.
Netlify and Gremlin do have a brand new
relationship, but not enough to have documentation
yet. So I have hope there
are other chaos engineering tools besides Gremlin.
There's chaos kit, there's yespen,
there's litmus. They're probably at this conference.
I'm going to hear about tons of other tools.
It may be that those other tools are the answer.
If you have experience with other chaos engineering tools
that you think you could help me with, I would love to hear from you.
Maybe there's something else about Docker. Maybe there's someone
watching this who's like, I am a docker expert and this sounds
intriguing, and maybe I do just need to merge
a YAMl file. That's a possible
next step. And then finally,
ultimately, the next step is to present at Comp
42 chaos Engineering and ask for help.
I'm rain Leander on most of
the socials. My email address is rain
cockroach labs.com and I
would love to collaborate with you. Let's try these next
steps. Or if you know of something, if you work for
Gremlin, I would love to hear from you.
I hope you have come along with me
on this journey. I hope that when you run into issues that
you don't give up, especially if you're coming from a place
of curiosity, of adventurous learning, rather than
from running from a catastrophic event. I hope you
don't give up even though your motivation
may not be exactly the same as
if you were trying to fix something majorly wrong.
If you have any questions, I'm rain on Discord,
and thank you to comp 42
so much, the organizers and sponsors of this conference.
And I will see you online.