Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE,
a developer, a quality
engineer who wants to tackle the challenge of improving reliability
in your DevOps? You can enable your DevOps for reliability
with chaos native. Create your free account at
Chaos native. Litmus Cloud hello
everyone, and thank you for joining me here today at Conf 42,
site reliability engineering. My name is Robert Barron,
and I'm can AI ops and site reliability engineer
solution architect in IBM. What that means?
Iss basically that I help IBM's clients adopt SRE
for themselves. Now, it's not very often
when one's hobbies align with one's professional
interests. And in this case, I'm very lucky because the history of space
exploration iss something that I've always been fascinated with.
And what I'm going to do is take you on a little story, a little
historical storytelling expedition to show you how
the International Space Station is actually very similar to the
microservices that we are developing today, except that some of
them are in the cloud and some of them are high above the
cloud. So let's start off with thinking about
why we need a space station in the first place. In the early days
of space exploration, 60s,
early 70s, basically, we were going out into space,
seeing what it's like, seeing what we can do there, seeing how we could function
in space. Only later, we started actually
working in space, doing things with not merely exploration,
but also productive activities that would bring resources
back to Earth. So this is very similar
to the early days of development versus actually getting
into a production environment and generating additional
value beyond what we've put in.
So another way of looking at it is a space mission. For example,
going to the moon is basically launching the spacecraft,
going through a process of deployment,
reaching our target, doing whatever we want
to do there momentarily, and returning very much
a CI CD pipeline of deploying a
new feature where what we're concentrating on is the process of
deployment itself and not so much on what we're doing with
what we've deployed, because in the context of a space flight,
it's done. So that's a
big difference between a spacecraft and a space station. ISS that a spacecraft
is a temporary activity, whereas a space station
is a permanent presence. We've got multiple crews doing
multiple things in the space station over time, replacing themselves,
modifying the space station itself as opposed to the
spacecraft, which is one thing that we've developed.
We deploy and we get back. So we can look at it
as a spacecraft, as a stateless process.
You can either look at it as a CI CD process, or you can look
at as a function that is doing something. But a
space station is a full application. It's got a lot of
data in it. It's very stateful.
It's something that, if there's a problem with it, we can't just say,
okay, we're going to try the next time, because we've invested so
much in this. We need it to work. Were going to retry after our failures,
not retry from start as we did with a failed
spacecraft mission. For example, the famous Apollo
13 disaster in space, an explosion on the way to the
moon. They didn't recover Apollo 13 itself.
They replicated its mission in a future Apollo
mission, Apollo 14.
Now, if we look at the space stations, we can see that we
have at least three generations of space stations which
were developed. The first ones in the 1970s,
were monolithic space stations. The entire station
was launched at once into space. In many cases, it couldn't
be reprovisioned. And once a few missions
were performed, that was the end of the space station Sallyat.
Six and seven were transitional space
stations, where a central station was launched and various sidecar
components were added, which gave additional capabilities,
especially in engineering experimentation
and scientific collection.
Whereas more modern space stations, beginning with MER from
the 1980s and the ISS and Tiangyang
today are very modular space stations, where you construct
them in stages. There were over nearly 50 flights, both of the space
shuttle and of regular rockets,
which launched various components into the. To build
up the International Space Station. The modules have been moved around
to recalibrate them, put them in better positions
for whatever work they need to be done, and sometimes modules become obsolete
and are replaced. Now, if we look
at America's first space station,
Skylab. Skylab was launched using the same technology
that got the Melkins to the moon, the Saturn V.
And it was actually the top third of the Saturn V was transformed
into the Skylab
space station. It was so large that they actually had space inside
to test a jetpack.
Entire Skylab
was launched once with all the scientific equipment, with all
the supplies that they needed, everything in a large
monolith, just to illustrate the size,
the internal size of Skylab, you can see the astronauts
exercising, running a treadmill, which was the inside of
the space station. The problem is,
of course, that there's a lot of empty space in a space station like
this. Whereas if you look at the International
Space Station, while it iss much larger overall
than Skylab, famously compared
to the size of a football field, you can see that each
of its components is actually much smaller than
the large mask, the large monolith that Skylab was.
And these pieces fit together, each of them with their own role,
with their own goal, with their own targeted
mission. But each of them is, in itself,
much smaller than Skylab was. While the station is larger,
there's a lot less open space. It's a lot less
roomy than Skylab was,
and that's because it was developed in a modular
fashion to be brought up piece by piece, starting off with the engineers
components, then adding more and more scientific
and engineering exploration capabilities.
Has. Time goes by. So this is the blueprint, number one,
from 1998, where the space station started out.
The first component was launched in 1998,
and it was only completed in 2011.
This short film shows us the various
components. Each additional component that you see were is
another launch of the space shuttle or another launch of
a rocket. And you can see that pieces are being added. Step by
step. Pieces are being moved from location to location
because, for example, the solar panels start off in the
center of the space station when there's not a lot of requirements for power.
But as we need more power, more solar panels are added, and they
are reconfigured into different places so that the station remains
balanced. And if you have time
to read the names of these components, you can see that we have more
and more scientific components being added. We have more and
more components which have commercial applications,
allowing ground based companies to add their
own explorational payloads
to the space station over time. Whereas the first components,
the original core of the space station, was all
the life support and engineering components that were required.
Unlike the monolith of Skylab, each of the components you
see here has a dedicated goal. It can be the Svetster
service module, which holds much of the engineering, life support,
and functional capabilities of the space station. Or it
can be the destiny or Columbus scientific laboratories,
which perform scientific experiments. Some components are laser
focused on specific things, such as the solar panels,
the robotic arms, or the airlocks, which cannot be repurposed for
anything else. But other components do have flexibility,
especially since the station is filled with standard payload
racks, which means that new scientific experiments or technical tests
can be brought up on spacecraft to the station and replace the
older ones. It's quite remarkable
how similar a space station is today to the design of over 20
years ago. Most of the components which were decided on in 1998
do exist in some form or another. Other components,
such has a dedicated living area.
Along the way, they decided that there was no necessity for
an entire component just for astronauts to sleep in, and the
astronauts sleep in various areas that they found
within the space station. I'd like
to go into a number of resiliency use cases so we can see how the
station operates day to day, and what can be
more natural than the oxygen that the astronauts breathe.
Just to be on the safe side, there are a number of multiple redundant and
complementary oxygen solutions. The first one,
which is what the station started with in 1998, was based on
the 1980s Mirror space station, which predecessed the
International Space Station. It converts water into oxygen.
However, it does have a technical byproduct,
which can cause clogging and other issues in the system.
This is technical debt that has been plaguing the
station since the very beginning.
In 2006, another system was
added called the oxygen generation system, which also uses
the same general idea to convert water to oxygen. But the byproduct
that's created requires a lot less maintenance.
And a new system from 2018
uses a completely different solution, converting carbon dioxide
to oxygen. And not only that, it can also create
more water for electron and the oxygen generation
system. So we actually see here a
progression of starting off with a system that we know
works, but has technical debt, another system, which improves
on it, and a third system, which is now eliminating
the technical debt completely, not solving the problem by creating
a better or simpler byproduct, but completely changing
the mechanism that they use to create oxygen, which means that the
problem not only will the problem be solved
more easily, but it won't come up in the first place.
When there are problems and these systems don't work,
then there are emergency oxygen sources. You can
see on the right here chemical bottles that are used
to create oxygen, or even simple
bottled oxygen, which is found in the station or
docked spacecraft. Despite a number of issues
with the oxygen generation systems,
primarily with the electron,
because it's based on the oldest technology.
Despite these problems, there's never been a severe problem with
the oxygen, with the health and the breathing of the astronauts
in the system. Throughout the over 20 years that it's been
working, however, there are technical debts
to the system. Electron is supposed to generate over
half the oxygen for the
space station, and it is very old technology. It's very difficult
to find experts on Earth who are still familiar with the
system, and also due to the design of the
russian part of the space station, where the pieces are less
modular than in the american side,
it's much more difficult to replace the components, which is why
the new solutions, especially the ESA solution,
are coming in and will take up more,
generate more and more of the oxygen of the station as time goes by.
So here's an interesting edge case.
Spacesuits used to walk in space. Every spacewalk
is pre planned to the very last detail, including who are
the astronauts who are going to be on the spacewalk. One of the
reasons for this is that you need to customize the two piece suit
to suit the size of the astronaut. An astronaut
might want a medium upper and a large lower, or a
small lower and a large upper or any other
combination that will suit their size.
Now, the ISS only has a limited set of
pieces of these different spacesuits. And in 2019,
there was a failure of a launch failure, which meant that the right astronaut
who was planned to go on the spacewalk, didn't reach the space
station in time. Now, they still wanted to do the spacewalk,
but then they discovered that the scheduled astronauts,
two women, would not be able to build two
spacesuits in the sizes that they needed.
So the spacewalk was postponed again till
the right size spacesuit could be sent up into
the space station for them to build two spacesuits which
suited them. The fact of the matter is that because most of the astronauts
were men, most of the spare pieces of a spacesuit
were sized larger than the two astronauts who were then scheduled
to do the spacewalk. While the image we have of an
astronaut is that of a superhuman who can do anything, we would
like to give them a hand. One of the most interesting components on the ISS
is Simon, an independently flying assistant who
can keep up with an astronaut and assist him or her with whatever
they're doing. This can range from anything from showing documentation
or a troubleshooting manual to broadcasting music for the astronaut.
Simon can keep track of the astronaut and position itself,
so it's easy for the astronaut to read the document Simon is
displaying. Over the years, the computers we've been
able to launch into space have become more powerful, and the network
speeds are faster. In fact, while Simon has
a powerful processor of his own, most of the work,
especially the AI analysis, is offloaded and executed
by Watson on the IBM cloud hundreds of kilometers
below the station. While we've discussed a number
of the technical things which happen in the space station, there are also a couple
of procedures that we should be were of.
Space station didn't actually start in 1988.
It was first proposed in 1969, built it,
got bogged down in budgetary issues and political issues,
and it was announced in 1984 and canceled in 1993.
And nothing actually happened with the space station
for decades, except a lot of talking and a lot of
money wasted on just designing in
place instead of construction,
what did work? The International Space Station.
Adding the twist of international cooperation between countries,
especially the United States and Russia. ISS the
thing that made the space station happen. It wasn't the exploration,
it wasn't the scientific advancements, it wasn't
the engineering capabilities, it wasn't the
commercial aspects and possibilities. No,
it was the politics of countries working together,
cooperating and creating something jointly.
So to a great extent, the business of the space station
is being an international space station.
And in the same way, when we go into creating any application that
we're developing, we need to understand what it is we're trying
to do. We're not always trying to sell the newest
widget at the lowest price. We might be wanting to
do something that is politically more complex,
which means that we need to be able to align the reliability goals
that we have to this target. For a long time, the space station
was basically supporting itself,
but wasn't doing much experimentation because
those components had not yet been launched. But still,
humans started being in the space station, working in the space station
as early as possible, because there was value simply in being
there. The smallest things can cause the
largest headaches. Has site reliability engineers were always
conscious of the fact that we want to learn from mistakes, not just find
someone to blame, built to understand the underlying reason that the
problem occurred. Well, here's one
example of why this is sometimes difficult. In 2018,
an air leak was dedicated in the space station.
After lengthy examinations, the source of the leak was found a
hole in the side of one of the spacecraft which had recently docked
with the station. Now, the immediate suspect in
the case of a small hole in a spacecraft is a meteorite
or another piece of space junk hitting it. Just a case of bad luck
and statistics. That's why the station can survive multiple such
strikes, and the astronauts can patch up any hole quite
quickly. However, in this case,
it quite obviously was not a random piece of metal which punched
the hole it was drilled. But how could
a spacecraft fly into space with a hole drilled into it?
Were are basically two possibilities. The first is that after the
spacecraft docked with the space station, can astronaut took a drone
and drilled a hole in the spacecraft,
or an engineer did the same thing on the ground,
applied a patch which passed the pressure tests on the ground,
and failed a few weeks later up in space.
But why would either of them do something like this? It's hard to say.
Perhaps it was sabotage. Perhaps it was user error,
a slip drill and a cover up instead of a proper fix.
In any case, no public summary of the cause of the issue has ever been
published. While there has been a certain amount of blame
game going around in the press, I'm not going to go into any details.
I just wanted to remind you that while we should always try to remain technical
and detached and blameless, sometimes we
won't be able to remain as detached as we like from
the political processes which are hovering above us.
Here are some of the lessons which I hope you've seen during this session.
The first one, which we learned from Skylab, is that monoliths are
simpler, even if they might be wasteful and more expensive in
the long term. When you choose your mvp,
it might be a spacecraft, a small, stateless solution.
It might be a monolithic space station, or it might be a modular space
station. Don't decide your technology before you decide what
you want to do with it. Technical debt
is the biggest problem that we have in the industry. It's crippling.
You have to be sure that you know how to transfer your knowledge.
Don't get into the situations that the russian space agency is
today when they have virtually no one with the skills to support the
old electron oxygen system.
Remove old technology when you can, replace it with
new technology. If you can avoid problems instead of
solving them again and again, bring in something
else that will make the problem completely disappear,
like the Europeans are doing with the new oxygen
system, which does not require water at all.
Lower the cost of learning. Technology is going forward at
an increasingly increasing rate and
we can't all hire only astronauts to solve our problems.
This is where AI can help by pointing out what the right documentation
is, what the right troubleshooting procedures are, helping us
find how to solve problems faster.
The topology of our systems is ever changing.
No matter which diagram I show you of the International Space Station,
chances are it's going to be a wrong diagram because something has
happened in the last few weeks and in the cloud
native development. These last few weeks could be
last few seconds. Make sure that you have
redundant solutions and backups for cases when you can't get rid of
your technical debts. Be ready for something to fail and
have a solution in place to solve it.
Make sure you've got good resource management. You never know when you might
need a new size of spacesuit. You never know when you might need a new
node for your Kubernetes cluster or a new
runtime for your system.
It's impossible to have a completely blameless post incident
analysis simply because were humans and simply because
politics is part of technology. But try not to
blame astronauts, not to blame people directly built.
Keep it as process driven as possible and remember
that the technology is cool, the deployed is where the fun
is, but operations and production is what
keeps the business going, gets the money coming in, makes our
clients happy and gives us support
to go on for another day and a new version of the product.
Technology ISS cool, but the business and the politics of the
business is vital. Keep up with the technologies,
adopt the new things that you can, but don't make it your
goal. The space station before the International
Space Station was constantly reinventing itself using
the latest and greatest technologies, but it never got off the
ground. So make sure that your solutions can
get to the cloud and beyond. Now if you deployed this session,
and I really hope you did, I've collected some
links to further reading which might interest you.
I didn't really want to get into all the gory details of each and
every component in the space station or all the flights
which were made in order to build it up piece by piece.
If you're interested in that, then you can go read more
about it. The reference guide to the International Space Station is published
by NASA. It's available online. Just google it.
The link is very long. I write a blog about
these things and similar lessons. Lessons from the Lunar Landing
chateau to site reliability engineers I think there's a lot of
things that NASA learned in the
thousands which is relevant to the work that we do
as site reliability engineers today. There's a lot we can learn from them, a lot
of things we can inspired from them, and this is my collection
of such lessons. NASA has actually created
its own database, public database of significant incidents in human
spaceflights. Again, a link down here from
the perspective of what IBM is doing in this domain.
Here are two links which will lead you down the rabbit hole
into a lot of further information about modern
service management and operations, site reliability engineering,
AI operations, chat ups, which is my favorite,
and so on. And one last link about kubernetes
on the space station. IBM working together with
NASA, with HP, with other partners in order
to deploy a unique version
of cloud computing far, far above the cloud.
Thank you and enjoy the rest of
the conference.