Abstract
Setting-up a system of monitoring production availability in multiple regions might seem pretty straightforward if we’re looking at this strictly from a tooling perspective: Postman, DataDog, New Relic, AWS, Uptrends, Cloudfare (and the list can go on), they all offer solutions that are super easy to setup, manage and maintain. However, the real challenges are behind the use of the tool:
* How do we select the regions and locations to monitor?
* Do we want to monitor exact particular location (eg: a city) or broad areas (eg: countries)
* What do we consider a failure (typical HTTP error codes, certain time thresholds?)
* Do all the selected areas have the same importance to the business?
* Do all the monitored areas trigger the alerts with the same priority?
* How often do we want our platform to be monitored?
* Who owns the actions to fix this in case of failure?
* How do I flag these monitors so I don’t mess up with company data?
* How do we design the monitors so in case of failure, we don’t get a cascade of alerts?
In this session I plan to answer all of the above questions, combining with examples from my personal work experience on the topic.
An agenda would look something like this:
* About the Speaker
* Why is the topic important?
* Building your own strategy: things to consider
* Choosing the tool
* Showcase of an already working solution
* Conclusions
* Q&A on Discord
After the session you’ll be able to:
* Get a good grip of why this topic is important and in which context
* Understand how to create a strategy for monitoring different world areas that applies to the realities of their own company
* Understand what tools they can use to achieve that
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are youll an sre? A developer?
A quality engineer who wants to tackle the challenge of improving
reliability in your DevOps? You can enable your DevOps
for reliability with chaos native.
Create your free account at Chaos native Litmus Cloud
hi everyone, welcome to my session today. I'm Andre
and I'm super grateful to be part of this event and I'm currently saying hi
from Barcelona, Spain. About myself as a professional,
I've always enjoyed the creative and data driven way at looking
at technology through my blogs and talks, and I hope to inspire many of
you it professionals out there watching this to bring
value in their organization in new, diverse and ingenious ways.
Previously I was part of a platform team at Typeform that was called
software engineering tools and infrastructure. Our main aim was
to support the organization with processes, self built or existing market
tools, standards, infrastructure metrics throughout
the whole software delivery lifecycle that will finally lead to a more stable and
reliable platform. Today, I'm going
to share with you a story about scaling up in 2021.
Things have considerably changed since the first time a Typeform
has been created in 2013. Typeform is growing as a
brand right now and you can imagine that also the platform behind
it has grown to support what it is today. As our
production started to mature, our user base to grow, and all of
this not in just one specific part of the world, but in many,
we realized that our platform is facing new challenges that we need to tackle
if we want to continue to grow as a global brand.
One of these challenges is pretty straightforward to guess.
Basically your product, having your product available throughout
different regions of the world. This is one problem you don't really notice
until you start to have a huge number of customers or users in parts
of the world you are not really expecting. You think your
platform is up because your production monitoring system
hit your data center, the same data center
where your application is deployed. But it's
not the case and these customers are opening support tickets
which take a bit more time to understand and resolve by engineers.
As a company that wants to grow and scale, we need to prepare our platform
not just to be better reacting in this situation, but to
be able to proactively tackle these problems if they occur.
In this context, there are many improvements that you could bring to your platform
and it was pretty clear for us that one of the things
we needed to look at is, apart from our current spotlight,
is availability in different parts of the world
and of course have things coupled with our alerting
system. So in case we do have a downtime, we can react
and repair much faster.
There are many tools out there that offer effortless or almost
solution to this problem and most of them are right.
Obviously each one of them comes with its own set of features
that makes them unique, like integration, different set of locations that they offer,
pricing rules and et cetera. Speaking strictly
from a technical perspective, there is indeed a very low complexity of
spinning any of these solutions up and I really love when
technology is making us focus on
what is the most meaningful solve problems.
So if the challenge is not setting up the tool,
where is the challenge? Well, the challenge
is actually about all the rest of the things, but these tool
as in everything in engineering,
understanding what needs to be built and how it's going to be used before
laying a finger on the implementation.
Up next I will share you ten things you need to be asking yourself
so you can build up your owns strategy of monitoring your
platform from different locations, together with some tips and tricks
from the lessons that I've learned, this first one is the
most difficult from all of multiple locations of this world.
Which ones to monitor? Obviously if you are not a
teleco company or something similar, it is not really feasible to really
monitor each tiny location from the globe.
Look at what matters to your business. Things like top x location
by user base, top x customer location by visits,
top x location of customers by premium accounts should
help you understand where you need to put your focus. Look at your slas
and legal agreements. Do you need to provide something in a special
location? Our approach at Typeform was to have a baseline of
a mix of top location by visit with our top location by
number of customers. So what does a location
actually mean? Is it a city, a country, a region?
It can be any of the options or any of the combination
that I've just mentioned. And this depends mostly on how your business
operates. Think that you areas a food delivery company with a strong presence
in a metropolis or a big city.
Monitoring those particular cities might be
making more sense than monitoring whole regions or
countries where if something fails in that big city,
you might not get to notice it.
We've settled for a mix between countries and regions, but we
areas aware that this might change in the future.
This question might appear before or after selecting multiple locations.
So should the chosen location be treated all in
these same way? Meaning should I trigger alerts, for example, in the same
way for all of my locations? Ideally,
yes. It's to treat all multiple locations the same way and alert
the same way if your platform is down in Barcelona or in San Francisco.
Realistically speaking, this goes a bit like any
other production incident or bug. We want to fix everything, but where
do we start? First, using your company's
priority system can help you. At Typeform quarantine we are treating
all of the locations in the same way because we have a very few set
of them, but due to regional and countries mix,
some locations are finally a bit more important than others.
How frequent codes these monitors execute over these production instance
of my platform looking at traffic is the best
way to tackle this. A couple of users per minute might not give
you enough reasons to monitor often, but as we're talking about
scaling platforms here and that the chosen locations are with high
traffic generally once per minute should be enough to
start with. Granularity above the minute might
make you unaware for much more time that you are down in a particular area
and this is not convenient. Generally at Typeform we are
trying even to go more granular than a minute,
but if we execute each minute, we'll end up with around
45,000 countries per month and almost half
a million per year in the database of different things that
we do know. Even the security rules might
block so much activity.
Yes, and that's a lot. The only way forward is to start flagging
it so this doesn't interfere with other system in place
at your company. This can be done via allow listing the ip,
adding certain bypass headers, user agents,
parameterized requests and so on at Typeform. The different system
this might interfere with all have different requirements in this case.
So the solutions is a mix of more of these
aforementioned ways. Should this monitoring system create an alert
each time a failure is generated? I'd say no,
not really, and this is for a couple of reasons. One of the most
important one is that the servers of the location of the tool you
are using might not have availability in that moment.
You are trying to reach your platform, but this doesn't mean that
your users are affected by that or have problems
accessing your website. This shouldn't create an alert as it would
spam the engineers and the challenge, and they will challenge the robustness
of the tool and usefulness of it.
A good practice would be to trigger an alert if consecutive
amount of failures is identified over that location.
You could set up a retry in case of failure or wait
for the next execution depending on the frequency of the monitor that
you choose for your platform.
Okay, we talk a lot about failures, but what
are they? What should we consider a failure in this case?
All these tools come with a default understanding of what a failure is,
which is pretty standardized as far as I've seen between them,
basically errors in HTTP requests,
timeouts, everything around this area. Of course
you can add your own failure criteria. For example,
maybe you have some custom error codes that you want to remove or
add on top of the default options. Youll want to lower the
timeouts or figure out some latency
thresholds. Assert something in these response.
All of this depending also on your slas and general policy regarding
performance at your company. If you're starting and you just want
to explore how you could tweak this, I'd really recommend
going with the default in the beginning and learn from
how your system behaves before you go custom were
talking a lot about failures, but who should react first?
If there is an availability problem in a specific world area,
who should fix it? Well, in an ideal DevOps
environment it would be definitely the platform owning
that particular part of your system, the team
owning that particular part of the system.
Realities in different engineering organizations makes this
question to depend really on who has the access and
knowledge to make those changes. This can be an SRE
team ops incident management platform team.
Nevertheless, it's critical to have somebody to react if
this happens at Typhoon. At first our team was
the one to be alerted so we could understand who is better
to contact in case this failure is identified. We have
tried out to have teams owning that part of the platform to
be the ones to react first,
but we have adapted to improve the
reaction on these alerts.
So what if there is a downtime in all of the areas
because let's say a defect is introduced in production,
isn't this going to trigger a huge quantity of redundant alerts?
If you're going to go for a default setup, these monitors are
going to fail and are going to trigger an alert that will
be yet another alert in the clutter of alerts that your
on call engineer needs to go for. What a nightmare.
This is not ideal. This is a problem that generally appears
on scaling engineering organizations. And luckily there is software out there that
can help you with grouping alerts based on the root cause or
affected area or even using AI. For example,
I was reading about Moocsoft the other day and it
looks like they are being able to group all of these alerts
over there. Like a side note, nevertheless, starting small
is better. Having the alerts for availvit in different world areas
set up is a big step and grouping them with
other potentially failing alerts is definitely another.
Keep those separate. And what exactly should we monitor?
Is it an endpoint? A complete user journey? All the endpoints
you have exposed. For example, if I get
a certain response action after hitting an endpoint or completing a
journey, what is it that we should be looking for
in these monitors? First of all, think about the purpose
of this checking the availability of the system from different parts of
the globe. Modern tech organizations already have some testing
interaction going on which already tests some complete
or partial user journeys in these UI
check some response to APIs and et cetera.
A good practice here is to keep it simple.
The purpose is to check if a particular part of the platform is
accessible from different regions. Think about
what is the minimum that you can do to achieve that?
And from my experience, most of the time a simple
API call like a get or just a ping to an URL should
just do the trick. Secondly,
look at your architecture of your platform so you can choose
what are the exact parts of it that should
be hit by the monitors. In a monolithic architecture,
one monitor doing a get in your main URLs
should really do the trick. In a microservice architecture,
things get a bit more complicated and it might not be
so feasible to create monitors for each of the endpoints that you are exposing
just because of the quantity of microservices and their rapid
change that exists. So understanding
how are those things deployed and where,
what's the connection between them would allow you to understand
the risk and reduce the quantity of monitors to a very
few, but I would say powerful. That will allow you to avoid
downtimes in different parts of the globe. To conclude
my talk today, I'd like to stress out the following aspects.
As a scaling organization, you need to look at things differently
than before. So before jumping into
setting up everything in any of the tools out there, look at the
current needs of your company, checking the availability
in different world areas today, but also from the perspective
of near future. Although there are
some best practices, the most important aspect is looking
at the personalized context of your organization while designing the
strategy for this thing. Choose a tool
that can help you a bit more than just today. Last but
not least, in a scaling organization, there are many things
that can be done much better. So if you choose to monitor the
availability of your platform from different world areas,
start small and then figure out some of the
details along the way. This will be much more impactful
and we will keep you focused. If you want to reach
out to me to help you set up this or just drop me
a line and chat, feel free to add me on LinkedIn. Also do
check out my blog and for my experiences and opinions regarding
work in the software development industry. Last but
not least, I like to hear what you love and hated about my session
today, and I invite you to do it via the feedback form in
the QR code that is on the screen powered
by our new product at Typeform that is called video ask.
I hope you enjoy it. Thanks for joining me today
and enjoy the rest of the conference. See you
around.