Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, thank you very much for attending the webinar. My name is Quiz
and it's not going to be easy Omega's experience with serverless Cli CD
in Omega we are 100% serverless in production, hundreds of
millions of lambda invocations per month. And of course, serverless is
not only lambdas, we are heavy users of Dynamodb,
SQS, SNS, S three, Athena, Kinesis,
Eventbridge and I'm sure I missed something. We are really the poster child
of serverless in AWS. Not only that, we are also 100%
of serverless CLI CD flow tens of deployments per week. We don't
have a single server that we manage to drive our flow. Everything is outsourced
either to AWS or external service provider. In this session you
learn how we drive our CLi CD flow in Omego, how to make
the entire flow robust enough, and I'll share some tips on how we cut
our Cli CD time flow by two thirds and we'll talk about a
unique flow for phase deployments we implemented internally. Across the session
you'll see many links to blog posts we wrote on the subject.
As a preparation for this session I gathered some ethics so you'll
better understand the CICD numbers Lumigo is facing. We have around
20,000 individual integration tests we run per month,
around 30 deployments to production per week. And the
pain points, I'll also talk about them. During the talk is the time
it takes us to deploy the test and run them. By the way, we are
trying to be metrics driven and we gather quite a lot of information about the
quality and performance of our continuous deployment and development.
I'll share with you some of our dashboards, a couple of words about me and
AWS serverless hero I'm leading the development of Lumigo's R and D.
I've been using serverless in the past four years, mostly with Python.
I'm obsessed with serverless testing and serverless development flows.
I wrote a couple of blogs and have a couple of sessions on the
subject. 15 years in the industry, working mainly with backend,
mobile application, various verticals. On my spare time I like to play card games,
although most of the time I lose a couple of words about omego.
It is a SaaS platform for AWS monitoring and observability,
heavily focused on serverless workloads. So the agenda we are going to
talk about the famous Infinity loop and see how to adapt it to a
serverless CLi CD flow, talk about our best practices and unfortunately,
here we won't have any time for Q A, but you are more than welcome
to contact me either through Twitter or through my email address,
which is going to appear at the end. Service is different. It affects
the way CLI CD is conducted. It's distributed with maybe
hundreds of components. Therefore, the orchestration of building
a serverless environment from scratch each time is difficult.
Many components usually means that frequent deployments are the norm,
and for many of the components you can't run them on a regular Linux machine.
SQS SNS kinesis are AWS services.
We don't have their container, only their mocks, so they have to run
on a dedicated environment in order to test the real behavior. The infinity
loop starts with a plan, goes to code, then to a build process.
By the way, in Lumigo we use python, therefore the build process
is skip, but in other runtimes like Java net,
you actually have to create a running artifact. From there we move to testing
both unit testing, integration testing, and end to end
testing when everything passes, and elaborate later what it
does mean by everything. We're releasing our artifacts and deploy them
to production and start monitoring them. Before starting the flow,
I want to go over a couple of guidelines we have internally in Lumigo that
affect our process. Each of our developers have an AWS
environment on their name. We're using organization to manage it
and consolidate the billing for us. If the code is not tested
on a real AWS environment, then it's not considered ready.
Originally we had a shared environment with different name prefixes for
each, resources for each developing. But very quickly
we started stepping on each other's toes and decided to separate the
environment. But we still have a single environment. For CLi CD driven integration
tests, we are serverless first, which means we will always prefer to choose
a serverless service instead of managing it by ourselves.
And by serverless I mean true serverless and not the management.
We don't want to handle sizing operation and pay for unused capacity.
By serverless we mean across the technology step, things like CLI,
CD, code, quality, and so on. I'm not talking only on serverless
in production, we don't want servers. We prefer to outsource everything
that is not in the core of our product. No dedicated QA
and Ops, which means everything is being done by the developers from
start to finish. The infinity loop that you saw earlier, same developer
takes the ticket across all states. No QA and
no ops means we invest heavily in automation. Across the
board. We've talked about environments, and these are the environments we have in
Lumigo. Each developer has its own personal laptop and the personal
AWS environment where they can run their code. We do have
a couple of shared integration environments that are part of the automated
CI CD process. We have production environments which are composed from
an environment our customer use and a monitoring environment that trams
our own product that monitoring our production. We are eating our own
dog food extensively, which helps us both to find potential issues ahead of
time and to sharpen our product. Our technology stack to drive
the CLI CD, we're using Circleci. We even wrote a joint blog
on how we use them for our deployment. We use the serverless framework. Most of
our code is written in Python, but we do have some services that are written
in node JS. So let's start with the infinity loop. So when omega
started and we were small, we used Kanban to drive
our workflow. We had a long list of tasks prioritized.
Each developer picked the top one. But as we grew, the Apple
management wanted more visibility, so we moved to a more detailed scrum.
Each of our sprints are a week long. We keep them short on purpose
to create the feeling of things moving fast, but we don't
want to wait for the end of the sprint to deliver. We are very
continuous, delivery oriented. When a piece of code passes through all of
our gates, it's being pushed to production again. A lot of responsibility
falls on the developer's shoulders. Originally we used Trello as
a ticket tracker, but as the team grew and the complexity of the task grew
as well, we moved to GM. I can't say that I'm satisfied with the move,
but that's what we have and we live with it. We're using
GitHub to store our code and using the GitHub flow in
which you have only master and feature benches. Each bench from a feature
bench to master means deployment to production again, and it's going
to come back over and over again by putting a lot of responsibility
on the developer. At the beginning, we had a very heated discussion regarding
mono versus Multireepo. We chose a multi because it was most suited
for services deployment. That is, each change in the repo
means deployment and it coerces the developing to
think in a more service oriented way. You don't read directly from
a Dynamodb table that does not belong to your service only because you can
import a dial or call functions directly. Instead, use common
practices to access remote resources, API gateway,
lambdas, queues, and so on. Run the lastly post on the never ending battle
between mono and repo you can find it in our blog.
In one of slides I mentioned that each one of our developer has their
own AWS environment. Right now we have around 20 services we need
to orchestrate the deployments of these services so our developers can update
or install them into their environment. We've created
an internal orchestration tool in Python which we call the
uber deploy. Not related to Uber by the way, that does
the following pulls relevant code from git depending on the branch
you choose, installs relevant requirements for each service and
installs in parallel the various services according to predefined
order. The Uber deploy tool enables our developers to easily
install these services in their environment, so no one needs
to know the various dependencies and the order of deployment, and it does it
faster than manual. By the way, we use this tool only in the
developer and integration environments. In production, each service
is being deployed on update. This is purely a development
tool. We believe in automated testing without QA. It's mandatory.
We can't skip it. We have three types of tests, unit tests,
which the developer runs locally integration tests, which the developer runs on
their AWS environment and are also being ran as part of the CLI
flow in a dedicated environment, and end to end testing
with cypress, which again being ran on an AWS environment.
Because testing in the cloud is slower than testing locally, we prefer
to detect as many issues as possible before pushing it remote.
We use git precommit hooks to automatically run test
and linting. For python we use precommit and for node we use ASCII.
One of the hardest things when running tests is adding external
services. People usually ask me, why are we running
our integration test in the cloud? Use mocks there are a lot of
mocks that mimic the behavior of the various AWS services.
Well, we tried it and it didn't work well. Couple of reasons.
Some services that we use don't have good mocks. They don't really mimic the
true behavior of the service and don't always include the
latest API. Some mocks infrastructure like local stack
are complicated. There's a lot of rocks surrounding them. I prefer to waste my time
on real testing. Some of these mocks have bugs, and it happened
to us more than once that things didn't work well. Later to find
out that they work perfectly when running on AWS, I prefer
not to waste my time on debugging mocks. So as a rule of
thumb, we don't use services mocks for integration testing. We always
run things in AWS environment. So this is
our testing stack. We use black and fleck, eight for Python
and prettier and dslint for node. We use static analysis
for Python, although I'll be honest, it helps us more with
readability and less in actual caching type errors, although it does sometimes
succeed in catching the issues. And for unit testing we use Pytest,
mocha and Jest. We have two types of integration
lets very thorough API based testing in which our test is
driven only by API calls no UI interaction, using node
js and mocha to drive the flow very specific critical
flows that are end to end, which include also the UI flows like
onboarding, logging and so on. Within Cypress and
jest we have two main problems when running our test. We have 20 services
with around 200 lambdas, multiple dynamodb tables
and kinesis. Deployment takes a lot of time. By a lot of time
I mean around 1 hour. Tests themselves are also slow. There are many synchronous
asynchronous flows. The test originally took us around hour and a half
to run. These numbers are not good. They don't allow us to quickly
deliver our features and above all get quick feedback in case
of a failure. So first, we've tackled the second
problem of running the test. We are using the power of parallelism.
We've duplicated our integration test of AWS environment.
Each time we do a deployment for testing, we actually deploy
our code to three environments, and we're running each of
our tests in a different environment so the tests don't interfere with one another
and we can run them in parallel. The nice part here is because
we are serverless, it does not cost us any extra money except
for kinesis, which is quite a painful point.
Kinesis requires at least one shard to operate, so we do pay for it.
By using this parallel flow, we managed to reduce the integration test time
from hour and a half to around 40 minutes half of the time. The nice
part about it is that it's fully scalable. More lets means adding more
AWS environments another change we did just recently
is the ability to use existing stacks after deployment. We pinned the latest
hash of the repo as a tag in the cloud formation stack. So when the
Uber deploy runs, it checks whether the code changes. And in case it
didn't, it skip the deployment of that specific service. It reduces
the redeployment time from around 30 minutes to 5 minutes.
Right now our biggest obstacle is the initial deployment,
but it unfortunately takes quite a lot of time and we are trying
to tackle this issue. AWS well, another flow that we've developing internally
is a staging flow. One of our components is SDK which is
embedded by our customers. The SDK is very delicate and a bug there
means a lambda will crash. So we wanted first to deploy the
SDK on our system. As I mentioned, we are dog fooding our own platform,
so we created a flow that at first a step release
an alpha version to NPM. Alpha is not seen by our customers. It then
deploys the alpha version to our environment and triggers a step function
which automatically, in case no issues are found in the staging environment,
releases the final version. At the moment we are able to
stop the step function like a red button in case we
find an issue in the SDK. Integration metrics
are important and they give us the ability to pinpoint potential issues.
We are feeding the results of our test to elastic.
At the top we are able to aggregate metrics like the number of
failures or succeeded tests. The really interesting part
is the distribution of failed lets. We can see a breakdown of the various
lets according to the branches that ran them. If we see a test that fails
in multiple branches, it's a hint for us that something is not working well in
the test and requires a fix. The idea is that some branches might
have bugs in them. That's why some tests might fail. But if the
same test fails many times in multiple branches probably means the test
has an issue. Another metric we are gathering is the deployment problems.
The deployments are not bulletproof. For example, a deployment
might fail because we are trying to provision multiple event bridges
at the same time and the AWS has limitation on it. Although we
do have internal retries, it doesn't always work. So this
overview gives us the visibility on the number of failures that happen due to deployment
issues. At the end of the day, we are cleaning our environments for
two main reasons. Some services like in essence cost money and AWS
environment has limitation in the number of resources you are allowed to
provision. So cleaning is mandatory and it's hard.
We still haven't found a good process for it. We're using a combination
of AWS Nook and omega Cli right now. It's very slow.
It takes a couple of hours to clean an environment.
Unfortunately, right now I don't have any good and fast solution
for this problem. So we are ready to release.
We have a couple of release gates. Some are manual, but most are
automatic. Code review is manual and the rest of the steps are automated.
When gates pass, the developer clicks on merge and the deploy begins.
Here you can see the various gates we're using. GitHub checks for gating.
So that's it. Everything is in production now. What's next?
Monitoring. Of course, monitoring is hard in serverless for
a couple of reasons. Many microservices you need to see the full story.
The root cause might be somewhere down the stack in various services,
no service to SSH to. It's hard to collect the relevant details to
better help you understand what's going on. A lot of new technologies,
suddenly this space and cpu does not play a role. A lot
of new metrics and new jargon to learn. So we are using
our own product in Lumigo to monitor and to do
root cause analysis. As I mentioned earlier, we are eating our own
dog food and I do want to show you a quick demo of how Lumigo
looks like and how we use it on a daily and a weekly basis.
Okay, so Lomigo is an observability and troubleshooting platform
aimed mainly for serverless workloads. It fuses data for multiple resources
and combines it into a coherent view. Right now we use logs,
specifically AWS, Cloudwatch, AWS API,
and an SDK that the user can embed with zero effort in
their code which collect more telemetry data on the compute instances.
We have two major production flows, slack alerts, which indicates that we
have an issue and we should handle immediately, and it's being monitored
by an R D developer. We are defining the alerts to the alert configuration,
and when alerts is received, you are able to configure it to work against
slack, against pagerduty, or against an email. In addition, we have also
a lot of other integration you can work with. We also have a weekly meeting
in which we go over a list of issues which occur during the last
seven days. So we have a real time flow where an issue is
arriving through Slack, and we have a weekly meeting where we
go over a list of issues and try to better understand what issues
happened and whether they affected specific customers that we want to handle.
One of the nice things that Lumigo platform enables
you to do is to drill down into each type, into each issue,
and to better understand the entire story of what
happened, why something ran the way it ran. You're actually able
to zoom into each one of your lambdas and the resources
that you use actually read the return value, the event details.
You can actually see the external request that you are doing in guest external
resources. For example, here you can see a DynamoDB request,
you can see the actual query and the response. And above
all, we can actually detect issues in your lambda show the
stack trace, you actually can see the various variables
that various variables that were defined and to better debug
the issues that you've encountered. So this is Lumigo.
I want to quickly go and summarize
what we have right now. So to summarize, use the power of serverless
to parallel your testing, which reduce costs, prefer integration
tests with real resources and not mocks, and allow easy orchestration
for your developers. And before finishing,
I want to tell you about a nice open source tool developed by Lumigo.
It's a swiss knife for serverless. Many useful commands,
for example, commands that enable you to switch between AWS environments,
ability to clear your account, televent bridge, and the list goes
on. Give it a try. That's it. Thank you very much. Again,
if you have any questions, you're more than welcome to email me or
ping me through Twitter. Thank you very much and bye.