Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, everyone.
How is it going today?
We are here for the talk on serverless workflow orchestration
on AWS or Amazon Web Services.
As we know it.
Before I begin, let's start with a small introduction about myself.
I am Bharat Vishal Tiwari, currently an SD2 at Amazon.
I have over 12 years of experience in software development industry and
two master's degrees in electrical engineering and computer science
from Arizona State University.
I recently published an article on techtimes.
com on a related topic.
The link is at the bottom of the slide.
So do feel free to take a look when you get a chance.
I am someone who likes to stay updated with the latest tech trends.
talk about microservices, DevOps, machine learning, Gen AI, you name it.
If you are someone who likes to talk about these subjects, Definitely
feel free to reach out to me and connect to me on LinkedIn.
And I would love to discuss more on these topics with you.
Let's get started with the talk then.
Serverless Workflow Orchestration on Amazon Web Services.
Let's start with a quote, like many other talks.
By orchestrating services, businesses can unlock agility, quickly adapt to
changing customer needs, and deliver.
innovative solutions faster than ever before.
How you say?
I'm sure after this talk, you will be in a position to answer that
question much better than you are now.
So let's start with today's agenda.
We start with looking at some of the concepts about the, and
we break the title word by word and look into it individually.
Then we move on to talking about orchestration in AWS, how it's done and
understand the current services available.
Finally, we conclude the talk with best practices that We should keep in mind
when orchestrating on AWS and these are the things that will help us design a
resilient system and shape the future.
So let's get started with some of the concepts.
So what's serverless workflow orchestration?
Let's break it down word by word and look at what is serverless.
Serverless Serverless is where we build and run applications
without thinking about serverless.
But what is serverless and why do we need it?
Serverless computing is a cloud computing execution model where the cloud provider
dynamically manages all server resources.
What that means is less worry for you.
about how to provision a server and maintain it.
The physical servers are still used.
It's not like we are getting rid of servers, but they are
abstracted away from the developers.
Unlike earlier times where people used to estimate, order hardware, and deploy
their services on that, scale them as per the peak time or customer traffic needs,
these things are abstracted away from you, so you can focus more on development.
The main value proposition is focusing on business outcomes while abstracting
away the mechanism of computing.
You don't need to worry about where this code is running
or how this code is running.
You can just say, I want a certain compute and you give
them a code to run and it's done.
What are the benefits?
Benefits include pay per use.
You pay for what you use, you pay for the compute, you pay for the
execution time of your function.
You get automatic scaling out of the box.
So you don't have to worry about those peak times, those times
where customer traffic can increase or the spikes that can come in.
There is automatic scaling, you can configure it beforehand and it takes
care of scaling your hardware for you.
What that means is reduced operation and infrastructure cost, less worrying
about operation and infrastructure, less money required to spend there
because we pay for what we use.
It also means for startups or new ideas, faster time to market because you don't
have to wait for that provisioning of server, the ordering of hardware,
the time it arrives, and so on.
Making it ready for your software.
You worry about what functions you want to put in.
You worry about the features you are going to launch.
You worry about the solution you want to give the problem you want to
solve and you are enabled serverless.
to reach the market faster.
Let's look at some of the AWS serverless services that are available today.
I have a link at the bottom.
If you want to go to the official documentation and look at those, but
some of the common ones are cloudfrench.
CloudFront, the CDN network from Amazon Web Services.
Route 53, the DNS hosting service.
API Gateway, which is the entry point for your application on cloud.
VPC, your personal piece of cloud cut out for you in which your systems reside.
There are services for application mobile development like Amplify, AppSync.
There are services which help you orchestrate or choreograph your logic
step functions, event bridge, sqs, sns, you have database, dynamo, you have s3.
These are some of the most famous names in serverless world.
Then you have compute in the form of lambda, where you can give your
code and run them as functions.
You have Fargate where you can run your containerized application without
worrying about the infrastructure.
You have identity management services, Cognito, where you can store your
users data, and authenticate them.
You have other operational and development tools, related services.
It's a lot on one slide, but if you have any questions about any of these services
or would like to need to discuss them in detail, do feel free to reach out and
I'll be happy to talk about them to you.
Next, let's look at what are workflows.
A workflow is a sequence of tasks that are part of a larger process or goal.
A workflow is a series of actions that accomplish a particular task,
serving as a fundamental unit of work.
So let's take an example.
Let's think about making coffee.
you take the coffee powder, brew the coffee, You heat milk, add
milk to your coffee, maybe add some sugar and have the coffee.
So this process of making coffee is a workflow and the different steps you took,
like taking the coffee powder, brewing the coffee powder, heating the milk.
These are the different tasks or steps in your workflow.
Workflows are designed to simplify and automate tasks by combining multiple
actions into a coherent sequence.
In various contexts, workflows serve, different purposes.
They can be either manual or automated and are dynamic in nature with different parts
taken based on previous steps or results.
In various contexts, they serve different purposes.
They can be, for an ETL job.
They can be used for CICD automation or for maybe implementing a function
for an e commerce website and so on.
Next, let's look at what are microservices.
Microservices is an architecture in which software is composed of
small independent services that communicate over well defined APIs.
It's a software architectural approach that structures applications as a
collection of small, independent services that communicate over well defined APIs.
Each service runs on its own process and focuses on doing one thing well,
making them simple and granular.
What are the key characteristics of microservices?
Microservices allow autonomous operation, technology diversity.
Each microservices can be built in using different technology.
They all have their own independent databases.
What it means is teams can operate independently using the
YouBuild, YouRunIt DevOps model.
There are common scenarios where you might have to enable communications
between different microservices.
And if you think about it.
It, each microservices talking to each microservices means
a lot of communication.
So to put some order in the chaos, there are two common patterns that
is followed for microservices.
One is called orchestration.
The other one is called communication.
Orchestration is where a central service acts as a brain to coordinate the logic.
So let's say we talk about ordering scenario.
If a customer places an order, we need to notify the customer.
We need to prepare the order and maybe we need to generate some metrics.
So there are different approaches we can take for that.
Orchestration approach is Where a central service takes care of calling
a notification service to, to the notification calling a different
peak service maybe to prepare the order and analytic service to
generate the metrics regarding that.
Then we have choreography, where each service acts autonomously.
In this kind of pattern, we have an event broker in between
the event and the different microservices acting on the event.
This is often known as event driven, event driven design as well.
Let's compare these two and see what are the differences.
Orchestration is where we, the control is explicit and managed by the orchestrator.
Think about an orchestra where different music instruments are playing
and there's a person standing in front of it who is directing it on.
That person is orchestrated.
And the model is known as orchestration.
Choreography is implicit, implicit and managed by individual services.
All the services in orchestration communicate directly with the
orchestrating service, the central brain.
While in choreography, it's more event based communication.
Orchestration is simpler for defining workflows.
Choreography has more complex interactions, but
simpler service autonomy.
Orchestration allows centralized error handling, whereas choreography
requires distributed error handling.
Orchestration can be considered less flexible due to central
control, whereas choreography is highly flexible and active.
But there are certain scenarios where either of these are useful,
especially orchestration is when you want a clear view of what
happened, what, when it happened.
and you have a clear view of the flow of things happening.
In choreography also it's possible, but you have to adapt
to complex monitoring for that.
so today's talk is about orchestration and let's move on and see what
are the different use cases.
Whenever, whenever you can think of a workflow, you can think of orchestration.
So for example, you raise an IT service request, There's someone who approves the
request, and then you, the software is installed or the service is delivered.
This is an example of where orchestration can be used.
In compliance, you have, many industries have rigorous legal compliance.
Orchestration can automate needed data collection from multiple departments,
generate templated reports, and ensure that the right people get filled.
Similarly, we can use, we can imagine orchestration in employee
onboarding, offboarding, and software development as well.
This brings us to the second part of the talk, orchestration
in AWS, AWS step functions.
Before we move on to step functions, let's look at the top orchestration tools.
used in the market today.
So there is Apache Airflow.
Apache Airflow is an open source tool for scheduling and monitoring
workflows developed by Airbnb.
It uses directed acyclic graphs to manage complex data pipelines effectively.
Then we have AWS step functions which we will be looking
into, from the next slides.
AWS Step Function is a serverless orchestration service that
lets you combine AWS services to build to scale distributed
applications using state machines.
We have Google Workflows, which is a powerful orchestration
service from Google Cloud.
We have Microsoft Power Automate, which is another offering from Azure.
Then we have some other ones like Daxter, Argo, which are also
very useful in their own fields.
So let's move on and talk about why we should use AWS step functions.
Step functions allow low code or no code to be used.
Workflows to be created using the workflow studio in the console, which,
which we'll look into, in the next slide.
It is highly scalable.
The solution can easily scale to meet the demand of enterprise
level applications and workflows.
It's reliable, built on dependable infrastructure of AWS, which
has been battle tested and is being used widely today.
It provides high availability and fault tolerance for orchestrated workflows.
It offers flexibility.
Developers can create workflow logic using familiar programming
patterns and seamlessly integrate with various AWS tools and services
that they are already used to using.
It's a cost effective solution.
by carefully choosing the type of state function workflow, it
can be pretty cost effective.
And another benefit that comes with it is AWS CDK.
With AWS CDK, the deployment can be made much simpler by writing the infrastructure
code, linting tool, and generating cloud formation template with easy validation.
before deployment.
So how does an AWS tech function look?
On the right you see an example of how a AWS graph looks like.
It's basically, create workflows to build distributed applications.
automate processes, orchestrate microservices, and create data
and machine learning pipelines.
Few key concepts from AWS step functions are execution tasks and activities.
Every instance of the workflow which is executed as well as executions.
We have tasks and different activities in step functions.
In the step function console, you can visualize, edit and debug
your application's workflow.
You can examine the state of each step in your workflow to make sure
your application runs in order.
And as expected, you can retry the workflow from the state
where it failed for any reason.
and A lot more.
So what are the components of, AWS step function workflow?
There is a basic one where there is a request response required.
so, so we say call service B and expects a response from there.
There are decision components where you need to make decision
where to go to state C or state D.
There is a retry task where, for example, if there is a retryable
failure, you might want your workflow to automatically retry the task.
Maybe immediately, maybe with a back off, but that's where
you use retry task component.
There can be requirement to add a human in loop.
For example, we talked about the IT services use case between raising a
request to getting the request fulfilled.
There might be a need to add a human who approves the request.
There is a need to process data in parallel.
There we have a component which allows us to achieve that.
And then finally we can also do dynamic, we can also dynamically
process with map kind of operations.
So all these components put together gives us a AWS.
step function workflow, which can help us achieve our business logic.
So how do you develop one step on step functions?
There are two options available.
Then no code or no code option is going to the AWS console and using the
easy to use dropdown workflow studio.
where you can just track the different components that you
need for your workflow, integrate with the different services you
need and you get the workflow.
The other option for more techy kind of users is to use Amazon state
language, where you can basically use a JSON like syntax to define the
states, define the inputs, output, the behavior, and control the flow of logic.
What are some of the.
use cases for the AWS step functions.
As we are discussing in this talk, we can use it to orchestrate microservices,
allowing breakdown of complex application into smaller independent
services that can be deployed.
developed, tested and deployed independently can
be used for data processing.
Step functions can be used to process large volume of data or perform tasks
that needs to be done periodically.
It offers easy integration with AWS glue.
For example, we can use it for machine learning use case.
Step functions can be used to build and manage data pipelines,
allowing you to move data.
between different sources and destination in a reliable and scalable manner.
It has integration with AWS Bedrock.
and you can use that to build machine learning use cases.
You can build event driven architecture.
You have features in AWS step function to perform async steps, which can be
useful to build event driven architecture.
There are multiple happy customers of AWS Step Functions.
I have named a few here.
You can find the information on how they use it and what they were
able to achieve and how highly they speak about it by going to
the website for AWS Step Functions.
This brings us to the final part of today's talk, best practices that we need
to keep in mind when designing systems.
on AWS step functions.
Let's look at how we can design for scale and performance.
The first thing we need to decide on whether we want to use standard
workflows or express workflows.
Standard workflows are workflows that can run up to a year.
These guarantee exactly once execution.
These are charged on number of state transitions.
and are better suited for non idempotent, long running workflows.
For instance, think about long running, executions, where there's
a human loop approval required.
It can be approved today, tomorrow, maybe three days down the line.
So you would like to use standard workflow there.
Also, think about idempotency.
Is it okay if the same step is executed twice?
Maybe, maybe not.
Think about, Maybe a payment workflow.
You want to make sure that you can track the payment made and what all happened
with that transaction and you don't want to run it again without your knowledge.
So that's something where you would use standard workflow.
Then we have express workflow.
These are comparatively newer options available.
And these work, these workflows are limited to five minutes.
Guarantee at least once execution for async and at most once for sync workflows.
The cost is pretty low, 1 for a million executions.
And It can be used to perform high volume processing workload
with DPS allowed in thousands.
Think about using these when you have something that is item potent and
can be executed and finished quickly.
Or maybe use both.
Standard workflows can act as parent workflows to invoke
express workflows synchronously.
Keep in mind, the reverse is not true.
The standard workflows can be banned for express, but not the other way around.
This method of designing the workflow combines the strength of both workflow
types, it offers a reliable workflow while maintaining cost efficiency
and performance optimization.
Next thing for performance and scale is.
doing the right service integration.
Consider Lambda when you have to run a large number of tasks
in parallel or concurrently.
But think about Fargate when you have something that runs
for more than 15 minutes.
Think about containerized solutions like ECS or EKS if you want
more control on the containers.
All lambda functions in step function must be designed to be idempotent.
Lambda function names should not be specified explicitly.
The names can have some prefix assigned to them when CloudFormation deploys them.
So you need to be aware of that.
Version control is crucial for both lambdas and step function definitions.
For DynamoDB interactions, use optimistic locking.
transactions or conditional write to handle race conditions.
Moving on.
Beware of timeouts.
Amazon state language doesn't specify timeout for state machine definition,
which means you can, or the workflow that you develop can be waiting in the
same state for indefinite time without the knowledge of, without failing or
without giving any notification to you.
For callbacks with a trust token.
The.
the method that is used for async allocations.
Use hard bits and add the hard bit second field in task state definition.
Retry is an error handling option.
Exceptions should be categorized into retriable exceptions like, SQS
dependency exception and non retriable exceptions like null pointer exception
to simplify the step function graph.
When configuring dependencies, always set timeout and retry policies,
especially when connecting to other services like CloudWatch.
Proactively handle transient Lambda exceptions in your state machine
to retry invoking your Lambda functions or to catch the error.
Monitor and optimize.
The key to scale and performance is to monitor your systems
and AWS CloudWatch service.
is used to monitor the performance of step function workflows.
This will help you identify any bottlenecks or issues that may be
impacting performance and allow you to take corrective actions as needed.
We'll look at some of the CloudWatch metrics down the line.
The next best practice is about security.
For ensuring security of your step functions, use IAM roles for tasks.
Encrypt sensitive data, both at rest and when they are transient.
Use CloudTrail to monitor step functions.
Use resource level permissions.
to make sure the right services have access to right resources.
Enable CloudWatch logging to debug the step functions and identify What
went wrong or what can be made better
next in best practices, we'll talk about operation excellence best practice
when you have to pass a large load to your step function, think about
using S3 as storage and passing the S3 ARNs instead of passing the large
payloads directly in the step functions
use cloud watch for monitoring.
There are few key metrics that are.
Available in CloudWatch that can help you monitor your step functions better
and use them to even tune the performance and performance of your step functions.
So these are state transition, throttled state transitions, execution duration,
especially the execution duration can be used to tune the performance
of the step functions, throttled execution starts, and task failures.
Next, let's look at reliability best practices.
Some of the best practices for reliability is handle timeouts gracefully.
Beware of event history quota.
If you, there is a limit to how many events can be
invoked in the step function.
And if you are going about to go over the quota, start a new workflow.
Use retries and error handling wherever possible, like we talked about.
Use idempotent tasks.
Use CloudWatch alarms to monitor your load and find new step functions.
Use CloudTrail for logging.
And finally, test your workflows.
There is a great feature available where you can test each step in your
step function individually with ease.
Make use of those features.
Next, we'll talk about some of the cost optimization best practices.
Like we talked about earlier, standard versus express workflow.
Think about the business use case.
Think about whether it is long running, non idempotent or short running, needs
high throughput or maybe a mix of both.
Carefully organizing the workflow properly will help you save cost.
Monitor and optimize the usage.
This will also help you improve on cost.
Use tagging for cost allocation.
Tagging is another way of keeping track of who is using the resource,
how much resource is being used.
With that, I conclude this talk.
If you agree with what I said, that's great.
If you don't agree, feel free to connect with me or reach out to
me and we can discuss further.
Ping me if you want to have a chat over coffee or discuss over
something related to technology.
You can email me at the email provided here or you can
reach out to me on LinkedIn.
Looking forward to hear from some of you, maybe more of you.
Thank you for staying with me and listening to me and helping
me deliver this great talk.
This, I will end this talk.
Thank you.