Conf42 DevOps 2025 - Online

- premiere 5PM GMT

Serverless Workflow Orchestration on AWS

Video size:

Abstract

The talk will cover micro-service orchestration, focusing on designing systems for performance and scalability. We’ll discuss best practices for developing complex workflows with tools like AWS Step Functions and Lambda. Join me in unlocking the potential of serverless workflow orchestration.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, everyone. How is it going today? We are here for the talk on serverless workflow orchestration on AWS or Amazon Web Services. As we know it. Before I begin, let's start with a small introduction about myself. I am Bharat Vishal Tiwari, currently an SD2 at Amazon. I have over 12 years of experience in software development industry and two master's degrees in electrical engineering and computer science from Arizona State University. I recently published an article on techtimes. com on a related topic. The link is at the bottom of the slide. So do feel free to take a look when you get a chance. I am someone who likes to stay updated with the latest tech trends. talk about microservices, DevOps, machine learning, Gen AI, you name it. If you are someone who likes to talk about these subjects, Definitely feel free to reach out to me and connect to me on LinkedIn. And I would love to discuss more on these topics with you. Let's get started with the talk then. Serverless Workflow Orchestration on Amazon Web Services. Let's start with a quote, like many other talks. By orchestrating services, businesses can unlock agility, quickly adapt to changing customer needs, and deliver. innovative solutions faster than ever before. How you say? I'm sure after this talk, you will be in a position to answer that question much better than you are now. So let's start with today's agenda. We start with looking at some of the concepts about the, and we break the title word by word and look into it individually. Then we move on to talking about orchestration in AWS, how it's done and understand the current services available. Finally, we conclude the talk with best practices that We should keep in mind when orchestrating on AWS and these are the things that will help us design a resilient system and shape the future. So let's get started with some of the concepts. So what's serverless workflow orchestration? Let's break it down word by word and look at what is serverless. Serverless Serverless is where we build and run applications without thinking about serverless. But what is serverless and why do we need it? Serverless computing is a cloud computing execution model where the cloud provider dynamically manages all server resources. What that means is less worry for you. about how to provision a server and maintain it. The physical servers are still used. It's not like we are getting rid of servers, but they are abstracted away from the developers. Unlike earlier times where people used to estimate, order hardware, and deploy their services on that, scale them as per the peak time or customer traffic needs, these things are abstracted away from you, so you can focus more on development. The main value proposition is focusing on business outcomes while abstracting away the mechanism of computing. You don't need to worry about where this code is running or how this code is running. You can just say, I want a certain compute and you give them a code to run and it's done. What are the benefits? Benefits include pay per use. You pay for what you use, you pay for the compute, you pay for the execution time of your function. You get automatic scaling out of the box. So you don't have to worry about those peak times, those times where customer traffic can increase or the spikes that can come in. There is automatic scaling, you can configure it beforehand and it takes care of scaling your hardware for you. What that means is reduced operation and infrastructure cost, less worrying about operation and infrastructure, less money required to spend there because we pay for what we use. It also means for startups or new ideas, faster time to market because you don't have to wait for that provisioning of server, the ordering of hardware, the time it arrives, and so on. Making it ready for your software. You worry about what functions you want to put in. You worry about the features you are going to launch. You worry about the solution you want to give the problem you want to solve and you are enabled serverless. to reach the market faster. Let's look at some of the AWS serverless services that are available today. I have a link at the bottom. If you want to go to the official documentation and look at those, but some of the common ones are cloudfrench. CloudFront, the CDN network from Amazon Web Services. Route 53, the DNS hosting service. API Gateway, which is the entry point for your application on cloud. VPC, your personal piece of cloud cut out for you in which your systems reside. There are services for application mobile development like Amplify, AppSync. There are services which help you orchestrate or choreograph your logic step functions, event bridge, sqs, sns, you have database, dynamo, you have s3. These are some of the most famous names in serverless world. Then you have compute in the form of lambda, where you can give your code and run them as functions. You have Fargate where you can run your containerized application without worrying about the infrastructure. You have identity management services, Cognito, where you can store your users data, and authenticate them. You have other operational and development tools, related services. It's a lot on one slide, but if you have any questions about any of these services or would like to need to discuss them in detail, do feel free to reach out and I'll be happy to talk about them to you. Next, let's look at what are workflows. A workflow is a sequence of tasks that are part of a larger process or goal. A workflow is a series of actions that accomplish a particular task, serving as a fundamental unit of work. So let's take an example. Let's think about making coffee. you take the coffee powder, brew the coffee, You heat milk, add milk to your coffee, maybe add some sugar and have the coffee. So this process of making coffee is a workflow and the different steps you took, like taking the coffee powder, brewing the coffee powder, heating the milk. These are the different tasks or steps in your workflow. Workflows are designed to simplify and automate tasks by combining multiple actions into a coherent sequence. In various contexts, workflows serve, different purposes. They can be either manual or automated and are dynamic in nature with different parts taken based on previous steps or results. In various contexts, they serve different purposes. They can be, for an ETL job. They can be used for CICD automation or for maybe implementing a function for an e commerce website and so on. Next, let's look at what are microservices. Microservices is an architecture in which software is composed of small independent services that communicate over well defined APIs. It's a software architectural approach that structures applications as a collection of small, independent services that communicate over well defined APIs. Each service runs on its own process and focuses on doing one thing well, making them simple and granular. What are the key characteristics of microservices? Microservices allow autonomous operation, technology diversity. Each microservices can be built in using different technology. They all have their own independent databases. What it means is teams can operate independently using the YouBuild, YouRunIt DevOps model. There are common scenarios where you might have to enable communications between different microservices. And if you think about it. It, each microservices talking to each microservices means a lot of communication. So to put some order in the chaos, there are two common patterns that is followed for microservices. One is called orchestration. The other one is called communication. Orchestration is where a central service acts as a brain to coordinate the logic. So let's say we talk about ordering scenario. If a customer places an order, we need to notify the customer. We need to prepare the order and maybe we need to generate some metrics. So there are different approaches we can take for that. Orchestration approach is Where a central service takes care of calling a notification service to, to the notification calling a different peak service maybe to prepare the order and analytic service to generate the metrics regarding that. Then we have choreography, where each service acts autonomously. In this kind of pattern, we have an event broker in between the event and the different microservices acting on the event. This is often known as event driven, event driven design as well. Let's compare these two and see what are the differences. Orchestration is where we, the control is explicit and managed by the orchestrator. Think about an orchestra where different music instruments are playing and there's a person standing in front of it who is directing it on. That person is orchestrated. And the model is known as orchestration. Choreography is implicit, implicit and managed by individual services. All the services in orchestration communicate directly with the orchestrating service, the central brain. While in choreography, it's more event based communication. Orchestration is simpler for defining workflows. Choreography has more complex interactions, but simpler service autonomy. Orchestration allows centralized error handling, whereas choreography requires distributed error handling. Orchestration can be considered less flexible due to central control, whereas choreography is highly flexible and active. But there are certain scenarios where either of these are useful, especially orchestration is when you want a clear view of what happened, what, when it happened. and you have a clear view of the flow of things happening. In choreography also it's possible, but you have to adapt to complex monitoring for that. so today's talk is about orchestration and let's move on and see what are the different use cases. Whenever, whenever you can think of a workflow, you can think of orchestration. So for example, you raise an IT service request, There's someone who approves the request, and then you, the software is installed or the service is delivered. This is an example of where orchestration can be used. In compliance, you have, many industries have rigorous legal compliance. Orchestration can automate needed data collection from multiple departments, generate templated reports, and ensure that the right people get filled. Similarly, we can use, we can imagine orchestration in employee onboarding, offboarding, and software development as well. This brings us to the second part of the talk, orchestration in AWS, AWS step functions. Before we move on to step functions, let's look at the top orchestration tools. used in the market today. So there is Apache Airflow. Apache Airflow is an open source tool for scheduling and monitoring workflows developed by Airbnb. It uses directed acyclic graphs to manage complex data pipelines effectively. Then we have AWS step functions which we will be looking into, from the next slides. AWS Step Function is a serverless orchestration service that lets you combine AWS services to build to scale distributed applications using state machines. We have Google Workflows, which is a powerful orchestration service from Google Cloud. We have Microsoft Power Automate, which is another offering from Azure. Then we have some other ones like Daxter, Argo, which are also very useful in their own fields. So let's move on and talk about why we should use AWS step functions. Step functions allow low code or no code to be used. Workflows to be created using the workflow studio in the console, which, which we'll look into, in the next slide. It is highly scalable. The solution can easily scale to meet the demand of enterprise level applications and workflows. It's reliable, built on dependable infrastructure of AWS, which has been battle tested and is being used widely today. It provides high availability and fault tolerance for orchestrated workflows. It offers flexibility. Developers can create workflow logic using familiar programming patterns and seamlessly integrate with various AWS tools and services that they are already used to using. It's a cost effective solution. by carefully choosing the type of state function workflow, it can be pretty cost effective. And another benefit that comes with it is AWS CDK. With AWS CDK, the deployment can be made much simpler by writing the infrastructure code, linting tool, and generating cloud formation template with easy validation. before deployment. So how does an AWS tech function look? On the right you see an example of how a AWS graph looks like. It's basically, create workflows to build distributed applications. automate processes, orchestrate microservices, and create data and machine learning pipelines. Few key concepts from AWS step functions are execution tasks and activities. Every instance of the workflow which is executed as well as executions. We have tasks and different activities in step functions. In the step function console, you can visualize, edit and debug your application's workflow. You can examine the state of each step in your workflow to make sure your application runs in order. And as expected, you can retry the workflow from the state where it failed for any reason. and A lot more. So what are the components of, AWS step function workflow? There is a basic one where there is a request response required. so, so we say call service B and expects a response from there. There are decision components where you need to make decision where to go to state C or state D. There is a retry task where, for example, if there is a retryable failure, you might want your workflow to automatically retry the task. Maybe immediately, maybe with a back off, but that's where you use retry task component. There can be requirement to add a human in loop. For example, we talked about the IT services use case between raising a request to getting the request fulfilled. There might be a need to add a human who approves the request. There is a need to process data in parallel. There we have a component which allows us to achieve that. And then finally we can also do dynamic, we can also dynamically process with map kind of operations. So all these components put together gives us a AWS. step function workflow, which can help us achieve our business logic. So how do you develop one step on step functions? There are two options available. Then no code or no code option is going to the AWS console and using the easy to use dropdown workflow studio. where you can just track the different components that you need for your workflow, integrate with the different services you need and you get the workflow. The other option for more techy kind of users is to use Amazon state language, where you can basically use a JSON like syntax to define the states, define the inputs, output, the behavior, and control the flow of logic. What are some of the. use cases for the AWS step functions. As we are discussing in this talk, we can use it to orchestrate microservices, allowing breakdown of complex application into smaller independent services that can be deployed. developed, tested and deployed independently can be used for data processing. Step functions can be used to process large volume of data or perform tasks that needs to be done periodically. It offers easy integration with AWS glue. For example, we can use it for machine learning use case. Step functions can be used to build and manage data pipelines, allowing you to move data. between different sources and destination in a reliable and scalable manner. It has integration with AWS Bedrock. and you can use that to build machine learning use cases. You can build event driven architecture. You have features in AWS step function to perform async steps, which can be useful to build event driven architecture. There are multiple happy customers of AWS Step Functions. I have named a few here. You can find the information on how they use it and what they were able to achieve and how highly they speak about it by going to the website for AWS Step Functions. This brings us to the final part of today's talk, best practices that we need to keep in mind when designing systems. on AWS step functions. Let's look at how we can design for scale and performance. The first thing we need to decide on whether we want to use standard workflows or express workflows. Standard workflows are workflows that can run up to a year. These guarantee exactly once execution. These are charged on number of state transitions. and are better suited for non idempotent, long running workflows. For instance, think about long running, executions, where there's a human loop approval required. It can be approved today, tomorrow, maybe three days down the line. So you would like to use standard workflow there. Also, think about idempotency. Is it okay if the same step is executed twice? Maybe, maybe not. Think about, Maybe a payment workflow. You want to make sure that you can track the payment made and what all happened with that transaction and you don't want to run it again without your knowledge. So that's something where you would use standard workflow. Then we have express workflow. These are comparatively newer options available. And these work, these workflows are limited to five minutes. Guarantee at least once execution for async and at most once for sync workflows. The cost is pretty low, 1 for a million executions. And It can be used to perform high volume processing workload with DPS allowed in thousands. Think about using these when you have something that is item potent and can be executed and finished quickly. Or maybe use both. Standard workflows can act as parent workflows to invoke express workflows synchronously. Keep in mind, the reverse is not true. The standard workflows can be banned for express, but not the other way around. This method of designing the workflow combines the strength of both workflow types, it offers a reliable workflow while maintaining cost efficiency and performance optimization. Next thing for performance and scale is. doing the right service integration. Consider Lambda when you have to run a large number of tasks in parallel or concurrently. But think about Fargate when you have something that runs for more than 15 minutes. Think about containerized solutions like ECS or EKS if you want more control on the containers. All lambda functions in step function must be designed to be idempotent. Lambda function names should not be specified explicitly. The names can have some prefix assigned to them when CloudFormation deploys them. So you need to be aware of that. Version control is crucial for both lambdas and step function definitions. For DynamoDB interactions, use optimistic locking. transactions or conditional write to handle race conditions. Moving on. Beware of timeouts. Amazon state language doesn't specify timeout for state machine definition, which means you can, or the workflow that you develop can be waiting in the same state for indefinite time without the knowledge of, without failing or without giving any notification to you. For callbacks with a trust token. The. the method that is used for async allocations. Use hard bits and add the hard bit second field in task state definition. Retry is an error handling option. Exceptions should be categorized into retriable exceptions like, SQS dependency exception and non retriable exceptions like null pointer exception to simplify the step function graph. When configuring dependencies, always set timeout and retry policies, especially when connecting to other services like CloudWatch. Proactively handle transient Lambda exceptions in your state machine to retry invoking your Lambda functions or to catch the error. Monitor and optimize. The key to scale and performance is to monitor your systems and AWS CloudWatch service. is used to monitor the performance of step function workflows. This will help you identify any bottlenecks or issues that may be impacting performance and allow you to take corrective actions as needed. We'll look at some of the CloudWatch metrics down the line. The next best practice is about security. For ensuring security of your step functions, use IAM roles for tasks. Encrypt sensitive data, both at rest and when they are transient. Use CloudTrail to monitor step functions. Use resource level permissions. to make sure the right services have access to right resources. Enable CloudWatch logging to debug the step functions and identify What went wrong or what can be made better next in best practices, we'll talk about operation excellence best practice when you have to pass a large load to your step function, think about using S3 as storage and passing the S3 ARNs instead of passing the large payloads directly in the step functions use cloud watch for monitoring. There are few key metrics that are. Available in CloudWatch that can help you monitor your step functions better and use them to even tune the performance and performance of your step functions. So these are state transition, throttled state transitions, execution duration, especially the execution duration can be used to tune the performance of the step functions, throttled execution starts, and task failures. Next, let's look at reliability best practices. Some of the best practices for reliability is handle timeouts gracefully. Beware of event history quota. If you, there is a limit to how many events can be invoked in the step function. And if you are going about to go over the quota, start a new workflow. Use retries and error handling wherever possible, like we talked about. Use idempotent tasks. Use CloudWatch alarms to monitor your load and find new step functions. Use CloudTrail for logging. And finally, test your workflows. There is a great feature available where you can test each step in your step function individually with ease. Make use of those features. Next, we'll talk about some of the cost optimization best practices. Like we talked about earlier, standard versus express workflow. Think about the business use case. Think about whether it is long running, non idempotent or short running, needs high throughput or maybe a mix of both. Carefully organizing the workflow properly will help you save cost. Monitor and optimize the usage. This will also help you improve on cost. Use tagging for cost allocation. Tagging is another way of keeping track of who is using the resource, how much resource is being used. With that, I conclude this talk. If you agree with what I said, that's great. If you don't agree, feel free to connect with me or reach out to me and we can discuss further. Ping me if you want to have a chat over coffee or discuss over something related to technology. You can email me at the email provided here or you can reach out to me on LinkedIn. Looking forward to hear from some of you, maybe more of you. Thank you for staying with me and listening to me and helping me deliver this great talk. This, I will end this talk. Thank you.
...

Bharat Vishal Tiwary

Software Development Engineer @ Amazon

Bharat Vishal Tiwary's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)