Conf42 Kube Native 2024 - Online

- premiere 5PM GMT

Leading High-Impact Engineering Launches

Abstract

In this talk, I outline my learnings on leading critical and high impact engineering launches based on my experience captaining the launch of paid sharing at Netflix.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Thanks for joining me today at the Conf42 Kube Native 2024. My name is Ramneet and I will be delivering a talk on leading high impact launches, strategies for mastering critical engineering rollouts. This talk is based on my learnings leading major initiatives such as page sharing at Netflix. I will dive into the roles and responsibilities of a launch captain and will also share specific strategies and tactics that you can use to master highly critical launches in your own organization. So let's get started. Now, what do we mean by a critical launch here? Every launch is different. Some launches require a lot more rigor than others. Such launches typically have some common characteristics, such as large scale user and revenue impact. That is, if the launch is going to affect a large proportion of your customer base and has high revenue implications. This would also mean that the launch could be high visibility. That is, it is tracked for a long time. very closely by the senior leadership, which makes it even important to ensure clear communication to seek inputs and share updates. Such launches may also be a huge cross functional effort. That is, they involve an army of cross functional teams to come together and deliver, which brings coordination as a skill to the forefront. Now, you may think that this can be a managed responsibility of a couple of key stakeholders. But that may not be the right approach and can result in diffusion of responsibility or something known as the bystander effect, where people are less likely to take an action where there are other people involved. To avoid this, I would highly encourage you to consider a dedicated role of that of a launch captain. So let's talk about the responsibilities. There are two primary responsibilities for a captain. One is to ensure launch readiness and smooth execution. And the other is to be the focal point of communication for all cross functional partners and leadership. Now you must be thinking who should be the launch captain in a team that's filled with engineers, data scientists, design, and other cross functional team members, who is best equipped to execute this role. At Netflix, we typically have senior engineers play this role. But it can practically be anyone on the team who's close to the solution, although there are definitely some must haves that are helpful in ensuring your success in this role. The first is that you should be clearly able to understand the product feature, the intended business impact, and the engineering systems involved. You should be able to look at the launch from different perspectives of that of a user, a customer service professional, an engineer, a data scientist, etc. You should be able to visualize the launch to identify the expected trends and the not so expected trends, that is failures and risks. And you should also be a great collaborator because ultimately you would have to lead the cross functional team towards a common goal of smooth execution and success. Now let's go a little bit deeper into the foundation. The way I see launch captaincy, it has three support pillars for success. The first is strategy. Now a launch captain starts playing an active role in weeks leading up to the while the entire team is heads down on delivering the right solution. Making sure it's scalable, testing the solution, the launch captain is looking further ahead, visualizing and preparing for the launch. As part of this, the captain drafts a launch plan that captures details of how a launch should be executed. Identifies metrics that will provide a signal as to how the launch is progressing. It is always good to develop a mental model of what the metrics should look like in case of a successful launch. Identifying potential risks and mitigations is an important part of this role and ensures minimal surprises on the day of the launch. The last part is about reviewing functional and operational readiness. Now functional readiness means the solution is as per the product specifications and there are no blockers. It is in your best interest as a launch captain to keep a tab on the functional readiness and have regular checkpoints to flag risks. Last but not the least, reviewing operational readiness. Operational readiness is important because in a way it would determine your effectiveness in navigating the launch. It ensures that you have the right metrics and dashboards in place. To assess how the launch is progressing and relay the status with confidence to a larger group and the leadership. The other two pillars are that of collaboration and communication. In projects such as these, there will be a huge cross functional team and you would have to work with all of them to ensure readiness on all fronts. For example, uh, one of the user facing features could, where you could expect users to reach out to customer support. In that case, customer service, knowledge base articles and FAQs should be updated in time for the launch. And if the launch is sensitive in nature, you would want to very closely coordinate these actions. This underlines the importance of collaboration. Under this, you would also need to identify key stakeholders and their roles to support the launch. Launch captain alone cannot navigate the entire launch and will always defer to domain and system experts to weigh in, in case of issues or for validations. The last pillar is that of communication. As I mentioned, these launches can be closely tracked by senior leadership. And can have larger implications on other business outcomes. That is there are high stakes, clear and timely communication is key in ensuring we cover the risks and keep the group updated and empowered to help where needed. Launch rooms for such launches can have many teams and navigating those discussions. while keeping the group focused on the outcome becomes extremely important. Now, having shared all the context and responsibilities, this role may seem daunting to some, but in the next section, I'm going to talk about specific steps you can take to master each phase of the launch. And we will start with pre launch. So the first thing you can do in the pre launch phase is run readiness checks across teams and systems. This could be a simple asynchronous check in on Slack or any other collaboration tool that you use where every team reports their status to indicate if they are on track for the launch. Teams that are not on track should be asked specific steps that need to be taken or issues that need to be resolved for them to be back on track. These blocking issues could be related to dependencies on other teams or bandwidth constraints. The intent here is to identify and surface the risks up front. It also informs and empowers the larger group and the leadership to help unblock the launch whenever needed. The second part is reviewing operational readiness. As I said before, this is important to ensure you have all the right signals and metrics in place to navigate the launch. Identify key engineering and product stakeholders who would be the point of contact to indicate the health and readiness of their respective teams. And domains. These stakeholders will also play a key role in ensuring that the launch captain has all the required metrics and signals available to monitor the launch. Collaborate with these people to identify the relevant metrics, map them to a launch dashboard, and reach a consensus on the expected trends. The dashboards are important and should all should answer all the expected frequently asked questions, such as how many users signed up, how many users bought an item, depending on your use case, these launch dashboards should have a high level view and troubleshooting views, which to allow more detailed debugging of issues. The second part in the pre launch phase is to review the launch logistics. And the main part of launch logistics is a launch runbook. Launch runbook is a definitive guide or source of truth for executing the launch. All the pre launch preparation and the work that you would do would get captured ultimately in this launch runbook. It includes sequence of steps needed to execute the launch, along with all the other details. I would highly encourage you to document. Everything from little minor steps to reminders in the runbook. Critical launches get intense. So always assume you will likely not recall all the minor details at the right time. Add them to your runbook. We will cover the runbook in detail in the coming slides. Now, while the launch runbook is primarily drafted by the launch captain using inputs from other team members, it is a good practice to review the final runbook with the extended team and ensure alignment and a shared understanding of the overall sequence, as well as the individual's roles and responsibilities. The other part is, of launch logistics, is to plan for the launch day. You should set up calendar invites for the launch day and for the launch check ins, clearly marking individuals who are required, and to be present and share the metrics that they will be responsible for. So let's talk a little bit about the runbook. As we said, this is going to be the source of truth for managing the launch. And I have created, I've added a little bit of a snippet to show you what kind of information can go into a runbook. So we can capture the overview of the product or the feature that we are trying to launch. The teams involved, along with the launch accountability metrics, which is primarily for the operational support. There has to be a pre launch checklist, which is all the items that need to be executed in a specific order. And these could be items such as go, no go from all the stakeholders, configurations that need to be enabled. systems that need to be pre scaled and teams that need to be notified before starting the launch. The runbook then captures a launch day section, which talks about specific steps that you need to execute and also how Uh, validate, how to validate each step during the launch. All the monitoring dashboards are captured in the runbook along with mitigation and rollback scenarios so that you don't have to do a last minute brainstorming on how to rollback something that has gone unexpected. There's also a section for post launch, that is, when in post launch phase, you will have launch check ins to capture metrics, which would go into the runbook. And also issues that are under investigations and tech debts and learnings that the group would share later. Let's come to the launch day. Now putting all this effort prior to the launch should ensure that the launch goes smooth. And even if there are unexpected issues, the team is ready with possible mitigation measures. On the day of the launch, as a launch captain, you should use the runbook to guide the team for all the launch proceedings. Lead the launch room with clarity and empower the group to voice their concern or opinions while ensuring the launch stays on track. Now there can be scenarios where there may be conflicting opinions. Moderate the room to ensure everyone's voice is heard while ensuring we are on track to meet the objective for that day. which could be, for instance, a 20 percent ramp by this particular time. The launch captain should feel empowered to hear dissent. And in case there is no consensus in the room, you should be the informed captain to decide on the next steps while clearly articulating the risk and the trade offs. Let's jump to monitoring. Any high impact launch usually has a staged or a ramp rollout, which means we start rolling out to a small percentage of users and then slowly increase this percentage over a span of hours or days. In the early stages of the launch, where there is less traffic, it is always useful to spot check the first few instances of success or failures to verify if everything is looking good. It is good to document the success and failure metrics and also compute the success and failure percentage. This helps us flag any deviations as the traffic ramps up since we would expect the success and the failure percentage to stay somewhat in the same range as we ramp up. It is awesome to see a launch go smooth without any issues. But that may not be the case always or often. During the launch, each team will be monitoring their metrics. A launch captain or any of the team members can notice an anomaly, which can be an unexpected trend or a spike in error. As this is noticed, The launch captain should identify and call upon specific teams to share their insights on the potential issue to help assess two things, severity and impact. The next step then is to summarize the issue for the extended group and clearly communicate the impact and the severity, flagging whether this is a launch blocker or something that's critical or a major. If it is a critical or a major issue. Capture it as issues under investigation on the run book. And you can designate a small subset of the team to follow up on this. If it is a launch blocker and needs more thorough investigation, lead a smaller group into a separate war room. While the investigation is in progress, keep the group informed on the updates and help brainstorm possible resolution mechanisms, concluding the launch day. Now, hoping that the launch is all successful. Always conclude the launch day with a status update and metrics. Call out potential risks and issues encountered and the next steps. It is possible that some teams have support models, have different support models in the early phases of the launch. So always align on the support models and the point of contacts for each team in case there is any issue in the late hours of the day. Last but not the least, the initial few days of the launch can be taxing. Always include a thank you note and a reminder for the team to rest and recover for the coming days. Now let's talk about the post launch phase. As we celebrate the launch and step into the post launch phase, we should transition from active to active. Active to passive monitoring. That is, let the automated alerts take over once the launch is stable and scale down on the real time or active monitoring. We should also use this time to shift gears and shift the team's focus towards tuning pageable and email alerts, which are going to be the backbone of monitoring. We can also use this time to follow up on any critical or major issues that were reported during the launch. It is a good and a healthy practice to initiate retrospectives, to gather feedback and learnings for the future. This brings us to the end of the presentation. Thank you everyone. I hope this was useful for all of you. Again, my name is Ramneet. If you want to follow up with me and have a discussion, I'm very happy to connect on LinkedIn. Thank you and have a good day.
...

Ramneet Bhatia

Senior Software Engineer @ Netflix

Ramneet Bhatia's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways