Conf42 Cloud Native 2025 - Online

- premiere 5PM GMT

Cloud-Native ML Infrastructure: Building Resilient Apache Spark Clusters on Kubernetes for AI Workloads

Abstract

The convergence of AI/ML workloads with cloud-native infrastructure presents unique challenges in scalability, resource utilization, and operational complexity. This talk demonstrates how to architect production-grade Apache Spark clusters on Kubernetes that specifically cater to the demands of modern AI/ML applications while adhering to cloud-native principles.

Key Topics

  • Implementing cloud-native patterns for Spark on Kubernetes: statelessness, declarative configurations, and immutable infrastructure
  • Designing resilient architectures for ML workloads using Kubernetes operators and custom controllers
  • Container optimization strategies for Spark executors running AI/ML workloads
  • Network optimization and storage patterns for large-scale data processing
  • Implementing GitOps workflows for Spark-based ML infrastructure
  • Monitoring and observability solutions for distributed ML training

Attendees will gain practical insights into building cloud-native data infrastructure that scales effectively for AI/ML workloads, with real-world examples of Kubernetes configurations, deployment patterns, and operational best practices.

...

Anant Kumar

Tech Lead @ Salesforce

Anant Kumar's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)