Building Efficient and Frugal Image Data Processing Pipelines

Video size:

Abstract

Having trouble processing large datasets of images? Want to achieve lightning-fast image processing without breaking the bank? Let me walk you through creating data pipelines that are lean, mean, and optimized for speed and cost.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Today, I'm going to talk about building efficient and frugal image data processing pipelines, which is based on my experience in Amazon. And this is the agenda. we will first overview the image processing use cases in the large, in the large industry. the most challenges that we face, the current state of art solution and the future invasion we have it for. And here there are so many examples of the application of image data, medical imaging, object recognition for vehicles, product image and recommendation for social media and e commerce. and recently due to AI, its importance continues to grow, for multimodal AI training specifically. most of the challenges come from large image data sites. the high storage cost, for storing vast amounts of image data and very intense computation when processing the, image data. Image quality and the time it consumes is very long as well. Sometimes it can become a bottom leg of the infrastructure for any real time and time sensitive applications. and network bandwidth, transforming those data over network is very expensive and costly as well. And most of the time we are very limited on those resources. And the goal of this efficient and frugal pipeline is very important, such as we're able to minimize the processing time and optimize the storage costs and utilization, to have all Better resource allocation throughput. The first important thing is choosing the right image formats. there are several popular image formats to consider, but each with its own advantages and trade off, mostly in terms of compression and image quality. there's there. Here are some commonly used image formats, like such as, JPEG is the popular choice for photographers and the image with complex color gradients. it offers a good balance between file size and image quality. whereas PNG is more versatile and lossless, suitable for images with sharp edges and transparency. And I feel like I would like to go more details about this popular choice. JPEG, which we'll be using a lot in our image processing infrastructure is known for joined photographic, photographic experts group, which actually I'm new to this as well. it's widely used lossy compression format for photographers and, As I said, it has the good balance between the size and the quality, and also supports adjustable compression level, allowing for fine grained control over the tradeoff between file size and image quality. it's widely used in web browsers and editing software. And the PNG, also new to me, Portable Network Graphics, that is, unlike JPEG, PNG preserves all image data, ensuring that no quality is lost even after multiple edits. It supports a wider range of colors and transparency than JPEG, making it very ideal for logos, icons, and graphics. And there is some breakdown of the next generation image formats. Yeah, I think, for example, WebP, that's all for both lossy and lossless compression. H, like HIC, that was high efficiency video coding for compression. and AVIF. that's based on the AV1 video codec. it's very promising, even better than compression that WebP and HIC. in the realm of image compression, there's a fundamental trade off between file size and image quality. mostly, Like for the low C compression, it can achieve higher compression ratios. and it's achieved by selectively discarding some image data. It's mostly suitable for application where minor quality loss is acceptable, for lossless compression that will come, that will pr Preserve all image data, but resolve in larger size is mostly ideal for application where image fidelity is very critical, such as medical and archive. Sometimes, there are. Yeah, there's a bunch of recommended compression libraries and tools provide pre built functionality for encoding and decoding various image formats that enable efficient compression and decompression of image with our application, mostly including the algorithm and hardware acceleration capabilities, which can have an improvement of the performance. And this also isolated to the logical image handling process, which can lead us to focus on our business use case when dealing with image processing here is a way to choose the right image formats using adaptive compression. that. The adaptive compression that can optimize image storage and the bandwidth by dynamically adjusting compression levels based on the contact and the quality requirements. Yeah, this is the benefits can provide us good, better user experience and cost saving. And this is how it works. It will first do the analysis of the content to identify areas with varying levels of detail and the complexity. Based on that, it will have different compression level, such as smooth background can be compressed more, while faces and edges need to compress less to preserve the visual quality. And then, then the last is final encoded image, to a smaller size will maintain the acceptable, the quality and the, when it came to infrastructure is very popular to leveraging cloud native services for cloud storage and other, and other system, system handling that I will go over later. yeah, cloud computing offers, significant, offers lots of service that will enhance the efficiency and the cost effective of the image processing pipelines, and these are the popular choices, it will offer the benefits of, scalability to Durability, cost effectiveness, and accessibility, and the image data can be assessed from anywhere with an internet connection, facilitating the collaboration and remote processing. and we are also using a lot of serverless computing for on demand image processing. the popular choice is AWS Lambda. Yeah, we'll be using AWS Lambda mostly. and, this is more like we can run code in response to events without provisioning and managing servers. and also these on demand execution model is good for image processing such as resizing, resizing, thumbnail generation, and format conversion. There are also cloud based image processing APIs that is very handy, such as, yeah, we'll be using recognition a lot, but, comparably Google and Azure also offer a similar service like that. and, it helps us with the object detection, image classification, facial recognition, and contact or moderation. Yeah, which is save us a lot of effort building and maintaining our own models. And here are some strategy for optimizing cloud costs. using spot instance that take advantage of unused cloud computing capacity. having the right size resources, choosing the appropriate instance type that can match the workload to, and to avoid over provisioning. Okay. Leverage reserved instance, which is committed to using specific instance tabs for longer duration, and we can get discount pricing of that, and, yeah, monitor and analyze user pattern so we can identify where we can optimize. iteratively, budget alert. Yeah, definitely having the alert to notify when the cloud spending is exceed the limits, which can happen quite often. Data per and distribute processing and the parallelism, which is also very helpful. that is how we divided an image data into smaller chunks and the process in parallel to achieve better throughput and efficiency. And this approach allows us to leverage the combined processing power of multiple machines and definitely speed up our pipeline in the end. I can give an example of the parallel task for image processing, such as, such as we can break the task into independent subtasks, Resizing, filtering, feature extraction, and you're able to do these independent steps, simultaneously, Or in a concurrent manner, And, optional, if necessary, we can synchronize the result of the subtax and obtain the final output. When it comes to distribute processing and parallelism, there are also several, known frameworks that, we can leverage, Apache Spark, the versatile distributed computing framework well suited for the big data processing. It offers high level APIs for working with large data sets, and, Dusk is a flexible library for parallel program, parallel computing in Python. and Ray, is a general purpose cluster computing framework that is good at, it offers a simple API for parallelizing Python code and supports various machine learning library. These frameworks provide the tools and abstractions to distribute your image processing workload across multiple machines and enable you to process large data sets efficiently. And next one is containerization and orchestration. Yeah, these two technologies can simplify the deployment and the management of image processing pipelines, such as the containerization like Docker allows you to package your application and dependency into portable containers, and these containers can run consistently across different environments, ensuring reproductivity, reproductivity reproducible and eliminating compatibility issues and orchestration, such as Kubernetes, automating the deployment, scaling and the management of continualized applications. this will simplify the process of deploying your image processing pipeline across multiple machines and ensure that they run real, real, Reliably, rather reliable, even in the face of failures. And continuous optimization and the cost motor cost monitoring, as is the crucial part to, help us remain efficient and budget friendly, performing tools, happy you to identify performance bottlenecks by measuring the execution time of different stage, once we identify any bottom like, Bottom next, we can apply various techniques such as algorithm improvements, using more efficient algorithm or data structure, hardware acceleration, leverage GPU or specialized hardware accelerator. to offload computationally intensive tasks. Parameter tuning, which is like experiment with different parameter settings to find the optimal configuration for the workload. Yeah, and, caching and memorization are also very valuable to optimizing the image processing pipelines. for caching, we will store the result of expansive operation. and this is very beneficial for image transformations and feature expressions, feature extractions, which is applied repeatedly to the same images. Memorization, will cache function calls result based on input arguments, providing redundant computations when a function is called multiple times with the same inputs. And, yeah, this is how mostly we, to, we are doing for the computationally intensive and frequent repeated operations for optimization. The cloud providers also offer a lot of tools to service to track resources, usage and associated costs, which can, which will give us valuable insights in the financial footprint and, yeah, there's a lot of them. And, we mostly using cloud watch from a double eyes. yeah, and it help us to gain the better visibility into our cloud experience and make informed decision, most of the time in the important business leadership meetings, the future of image processing is very, bright with lots of, With lots of advanced, advanced progress in artificial intelligence and machine learning, lots of techniques like deep learning and neural networks can enabling sophisticated image analysis tasks. however, the image processing technologies is. can cause some ethical implications, such as privacy, bias, and transparency. They need to be carefully addressed to ensure the responsible and ethical use of image data. yeah, that's a hard and controversial topic where to strike the balance between the innovation and the ethical considerations. for now, we will be evaluating the race based case by case, mostly about who our, how, who are the customers of our images, and based on how much benefits we can bring to them and, the risks that we're going to take, to make the, to make the balance and informal choice. Here is the overview of my, of my presentation. First, we go through the image data use cases and its applications. and second, we go through what is the main challenge, with the large image datasets. And then we go through the state of the art solution. Which in the industry, regarding to having the right formats, compression techniques, using cloud native services, distribute processing and parallelism, container and orchestration. And how to optimize and monitor the cost. and then we through more provocative thoughts about the future applying, about the image processing, choices that we need to make in order to achieve both innovation and safety measure. so here, so here. Yeah. Thank you so much for, participating this, this presentation. And, here is my contact detail. so feel free to contact me for any questions and the future follow up. If you want to collaborate more on the image processing pipeline and, have, hope you have a good time at the conference.

Slides

Download slides (PDF)

See all 50 talks at this event!

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

Building Efficient and Frugal Image Data Processing Pipelines

Video size:

Abstract

Summary

Transcript

Slides

Anna Heng Chi

Software Engineer @ Amazon

Join the community!

Featured event

2025

2024

Info

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

Building Efficient and Frugal Image Data Processing Pipelines

Video size:

Abstract

Summary

Transcript

Slides

Anna Heng Chi

Software Engineer @ Amazon

Join the community!