Turning AI into API: Extracting Structured Data from Images with OCR and LLM

Video size:

Abstract

Discover how combining OCR with LLMs can transform unstructured image data into structured, usable formats. This talk demonstrates building AI pipelines to automate data extraction, improve workflows, and enhance decision-making, turning visual information into business-ready insights.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, everyone. Thank you for joining this talk. My name is Vladimir. I'm a software engineer at Wales Corp. In this session, I will be talking about how to turn AI into APIs. Big disclaimer, this talk is from my perspective as a software engineer, not a data scientist. I will be demonstrating how software engineers can use OCR and LLMs to build their own applications. Before we dive into the demo, let me first give you a quick introduction to OCR. OCR stands for Optical Character Recognition, which is the process of converting images of typed text into machine readable text. For example, if you have a scanned document or any other image with text, OCR can extract that text for further use in your application. Of course, there are many options that provide OCR. As a service that is Amazon Textract or Google Cloud Vision API, for example, but OCR is not enough. there is a problem. OCR usually convert images into unstructured text. But actually we need structured data to work with it to build a good interface for our clients. Most APIs usually represent data as JSON or any other serializable structure. So it will be nice to extract text from images, find the available data from this text and represent it as a JSON, right? And in 2024, we can actually do this. this is where LLMs can really help us. Many of them offer API for developers. For example, OpenAI provides many features through their API. And one of the most interesting and useful feature, in my opinion, is function calling. Function calling is extremely useful for a large number of use cases. It also has JSON mode. The model's output is guaranteed to be valid JSON when JSON mode is enabled. This is exactly what we need, right? how can we use it? I will show you. Let's go ahead to the demo. And now I'm going to get my hands dirty and write some code. I'll be using TypeScript for this demo. First, we have to set up an application to handle HTTP requests from our clients. I will be using Deno, which is a modern JavaScript runtime. Think of it like Node. js, but better. let's open my terminal. Okay. I'm gonna use dynainit to create a, the basic structure of our application. We have to give it a name. I'm gonna call it receipt scanner API. Why not? All right, Dina has executed this command. Now let's follow the instructions. We, I will navigate to the project folder and as you can see here, there are three files here. The first file is dna. json, which is just a configuration file. The second file is main. ts, where we will write our code. The third file is just a file for tests, but I don't need tests on this talk, I will delete it. Okay. Good. Now let's open . I will use Vem for that. we are seeing a few lines of an example here, but I'm going to delete them. So now I will be using Dina Surf. What is Dina Surf? Gina has actually a built in HTTP server, and this function just runs that server. If we check the definition of this function, we can see that it accepts a handler function. This handler needs a accept request from clients and return the appropriate responses. All right, now let's go ahead and code a simple echo server to get started. I'll just show you how it works. Okay. We can run it like this. Okay. I'm going to use QRL to test this server. All right. It works. so now let me explain again what exactly we're going to create now. let me open my browser. Um, here is a overview, of how it'll work. I will be using this direct jets for OCR. Because it's free and open source. I could use Amazon Textract for example, but in this case it doesn't really matter. Let's go through the flow quickly, step by step. First of all, the client sends a request with URL of the receipt. Then Textract extracts text from this receipt. The extracted test is passed to chat dpt, along with a well structured prompt. chat dpt does some magic and applies our custom function to put data into a JSON. And that JSON is sent back to the client as a response. Quite simple. So now I'm going to use Tesseract. js. First, let's take a look at documentation, which includes a brief example. how to use it. example of how to use it. the first thing is, we, need to create a worker and then pass URL to recognize function. the worker will automatically download an image and extract the text from that. Okay. For this demo, I have a receipt that I will be working with. This one. I just want to extract a description. unit price and amount, for each item. Additionally, I want to capture total and total. we'll use chat dpt for that. Okay. Let's go back to my terminal. I'm going to import this direct from my, from npm and then i will use i will create a worker and then use this worker on every request good let's test it i'm gonna use qrel and good okay, for now, our application can extract text from images. However, this text is unstructured and contains a lot of irrelevant information. We will use chart2BT to extract for, to extract only the key information and put in, put it in to JSON. Okay, to use OpenAI API, we need an API key. You can easily create your own API key on official website. yeah, yeah. It's called OpenAI developer platform. that is also an API reference here. Okay, I'm gonna use the official JavaScript library for OpenAI. This one. okay, let's open my editor. And I'm gonna import, I'm gonna import OpenAI from npm. Um, we have to put a P key here. I'm going to load my IP key from my system environment. okay, let's create a chat GPT completion. We will use, GPT 3. 5. Access. I will provide two messages, each message has a specific role, but I will not discuss it, in this talk, the first message, which has a system role, establish the main chat context. the GPT model will understand that it should act like a helpful receipt analysis tool. In the second message, I will put the prompt. so we need to come up with a good prompt here for chat GPT. For example, like this. note that I concatenate my prompt into a disrect result here. you can come up with your own prompt based on your use cases. This my, this most creative part actually. Yeah, I need only One choice, so I'm gonna put one here, and also I'm gonna turn on JSON mode. now we need to ask ChatDept to map the data into specific JSON format, and for that we will use the function tool. Let me show, let me show how. I will copy and paste a piece of code here. I can use it here. the most important thing here is a JSON schema. this part. we here define receipt items as a JSON schema. Every item has description. unit, price, and amount, which is a number. all these properties are required, and it has also subtotal and total as a number. Once we get the completion here, we can grab the first choice from it like this, I'm going to remove it. I take first choice here and I look for a tool call that has this name. If we can find, we will send a internal server Otherwise, we will take the arguments and send back them to our client here. And that's it. I'm skipping important things like argument validation and, token optimization, because it's a just a demo. okay, let's test it. I'm going to run it here and use QRL. Wow. it works. Unbelievable. Let's compare this JSON with the original image. Let me open the browser. I'm going to have to open this, yeah. It get a little bit wrong amount, but everything, but, but it extracts everything we need as you can see. And that's it. That's all you need to get started using OCR and ChatGPT in your project. If you run into any issues during this talk or would you like to discuss it, just reach me out on Twitter. And thank you so much for your time. I really appreciate it and enjoy the rest of the conference. Bye bye.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

Turning AI into API: Extracting Structured Data from Images with OCR and LLM

Video size:

Abstract

Summary

Transcript

Slides

Vladimir Pesterev

Rust Developer

Join the community!

Featured event

2025

2024

Info

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

Turning AI into API: Extracting Structured Data from Images with OCR and LLM

Video size:

Abstract

Summary

Transcript

Slides

Vladimir Pesterev

Rust Developer

Join the community!