Conf42 Machine Learning 2021 - Online

Search through your data with Weaviate and out-of-the-box machine learning models

Video size:

Abstract

This talk is an introduction to the vector search engine Weaviate. You will learn how storing data using vectors enables semantic search and automatic data classification. Topics like the underlying vector storage mechanism and how the pre-trained language vectorization model enables this are touched. In addition, this presentation consists of live demos to show the power of Weaviate and how you can get started with your own datasets. No prior technical knowledge is required; all concepts are illustrated with real use case examples and live demos.

Most of all data is unstructured. Additionally, data is often stored without context, meaning and relation to concepts in the real world. This means that all this data is difficult to index, classify and search through. While this is traditionally solved by manual effort or expensive machine learning models, Weaviate takes another approach to this problem. Weaviate is a vector search engine, which stores data as vectors and automatically adds context and meaning to new data. This enables to search through the data without using exact matching keywords. Moreover, data can be automatically classified.

Weaviate is completely open source, has a built-in machine learning model, has a graph-like data model, completely API-based and is cloud-native. Weaviate uses a GraphQL API next to RESTful endpoints to interact with the data in an intuitive manner. Additionally, Python, Go, Java and JavaScript clients are available to facilitate interaction between Weaviate and your applications. GraphQL and client examples will be shown in the presentation.

Summary

  • Weaviate is a cloud native, modular, real time vector search engine. It uses machine learning to understand the data that is in it. You can combine semantic search with traditionally search. It has full crud support for both data and vectors.
  • Weaviate uses semantics and the context rather than exact matching keywords. With VV you can combine these kind of factor searches with traditional scalar search. We can also ask for how certain we feel is to show certain results.
  • You can use weaviate for a big variety of use cases due to the flexibility of choosing your own machine learning model. In this presentation you learned that with the open source search factor engine vv eight you can search through unstructured data. Thank you and see you.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. Welcome to my presentation of the vector search. And in Weaviate, I'm Laura. I am a community solution engineer at semi technologies, and in this presentation I will introduce you to our open source factorsearch engine, Weaviate. First, let's take a look at data, and particularly about unstructured data. Unstructured data are forms of data that are not organized in a predefined manner. Take for example, big pieces of text. We learned that 93% of your data stays unused and is unstructured, and that 80% of businesses don't know how to use their unstructured data in favor of their business. Why is this so difficult? What's so difficult about unstructured data? One thing that is difficult is searching through unstructured data. For example, to answer business questions. Let me give you an example, a simple example of searching through unstructured data. If you want to find information from unstructured text, you will need to use exact matching of keywords to find an answer. If you look, for example, for a wine that fits with your seafood dinner, while wine in your database only tells you that it is good with fish, you will most likely not find this wine. If you instead use a vector search engine like VVA, you can find information in unstructured data based on semantics. Compare this with Google Search. If you ask Google a very abstract question, it might find an answer. The question here, what color of wine is Chardonnay? Is very abstract. Still, Google search finds exactly this answer from a particular data node. So the question is, how does Google find exactly this answer from exactly this data node? And how can we predict the relation between this answer and the question that we asked? And in addition, how can we do this so fast? So the main question here is, yeah, what if you could do the same with your own data in a simple and secure way? So the answer that we case up with is Weaviate. Weaviate is a database that uses machine learning to understand the data that is in it. Weaviate is a cloud native, modular, real time vector search engine that is built to scale your machine learning models. So as I said, weaviate is a vector search engine. So first, let's dive in what vector search actually is. Weaviate stores data as vectors, which are placed in a space in relation to other data objects. Machine learning models are used to compute a vector for each data object and also for each semantic query. So VV eight really takes to understand your data and your queries in more detail. This is how weaviate works with a text factorization module. A pretrained model, for example, a fast text or bird transformer model can compute vectors from known concepts. This is, for example, our daily human language. You can add your own data to a weaviate instance and all this data will be vectorized using these machine learning models and be placed as vectors and with their own data object into weaviate. The data object will then be indexed by the machine learning models and be placed in the high dimensional vector space. Then you can perform, for example, a search query which will also be vectorized using machine learning models of WeAV eight. For example, let's find a wine that fits with seafood. Weaviate computes does its nearest similarity search and find the objects that lies nearest to your search vector. So the answer that lies closest to the vector of this question will be returned. With weaviate, you can do the following tasks with unstructured data. You can search through data, you can discover answers to your specific questions. You can classify and label your data automatically with machine learning models and Weaviate can predict relations in your database. The vector database Weaviate has full crud support for both data and vectors and you can combine vector search and scalar filters, which means you can combine semantic search with traditionally search. It has a graphQL and rEsTful API and weaviate supports multiple data types like text but also images. And this is all possible through Weaviate modules. So modules can be attached to the Weaviate core vector database to enable the features I just described in the previous slide. For example, you can choose to use an image vectorization module to index images and search through these images, but you can also attach a question answering module. You can also attach any transformer NLP model or you can even attach your own machine learning or NLP models. So this allows you to use Weaviate really for scaling your own machine learning models to a production scale as well. So now let's move on to a demo. I will use a data set of news articles for this demo and you can also find this demo data set running on our website. You can go here from any code example, a query example. So over here there's a really simple question or query to first just see what kind of articles we have in this data set. So I can perform a get query to get all the articles. And here I just want to see their titles, their URLs and their word count. And here I get a list in random order of all the articles. So now of course, nothing special or nothing magic is happening here. And just to show you that it is a vectors database, you can also query the whole factor of a data object. So you get the long list of vectors. So now let's only show the title. And as I said, this is just scalar search. So I don't do any machine learning magic here. But now let's take a semantic filter. So for example, let's see if the data set has any articles regarding housing prices. I can perform a near text query and this filter is added by a specific text factorization module. So let's see if there are articles about housing prices. And you can see this is very abstract question. In here, the list of articles is ordered to the relevancy, to the search query. So we have for example something about housing becoming the biggest asset class, something else about housing prices, expensive housing, et cetera. And note that the query using prices is not an exact match of any of the words here in the title. So here you can see that weaviate uses semantics and the context rather than exact matching keywords. We can also ask for how certain we feel is to show certain results. This is called certainty. So here you see that the first result is around 87% certain that it is matching the search query. And then we can also make a filter based on this. So now I will show only results that are above 80% certain. Now you can see that this is a very abstract query and we can also make this a bit more concrete. For example, to see the prices of houses in Greece, this is a bit more concrete. And you can see that there's only one result returned now because we made the query more concrete and this is about Ethene. So yeah, we can see that weaviate matches Greece here with its capital without saying anything about Greece. So as I said, with VV you cannot only store factors, but also it stores the whole data object. So this means you can combine these kind of factor searches with traditional scalar search. And yeah, for example, I will show you this, I will add some properties first. So we have for example each article appearing in a publication. So we can see that this article for example appeared in the Financial Times. This is a graph relation in the database. And now I can takes a scalar filter combining with this already existing factor query. So this is a rare search. So now we can see I'm querying for using prices again and I want the result to appear in this publication, the Economist. So let's see now. So this first result, it's 87% sure this is the title. And we can see that Pearson economist we can also ask questions to weavy eight or to the data in Weaviate if we have a question answering module available or enabled. I can also show this so I will remove the previous filters so I can ask a question also in a filter. For example, what was the monkey doing in the neura link video? And I will not query the certainty here, but I want to see the answer. And the answer here is he was playing mind pong. So if we limit the result to one and we also ask for the summary, this is the summary of the article. So the result is that monkey was playing Mindpong and the answer was found somewhere in this whole summary, which is of course the bit of unstructured text that we have here. Okay, so now let's go back to the presentation. So you can use weaviate for a big variety of use cases due to the flexibility of choosing your own machine learning model or also keeping it very general. So this was my presentation. Thank you for listening and watching. In this presentation you learned that with the open source search factor engine vv eight you can search through unstructured data and in addition you can use to bring your own machine learning models to production skill. Thank you and see you.
...

Laura Ham

Community Solution Engineer @ SeMI Technologies

Laura Ham's LinkedIn account Laura Ham's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)