Conf42 Machine Learning 2023 - Online

From Data to Discovery: Unveiling Clustering in BERTopic Topic Modeling

Video size:

Abstract

Want to learn how you can process huge amounts of open data available on websites like Google Reviews and Amazon to build intelligent topic models and understand what customers are talking about your products and services? BERTopic to the rescue!

Summary

  • Session on data to discovery, unveiling, clustering in birth topic modeling. Presented by Abhiram and as well as my co presenter Jaspal. Both of us are reachable at Twitter and our handles are mentioned below.
  • People buy Alexa Echo Dot mostly online and also in offline stores and leave reviews on websites like Amazon Reviews and also the Twitter website. Topic model link comes into play, where it can process millions of reviews and intelligently pick out topics of interest for organization like Amazon to process and take action.
  • Bird topic modeling has to work on unstructured data. It is capable of taking advantage of transformer models like Bert and other embedding related models. Dimensionality reduction is another process followed by clustering. All these processes can easily be interchanged by the most advanced algorithm.
  • Abhinam: Let's pick up the Amazon Alexa review data set, which is available on Kaggle. And then we install build topic and again check our reviews data frame. These code snippets and notebooks are available on GitHub for your reference. I will now pass it on to Jaspal for a detailed overview on httpscan.
  • HDB scan is a hierarchical density based spatial clustering of applications with noise. It helps us to find high density regions inside a data set. Distance between two clusters is defined by another distance called mutual reachability distance. This helps to identify the dense regions.
  • The assumption here is basically every document contains one topic. There are a lot of cloud vendors that provides this build topic and you can do it using really simple architecture. Hope you took some of the material with you, which might be useful for you in the future for your further research.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone and welcome to the session on data to discovery, unveiling, clustering in birth topic topic modeling. This session will be presented by me, Abhiram and as well as my co presenter Jaspal. So a little bit about ourselves. I work as a cloud machine learning engineer, Collinson and hold a master's in data science from King's College London. Previously I was working as a research fellow at SAP Labs in Bangalore. Additionally, I also have published courses on LinkedIn learning as an instructor on rust programming and it has over 40,000 participants. I love to volunteer at Data Kind Bangalore on nonprofit products and projects and also play badminton and love to listen 80s rock music in my free time. Along with me is Jaspal who works as a lead data scientist with me at Collinson. His expertise lies in the areas of AI, Python, AWS and building data products from scratch. He holds a CBA in advanced business analytics from Indian School of Business. He loves to play football and his favorite football club is Arsenal. Both of us are reachable at Twitter and our handles are mentioned below. So let's look at today's agenda. First we will present the problem statement and the topic modeling use case following which we present why build topic is best suited to solve this problem and the end to end flow or the different processes that are involved in build topic. One such process is clustering, which is an integral part of the build topic technique and here we will explain about hdb scan. Then we will go into a hands on session where we look at Amazon Alexa reviews and how we can get topics of interest from that data set published on Kaggle. Then we close it off by discussing about the future scope of topic modeling. So let's get started. Let's imagine that you have a product like Alexa Echo Dot, which is basically a Bluetooth enabled voice assistant. People buy it mostly online and also in offline stores and leave reviews on websites like Amazon Reviews and also the Twitter website. So both e commerce as well as social media sites. So in this scenario, the first review on the Amazon platform is a positive review which says about the usability of the echo dot and how it's helping the customers. Kids learning current affairs general knowledge also improve their english skills. So this is definitely a pleasure point for Amazon. Whereas in the tweet below a customer is complaining saying that the echo dot is not working on his phone and the customer care is also not picking up his calls, both of which are crucial pain points that Amazon has to address. It is inhumanly possible that it's not possible by humans to sift through hundreds and thousands of reviews like these and pick up pain points. So that is where a topic model link comes into play, where it can process millions of reviews and intelligently pick out topics of interest for organization like Amazon to process and take action. So now that we've established a use case, or the problem statement of why topic modeling is necessary, we look at why build topic is the best suitable one, or one of the best suitable ones in this area right now. So the first thing about such topic modeling is that it has to work on unstructured data. So as you saw in the reviews samples before, the reviews are unstructured text and they might have emojis, they might have hashtags, all of these need to be processed. And bertopic is capable of doing that. Also, it is capable of taking advantage of transformer models like Bert and other embedding related models, and efficiently convert the textual words into contextual embeddings, which is a vector format or a numeric format, right? It offers modularity. So for example, on the image, right, there are different processes listed that happen as part of the bird topic technique, and embeddings is one such process. Dimensionality reduction is another process followed by clustering. All these three processes can easily be interchanged by the most advanced or the state of the art algorithm, which is there out there. For example, if we take the example of UMAP for dimensionality reduction, if you're not satisfied with UMAP and you want to try out something like PCA, which is principal component analysis for dimensionality reduction, that can be used as well. And for now, the state of the art for clustering seems to be HDB scan. But a year from now or even two months from now, you never know. A clustering algorithm, which is far more advanced, comes up and you just have to plug and play the HDB scan, like replace HDB scan with the latest algorithm, and the whole build technique will work seamlessly as expected, right? And that is why one of the main crucial aspects of this technique is that new advancements in clustering can be adapted very easily. And the improved version of TFIDF, which is the CTFIDF extraction of topic representation, has been used for the weighting scheme in this particular technique. And that is also one of the motivational factors to use per topic. The CTF IDF works quite well in extracting topic representations from clusters of documents without focusing on centroid based extraction, which again, as you may be aware, has its own share of problems moving to the end to end flow. What are the different parts of this whole picture of bird topic. So we pick up undokenized reviews and then we convert them into a numerical format in the vectorization process using sentence transformers or count vectorizer or such technique. And then we do dimensionality reduction which reduces this high dimensional vectors into smaller dimensions so that it can be processed by the clustering algorithms. We will be looking at clustering and hdb scan in detail and Jaspar will be presenting that in a short while. And the last part is the importance of words or the representation with term frequency and inverse document frequency that is also taken care. And finally we get the required topics. So a quick preview on clustering over here. So, clustering, you may have heard of a famous clustering algorithm called k means, right? It allows you to select how many clusters you would like and forces every single point to be in a clustering so that there are no outliers at all. But in real life situations that is not possible. Also, K means assumes that all the data points form a neat gaussian sphere or a gaussian circle, right? That is also not the case. And the shape of these clusters may be different in real world. And that is where httpscan is quite popular and quite effective. There are other algorithms like agglomerative clustering as well, beg your pardon, which can take care of these things. But yeah, it depends on how you plug and play these algorithms and see what works best in your use case. So let's jump into the hands on part where we have the data set description. So here we pick up the Amazon Alexa review data set, which is available on Kaggle. And this data set has five columns. And for us, the verified reviews contains the textual reviews. And that is the column that we will be working with today. So without further ado, let's jump to the Jupyter notebook on Google Collab. All right, so this is the Kaggle website where the data source data set is available. And this is how I have retrieved the data. Now don't worry, this notebook is available on GitHub for your perusal later on after discussion. So first I get the data, and this is how the data looks after I feed it into a pandas data frame. And we are focused on this verified reviews column. And then we install build topic and again check our reviews data frame. And this is where the magic happens. So first we instantiate a vectorizing model, vectorizer model, which basically allows us to define certain parameters on what are the ngrams and certain configurations we can declare over here. And this is where we are declaring the build topic model with what is the language and certain config options that are available in the documentation. And yeah, this is where on line 19 we do the fit function fit transform gets us basically the topics and the probabilities. So moving ahead, this is how our topics look like. So basically the first most topic which is popular is 284 occurrences of this topic is Alexa and love and then echo dot and music and smart hub. All these are the topics that have been generated organically from our data set. And then over here we have the bertopic distance map. So all these circles actually are representation of topics. So for example, this circle refers to topic 30 where the keywords are hub and hue and plus and here if you go here, this is topic 15 which talks about hulu and streaming. And the one next to it which is topic ten talks about stick and fire stick. So we can also zoom in to see why these topics are so close to each other. So as we can see, there are two different topics over here. The first one is about fastic and the second one is about streaming which talks about similar topics of interest. And also to wrap this up, basically the distance between these circles actually indicate how close or how far or how similar or how dissimilar one topic is to the other. There is also a hierarchical representation of all the topics that have been generated. And here if you can see so the easy setup or easy set is one of the topics that got picked up. And if you see over here, all these related topics are listed together to form the bigger hierarchy. So moving ahead, we have this final initialization of the topic word scores where we actually display what are the most frequently occurring topics and within those topics, what are the different keywords and what is the frequency of their occurrence. So over here you can see Alexa without doubt is the most popular topic. And one of the interesting thing that comes out of this is this gift topic, topic four which talks about love, gift and bot which says that this product has been purchased as a gift by customers to their friends and family and that is also being captured in topic number four. So that's about it on the hands on of bird topic. And these code snippets and notebooks are available on GitHub for your reference. I will now pass it on to Jaspal for a detailed overview on httpscan over to you Jaspal. Thank you abhinam for your explanation on build topics. Now get ready with your swimsuits everyone because we are going to go and take a deep dive into HDB scan. So what exactly is HDB scan? If you look at the full form, it says hierarchical density based spatial clustering of applications with noise. Quite a mouthful, right? Basically what I'm saying is HDB scan is a long acronym, and in addition to that, it's a clustering algorithm. To be exact, it's a density based clustering algorithm. If we want to know what exactly is HDB scan, we would want to know what is DB scan. So on the right hand side of your screen, you find a list of data points scattered around the graph. You also see a small circles encircling the data points. Now that's what DBScan does. It creates small circle around every data point. And if there is another data point within that small circle, it traverses and it traverses as much as possible until it finds like next data point, as long as it finds next data point. Now what it does is basically it helps us to find high density regions inside a data set like you see on the graph here, where there are a lot of data points close to each other. The circles, there are a lot of circles, you see, and that is the result of DB scan algorithm. It's a good algorithm and it is quite robust, flexible, and it's quite like outlier resistant as well. We don't need to define any predefined number of clusters in this, which is a really good advantage. But it has got a couple of problems. Like you need to define the radius of the circle. Now it is okay if you have already seen the data set, but what if you have not reduced the dimensionality and then not seen the data sets? How do you know what should be the radius of the circle? So that's one of the problem. And the other is it's a bit slow. So what if there was a mechanism? If we don't need to define the radius of the circle like this, no fixed radius, but at the same time we could identify dense regions. This would help us to be fast. And that's what hdb scan does here. Now we have got an old friend in KNN algorithm. We take help from this old friend of ours. Now, KNN algorithm is basically identifying its nearest neighbors. So we define a k, and using that KNN algorithm, you draw a circle around it, and the last data point in that circle becomes the radius from the center of that circle, which is the actual data point. Now that is the core distance. And when we define this minimum number of point in a cluster, we will find out that at some places your cluster size is huge, like you see on the right hand side, the cluster size for the green circle is really, really high. On the left hand side, the blue circle is small because the data points are closer to each other. Now, this distance between two clusters is defined by another distance called mutual reachability distance. To explain this, we need to imagine a scenario. Imagine yourself with your friend inside a park with a lot of other people. Now, one thing we know that some human beings are short, some human beings are tall, and as and when their arm size are different as well. Now imagine the arm size being the core distance of a human being. Now, another human being within that arm's distance is your mutual relatability distance. Reachability distance. And that's the concept, basically that how many clusters are within your radius, and that's where dense regions are formed. If they are outside of your radius, then they are away from you. And this mutual reachability distance helps us to identify the dense regions. Now, once we have this mutual reachability distance, we find out, like how many items are closer to each other and by what meaning, by what relation. Now, when we are at a park, you have your arm's length to identify who is closer to you or who is not. Now let's take that park example to a different place. Let's imagine a room full of objects and they are scattered around. Now, you have a string and you want to tie these objects to each other, but using shortest possible paths. So how would you do that? Okay, you'll try to do it by calculating the distance between these object. And now you have this mutual reachability distance defined against each object and each cluster. So there is an algorithm called minimum spanning tree algorithm, which finds the shortest possible path for you. And that minimum spanning tree algorithm is really useful to identify the density and the hierarchy of the cluster. Now, you might ask the density. Okay, you have the mutual reachability distance as the minimum, but how the hierarchy? For that, I need you to imagine another set. Now imagine that you have tied all the object with the minimum span using the minimum spanning tree algorithm. Now, what you do next is pick up these objects, and here you see a hierarchy is formed, something like this. The objects that are closer to each other come closer and form a cluster. The objects that are farther from each other are lying below in the hierarchy, and that's the way the spatial clustering happens. Now, let's say you got the hierarchy, but you still are not really happy with the number of clusters because there are a lot of hierarchy there. So how do you define what should be your optimum length of the clusters? So for that basically you need to define how many number of objects do you need in one cluster? And once you define that, you get a pruned tree, something like this. So this tree basically is the hierarchy of the objects where you have removed the objects where the clusters are not really big, and I have put it across a lambda value. Now you'll ask what is the lambda value? Now once you have pruned those objects, you would want to capture the clusters which are most meaningful or reliable. So how do you do that is by using the stability score. Now, stability score is calculated for each cluster in the hierarchy based on two factors. One is the density of the points and the other is the persistence, basically how long it lasts in the hierarchy. Now here when you see the y axis, it's the stability score, which is the highest for which cluster. So as you can see, the first two are clearly highest stability scores and the width of the cluster and the color gives you the number of points in that cluster. So the third cluster which we choose is the green one on the right hand side, which is quite stable as well as it has a lot of points in it. Now these are the clusters that are being chosen based on the stability score. Now once we have chosen these clustering, this is how the final clustering look like. There would be a few points which you can see are grayish in color. It might be a little difficult, but yeah, they are there. So somewhere near the greens, on the edges of the greens, those gray points are the outliers. But we are able to find highly well defined clusters here. Now, just a recap of what we did. We transformed the spaces as per the density we identified, which is the highest density region. Then we built a minimum spanning tree and then we constructed a cluster hierarchy. Then we condensed this hierarchy based on minimum cluster size, and then we extracted stable clustering from the condensed tree. So that's what HDB scan does. Now let's see what is the performance of the HDB can. As you can see in this graph, HDB can performs better than DB scan. As you see, on the blue line is the DB scan and the green line is the HDB scan. On the x axis you see the number of data points, and on the y axis, the time taken to perform clustering activity. Now, as you see, past cluster is not really that good with higher set of data points, but DB can is better than that HDB scan, even better. And then the last two are the K means. K means is the fastest, no doubt about it. But it's not really that great in giving you better cluster. Now let's see what are the strengths and weaknesses of HDB scan? HDB scan basically focuses on high density clustering and as a result it reduces noise problem and the minimum cluster size parameter can be set and as a result it's relatively fast than others as we see in the performance. And it does have problems in handling large amounts of data. But you can use cumL HDB scan which uses GPU acceleration to perform this, and I've given the link in the description as well, so you can click on it and look at it in your free time. Now knitting it back to what we discussed previously with bird topic. So basically we know that every component of build topic is modular and it's scalable and it's also flexible. If we are not using HTTPscan, probably we might have used some other algorithm and as a result clustering is modular, which gives you a really good advantage over it. And the assumption here is basically every document contains one topic. So when we are giving each topic to analyze, we have only one topic for every document. And also there can be chat CPT might be impacting like topic modeling, but yeah, I mean it does the job, Chat GPT, there are a bit of biases, ethical issues and so on and so forth. But topic modeling is just topic modeling. You identify which is the important topic and you get it. So yeah, it's really useful for that and it's easier to operationalize as well. And there are a lot of cloud vendors that provides this build topic and you can do it using really simple architecture, using sort of lambda functions and stuff. And it's really quick. So I have given some of the references and the resources as well. So yeah, that's about it. And thank you so much for joining my session, hope you enjoyed it. Hope you took some of the material with you, which is useful, which might be useful for you in the future for your further research. And thank you so much for joining, see you later,
...

Abhiram Ravikumar

Cloud Machine Learning Engineer @ Collinson

Abhiram Ravikumar's LinkedIn account Abhiram Ravikumar's twitter account

Jaspal Singh Jhass

Lead Data Scientist @ Collinson

Jaspal Singh Jhass's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)