Precision Metrics Framework: Context-Aware Evaluation for Search and Recommendation Systems

Video size:

Abstract

Discover how to revolutionize search and recommendation systems with a cutting-edge metrics framework! Learn to tackle position bias, creative interactions, and attention decay using advanced normalization techniques, boosting engagement accuracy by 20%+ and optimization success by 28%.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, everyone. Thank you for joining me today. I'm Aditya Singh, and I'm excited to present a framework that's transforming how we evaluate search and recommendation systems in the next 15 to 20 minutes. I'll show you why traditional metrics are failing us, how we can do better and the dramatic improvements that we achieved in practice. The challenge. So let's start with the fundamental question. Why do we need to evolve beyond traditional metrics? Traditional metrics, engagement metrics like CTR, conversion rate, while valuable, fail to capture three critical aspects of modern user behavior. First, the position bias significantly distorts our understanding. Research shows that it can inflate CTR. think about it, an item placed at the top of page usually gets more attention, but does that truly reflect its relevance? Second, the users interact with content in increasingly complex ways, scrolling, hovering, partial views, scrolling past and then coming back. and these things are simply missed by click measurements. Third, with the rise of multimodal content, videos, interactive elements, rich media, we need metrics that understand how different types of content influence user attention and engagement. Let's move next. so let's look at why traditional metrics fall short. Take CTR as an example, while it provides a baseline engagement metric, it fails to account for three crucial context factors. First, Algorithmic Ranking Bias, where our own system's decisions influence user behavior. Second, Complex Interaction Patterns, how users naturally navigate through content. And third, the interplay between item positioning and user selection behavior. Our data shows these distortion factors have significant impact. Position based attention decays varies by up to 30%. Context item influence affects engagement by 25 to 40%. Temporal patterns cause metric fluctuations by up to 42%, and device context requires adjustments of approximately 25%. And these are just academic concerns. they directly affect our ability to optimize for user experience. to address these challenges, We have, developed three core mathematical models. don't worry if you're not a math person, I'll break this down to you in, mathematical to practical terms. But before we go there, traditional metrics, as I was talking about, if you just look at raw CTR, just because something is in top position, it will have a higher raw CTR. But when used and conversion, actually. But when you normalize that conversion, given the fact that this thing was shown at the top position, you find out that the click through rates, or engagement rates rather, aren't too different from the baseline. You can see the blue bar here, the red bar being the raw CTR. If of something on the left most histogram, a bunch, you can see that the red bar is basically the raw CTR observed for something in the top position. And because that thing was at the top position, it's expected it will get some engagement, but is that engagement coming because it's a really engaging content or because it was placed in that position? So this type of, gap, which exists in, the engagement metrics. That we collect on our platforms can lead us, can lead to skewed behavior and also feedback loops in our system. if something just directly shows up on position one, it directly gets a boost. And because of that boost, the model or the systems or the feedback mechanism that has been built, a learned mechanism or a mind mechanism, whatever it is, It can be confused as though why was this popular and these systems help us identify that the popularity or the engagement is coming either from position or from the context or due to the type of content that we have, whether it is because of the platform and so on. So this is what we are dealing with. As I said, we do have mathematical models, that we can talk about. and these, models basically handle three different things, three different independent dimensions that we have identified. first of all, it's our position bias model. And, it can be described as a very simple, what do you call it, a probability model in which we define the probability of a click given at position i is defined as by this relation where you can think of beta i as an attention discount factor. It tells how much we need to adjust for an item's position. So if something is in the top and if it is relevant, then, give whatever its relevance be, we discount its, relevance by an attention discount factor to truly know how relevant, how engaging that content will be. For instance, we found that position five needs about 3x adjustment to reflect the true engagement rate. Second, the temporal decay function. In this, we, in this formulation, we capture how users attention diminishes over time, but with an interesting twist. Attention doesn't decay linearly. We found that users have bursts of renewed interest, especially when they encounter highly relevant content. then third, we come to our cross item influence model. this relation gives us or helps us rather understand how an items affect each other's performance. For example, a compelling video can boost engagement with related text content by up to 40%. But if that text didn't was not accompanied by a video, an engaging video, then it would hardly get any attention. here we can look at the, user attention decay analysis, something that we have modeled and something that is actually observed in data. And you can see a lot of idiosyncrasies or peculiarities rather that are observed or that exists rather in a platform. For example, if it's a mobile device and, There are only, when it loads up, there are only, three content views or three things that you can display to the user. the engagement rates or the impression rates, rather impression count and engagement rates further on for the fourth item will be very different from the engagement rates and metrics observed for the first three items. this is one sub system where the fourth position was a little bit different compared to the first three, and you can see it repeats again when it you go to position eight. So these, idiosyncrasies or peculiarities differ within a platform. So for example, you can prick show the same content on web in which you have a larger real estate. So you can show multiple pieces of content. on mobile device, the real estate is limited. You only show a few. So even these presentation differences, UI differences or platform differences can affect how the user interacts with your content. Here, just to give a overview, the relative attention score that you see on the y axis is a normalized measure, and it's combined by using the viewport timing, that is how long was the item visible, active engagements, that is how many clicks and hovers did the user do, and then, scroll velocity near the time, near the item. you can see, that the attention is higher in the top three positions. then there is some variance after that, and then it stabilizes at some rate. and you can see that it, if we just predicted by our theoretical model, there is some differences in actual observed user behavior. And this is what, essentially, what we want to measure. Ahem. as we talked about, again, going, and talking about a more comprehensive framework for contextualized evaluation. position bias needs to be corrected. There are multiple ways of correcting it. these are correction mechanisms. The other one were modeling, mechanisms. One more important thing that I would like to talk about that we didn't cover earlier was the viewport tracking method. So things like, scrolled, depth tracking, time for which the content was visible and the attention score are some things which are very crucial in figuring out how users and are, spending their attention currency, I would say. for example, scroll depth tracking, we measure just depth of how deep the user engaged by just figuring out, the viewport bottom and then normalizing it by the page height. Similarly, for visibility time, we figured that time out just by summing up the start time. the difference in the end and start time, and then summing up over all the pieces of content. And then the attention score is just simply, active time, viewed time in active time divided by the total time. usual stuff, but these, go a long way, for, measuring, what we are looking for. the, what do you call it? User attention analysis. the, That's a good question. So now let's dive into what we have learned about user attention patterns and the image that you see, displays how, and what is this fascinating phenomena and which challenges the conventional wisdom. So for example, you would expect something to be on. that is ranked fourth to be more engaging by default than rank five. But if you have a presentation, a matrix, like a tile like presentation, and if the fourth item that you recommended, which you think is more relevant than compared to the fifth item, shows up all the way to the right and to the top, compared to the fifth item, which shows up just below, the top ranked, content, you'll see that the, in default engagement rate, just because of the position is higher for content, which is on position five compared to, position four. So observations are pretty evident. the top positions receive three to five times more attention. there exists a left side bias in horizontal layout. Maybe it differs in places where content or the language is read, or the script is rather read from right to left. In cases where the script is read from left to right, there might exist a left side bias. If in the other cases, there might be a right side bias. then there is a significant drop in, engagement, a drop off points in the UI. So elements where, there are folds or changes, then we see significant drop. And also, there is huge difference as I have talked earlier between, attention patterns, for users, using, mobile devices or desktops. Desktops have a larger real estate. mobile devices on the hand have completely different interface and so on. um, again, content quality and user intent act as powerful moderating factors. but still, position bias can overpower them. we, then there exists something called a tension cliff where you can see, as we have noted earlier, there is a drop in, Attention or engagement at certain positions, which is very peculiar to how the content was presented to the users. and also there exists some sort of psychological threshold after which users switch from, systematic to an exploratory browsing mode. that psychological threshold can show up in something where a user says, I've scrolled enough pages. So that page itself becomes a unit of attention, unit that got an attention, but the other sub, subsequent pages do not. For example, you might have seen that the, Google searches also, they usually almost no one navigates to position two. if the content is getting engaged on second page, and almost no one definitely goes to page 3 or page 4, unless they're really looking for something very specific. that page, the Google result page 1, is one such, or switching from one page to another, can be one such attention cliff. if the content is getting engaged on second page, That attention or that engagement could have, the rate could have been much higher had that content been shown at a relevant position in page one. So there, there are these things that, very peculiar to how the whole system is orchestrated, how the platform exists, why and how do users interact with it. And it also depends between each user. different users, different, interact with things very differently. different generations, interact with content differently. Min and deal generation, Gen Z and all those, there are very, significant differences. Sounds good. then, before we go, I just want to figure, say that, you know, all this starts with a data collection layer. And it's our foundation. All of these observations are founded by an excellent data collection layer, which is able to give us things like, the viewport tracking. A lot of people don't have it. I would highly suggest that you have very sophisticated viewport tracking and you capture as many events as possible within on your platform. viewport tracking for precise visibility measurement, interaction events that show how users engage, capture as many events as possible. user data for contextual understanding. So are we, saving the user journey, as they are interacting with the content, as they logged in their session, what happened in their session, and so on. And then content data to classify different types of formats. So for example, they logged in. Your platform provides different types of content that the user can interact with. Let's say they interacted with, a video content and then after that, two positions after that, there is a text content. So there are correlations, that is, or rather observations that the Engagement rates after an engaging content has been shown, they can change the engagement rates on subsequent content, which gets shown to the user depending on what format it is. that's one thing. then, I would say, let's dive deep into, user attention analysis. so we already talked about it. but. Before we go to format level analysis, let me also talk about how do we mitigate position bias. There are multiple ways to do it. We have already talked about some models and metrics to figure out the position bias. now some other ones are, viewport visibility patterns are analyzed, to identify area of high user attention. We already talked about it. It requires, sophisticated, what do you call it? The logging, of all the events happening on the UI. Then scroll, rep, distribution data is used to adjust item scores based on relative likelihood. these can be done, up, away from what your traditional ranking does, or this can be built into your ranking model. that is how deep did the user interact with your content or your platform and all these models we have seen were done when they can be done in very different approaches, but when they are applied in whatever format they are applied, they do reduce position bias up to, up to 30%. It can even be higher. At least we have seen 30 percent and, this can lead to an accurate measurement of item relevance and performance. removing the position bias. let's go to, the next, type of interaction, which can happen. this interaction is cross format interaction. So moving to content format interactions, our format interaction matrix reveals some surprising patterns. On the left hand side, you have, the content which was shown first and on the on the top most is the content that was subsequently shown to the user and the numbers aren't exactly what we have, what I mean, I have seen on different platforms because I, some things cannot be exactly shared, but it looks something like this. where, the, format of an item that has been, shown to the user affects how the user interacts with subsequent content. videos emerge as the strongest influence drivers, actually. And now we have observed that the video length, if it's a short form video, Reels and TikTok and all that stuff, that, creates, what we call a tension halo effect. They boost engagement with neighboring content by 70 to 80 percent, but the timing is crucial. the effect decays by about, three or four positions. So, people can use this sort of information to build models in which, we are, repeatedly, spiking the, engagement that we get out of the user. So interactive elements now, elements in which user can click, tap, zoom in or whatever. these things, create what we call engagement momentum. So users who interact with one interactive element are 75 percent more likely to engage with others. And this creates opportunity for strategic content placement. So think about it. So if you're coming up with an ad creative, then, strategically placed and intelligently made, interactive elements can just do wonders. And perhaps the most surprisingly, we discovered that complementary format players like an infographic showed by detailed text can boost mutual engagement rate by up to 40%. this isn't about just variety. It's about creative, cognitive synergies, by figuring out what the user behavior, is doing. let's look at the implementation guidelines. let's look at, concrete results. our viewport aware scoring system has transformed how we measure engagement and delivered three CRE improvements. First, precision in Attention measurement are tracking systems capture micro second level visibility data, revealing users making crucial engagement decisions as in as little as 300 milliseconds. Second, we also account for what we call engagement windows, optimal periods when the users are most repetitive to create, to certain content types. For example, video content performs 35 percent better when presented after text based engagement. And third, by combining these insights into our mathematical models, we have, achieved a 28 percent increase in optimization success, in optimizing for success rates across different platforms and content types. The, I mean, the accounting for creative formats, the cross format interaction strength is really strong. repeating what I just said earlier, but the framework that we have estimates how would the strength, what is the actual strength and then based on that, we can optimize of how to use that strength and create these synergies that I was referring to earlier. temporal retention decay is modeled to capture diminishing effects of user attention over time, particularly in the long term, and these insights inform the evaluation process leading to significant improvement in optimization decisions. Thank you. As I just, defined, and this is also something that we, I just covered. we need viewport tracking and precise tracking, session level dynamics of how users are interacting with the platform within each session. And then, we, the metric that we figure out that, we are measuring how accurate is that measurement and where is that measurement actually coming from. all of this depends on good data collection. as I said, capturing real world viewport visibility data, spatial temporal interaction mapping, and finally controlled experiments in which users are exposed to certain combinations, certain UI elements or, certain, content, types or pairs or combinations are some things, how through which we can learn more and then optimize our system better. real world impact of this framework has been substantial. Let me share three key metrics that demonstrate its value. The prediction reliability has now improved. By 35%. In practical terms, this means we are significantly better at forecasting which content will resonate with the user. For content team, it translates to more efficient resource allocation and better editorial decisions. Our optimization success rate has increased by 28%. Teams using this framework are more informed, at decision making about content placement and format selection. One team introduced A B testing, reduced their A B testing cycle time by 40 percent by achieving better results. More importantly, these aren't just metric improvements. We are seeing tangible user experience enhancements and with 15 percent increase in session depth and a 23 percent improvement in content discovery. let's go to the next slide. So looking ahead, we are exploring four exciting directions that we'll be, further enhance this framework. I see I have only listed three here, but, let me go through them. First is, we use advanced ML models. then, so that sort of works for us by doing personalized normalization, because we are going to learn a user profile and how that user particularly interacts with our, platform. Then we do multi modal evaluation. We are expanding our framework to handle emerging formats like in augmented reality and interactive visualizations. this includes development of new metrics specifically for immersive content experiences. And, the other thing is a real time adaptation of framework by evolving user behavior. And this, is, done by the next generation of models, which will adjust by it with online training that is happening all the time, based on how the users are now interacting with the, platform. And think of it as adaptive normalization system. I think, there can be another thing in which we are doing privacy first analytics. whatever we are learning, we, can do it in a federated way such that the user privacy is preserved. we do learn how the user interacts, but we do it in a way that is very privacy preserving. key takeaways. Okay. as we conclude, I want to emphasize three critical points. First, the context matters more than ever. Traditional metrics alone cannot capture the complexity of modern user interactions. Our frameworks provide necessary context to make meaningful optimization decisions. Second, normalization and correction aren't just technical necessities. They are fundamental to understanding true content performance in today's multi format, multi device world. Finally, data driven insights, when properly contextualized, lead to significantly better optimization decisions. We have seen this across different platforms, content types, and user segments. So these three takeaways, these are some advanced metric components. You guys can take a look. I have added them to the slides. feel free. ping me with any questions. This is the implementation architecture. pretty standard. you guys can go through it and, it, it's, it has everything that I've already talked about. Um, The slide is attached in the future ML directions, I would say, the future direction is advanced ML models. As I said, we went through real time adaptation, multimodal enhancement, and privacy first analytics. with this, I would say thank you. Thank you for your attention. I look forward to your questions and discussing how these approaches might benefit you in your specific use cases. Thank you.

Slides

Download slides (PDF)

See all 53 talks at this event!

Conf42 Python 2025 - Online

February 06 2025 - premiere 5PM GMT

Precision Metrics Framework: Context-Aware Evaluation for Search and Recommendation Systems

Video size:

Abstract

Summary

Transcript

Slides

Aditya Singh

MS in Computer Science @ UW - Madison

Join the community!

Featured event

2025

2024

Info

Conf42 Python 2025 - Online

February 06 2025 - premiere 5PM GMT

Precision Metrics Framework: Context-Aware Evaluation for Search and Recommendation Systems

Video size:

Abstract

Summary

Transcript

Slides

Aditya Singh

MS in Computer Science @ UW - Madison

Join the community!