Don't Waste Data

Video size:

Abstract

Amassing large amounts of data is in the very nature of most tech companies. However, in companies which don’t have a dedicated ML focus, there is often relatively little effort put into thoroughly analysing this data and finding uses for it. In fact, there exists a misconception that ML is “irrelevant” to certain problems, and thus spending time on ML data analysis is a waste.

The real waste, as Anisha and I discovered, lies in ignoring this data. Through using our large body of historical data and implementing a simple random forest regression, we were able to discover a method of routing Lob mailpieces which, as time will tell, could have a significant impact on our mail delivery timings.

Summary

L lob. com has databases of information on the mail pieces we send. The company has conducted very little analysis on this data using machine learning tools. Sometimes it takes creative thinking and data analysis to unearth surprising improvements.
The USPS mail distribution network splits nearly 42,000 destination zip codes in the US. Millions of mail pieces move through these systems every day. Each mail piece can generate up to 105 different operations codes. The analysis focused on the mail entry and exit signals.
The project was finished in a matter of two days during a work hackathon. It takes into account historical data in a way our current routing algorithm does not. A company doesn't need to have an ML focus to make breakthroughs using machine learning tools.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi. This is our talk. Don't waste data. All right, everyone, so we are going to start off by talking about a classic problem. We all remember the traveling salesman problem that we studied during our undergrad days. But after that, that entire problem took a backseat right after we turned in our test papers and things like that, but were at lob. We take that traveling salesman problem, or as we call it, the traveling mail piece problem, a little bit more seriously. As a print and mail company, we generate digitally the mail pieces and route it to our print partners, as we say in our company moto, to connect the offline and online walls. And we have to deal with this every single day. So, as we know, the USPS mailpieces and the mailing times have been impacted severely during the pandemic, and we want to make sure that these mail pieces that we are responsible for gets to its destination as fast as possible. So in this case, just like the traveling salesman problem, the mailpieces make multiple hops through the various distribution centers that are across the entire USPS network. So some of these hop, as we call them, are beyond our control because we literally do not control the usPs, but some of the hops are within our control. That includes which mail partner is the origination point for this mail piece. That's why we spent a lot of time working on a routing engine, which uses a modified geolocation algorithm to determine which print partners are the closest to the destination of the mail piece. That potentially could be cutting down a transit time, right? But is that right? Over to Aditi to talk more about this problem? As it so happens, in a twist worthy of M. Night Shyamalan, our current approach may not be the best one. Only time and more performance results will tell. But as Anisha and I will discuss, sometimes the most obvious solution isn't necessarily the best one. Sometimes it takes creative thinking and data analysis to unearth surprising improvements. Our project began on a whim. Like many companies, lob.com has databases of information on the mail pieces we send. And like many companies which aren't strictly ML oriented, we've conducted very little analysis on this data using machine learning tools. There's a conception that machine learning is somehow inaccessible, that it needs expert guidance and a dedicated ML team to really implement. As a bunning ML student, someone who has taken a couple of ML classes but is still an amateur, I saw untapped potential for improvements to a number of our systems, foremost being our routing engine. Is it possible, I wondered, that we were taking a less efficient approach to print partner selection could data analysis tell a different story than the one we had assumed would hold true over to Anisha? All right, so before we talk about the analysis, or what Aditi and I unearthed, I want us to get some baselines here around the context of the data that we are talking about. So the USPS mail distribution network splits nearly 42,000 destination zip codes in the US in various fixed hierarchies and webs, the top most being the 29 national distribution centers, followed by 450 regional sectional distribution centers. Millions of mail pieces move through these systems every day, and each mail piece can generate up to 105 different operations codes that translate to different business logic around what stage of the mailing process is that mail piece on? So, as you can see, this particular system has been built to create messy data. So adity and I brainstorm to pick up the highest fidelity signals that we can from this messy ops codes, focusing specifically on the mail entry and exit signals. That helped us to translate to a simple metric. What are the transit times that a male piece takes when it goes through a particular partner and a particular NDC and SCF networks? We calculated the difference between the earliest and the latest ops code timestamps for each mail piece in order to figure out this transit, and that was our target metric. Although we discussed a more elaborate version of this project that involves using vectors of these ops codes and also talking about specific network routes. But for the purpose of the first iteration, we narrowed down to zip codes associated with specific partners with solid historical data, and the corresponding destination zip codes for the mail pieces over to adity to talk in detail about our model. Once we had determined the shape of the data and the features we should focus on, I set about creating a model with a wealth of ML tools available across languages like Python and Julia, this was hardly a daunting task. For the purpose of this, I used scikit learn, one of the most popular ML libraries around, and plugged the data into a random forest regression as input. As mentioned earlier, I used the zip codes of the print partner and the destination of the mail piece. Our output target was the metric we had calculated during pre processing the difference in days between the earliest and latest USPS events recorded for each mail piece. In other words, the mailpiece's time and transit, even just using 50% of the data or 500k data points due to databricks memory constraints, the difference between the model's predictions and the average error calculated using non machine learning methods was astonishing. The error margin dropped from 1.43 days to zero point 86. Print partners which were further geographically from the destinations sometimes yielded lower times in transit, a conclusion which may not seem intuitive, but one which does take into account historical data in a way our current routing algorithm does not. This approach can also account for seasonal fluctuations in a way that geolocation doesn't, necessarily, because those seasonal patterns will show up in the historical data. So one print partner might be closer to a destination, but maybe it gets flooded around Christmas, and it's actually faster to send it to another print partner which is a little further away. And that's something that you can get from the data that we found, but not necessarily something that would show up using just a strict geolocation approach. Although we are in the early stages of this project and have not yet plugged this into the live router to aggregate performance data, these preliminary numbers justify our decision to analyze exists historical data, even though our routing algorithm already seemed to be optimally configured according to common sense. So what's the takeaway? This project began as something shockingly simple. No in depth knowledge of ML was required. I didn't have to do any calculus or write my own custom model. The most challenging part was deciding which data features to use. Although Anisha and I have plans to further refine our work, which could add complexity, the baseline project was finished in a matter of two days during a work hackathon, which leads really to the point of this talk. Our efforts can and could easily be replicated it in a digital world packed with data about practically anything. A company doesn't need to have an ML focus, or a dedicated ML department, or even an ML expert in order to make breakthroughs and improvements to their processes using machine learning tools. All you really need to do is pull up scikit documentation and put your data to work.

Slides

Download slides (PDF)

See all 13 talks at this event!

Conf42 Machine Learning 2022 - Online

May 19 2022

Don't Waste Data

Video size:

Abstract

Summary

Transcript

Slides

Aditi Ramaswamy

Software Engineer @ Lob

Anisha Biswaray

Business Analyst @ Lob

Join the community!

Featured event

2025

2024

Info

Conf42 Machine Learning 2022 - Online

May 19 2022

Don't Waste Data

Video size:

Abstract

Summary

Transcript

Slides

Aditi Ramaswamy

Software Engineer @ Lob

Anisha Biswaray

Business Analyst @ Lob

Join the community!