Abstract
Amassing large amounts of data is in the very nature of most tech companies. However, in companies which don’t have a dedicated ML focus, there is often relatively little effort put into thoroughly analysing this data and finding uses for it. In fact, there exists a misconception that ML is “irrelevant” to certain problems, and thus spending time on ML data analysis is a waste.
The real waste, as Anisha and I discovered, lies in ignoring this data. Through using our large body of historical data and implementing a simple random forest regression, we were able to discover a method of routing Lob mailpieces which, as time will tell, could have a significant impact on our mail delivery timings.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi. This is our talk. Don't waste data.
All right, everyone, so we
are going to start off by talking about a classic problem.
We all remember the traveling salesman problem that we studied during our
undergrad days. But after that,
that entire problem took a backseat right
after we turned in our test papers and things like that,
but were at lob. We take that traveling
salesman problem, or as we call it, the traveling mail piece
problem, a little bit more seriously. As a print and mail company,
we generate digitally the
mail pieces and route it to our print partners, as we
say in our company moto, to connect the offline and online walls.
And we have to deal with this every single day.
So, as we know, the USPS mailpieces
and the mailing times have been impacted severely
during the pandemic, and we want to make sure that
these mail pieces that we are responsible for gets to its
destination as fast as possible.
So in this case, just like the traveling salesman
problem, the mailpieces make multiple hops through the various
distribution centers that are across the entire USPS network.
So some of these hop, as we call
them, are beyond our control because we literally do not control the usPs,
but some of the hops are
within our control. That includes which mail partner
is the origination point for this mail piece.
That's why we spent a lot of time working on a routing
engine, which uses a modified geolocation
algorithm to determine which print partners are the closest
to the destination of the mail piece. That potentially
could be cutting down a transit time, right?
But is that right? Over to Aditi to
talk more about this problem? As it so happens,
in a twist worthy of M. Night Shyamalan,
our current approach may not be the best one.
Only time and more performance results will tell. But as
Anisha and I will discuss, sometimes the most obvious solution
isn't necessarily the best one. Sometimes it takes
creative thinking and data analysis to unearth surprising
improvements. Our project began on a whim.
Like many companies, lob.com has databases of
information on the mail pieces we send. And like many
companies which aren't strictly ML oriented, we've conducted very
little analysis on this data using machine learning tools.
There's a conception that machine learning is somehow
inaccessible, that it needs expert guidance and
a dedicated ML team to really implement.
As a bunning ML student, someone who has taken a couple of ML
classes but is still an amateur, I saw untapped
potential for improvements to a number of our systems,
foremost being our routing engine. Is it possible, I wondered,
that we were taking a less efficient approach to print partner selection
could data analysis tell a different story than the one we had assumed
would hold true over to Anisha?
All right, so before we talk about the analysis,
or what Aditi and I unearthed,
I want us to get some baselines here around the
context of the data that we are talking about. So the
USPS mail distribution network splits nearly 42,000
destination zip codes in the US in various fixed hierarchies and
webs, the top most being the 29 national
distribution centers, followed by 450 regional
sectional distribution centers. Millions of
mail pieces move through these systems every day, and each mail
piece can generate up to 105 different
operations codes that translate to different business logic
around what stage of the mailing
process is that mail piece on? So, as you
can see, this particular system has been built
to create messy data. So adity and
I brainstorm to pick up the highest fidelity signals
that we can from this messy ops codes,
focusing specifically on the mail entry and exit
signals. That helped us to translate to a simple metric.
What are the transit times that a male piece takes when it
goes through a particular partner and a particular NDC and
SCF networks? We calculated the
difference between the earliest and the latest ops code
timestamps for each mail piece in order to figure
out this transit, and that was our target metric.
Although we discussed a more elaborate version of this
project that involves using vectors of these ops codes
and also talking about specific network routes.
But for the purpose of the first iteration, we narrowed down to
zip codes associated with specific partners with solid
historical data, and the corresponding destination zip
codes for the mail pieces over to adity to talk in
detail about our model. Once we
had determined the shape of the data and the features we should focus on,
I set about creating a model with a wealth of ML tools
available across languages like Python and Julia, this was hardly a
daunting task. For the purpose of this, I used scikit learn,
one of the most popular ML libraries around, and plugged the data
into a random forest regression as input. As mentioned
earlier, I used the zip codes of the print partner and the destination
of the mail piece. Our output target was the metric we had calculated
during pre processing the difference in days between the earliest
and latest USPS events recorded for each mail piece. In other
words, the mailpiece's time and transit, even just using 50%
of the data or 500k data points due
to databricks memory constraints, the difference between the model's
predictions and the average error calculated using non machine learning methods
was astonishing. The error margin dropped from 1.43
days to zero point 86. Print partners
which were further geographically from the destinations sometimes yielded
lower times in transit, a conclusion which may not
seem intuitive, but one which does take into account historical
data in a way our current routing algorithm does not.
This approach can also account for seasonal fluctuations
in a way that geolocation doesn't, necessarily, because those
seasonal patterns will show up in the historical data.
So one print partner might be closer to
a destination, but maybe it gets flooded around Christmas,
and it's actually faster to send it to another print partner which is a
little further away. And that's something that you can
get from the data that we found, but not necessarily something that
would show up using just a strict geolocation
approach. Although we are in the early stages of
this project and have not yet plugged this into the live router to aggregate
performance data, these preliminary numbers justify
our decision to analyze exists historical data,
even though our routing algorithm already seemed to be optimally
configured according to common sense.
So what's the takeaway? This project began as something shockingly
simple. No in depth knowledge of ML was
required. I didn't have to do any calculus or
write my own custom model. The most challenging part was
deciding which data features to use. Although Anisha and
I have plans to further refine our work, which could add complexity,
the baseline project was finished in a matter of two days during a
work hackathon, which leads really to the point
of this talk. Our efforts can and could easily
be replicated it in a digital world packed with data about
practically anything. A company doesn't need to have an
ML focus, or a dedicated ML department,
or even an ML expert in order to make breakthroughs and
improvements to their processes using machine learning tools.
All you really need to do is pull up scikit documentation
and put your data to work.