Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, everybody. I'm Alex. I lead the data science
and analytics team here at Kando, and I'm very happy to be
with you today at this year's edition of Python
Web conference, presenting what we do
today with data, science and algorithms,
kando machine learning, in order to generate intelligence
in the wastewater. So I want to begin
in by breaking these words down a little bit. So what's
wastewater intelligence? Right. So,
wastewater, I think most of us know, right? We use the restroom,
we use the sink, we use a shower. Water goes down the drain. Boom,
success. Wastewater created. That's one part of it. The other
part is maybe less familiar to some of us is what's
called industrial wastewater. Industries everywhere
use water just like we do, and some of them use water in their processes.
And after these processes are complete, water finds
its way into the wastewater collection system,
where it needs to get treated. Now, these processes
can be such that they
introduce various pollutants into the wastewater,
which is typically okay. The wastewater facilities know and are able to
deal with those pollutants, but every now and then,
pollutants exceed either types or various
limits, and then they are potentially very detrimental
to processes. We'll talk about that also in a few words soon
enough so we understand wastewater a
little better. Now, what's intelligence? What information
are we looking to get from our wastewater? So one
piece of information is to understand what's going on,
right? The whole system is underground. Right. We're mostly oblivious to it.
We have no idea what's going on there. So initially, we want to be able
to understand what's taking place. Then we want to be able to understand
whether or not something out of the ordinary is taking place.
And once we understand points one and two,
we would very much like to be able. By we, I mean wastewater
treatment facilities that need to do something with
whatever it is that's coming their way, they need
to be able to understand what steps they can Kando should take in
order to mitigate whatever potential damage is coming right now or
that has happened sometime in the past.
Okay, so why is the whole thing interesting? Right?
You might ask legitimately, do we really need to know this?
So, first of all, I think that specifically now,
with global warming, climate change, et cetera, I don't really need to
convince anyone that water is important. So we are all aware
that water is a very finite resource. And specifically,
drinking water or water that we, as humans can use is
rare throughout the world. Kando. It's relatively scarce,
and at least I've learned over
my relatively short, safer now at can that treating
wastewater is a very complicated process. It's typically combined
of multiple steps. Each one is responsible
of addressing one type or another of
either pollution or contaminant that is taking place
in the wastewater. So getting water from
being a waste to actually being usable again, either for
irrigation or just for reintroduction into the water cycle,
is a very, very delicate and nontrivial
process. So a
lot of the events that can be introduced into the wastewater,
as I mentioned previously, by various industrial facilities,
are able to damage this delicate process and either render the
wastewater treatment facilities themselves, in extreme cases,
in operational, or just cause degradation
in the effluent water quality. So the water that is
later used to irrigate our crops or
that is introduced into reservoirs,
rivers, oceans, et cetera, is polluted,
basically hurting all of us, Kando, our lives
in day to day. So what do our users
need to know when Kando is providing this platform? What is this platform
telling the wastewater treatment facility? So, first of all,
we're informing, or we're ideally informing them
that something is happening, right? So be aware that there is
an event taking place or there is a disturbance in
the force somewhere throughout the collection system, either somewhere
close to the producer facilities or somewhere further downstream.
Then we need to be able to tell them what it is that's
happening, right? So the event is of such and such category
type intensity, what the pollution is, ideally,
what is the potential damage of this pollution. And then
just as importantly, if we're able to, we need to tell
them where the pollution is coming from. And this is important for two reasons.
First of all, they need to be able to kind
of verify that this pollution makes sense in terms of the pollution
source. Kando. The other, much clearer, is that this is their
way to prevent this from happening in the future, right?
So once they know which facility is causing such
or other pollution event, they're able to contact the
facility, make sure that the processes that take place
in that facility are correct and that the treatment or pretreatment that
needs to happen to whatever water is being discharged into
the waste is taking place.
And lastly, if there is something illicit going on or
some malfunction or something that needs to be addressed
more severely, then there are, of course, legal kando regulatory
approaches that can be taken. But in order for them to be applied,
we need to know, or they need to know who to apply them to.
Okay? So fine. I gave a very nice
elevator pitch of what can do is and what it does, and it took me
a good several minutes, and now you all know, and now you all want to
buy one of our systems and install them in your homes and be happy with
them, of course. But this is a machine
learning kando data science trap. So what are our machines learning?
What is the data science here?
So let's
start with addressing our data sources, right?
The first and kind of probably easiest to understand is just open
source data, right? So when we deploy our system at a specific location,
we need to be aware of where it is, where deployed, what is
the combination of sanitary to
industrial facilities,
how many people live, wherever it is they live,
what the topology of the ground or the surface is,
so that we are able to understand how the sewage network is built.
Once we understand where it is we're located, Kando, we start
collecting information. I'll go into the deployment process also in
a few steps, a little bit further.
Then we start gathering our own information,
right? We have a bunch of sensors, some of them customary,
off the shelf. Others are designed
and built specifically to our specifications. Those sensors
generate a lot of signals. These signals need to be processed
and understood over time. And then lastly,
the very nontrivial bit of information that is partially our
proprietary and partially open source is data
about lab samples. Right? So we've
built our own database, as well as access externally
available data sources that tell us that when
a sampling process, which basically constitutes taking a little bit
of wastewater and sending it to an analysis in a lab, you can
see a sample of such analysis on the right here.
We are able to collect a lot of information about what
pollutants are found, where and when.
Okay, so these are kind of the main data sources that
we're working with day to day. Okay, great.
So now we understand what data goes into
our bellies, but what do we do with that information? So the first question
we need to answer is how our system needs to be deployed.
Right? So we need to understand, going into a new region, a new area,
a new wastewater treatment facility,
which locations need to be monitored, and how.
Right. The second question we need to understand
is what constitutes an event, right? So we have
a lot of signals. What of these signals are interesting
and to which extent? Next,
once we found something that is interesting, we need to be able
to understand what that is, whether or not it's something that requires direct action,
or is something that we just need to pass on as information.
And lastly, as I mentioned, having found something,
Kando understood what that something is, we would very
much like to be able to pinpoint specifically what that information source is.
Right. Where this pollution is coming from Kando, understand who its
creator is. Okay, so let's
dive in into a little bit of the nuts and bolts. One question
is the deployment information, right. And for that,
we need to combine information about the
wastewater network, which is something that we typically get from our
customers, and a lot of open
source data. Right. Who are the people? Where do they live?
What information do we have about industrial
facilities to what sensors they belong, what these sectors do, how they
do it. Once we understand all that, we're able to generate
a map that says, well, this location is
very important. This location is not as important here. We need to have
finer resolution. And over there, we can just kind of get a typical
overall glance, and that will be enough for us. So we understand
what facilities are located, where what
their potential pollution may be,
and where those potential pollutants are
gathered, such that we can focus on relevant areas.
Once we figure out where we need to be and we trade it off,
the resolution versus the cost of deployment,
et cetera, we actually start monitoring. Right. And monitoring basically
means generating a lot of time series through different
sensors. And as you can see, a tiny example over here,
it's not that easy to know when something is taking
place that is out of the ordinary versus kind of just the
regular bits and pieces of what happens throughout the day.
So in order to facilitate this
understanding of what an event is, we basically
have a three step process. One step generates candidates
using metrics for outlier detection that kind of identify
interesting bits of our signals. And we put those interesting bits
of our signals to the side, then the top candidates using,
again, some scoring process that we have.
These signals are sent internally to
expert labelers who tell us what the relevant signals are.
In cases that they know. They don't always know, but a
lot of times they do. So they're able to tell us, well, this is one
type of an event, this is something or other. This is a
pollution of this type and so on. Kando, so forth.
Taking all this information, we are now able
to proceed to the third step in which
we enrich our data set by matching the
known patterns to a lot of unknown data, where we can
tell externally what needs to be relatively similar.
And once we know where to look, we know exactly what to look for.
We're able to get a lot more information relating to
a specific label. Of course, again with internal kind of validation.
Kando corrections.
All right, so we know where to look. We know kind
of what we're looking for. Now that we found samples of
interesting data, we need to be able to classify them. In order to classify
our samples, we typically, pardon me, go to two different
directions. One direction in completely classical
machine learning, is regression. And in regression, what we do
is we train an easily obtainable
source of information to match a very difficult to
obtain source of information, such that in
having trained a specific subset of locations and
events to that system,
we can now deploy the relatively easily
obtainable sensors that generate a lot of data instead
of the very difficult and cumbersome sensors that are very
accurate but very hard to maintain. And this allows us
to be able to analyze signals that otherwise
would be either very expensive or in
yet other cases, almost impossible to obtain.
That's the regression direction. And then the classification is,
as I mentioned previously, when we have
built a large enough data set of labeled or semilabeled information,
we're now able to, in real time, classify events to
belong to different sources of
pollution. Sorry, different types of pollution.
Okay, great.
The next point I'd like to get to, of course, is localization.
So we know where we're looking at. But at those
locations, we typically have distribution of our
sensors such that we have very
broad coverage closer to the wastewater treatment facility.
And as we go closer to the industrial facilities
themselves, the coverage is obviously lower.
And we may be focused on specific regions or in specific producers,
but typically, we won't have coverage
that would be enough to identify every source on
its own. So typically, or a lot of the times, the information of
an event taking place comes from somewhere downstream, and then
we need to start building our
ladder in order to climb further and further upstream in
order to do that. So once we identify an event and we're able to
classify it to belong to a specific type,
we, from our open source data, can relate which are
the most probable pollutants to generate this information.
And having that information, we can now use our
signals in order to climb upstream, match patterns
through various metrics. Kando, get to the point where we're able
to point the specific or most likely source
of our pollution to our user.
Okay, so we've gone through the
entire process of what
it is we need to see, where we need to see it when we find
something, what it is we find, and finally,
where the information is coming from. This is basically
the entire pipe that we have, at least from data science
and machine learning perspective.
I didn't go a lot into the kind of the code behind it because
some of the information is, or some of the algorithms we use
are typical scikitlearn network x, et cetera
algorithms, and others are proprietary
that we kind of built either based on various
time series tools or something that we built completely from scratch.
But this is more or less the end to end of this
process and in the end of it, the main takeaway
here that I would like for you to go away with, or at least to
continue to the next session with, is that the
essence of what we do is to combine relatively
easily obtainable data, whether through proprietary
or available sensors, with intelligent
processing techniques that allow us to focus on where to look, what to
look for, and to identify what it is we see,
in order to be able to give clear and understandable
information to our users that are then able to
drive change with the industries around them.
And basically the bottom line is that based on this, we're able
to give everyone where we're deployed, of course,
cleaner, better water quality, which is one of
the reasons that going to work
at can do, is a lot of fun. Thank you very
much. It was a pleasure for me to speak with you and please
feel free to reach out and I'll be happy to try. Kando, answer whatever
questions you might have. Thank you.