Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hi, I'm Piyush Verma. I am CTO
            
            
            
              and co founder at Lastio IO. The talk is about
            
            
            
              an SRE diary where as steps,
            
            
            
              we are trained to be on pager all the time. I mean, I've held a
            
            
            
              pager for almost a decade now, in fact, earlier than when it used to be
            
            
            
              called an SRE as well. But this talk is slightly the
            
            
            
              opposite end of it, where we talk about all the times when
            
            
            
              the pager did not ring. And that's
            
            
            
              where I say it went. Make a sound when it breaks. Why do
            
            
            
              I say that? If you look at these photos, most of these
            
            
            
              things, the upper half, right, they would make a sound when it breaks.
            
            
            
              Starting from a sonic bubble of an aircraft to a balloon, to a
            
            
            
              heart, and then the ram and the bios.
            
            
            
              Not sure if many of us would associate with a ram breaking
            
            
            
              and making a sound, because I haven't seen modern computers do
            
            
            
              that. I've not even seen a BIOS screen in ages.
            
            
            
              So that could also be a case were people don't associate with
            
            
            
              it. But from an era were, which was
            
            
            
              prior to this cloud computing, when these things would break, they would make a
            
            
            
              sound. So it was very easy to diagnose that something has gone wrong.
            
            
            
              Probably not the case anymore.
            
            
            
              Most of the failures, the flavors,
            
            
            
              come in form of software, human network, process and
            
            
            
              culture. I'm going to talk about everything but a software failures.
            
            
            
              A software failure is the easiest to identify. There will be a
            
            
            
              500 error somewhere which will be caught using
            
            
            
              some sort of regular expression, some sort of forwarding rule.
            
            
            
              It will reach a page of duty, a victor ops or some other
            
            
            
              tool of this manner, an incident management system, which has been set up.
            
            
            
              And that would ring your phone, ring your pager, ring your email or
            
            
            
              something like that, which works pretty well. All the other
            
            
            
              failures, the flavors of it starting from human, a network,
            
            
            
              a process or a culture, are the ones which make really
            
            
            
              interesting rcas. A software
            
            
            
              failure is very easy to identify. You find something in a log line.
            
            
            
              A culture failure, on the other hand, is something that you identify pretty
            
            
            
              late. And the good part about is, as you go from top to bottom,
            
            
            
              you realize that the error that actually shows up, or a
            
            
            
              failure that eventually shows up is very latent.
            
            
            
              A culture failure is the slowest
            
            
            
              to identify. A software failure is the easiest to identify.
            
            
            
              A human failure, yet slower than a software failure, but way faster than a culture
            
            
            
              failure. So my talk, I'm going to split this into a few
            
            
            
              parts where I speak of outages, a few
            
            
            
              of them, some of them may have heard this talk,
            
            
            
              but this is slightly different variation of my earlier talk.
            
            
            
              And these outages, they traverse from very
            
            
            
              simple scenarios which otherwise,
            
            
            
              as sres, we would bypass thinking, hey, I could
            
            
            
              use x took for it, I could use y took for it, I could use
            
            
            
              z tool for it. But they do not talk about the real
            
            
            
              underlying cause of a failure. I want to dissect
            
            
            
              these outages and go one level further to understand
            
            
            
              why did it fail in first place. Because as steps,
            
            
            
              we have three principles that we follow.
            
            
            
              The very first thing is we rush towards a fire. We say,
            
            
            
              we got to bring this down as we're going to mitigate
            
            
            
              this fixes as fast as possible. That's the first one, obviously.
            
            
            
              The second question, a very important one that we have to ask ourselves
            
            
            
              is where else is this breaking cause?
            
            
            
              Chances are, our software, our situation, our framework
            
            
            
              is a byproduct of practices and cultures that we follow.
            
            
            
              So if something is failing at some place,
            
            
            
              there's a very high likelihood that it fails in another place as well.
            
            
            
              That's the second most question we have to ask because the intent
            
            
            
              is to prevent another fire from happening. And the
            
            
            
              third ones, the real important one that we mostly fail to
            
            
            
              answer is, how do I prevent this from happening
            
            
            
              one more time? Unlike product,
            
            
            
              reliability cannot be improved as one feature at
            
            
            
              a time. It cannot be done as one single failure at a
            
            
            
              time. Failure happens, I fix something, then another
            
            
            
              failure happens, I fix something. Sadly, actionability doesn't
            
            
            
              work that way because customers do not give us that many change.
            
            
            
              The first outage that I want to speak about is
            
            
            
              a customer service is reporting login to be down.
            
            
            
              We check datadog, paper trail, neuralink cloud, watch.
            
            
            
              Everything looks perfectly okay. Most of these
            
            
            
              trying to hint towards the fact that breaking looks okay,
            
            
            
              servers look okay, load looks okay, errors look okay. But we
            
            
            
              can't figure out why login is down. Now here's
            
            
            
              an interesting fact. Login, if it's
            
            
            
              down, means that it is not accessible. So requests that
            
            
            
              doesn't reach us are inside out monitoring
            
            
            
              systems which sit in form of a setting of a sophisticated
            
            
            
              Prometheus chain, a datadog chain, a new relic chain is not going to buzz
            
            
            
              as well because something that doesn't hit you, it doesn't create an alarm on a
            
            
            
              failures. You don't know what you don't know.
            
            
            
              So what works, what gives? And meanwhile, on Twitter, there's a lot
            
            
            
              of sound being made. I mean, surprisingly that I've really seen that
            
            
            
              my automated system alerts are usually slower
            
            
            
              than people manually identifying something is broken. So on
            
            
            
              Twitter, there's a lot of sound happening. What was the real
            
            
            
              cause? After a bit of debugging, we identified
            
            
            
              that a DevOps person had manually altered
            
            
            
              a security group, accidentally deleting the 443 as
            
            
            
              well. Quite possible to happen because ever since the
            
            
            
              COVID lockdown started, people started working from home. And when they start
            
            
            
              working from home, you are always whitelisting some IP address or the other.
            
            
            
              And those cloud security group operation tabs usually
            
            
            
              allows you to make just one single press which you can accidentally
            
            
            
              end up deleting a rule that you should not have, which results in a
            
            
            
              failure. What was the real root cause here?
            
            
            
              How do we prevent this from happening again? Well, the obvious second question that
            
            
            
              we asked was, okay, where else is this happening? We may have deleted other
            
            
            
              rules as well, but what's the real root
            
            
            
              cause? If we have to dissect this, it's not
            
            
            
              setting up another tool. It's certainly not setting up an
            
            
            
              audit trail policy because, well,
            
            
            
              that could be one of the changes as well, that you set up an audit
            
            
            
              trail policy. But then somebody may miss to have a look at it as
            
            
            
              well. I mean, the answer to solve a human problem cannot be another human
            
            
            
              reviewing it every time. Because if one human has missed it, another human
            
            
            
              is going to miss something as well. Then what is those real root cause?
            
            
            
              The real root cause here is that we do
            
            
            
              not have this culture, or we allow exceptions
            
            
            
              of things to be edited manually. Now these cloud
            
            
            
              states, if they were being maintained religiously using a simple
            
            
            
              terraform script or a pilumi script, it is highly likely
            
            
            
              that a change would have been recorded somewhere and a rollback was
            
            
            
              also possible. But because that wasn't the case, and we had these exceptions
            
            
            
              of being able to manually go in and change something, even if it was.
            
            
            
              Hey, I just have one small change to make. Why do I really need to
            
            
            
              go via the entire terraform route? Because it takes longer. Because every
            
            
            
              time I have to run an apply operation, it takes around ten minutes to just
            
            
            
              sync all the data providers and fetch the real state. And then
            
            
            
              tells me that here's a diff. So it comes in the way of my
            
            
            
              fast application of a change. Now the side
            
            
            
              effect of that is that every once in a while were going to make a
            
            
            
              mistake like this, which ends up resulting in a higher outage.
            
            
            
              In those particular case, it wasn't just that the slash login is down. Imagine deleting
            
            
            
              a 443 rule from an inbound of a security group of
            
            
            
              a load balancer. It's not just login is down, everything was down.
            
            
            
              Just that login was reported at
            
            
            
              that point of time was one of the endpoints because one of the
            
            
            
              customers had complained about it. So real impact was larger
            
            
            
              CTO, a degree where we can't even tell how much the outage was,
            
            
            
              unless now we start auditing the request logs from
            
            
            
              those client side agents, et cetera, as well, which itself
            
            
            
              at times are missing. So the real impact, we don't even know how
            
            
            
              much business you lose. We don't know. We just don't know anything about
            
            
            
              it. So the only way to overcome the situation is
            
            
            
              to built a practice where we say that no matter what happens,
            
            
            
              we are not going to allow ourselves to make these manual short circuits.
            
            
            
              This still looks pretty simple and easy. I want to
            
            
            
              cover another outage, which is outage number two.
            
            
            
              This is when we were dealing with
            
            
            
              data centers. There weren't really these cloud providers here.
            
            
            
              Not that it changes the problem definition, but it's an important addition
            
            
            
              to the scope of the problem. Around 25
            
            
            
              hours, just before a new country launch that we're supposed to do, payer duty goes
            
            
            
              off. We check in our servers and we see that
            
            
            
              there are certain 500 which do report on
            
            
            
              elasticsearch. Now, because we are in a tight data center
            
            
            
              environment, default log
            
            
            
              analysis isn't really a luxury cause you go over a firewall,
            
            
            
              hop, et cetera. So what we do is one of us just decides to actually
            
            
            
              start copying the logs. Comes in really handy a few minutes later in
            
            
            
              the script, a few minutes later, what happens is that 500
            
            
            
              just stop coming in. Pager duty is auto resolved.
            
            
            
              Everything looks good. Though to our curious mind,
            
            
            
              we are still wondering that why did something stop working? And how did that
            
            
            
              thing auto resolve? So we are still debugging into it.
            
            
            
              Five minutes later, pager duty goes off again and again.
            
            
            
              The public API is unreachable. We do
            
            
            
              get our pingdom alerts, everything that we had set
            
            
            
              up alerts go in. Looks like something fishy because
            
            
            
              this is clearly before a public launch that is happening. So we
            
            
            
              start checking rundeck, because what is the most important question that
            
            
            
              we ask when something fails? The very most important question that I have asked when
            
            
            
              something stops working is that, hey, who change what?
            
            
            
              Because a system in its stable stationary
            
            
            
              state doesn't really break that often. It's only when
            
            
            
              it is subject to a change is when it starts breaking.
            
            
            
              The change could be for a
            
            
            
              time of a change, could be something
            
            
            
              was altered, a change could be something
            
            
            
              was a new traffic source was added. But it's one of these change
            
            
            
              that actually break a system. We also ask ourselves this
            
            
            
              important question. Hey, is my firewall down? Why is firewall important
            
            
            
              here? Well, because it's a data center deployment. So one of the changes could have
            
            
            
              happened on a firewall as well. The inbound connection could have gone up,
            
            
            
              but that wasn't down either. Okay, standard checklist,
            
            
            
              check Grafana. Nothing wrong. Check stack driver because we were
            
            
            
              still able to send some data there. Nothing wrong. Check all servers.
            
            
            
              Nothing wrong. Check load. Nothing wrong. Check docker.
            
            
            
              Has anything restarted? Nothing restarted. Check APM.
            
            
            
              Everything looks okay now. While we
            
            
            
              had copied the logs, we start realizing that some of the database
            
            
            
              errors, there was only some database errors on some
            
            
            
              of the requests, not everything. So which looked really suspicious if
            
            
            
              we take a look at it. We had all the tools that we wanted.
            
            
            
              We had elasticsearch available, we had Stackdriver available,
            
            
            
              we had sentry available, we had Prometheus available,
            
            
            
              we had steps available. One of the ways to solve a problem is that we
            
            
            
              throw more bodies at a problem. We had really all the
            
            
            
              sres that we wanted. We were team of ten people,
            
            
            
              but we couldn't find the problem. 20 hours later,
            
            
            
              after a lot of toil, we found that the mount
            
            
            
              command hadn't run on one of the DB
            
            
            
              shards. Now why did this happen?
            
            
            
              There were machines, there were data center machines that we actually provisioned.
            
            
            
              And one of the machines earlier night was coming in from a fix.
            
            
            
              We had an issue where a mount command
            
            
            
              used to run a temporary file system and it wasn't
            
            
            
              really persisting it on a reboot. We applied a
            
            
            
              fix across ten machines, but the machine that was broken
            
            
            
              at that point of time did not have that ansible check on it.
            
            
            
              That did not run ansible on it. Just before the country launch,
            
            
            
              we decided okay, let's insert this machine and we will
            
            
            
              have the machine available in case a load arrives as well.
            
            
            
              So when the machine went in, it did not have that particular
            
            
            
              fix, that just one fix that we had applied.
            
            
            
              So data goes onto a shard as well.
            
            
            
              The machine wasn't fixed properly, so it rebooted again.
            
            
            
              And when it rebooted, that data got failed. It was just
            
            
            
              a slice of data. So every time request would hit that
            
            
            
              slice of data, they would result in an error which would result in
            
            
            
              load raising of too many errors. Load balancer would cut
            
            
            
              it off. Send a pager duty breaks the traffic.
            
            
            
              Health comes in. Health obviously is not checking that data.
            
            
            
              So the machine would come back in circulation and this cycle goes on and on.
            
            
            
              Pretty interesting story, but what's the real root cause here?
            
            
            
              Those real root cause, if you understand,
            
            
            
              if you try to dissect this further in this particular case
            
            
            
              was not that we were not running any automation, we had
            
            
            
              state of the art ansible configuration.
            
            
            
              What we could have avoided this was, if I ask myself,
            
            
            
              could have I avoided this outage? Probably not.
            
            
            
              What did that outage teach me so that it doesn't happen the
            
            
            
              second time around? This is what is important. Most of the times you won't
            
            
            
              be able to avoid it or avert that outage from the first time because they're
            
            
            
              so unique in how they happen.
            
            
            
              It's going to happen. But how can we prevent this from
            
            
            
              happening again? When I say this, I don't mean this particular errors,
            
            
            
              this class of errors from happening again. And the
            
            
            
              only way to solve that is we realize a simple
            
            
            
              script could have done the job only if we had a startup
            
            
            
              script of an OS query which would just check the
            
            
            
              configuration drift of the machine. Probably we could have saved
            
            
            
              this. Now one of
            
            
            
              the big questions that we ask while something is failing is
            
            
            
              how was it working so far? This is
            
            
            
              one of the most interesting question. Then the next question is what else is breaking
            
            
            
              on the lines of these two outages? I want to cover another one, which is
            
            
            
              an outage number three. This outage
            
            
            
              number three is typically about a distributed
            
            
            
              lock which was going wrong in production.
            
            
            
              We used to have sort of a hosted and amounted
            
            
            
              terraform which would allow multiple jobs to run at the same time.
            
            
            
              And there was a lock in place.
            
            
            
              But despite the clock, multiple jobs would come in
            
            
            
              and the contention of a lock was really failing
            
            
            
              almost to a degree that it looked like this. There was a lock, but was
            
            
            
              pretty useless. So if I'm to
            
            
            
              define the ideal behavior of a lock, we were using ETCD for
            
            
            
              clustered clock maintenance, and the standard compare and swap
            
            
            
              algorithm is being used. How does those compare and swap algorithm
            
            
            
              works? If I just quickly walk through, right, I set in a
            
            
            
              value which is one. Using the previous value
            
            
            
              of one I set in a value of two. Everything works well.
            
            
            
              If I now try to put in a value of three, it will say that
            
            
            
              key already exists because it already exists. Or if I try to set
            
            
            
              a three on a previous value of two, it would say the compare operation failed.
            
            
            
              So it's a pretty standard API of compare and swap which works.
            
            
            
              But what was really happening here?
            
            
            
              We started with a default key which is stopped, and both
            
            
            
              processes are trying to set a key of started with those
            
            
            
              previous value of stopped. The behavior is that only one should
            
            
            
              win that previous value is stopped. I set a new value of
            
            
            
              started. Only one process should go through.
            
            
            
              But what's really happening, what's happening is that
            
            
            
              run a acquires a lock, and with a TTL,
            
            
            
              TTL is very important. Cause every time you have a lock based system,
            
            
            
              in absence of a time to live expiry,
            
            
            
              what ends up happening is that the process which locked
            
            
            
              or acquired the clock may die. So it's extremely important
            
            
            
              that the process which has set the
            
            
            
              lock also sets an expiry with it, so that
            
            
            
              in case the process dies, the lock is freed up and is available for others
            
            
            
              to use, which works well. Those only
            
            
            
              difference being that when a tries to update those status,
            
            
            
              we get keynote found, b tries to update those status
            
            
            
              again, then we get keynote found, which is very weird,
            
            
            
              because the process which had set the lock should be
            
            
            
              able to unset it and free it, or either
            
            
            
              case that both of them should be able to go ahead and find the contention.
            
            
            
              But this is very unique here, because both of the processes say that I've not
            
            
            
              found the key. Well,
            
            
            
              keynote found, b says keynot found. So this
            
            
            
              is where we all become full stack overflow developers.
            
            
            
              We start hunting our errors, we start hunting
            
            
            
              the diagnosis, and we land on a GitHub
            
            
            
              page, which is a very interesting one, where it mentions that ETCD
            
            
            
              has a problem, that ttls are expiring too soon, which is related
            
            
            
              to an ETCD election. It looks like a fairly technical and
            
            
            
              a sophisticated application, and we say, all right,
            
            
            
              those makes sense, and this must be it.
            
            
            
              So a GitHub ticket, open source
            
            
            
              on a system and fairly tells us that hey,
            
            
            
              eTCD is a problem. Our solution is,
            
            
            
              well, we have to replace eTCD with a console
            
            
            
              for reasons of a better API or a better it
            
            
            
              has better state maintenance, et cetera. And we almost convince
            
            
            
              ourselves that this is the right way to go. We make a change,
            
            
            
              it's a week long sprint. We spend a lot of time at it and effort
            
            
            
              at it and replacing it, and things go back to normal.
            
            
            
              Is that the real reason though?
            
            
            
              One of us decides that, look, if this was a
            
            
            
              reason, how was it working earlier? That's the most important question,
            
            
            
              right? That we ask, how else was it working earlier? We're not convinced that,
            
            
            
              look, this could really be the way to go about it. So we dig it
            
            
            
              deeper, and one of us really dug it deep
            
            
            
              to actually find out that the only problem that
            
            
            
              existed on those servers was the clock was adrift.
            
            
            
              One of the machines and the other machine of the cluster were
            
            
            
              running behind in time. Now why
            
            
            
              this would fail is because the leader at that point
            
            
            
              of time, which was actually setting a TTL because
            
            
            
              a TTL was very short and it was shorter than the
            
            
            
              drift. By the time those cluster would get that value,
            
            
            
              the time had expired. So the subsequent
            
            
            
              operations that went into the cluster would say that the key is
            
            
            
              not found something as elementary as or
            
            
            
              a basic checklist. This is something that we
            
            
            
              all know, clocks are important, NTBD is extremely important.
            
            
            
              But this is where we forget to have basic checklists in place.
            
            
            
              And we almost ended up convincing ourselves
            
            
            
              that hey, etcd was a problem. We should actually replace it with an
            
            
            
              entirely new cluster solution which is based on a
            
            
            
              new consensus algorithm. Worked well, but it just wasted
            
            
            
              so many hours of the entire team. So simple
            
            
            
              thing, which is as simple as,
            
            
            
              I mean, it just baffles me today that a very elementary
            
            
            
              thing, that is the first thing that we learn
            
            
            
              when we start debugging servers is that a time drift could actually
            
            
            
              cause a lot of issues. And this is the first time that we really experience
            
            
            
              it. Those is one of those bugs that you hit every five years and
            
            
            
              then you forget about it. But what was those real reason here?
            
            
            
              The real reason was not automation. The real reason
            
            
            
              was not any SRE tooling.
            
            
            
              It wasn't obviously for sure running on Kubernetes as many
            
            
            
              a times. I find that as a solution, it was
            
            
            
              a simple checklist of our basic first principles could
            
            
            
              have done the job, a very simple one which says that is
            
            
            
              an NTP check installed or not,
            
            
            
              and that could have saved this.
            
            
            
              And that's the real root cause behind it. And this
            
            
            
              is what I want to highlight here,
            
            
            
              that most of the time our SRE journey is assumed that
            
            
            
              another monitoring system, another charting system,
            
            
            
              another alerting system, could have saved the problem, either of
            
            
            
              these problems that we just described, right, and these
            
            
            
              are the real business cause failures here, because each
            
            
            
              one of this led CTO either a delay
            
            
            
              in our project or it led to a downtime,
            
            
            
              which really affected the business. And this isn't about slos.
            
            
            
              This isn't about SLis, because that's what typically what
            
            
            
              we find was the material when we talk about SRE on the Internet.
            
            
            
              This is about getting the first principles right. The first
            
            
            
              principles of, hey, were not going
            
            
            
              to make any changes which are manually done to a server no matter what.
            
            
            
              And if it is because of a reason that we are being too slow
            
            
            
              in application, if terraform is too slow, we need to learn
            
            
            
              how to make that faster. But we are not going to violate our own policy
            
            
            
              of making manual changes. It's not about anomaly detection
            
            
            
              or any fancy algorithms. There it is just the second
            
            
            
              outage is probably a derivative of those fact that we did not have a simple
            
            
            
              configuration manager validator. A simple basic tool
            
            
            
              like OS query which runs at the start of a system checks
            
            
            
              if my configuration is adrift. Could have been the answer.
            
            
            
              The third one is the simplest one, but the toughest of all.
            
            
            
              We spent our time checking timestamps
            
            
            
              in raft and paxos, but we didn't even have to go that
            
            
            
              far. The answer was our own server's timestamp
            
            
            
              was off. Could have been saved if there was a simple check
            
            
            
              which says hey, is NTP installed and active on a system which
            
            
            
              could have been done by using a very simple elementary
            
            
            
              bash script as well. The fact that we don't push ourselves to ask
            
            
            
              these questions and we look for answers in these tools is
            
            
            
              where most of the time were. We'll forget is that what
            
            
            
              we see as errors are actually a byproduct
            
            
            
              of things that we have failed to do or failed
            
            
            
              is a wrong word or strong word or we have ignored to take care of
            
            
            
              at the start of our scaffolding on how we
            
            
            
              build this entire thing. Which takes me to the
            
            
            
              outage number four, and this isn't a real outage.
            
            
            
              This is the one that you are going to face tomorrow. What will be the
            
            
            
              real root cause behind it? The real root cause behind that one
            
            
            
              is that we didn't learn anything from the previous one. A very profound
            
            
            
              line that Henry Ford said, the only real mistake is the one that we'll learn
            
            
            
              nothing from and that would be the reason for our next outage.
            
            
            
              Why is those important? And using all these principles is
            
            
            
              what we cause at last nine. What we have built is a lastio nine
            
            
            
              graph. Our theory and hours thesis says that hey,
            
            
            
              all these systems are actually connected,
            
            
            
              just like the World Wide web is.
            
            
            
              If you try diagnosing a problem after a situation
            
            
            
              has happened, and if there is no way of understanding these
            
            
            
              relations, it's almost impossible.
            
            
            
              CTo tell the impact of it. Example,
            
            
            
              my s three bucket permissions are public right now
            
            
            
              and this is something that I want to fix. Yes,
            
            
            
              while I know the fix is very simple, but I do not know
            
            
            
              what the cascading impact of that is going to be on the rest
            
            
            
              of my system. This is extremely
            
            
            
              hard to predict right now because
            
            
            
              we don't look at it as a graph system, we do not look at it
            
            
            
              as a connected set of things. And this
            
            
            
              is exactly what we're trying to solve at last night. And that's what
            
            
            
              we have built the knowledge graph for, which basically just
            
            
            
              goes through the system and build these components and those relationships so that any cascading
            
            
            
              impact is understood very clearly.
            
            
            
              To give an example of how do we put it into use,
            
            
            
              we basically load a graph and we try to find out hey,
            
            
            
              what does the spread look like for my particular instance
            
            
            
              right now? And if I run that query I realize that hey,
            
            
            
              quickly it tells me that hey, the split is four four three.
            
            
            
              Clearly there is an uneven split happening across
            
            
            
              one of the availability ones which may result in a failure if it
            
            
            
              was supposed to failed. Well, in this particular case because I wanted odd number
            
            
            
              of machines so this is a perfect scenario. But was
            
            
            
              this the case a few days back? Well, clearly not. What did
            
            
            
              it look like? It was two those and three, which is an even number and
            
            
            
              probably not the right way to actually split my application.
            
            
            
              Answers like these are what the questions like
            
            
            
              those, if I can answer them quickly,
            
            
            
              is what allows us to make these systems reliable.
            
            
            
              That's what we do. At last, lines a, we haven't yet
            
            
            
              open source this thing but we are more than happy to,
            
            
            
              if you want to give it a shout out, you want to try this
            
            
            
              out, just drop us a line. And we are extremely, extremely happy to
            
            
            
              set that up. And obviously there's a desire to actually put this out in
            
            
            
              all open source domain once it matures enough.
            
            
            
              That's all for me,