Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, this is Aman Sharma and today
we are going to talk about a very interesting topic that has been very close
to my heart, which is data, data everywhere,
no time to think. Well this is like a proverb
and saying that we usually have like water, water everywhere and no drop to
drink. I think the perception of finding insights out of
data is kind of the same. So this is going to be an interesting journey
that we are going to take today together and we are going to discuss a
lot of things, but in a very summarized in a very fun manner.
So let's directly dive into it.
Firstly, let me introduce myself.
My name is Aman Sharma. I am cofounder CTO,
chief technology officer at Twimbit. Twimbit is a platform where the world
can create and discover research insights and
I lead the technology team overall to create the platform and the SaaS products.
Also, I am member at deep learning AI which
helps bring technical
knowledge in deep learning understanding to students. Also I
am mentor entrepreneur and also
I'm like a general guy who can span across into
different meta knowledge and technology themes. I advise startups and different
organizations about their technology approach and how to adopt new technology methods
into the tech stacks. So well, enough about me, you can
find more about me on my handle. That is Amantech
and my website is also actually amanin
tech and you can find all the details about me and my
work over there. So let's start with our first
question. What basically is the main key difference between
data and insights? Well, the first
time that I was seeing this popularity of data science,
data science and what the hell it is, it was very confusing
for me. Like there is data and then you can directly see it. But actually
being inside data is like sailing a boat in
a deep ocean, right? You are always
covered with this different stream of options,
different streams of data, and you are not able to identify what are you actually
looking for unless there is a lighthouse
which gives you a direction and it shines you all the pearls and
all the different treasures that are hidden deep beneath
the deep ocean of data. Well,
I think the big data or any data that a company possess has a similar
challenge. And the key objective over here is to
identify the key insights which can help make
people decisions and make
them better decisions and also help them navigate
through the tough course of finding key insights. Well,
that is going to be our agenda of the talk and that has been divided
in such a manner that we'll cover the brief problem that we have on hand.
And what are the different scopes of these problems that usually covers and
also whats are possible solutions that are existingly there.
And what else could we do about it? Then we'll also cover a new
approach, that is data science prints that I myself have
seen over the course of times, like how different organizations are adopting them.
Also we'll see the difference between approaches in code and no code
tools. Today there are many no code tools also available which
helps in data science. And then we'll see how
to do visualizations for betterment of explaining
what you mean by the data. Then we'll see an approach that is called dashboarding,
that is bringing everybody on board onto the same idea and
everybody is aware about what data we are talking about. And then finally documenting
the steps that goes along and how better
documenting and helps in overcoming the challenge of nondiscriminacy
in data. So let's begin with our problems.
The first problem is the problem of goal clarity.
Well, simply explain, goal clarity is
when team that are working on a similar objective doesn't have the
main idea that what they are trying to achieve, as I've written in the definition
as well, important to keep everybody aligned to ultimately achieve and improve
service. Now the key symptoms that you might see
when you are facing this kind of challenge is often your teams are losing
track. They are always asking this common question, why we were
here. Again, like what was the main theme of what we were discussing? Everyone have
a different perception. Somebody sees it as one challenge, somebody sees
as another challenge. And of course when people see these
goals as different, the outcome that would arrive out of
them would also be different and ultimately leading to a poor ROI,
which is the main cause why team then kind of
get demotivated and they don't go with the data science path as often. Now the
solution on a very generalized approach is to first of all identify the
main goal and then communicate that goal easily along the whole team
span, which can help everybody to get onto the same page.
Now the second problem, whats I see is poor planning. Once your
goal is clear, team always struggles with how to plan a
project which can help them achieve that particular
functionality. And this is the main reason of any chaotic
situations and offer also leads to abandonment. I have seen projects even in our
organization and other organizations as well, that often whats the deadline is
too long and there is no decision arrived at right time.
And team often tend to just abandon the project and move ahead.
And this also leads to a lot of wastage of resources and time, right.
So the symptoms that you would see commonly in organizations for this kind
of problem is they are missing deadlines all the time. The results that are yield
are not proper and they are always questioning the resources, that the
resources are not right, the technical skills are not right, and that
is a repeating question always coming. Well, the solution for this is again
a three step approach that is a better project structure,
understanding how to divide the project into proper timelines, and also
making sure that the deliverables are very defined and they are less
like, they are very lean and the scope is not too
broad. And then once you come up with that, always stick to the timelines and
limit the scope to it. We'll talk about this approach in
sprints of how you can make a better timelines and how you can make team
structures better. Then you have a dirty data
problem. Well, this is something that I have seen with almost all
projects, that the vastness of data expands and
there is bad data in the same good data at the same time.
Right? So you spend too much time in data processing.
That is like one of the key things that you would see in teams,
that they are always struggling with, that they are always trying to clean the data.
Sometimes team also do have to do it manually, right. And also
the ratio between the whole data set that you have versus the
amount of insights you gain would always be less because your data is always
already polluted with so much of dirty data that you
are not able to get the actual insights of the main data source.
So it's very important to always clear these results. And for this,
the solutions are advanced tools, which are data preprocessing tools.
We'll talk about them as well. And also that source ingestion,
like how you are capturing the data, if it is analytics data from the website,
you have to rethink about how you are calibrating, how you are capturing the
insights from that website or an app. And also if it's
a surveying tool, how you are capturing this data. So all those source ingestion
tools needs to be improved. And doing these three things parallel
can help in overcoming the dirty data solution problem.
Then we have on the other side, those are more,
not project related, but more technical challenge related. Now,
how I define the technical challenges are like data is important for any
company, whether it's a startup or it's a big organization. But the
challenge for small organization or mid organization or mid teams is that they are
limited on tools that they can use, they are limited on the resources that
they have, and they are limited on the talent that they have. And also for
big company, the challenge goes beyond and they have privacy issues like they cannot
take all the decisions lives that they are dependent upon GDPR
and other data protection policies and they are not able to rectify their
path through this time. And the symptoms that you would see is that people
are complaining about less people, there is lower turnaround time,
whats amount of time you are putting in and the amount of results that are
coming are not very good. You are always complaining about the
system efficiencies, like how different systems are not working properly.
And of course people are complaining about there is less transparency between the
technical team who is working on these challenges and nontechnical
team who is actually there to reap the benefits of the data.
And ultimately there is a silo that becomes between these two
teams. Now, the solution to this is that has been like a time
trial method, at least for me, over the course of time, is this
new wave of no code tools that can be adopted by any organization,
whether they have good technical bandwidth or not. So no code
tools and also documenting your steps on the way. It's helpful for
scaling teams, but also it's helpful for people to understand how
the conclusion to arrive at a certain data sets was made.
So the documentation step is really underrated
in the industry, but it needs to be really highlighted over here. So every step
of the process needs to be documented and read by everybody.
Ultimately, this thing kind of brings the transparency in the teams, right?
And teams are more flexible about discussing different priorities
and options and ultimately leading to less technical challenges
in the team. Then you have the problem of complexity.
Right. Data science is often limited to only the technical
people. That was a notion before, right? And insufficiency in the representation
of data also leads to poor decision making. So, for example,
the person you put in charge of finding insights
of the data, he was not very good at visualizations from his side.
He has presented the insights in the right manner. But the person who
is there to make decisions out of this data is not able to understand that
data very properly. Right? And this often leads to non
judgment. Like there are judgment issues in this clearly,
right? And you are not able to understand what actually this data is
trying to tell me. Right? So you often complain about that the data is unreadable.
Again, you will see poor decisions making out of him. And also
then the stakeholder is always thinking about, like, data science is too complicated,
let's just skip it at all. Right? Solution to this is, again,
no code is a better method to bring nontechnical people on
board to any data science project. And the turnaround
time, from technical to nontechnical people can be reduced by just using no code
tools, a better visualization techniques that we are going to emphasize and talk
about in this presentation as well. And finally, a proper feedback mechanism
that every time the project ends, how do people discuss? They come together
and they discuss about what were the good things that we did in this project,
whats are the bad things we did in this project. Right.
Last problem, but not the least, which is kind
of the cumbersome of all these different problems that we saw.
And that is the problem of silos. Like inside the organization,
there are walls, invisible walls that are built between the data science team
and the non data science team. And often these walls kind of create
these problems of non interdependent department communications,
right. And ultimately, when there is less communication,
people are not talking about the data that often, or they
are not transparent about what is the approach. Of course, it leads to
the lower growth of organization because the person who is there to make decision doesn't
know data science, but the person who is there to do data science doesn't
have the capability to take decision. So the wall is developed and now
nobody is able to reap the benefits and the overall performance is going
down. Right. Again, solution to this is a combined solution
of all these strategies that we did. First. One is dashboarding how to have
dashboards in internal team. So the data is available 24/7
for anybody at any time. Then automation systems
like how we can reduce a dependency on technical team
to always be there to present data feedbacks,
as we discussed already, feedback mechanism that can properly help people
navigate through these steps. And then finally a documentation method
so everybody knows how the process is going. Well.
That is overall the different themes of problems.
Let's dive into the solution. Now, the first solution, which is not
directly mentioned over anywhere, but I kind of tried to
get it from this book by Jake Knapp called a sprint. Now,
sprint is a method that people often use in technical teams
who are into product development as well.
But it has not been that much used in small
teams or organizations which have data science
as their bread and butter. So often it always tend to go into more kind
of agile methodology like how they want to work on it. Well, what I did
was this sprint approach kind of inspired me and it was a method of
doing projects and testing ideas in just five days in different organizations,
including Google and different ventures that Google invests in.
So I kind of picked some of the techniques from there and combined it
with some data science approaches and kind of try to came up with this
sprint approach that works a lot better than before.
So there are four steps to this sprint approach, clear goal,
plan well, execute and test and improvise. So all these stages are divided
into these four tools. And then finally in each step,
everything starts with the introduction of this project. So everybody comes together,
discussed. What is the main idea over here? Whats is the problem that they
are trying to solve? So they set a long term goal. A long term goal
could be a long term questions that they are trying to improve the user consumption
method on the platform or they are trying to minimize the cost. So that's
like a long term aspirational goal. Then you set some
kind of sprint questions, like what are the questions you are trying to answer over
here? Right? So these could be like directly
you are trying to understand the male versus female
ratio of the data. So you are kind of lives trying to be exact over
here. Then what you do is once you create a question bank,
like what are the questions you are. Of course these questions should be limited.
Don't try to exceed it to 2030 or more
than that, because ultimately that would lead to the longer timelines. The whole
target of doing a sprint is to achieve a limited scope in a limited timeline
and having the fixed timeline and fixed scope to do it.
Finally, make a map of how you are going to arrive at
a solution. We will see about how to create a diagram or a map to
this as well. So kind of map you can imagine is like you have data,
how the data is flowing, how you will flow it through different systems. So this
is kind of a project that you are doing in the initial steps of your
project discussions, just kind of sentiments so that everybody
starts imagining what are the resources required to do that. So this kind of will
help you in. Of course, clearly the goal. The second step is now starting
with planning, right? So the first step with this is talk with experts.
If your team already has experts, data science engineers, experts,
go talk to them. But don't kind of always
undermine the solution, just what they are saying, because they might be limited in
their understanding about the project as well. So listen to them, keep the thoughts,
but ultimately you are the decision maker in it. And you can go to other
outside help as well. You can talk with other people, like how somebody else would
have solved that problem. Go to different forums so that could help. Then what
you have to do is pick a small target. So out of this question that
we discussed now what you are doing is for the starting going, you have to
pick a small target and then you have to see how you can arrive at
multiple conclusions from the same small set of questions.
Now, for everybody who is in the team, I am imagining
that the team is usually of the size of four to five people.
Two of them are pure technical, hands on people who are writing
the code. Two of them are into data and visualization and
one of them could be a manager. So ideally it works good with the five
people team. Now what you have to ask is everybody in the team like
how they think they might go for the solution?
What are the different approaches that they think they can adopt? Don't discuss it
out. Let everybody write on a sticky note and stick it to a board.
And then let people vote for these approaches. That would help
us identify what approach we can go for once an approach
is identified. Second thing, what you have to do is to create a flow
diagram. Now this flow diagram is a little bit different than the map whats we
discussed in the previous step. Flow diagram is more like now
you have started to discuss that. This is the data that has to come through.
If the data need a preprocessing, you have to add a preprocessing step. If the
data needs some more big
data solutions or processing on that and so that would be captured.
We'll discuss more about how to do diagramming in the upcoming slides as well.
Now you are clear with what you
are actually trying to solve. You have created a flow diagram as well.
The next step in this step is so this was the first step. Second is
the plan. Well, third is the execute step. Now you have a
plan in action. Now you want to execute that and you want to
bring everybody on board and arrive at a conclusion as soon as
possible. So you have to set the deliverables out of it. So, which is similar
to what you set as a sprint question. Then you set up a pick target
and then you are trying to set deliverables out of them.
Fix these deliverables. Don't let anybody add more deliverables
to it. Do another sprint or maybe a future project to overcome that.
But for now, fix these deliverables and then set a pure
timeline. A timeline could be one to two weeks. Whats ideally
works for the sprint, it could be three weeks as well. If you think that
scope is a little big, divide these tasks according
across the team. So this is normal project management 101.
Then meet regularly. Already decide what would be the meeting
points, what would be the meeting agenda, depending upon how your
project projects, and always do a health check of how different
members of the team and how different aspects of the projects are working.
So then you have the last step that is test and improvise.
Once you have the data in place, now you want to test your hypothesis if
it's working or not. Now, instead of just going into plain dashboarding
and trying to display data, first of all, have a small mvp
or a test report to test. But if your hypothesis were right or not,
go back to the main stakeholder, ask it if it's right or not.
There would be some minor changes that you might need. Do these changes,
bring back the data and then present this data on any live
dashboard. So this is the present findings. Collect feedbacks
in that group, improvise over this feedback if it could be done in the same
sprint, and then finally document these learning. Now, you can see
there is a big blue arrow that goes back to introduction to the project.
So every time you find these learning, discuss them again when
the next sprint is going to start. So these were the feedbacks, these were the
learnings that we did for the last projects. This is how you're going to help.
Again, I would highly recommend to go through the book by JKnAp
that is sprint, and it would really help you understand how you
can arrive at quick decisions, how you can make small projects and create
this sprint approach and add it to your organizations. Now,
we were discussing diagramming a lot, right? So to me,
I think diagrams are really underrated when it comes to
different teams. I have not seen anybody who is very enthusiastic about,
okay, let's create a diagram and let's solve the problem by creating a diagram.
But what basically diagrams do is that they get everybody
on the point, it gets everybody clear about the thoughts
and they bring everybody to the same page. And it also helps
set realistic expectations and timelines,
what people think about how things take time, right? If you're looking
at just a bunch of code, then it doesn't make sense and it doesn't helps
people estimate the resources properly. But if you diagram something,
it's visually appealing and it helps people make decisions faster.
And of course, once you have realistic,
achievable goals that you can set from the diagrams or timelines that you have,
it also helps you in estimating the resources. So I'm
going to show you a small quick demo of how you can create diagrams.
Again, it's a complete diagram 101, but I would highly recommend
to go over small videos over UML diagrams or flowchart diagrams
of how you have to do so. How I actually go always with diagrams is
that I always place the main components and key findings or
the key components of the whole diagrams first. So for example, let's say
I am trying to find the male versus female ratio
out of a web analytics data. So of course what I would do is that
I would kind of make things web chart as first that
this is the data ingestion source. Then I think this
next key step or the next data source is of course a database
or a data lock system that is keeping this data.
Now as I go ahead, I am not putting any arrows
or connectors right now. Firstly, the important thing is just to keep all the elements
over here. Then I would actually need a script
that would do a data preprocessing. So it would be a python script
or something like that. And then once the data
pipe processing is complete, I would probably run some
kind of SQL queries. And let's say if this was more
like a big data situation, I can actually go with bigquery from
Google and that would actually help me solve this thing. So that bigquery is the
script that I have to write and I can write query over here.
Once this query is done, of course I would have bunch of data. Then I
can use something like let's say data studio or
Google data studio to present this visualization. I think the better would be to
kind of have visualization tools like that.
Now this is a typical thing. Now what I have to do is when I'm
connecting it, so the data is constantly updated
onto the databased. So that is like a repeat step.
Now every time, what do you say?
Once in a while I will go and pick up this data and kind of
try to pass this through this data pipeline process. So what I'm
going to do is every 24 hours, let's say
I would pass this data to my python script which will do the data preprocessing
and clear the data. Then I will do some kind of querying on this which
will arrive me at the visualization and this would be the ETL that I
will set for the whole time. And this would be the non tech objective
outcome that any user or any decision maker can
see to identify this thing. So this was a very simple example.
Sometimes things would go complex that you would need conditional things, right?
Sometimes the query won't work, then you have to go back to the main data
and then you have to do the preprocessing again. So the brainstorming that
the teams are doing and the planning that the teams are doing should be done
on these diagramming steps, which can help people understand what
are the main goals, what are the main objectives that they are trying to arrive
at? So next, an important
section of this presentation was to find approaches. Code and no
code tools are very sorry, no code and low code tools are very popular.
So I wanted to do a small comparison between these two tools. So code,
of course, we are all familiar with it has the flexibility, it has the scalability
and high function availability. Like you can pretty much do everything that you can imagine
if you know what are the right code ways to do it. But of course,
the cons on things end is that you need to know technically
how things are done. Talent acquisition is a problem these days.
Again, data is silos because tech people are not the decision makers
and decision maker doesn't know the tech. And of course the model complexity also is
a problem that if you are using any third party models to
process your data using machine learning approach, then you don't know how things
work and you don't have the clear idea of how you can get things done.
So there would be always a time that you will get stuck and you don't
know what to do after that. Now, some of these challenges are overcome
by no code tools. And first of all, the fun aspect of this
is like, it's very fast. Like you can arrive at a conclusion at very fast
things because you are not setting up the bare minimal or the base things over
here. It has a low learning curve, almost very easy to learn
about these drag and drop functions. It's fun and engaging. Usually these tools are very
fun to use. Like I have seen different tools with Google data studios
or intersect labs or any parabola AI. So it's
very fun to use and it's visually appealing. You can easily understand what are
the things going on. It increases the productivity because you don't have to
do different things. It's just simple ingestion of data and
presenting it. And also it's kind of open between the team.
Like anybody can come and see how are the different functions that are working.
But it also has its own cons.
It's not too flexible to do it. You cannot do everything with it.
You can do only the tools that are provided to you on that aspect,
right. And also you're limited to the source that you can choose of your
choice, right. Which also means there are less options of these tools
that are available. And also you are always dependent upon these approaches
and these tools for going forward. So if, let's say, the company shut down tomorrow,
your project is also shut down forever. And also you have these scalability issues
like if the data grows big, then there might be some issues with if you
can use the tools or not. So let's go one step deeper
and see little bit of core tools. What I see usually is
that the core tools are divided into three main sentiments, data science programming
languages which includes Python, R or Scala. These are main
bare minimal ways of doing data science. Then you
have querying and analysis tools which include SQL, MatLab and
Bigquery. And then you have application suit which is like a
packed bunch of things that are packed together. Apache, Spark,
Big ML, Hadoop. Again, this is just a general example. There are other tools
as well. So anybody who is beginner, they can choose one track, like they
can choose the Python track, add SQL to the stack and then use Spark
to kind of have an application suit. You want to go in more
generic manner. You can just go with the querying analysis tool that I've discussed like
Bigquery over here. Then the next side
of tools that we have no code tools. Again, it's just a generalized
way of presenting these things. The tools expands more than
that. So the first set of tools you have is easy to
create dashboards and reporting tools. So popular one
includes Google Data Studio, completely free to use tableau, it's limited free
and then power bi. It's also kind of free to use these tools, doesn't require
any technical knowledge whatsoever. You just come and drag and drop data and
then you are able to do things out of them as well. Then you have
build and automation data science flows tools, right? And these are
more like if you want to do repeat tasks, you don't have time to go
and set up the pipeline. Again, you don't have the manual time of ingesting the
data. So then you can use these tools such as explainti, intersect labs and
data robot. And then there are again complete end to end
data science flow tools that we were discussing on the core tool side as well.
You have obviously AI and Ghana which kind of helps you to
provide all tools at the same place and that kind of helps you do
everything with the data. Again, these are all no code tools, you can explore
them one on one and then that would help you identify what kind of tools
would work best for you. Then now
you want to make things decision that which tool you should go for,
code or no code. This metrics kind of help you understand what
kind of solution would be better for you. So if your functional
requirements are low and you need the results high,
then it is the perfect way to go with the no code solution,
right? Again, if your functional requirements are low
and even the expectations are still slow,
you can still go with no code solution which
are ideal for individual projects. So that is the main thing.
So you can see that the function requirement is the main thing that decides
if you want to go with the no code tools or not. So then if
your function requirements are high and your expectations
are fast, then I don't things the no code solution would actually
make sense to you. And also if
the expectations are slow still you will go
with the traditional ML technique. So it all depends
upon what I've seen is if the low functional requirement is a
key decision maker to choice if you want to go with no code tools or
not. Next step. I think visualizations
is also something that everybody in the team needs to be aware about
how to use different charts and methods and different says to visualize
data when it comes to making these decisions, like some libraries
that helps you make visualizations are d three, plotly and
ratplotly. This is kind of an order of different flowchart
that. What is the kind of data you have which will help you make the
decision? So a common path is if you have more than one
variable, you will go with this path. Then are these variables
similar? Yes, you will go with this part. Is there a hierarchy involved
in it? If you say no, okay, are they ordered? No.
Then go with this. So you can take a screenshot of this slide and that
would help you understand, make proper
judgment of how to present data going forward.
This is a more in detail kind of diagram
which helps you see what are the different charts and when they are used
in certain situations. So again, you can take a screenshot of this as
well and share it with your internal teams. That could help you
devise the proper method of what chart you should use in which situation.
Again, the link to things talk would be also available. And you can use
this for your future projects as a kind of
playbook for your future projects as well.
Now you are done with the whole planning thing. You are done with diagramming
thing, you have done with deciding the tool that you have to use. Now,
the final step in doing all these things is dashboarding.
Basically presenting your findings in a very organized manner and making it
available for everybody. And dashboarding is divided into four
main steps. Whats is collecting these visualizations, what you have to do is you have
all these 1234 question, and every question
has its own visualization, different kind of
charts. You collect all them together and you try to have
one way to do it. Then you organize them based on priority.
The first thing to keep in mind is the most important question.
Always remain on the top, right? Always the main question,
arrive at the top. And the second thing you have to do is if there
is a data connection, like if first question, answer the second. So those
graphs needs to be organized together. So let's say if this was one
graph and there was one conclusion out of it, so you need the second trash
next to it, then this is plain 101. Kind of organizing
and kind of arranging thing doesn't make sense, but we always skip the
steps. And when it comes to dashboarding, we are just applying things as they go.
Then you have to set an automatic schedule. So tools that I have mentioned
over here, like Apache Superset or Airflow or Gramx, they kind of have
inbuilt capability of automating and pulling the data from the main source
on time to time basis. So if that is kind of a niche that you
want, you can then set these things, set up a timer that when you
want these data to be accessed again, again,
tools such as superset, Gramx, plotly, dash and quicksites
have functions that you can include non tech people, share the access with
them and that would actually help them also
come at any time and do this. So this is kind of the dashboarding process
that you need to make as the final step of your journey.
And then finally your dashboarding is done. Now you need to document
stuff, right? So from the day one, only create a shared document.
People, users, notion, excel and all bunch of stuff. For my preference,
I think the simplest way to go about it is just an excel sheet with
a point to point description of what are the different things. Create a shared
document, add different progress as you go,
and then add the findings that you were looking like,
main findings from that project, and then collect feedbacks into the
same document. Make this document available in the next print. And that kind of become
evolutionary cycle improves the overall step and overall planning.
Well, that is all about what we want to discuss today.
So we started with a problem, and we started with what
are the different problems that are into the data science issue
in things space. Then we also looked into some of the solutions, which are
approaches. Then we discovered the code approach and the no
code approach of doing these things. We also saw how you
can use visualizations, different techniques of visualization, to present your data dashboarding,
to collect all things visualization into one organized fashion.
And then finally documenting as this final sprint. And also we
explored a new common method that is going to be popular. These says
that is sprint. And also you can use diagramming to explain
what you want to know. Well, that is
more or less about it that I could find in this time.
I hope I was able to deliver something new. I was able
to open some thoughts about it. Again, it was not a code 101
or DIY that you might be expecting, but this was more around how to bring
that exposure of finding insights out of the data.
This was Aman Sharma. If you really like the presentation, please let me know the
feedback on my handle. And again, the link to this presentation would
be available. So until that time, if you have any
questions, please drop it to me and thanks for your time.