Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning, everyone.
it's a real pleasure today to be here today, with over 18 years of
experience in data engineering field.
I have seen the evolution of data systems in action.
And today I'm excited to share how AI is revolutionizing the way we build
the systems, the way we build the data pipelines, and the way we move the
data from one place to another place.
Think of it as a journey.
Okay.
I'll take you through the evolution of the data engineering from, early
days of my, moving, like, simple job.
data with basic data cleansing with basic workflows, to today's intelligent
automated way of moving the data.
along the way, I'll share some stories of both challenges and breakthroughs and
how we mitigate, we can mitigate those challenges, with, today's AI world, see
the current state of data engineering.
Okay.
Let me start by sharing a little bit of my own journey.
when I began my career, a data pipeline often means moving a few data files
from one system to another system.
maybe cleaning them, and then running a basic report, for, for leaders or
product managers and various users.
if it is related to operational reports, fast forward to today, the
landscape has completely changed.
The data we deal with is like massive, dynamic, like frequent, increasingly
unstructured, because of the way it evolved and from where it evolved.
it's not enough to just move the data around anymore.
We need a system that can not only handle this complexity, but evolve with it.
Today's discussion will revolve around how AI is playing a pivotal
role in solving these problems.
I'll walk you through how AI is reshaping the data pipelines, making them
smarter, adaptable, and more efficient.
What is a data pipeline?
Before we dive into AI, let's take a step back and make sure we are all on
the same page about what we're talking and what's basic data pipelines.
Picture a traditional data pipeline.
like a factory line, okay, or a pipeline where data coming from various sources,
it's getting processed and then ending up in a warehouse or a dashboard for analysis
or for a basic data set that can be done like deeper insights, by data scientists
or business intelligence engineers.
users.
But as data grows larger and more complex and more varied, this factory line needs
to become like more sophisticated, okay?
The old systems are not like designed to handle this complexity, and
that's where the challenge lies.
The struggles, What we see in basically in these traditional pipelines are like,
let me paint a picture of like early days.
I remember like late night was spending chasing down issues in the
pipeline and debugging and finding, the basic issues like, okay, data
type, not mismatching, like one, one field having a value, which is not
in line with like the way we designed the schema and everything goes wrong.
And it would take sometimes hours to debug and find and fix it.
maybe the data source might go offline or transformation step would
fail just because of debugging.
Basic data type mismatch.
And we, we like literally scrambling to fix it.
those days were like full of firefighting.
Okay.
We're constantly pitching things up and then only to
find new problems the next day.
Okay.
and, and scaling the system, updating the system to handle those problems.
And that was kind of like headache.
We would throw more hardware with the problem.
If it is like bottleneck from like, efficiency, maybe we build new
custom scripts to try to make things faster or handle the bigger problems.
mismatches in the data types, but even with all that effort,
it was not sustainable.
We need something better.
AI role in transforming, data pipeline.
Let's see the turning point, for me, like, okay, came when we started integrating,
like looking into the AI, how we can liberate the AI, suddenly the task that
had like once taken hours, even days.
Were happening like automatically.
for example, okay, I'll take a hypothetical example here.
Manually checking your data quality, it would flag anomalies in the real time.
And, and at the best party, like system was not just reacting, it's even
learning from the, from the pipelines.
Like why failing?
How it's failing.
Okay.
And what could the potential issue, like when, when we do some manual, correction
and then process and then give back that information to, ML algorithms handing the
pipelines, it's like a new learning for.
it.
imagine a system that learning on its own from the data and then
getting smarter over the time.
obviously this ML powered or AI powered data pipelines can automatically
adapt to these fluctuations, adjusting the resources on the hardware side
and improve their performance, with, with, with all without like
a very minimal human intervention.
Key benefits, when we include this ML and AI in the data engineering, some
potential key benefits would be there.
Like impact of ML and AI is profound.
We, we are not just improving the speed or accuracy.
We're rethinking like how we can manage these huge data sets, huge pipelines
with complex transformation rules.
for example, instead of worrying about like scaling a pipeline, with like based
upon the size and corresponding hardware requirements or resources requirements.
Okay.
Just, this, this algorithms like AI models will automatically adjust
adding like more resources as needed and, and, and the best part is this
happens like without human intervention.
This addresses the need of like cost of hardware upgrades and manual interventions
and with cloud coming into the picture, this dynamicness like literally helps
and allow us to focus on, on better tasks like high level tasks and the
business oriented things rather than like spending a lot of time on the, on the
technical challenges on day to day basis.
real time.
adoption.
like obviously one of the most exciting aspects of AI is
like real time adaptability.
Imagine during a flash sale, okay, all of a sudden like, like, like, Black
Friday or walking day in UK, okay, there is a sale going and then the
traffic spikes like unpredictably.
In the past we would worry like, okay, we can't predict like how much volume it
could be like, okay, how much hardware or resources we need to handle this.
We would worry about like how systems would handle this.
new loads, all of a sudden coming in, but with, with AI, the system can
automatically scale and adjust itself.
And it makes like decisions in real time, giving like business
significant competitive advantage.
It's like allowing like lightning fast adjustments to the ingestion, ensuring
that there is no opportunity is missed.
Okay.
There is no failure in the end to end execution.
Even in with like huge workloads and, and, and certain spikes.
Another beauty, like, like, let me share, one real world scenario.
let's see.
There is like large financial institution, had been struggling to
detect the fraud in the real time.
And, and with, with, with traditional systems, it's
really, difficult to handle it.
The, the one reason is because it's not like real time with the ai, that could
analyze the transaction patterns in real.
time, with the, with the models that already been trained with enough
data and then detect these fraudulent behaviors as it is happening.
And before it could cause like much, much damage for the end user, it's
like having a real time security guard for your data, it's like having a real
time, person there sitting for you and watching for you in terms of like
what's happening around you in your, around your financial transactions.
it's able to spot the anomalies and take action immediately, Using the
impacts and further any impacts, how we can efficiency, optimize the
efficiency in this AI driven world.
One of the most powerful features is, AI powered data pipelines
in that are like not static.
As I mentioned earlier, by analyzing the past performance data, that these machine
learning algorithms learn more, identify the potential bottlenecks, resources,
underutilizations and inefficiencies.
Inefficiencies making proactive adjustments to the improved performance.
For example, if a pipeline running slower than usual Okay, the AI system
can predict like what could be the future bottlenecks based upon What's happened
now what's happening now or what's happened in the past and suggest or
implement resources adjustments before the impact Performance are breaking down the
things end to end How it works is okay.
Imagine a end to end system where okay, we have like data ingestion data
processing Storage And final analysis are usage of this process information.
This A dynamically selects the most relevant sources and formats only.
Okay.
And it's advanced algorithms automatically clean, transform, and
even enrich the data for like missing values or missing upstream, I mean,
missing details in the upstream systems are not entered by the users.
Or, or like issues in their legacy systems.
Okay, all these will be taken care, through, I mean, while
we're doing the processing of the data or transforming the data.
And finally, when storage also is automatically optimized the storage,
okay, the best way of storage in terms of like the level of degradation we need,
it, it will recommend like, okay, how to store, where to store and level of the
grain we need to maintain, and, and based upon the future access requirements.
Thanks.
and, and in the end, like, okay, once we do all this, obviously we need
this data for analytics and insights, and, and AI driven analytics helps
surface the most important insights, flagging, like, any issues or anomalies
and key patterns and deviations from, like, expected, patterns of progress.
Before, anyone, like any human intervention comes into picture
or any analyst or business user look into the data itself.
this is the future of data engineering.
it's seamless, intelligent, and fully automated.
Okay.
Let's see a case study.
let's look at like a real world example.
a, a, a real client, face delays in generating daily sales forecasts.
Okay.
because of the slow data processing.
Obviously if someone take me to take a decision before like 8 a.
m.
And the pipeline is taking like much larger time and we see a
surprise in terms of failures.
It would be like massive disaster, obviously, because
they can't take decision on time.
And it's giving an edge to the competition if we don't have like
information available on time.
And by switching to this AI driven like pipeline management, that
could process data in real time, allowing them to generate daily
forecasts in hours instead of days.
Okay, this will allow them to adjust the price strategies, managing the
inventory requirements effectively, and even personalize the marketing efforts,
all based upon like up to minute data.
Like, okay, what's happened last minute would drive like what
could happen in next minute.
That kind of scenarios, this is the kind of transformation we are seeing
in the industry and AI is playing a real major role here and driving
meaningful change in how companies should operate and will operate.
what are the some key technologies driving these?
Okay.
Now let's take a look at the technologies that behind this AI powered
pipelines and ML powered pipelines.
one basic thing obviously is machine learning and the deep learning.
These models, I mean, when learned from like past issues.
Adopt and improve based on the historical data, making the
pipeline more smarter over the time.
Making this end to end movement of data more smarter over the time.
Say NLP is another thing like natural language processing.
This allows pipeline to process unstructured data
more effectively, like text.
logs or even converting the audio to text and then process and then
and then massively unstructured data will be taken care of here.
Another advantage here we can see is in terms of NLP, because it's like
behaves like a human understanding the information and then processing.
Okay.
Take some example like, okay, missing city.
Based upon the postcode, okay, anyone can say city name.
It could be taken care.
Similarly, name misspelled, okay?
Say, instead of, my name's Rinoas, and instead of S R I N I V E S, someone
spelled it A is missing, and MLP have that ability to repopulate the correct value
based upon what it learned in the past.
This automates the selection and tuning of the machine models
like machine learning models for more optimal data transformation.
And another one is like cloud infrastructure.
Obviously, to have this dynamicness and necessary scalability in real
time and flexibility in real time, okay, the infrastructure should
support the way we leveraging the infrastructure to support all of this.
And cloud is the main, main, main thing.
that's required, which is the foundation for all these things.
Together, like, all these technologies will enable that next
generation of data engineering.
how to get, started with AI in data engineering, okay?
step one, obviously, okay, we need to assess the current infrastructure,
current pipeline, identify all the pain points, list down everything in the
backlog, and, and we can, we can just check, like, okay, where we can, we
could apply these AI, ML models, to.
To have like more impact, obviously we need to bring down and bring the return
on investment into the picture, before moving forward, but, but that's, that's
the step one, and then selecting the smart tools, right tools and the framework
that best suit to have like these.
new effective and efficient data architecture and goals.
And start small, better to start, like, okay, implement a small solution, okay,
that driving through ML models on a limited scale and iterate, like, see,
like, okay, is there any benefit happening and how much benefit is happening and then
iterate and apply more models, like, on the pain points to improve the things.
Once we scale up like doing the scale up, okay, we'll get like more confidence
in the system, okay, and we'll see the more benefit like how these are helping
in terms of like both business and also like technical team, and then expanding
the scope across the data architecture.
The key is here is to start small, okay, manageable project, like the
team can handle and build from there.
what could be the future trends?
Okay, one is like self handling, self healing or self handling pipelines.
we can say, imagine a system that automatically detects and resolves its own
problems and then without any human input.
that's like an arch star, okay?
it, it's not like a day two thing.
It could be like long term with an, I mean, long term effort going to be
there and, and still we're evolving.
and, and, real time data processing.
As of now, we have time processing working in a conventional way.
But when we have like more advanced ML models, we, we will see like even
faster processing, in more use cases, enabling business to make instant
decisions, based upon that real time, data processing across systems, across
areas, and then augmented engineering.
Data engineers will eventually, will work alongside with the AI systems
and ML models, using them like a supporting tools to optimize the
workflows and integrate them in the business processes for macro impact.
we just started scratching the surface of what AI can do in
the data engineering world.
but I mean, it's not, it's not, the final thing.
Like we, we could see more advancements.
We could see more surprises in the future.
And final conclusion and what could be the next steps, to wrap up, okay, in terms
of like AI is fundamentally transforming the way we're processing the data or
data engineering by making pipelines more smarter, more efficient, more adaptable.
And we're seeing a shift from manual process to self
optimizing intelligent systems.
my call here for action for, for all of us today is like, just embrace it.
Embrace AI.
Use it.
Whether you're just starting already, experimenting with AI
in your systems, take the next step and just build, integrating
it with your data architecture.
that, that, that is like a starting point, both in terms of learning and not both
in terms of like seeing the benefits.
Uh, thanks for all.
I mean, thank you very much, for your time today.
I'd be happy to have, I mean, answer the questions if you have
any, or the comments, or offline.
thanks a lot.
Thank you very much.